Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8289943: Simplify some object allocation merges #9073

Closed
wants to merge 43 commits into from

Conversation

JohnTortugo
Copy link
Contributor

@JohnTortugo JohnTortugo commented Jun 7, 2022

Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge?

The basic idea for this approach was discussed in this thread and it consists of:

  1. Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node).
  2. Scalar Replace the incoming allocations to the RAM node.
  3. Scalar Replace the RAM node itself.

There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs:

  • The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN.

These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints:

  • The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields.
  • The way I check if there is an incoming Allocate node to the original Phi node.
  • The way I check if there is no store to the merged objects after they are merged.

Testing:

  • Windows/Linux/MAC fastdebug/release
    • hotspot_all
    • tier1
    • Renaissance
    • dacapo
    • new IR-based tests

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9073/head:pull/9073
$ git checkout pull/9073

Update a local copy of the PR:
$ git checkout pull/9073
$ git pull https://git.openjdk.org/jdk pull/9073/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9073

View PR using the GUI difftool:
$ git pr show -t 9073

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9073.diff

Sorry, something went wrong.

JohnTortugo and others added 21 commits May 18, 2022 15:30

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Merge branch 'allocation-merges' of https://github.com/JohnTortugo/jdk into allocation-merges
@bridgekeeper
Copy link

bridgekeeper bot commented Jun 7, 2022

👋 Welcome back JohnTortugo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@JohnTortugo JohnTortugo marked this pull request as draft June 7, 2022 23:24
@openjdk
Copy link

openjdk bot commented Jun 7, 2022

@JohnTortugo this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout allocation-merges
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk
Copy link

openjdk bot commented Jun 7, 2022

⚠️ @JohnTortugo This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jun 7, 2022
@openjdk
Copy link

openjdk bot commented Jun 7, 2022

@JohnTortugo The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jun 7, 2022
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jun 8, 2022
@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 5, 2022

My idea was to have it as false by default. TBH, I thought that was kind of the "policy" for adding new optimizations.

Yes, experimental flag is good for now if you DO plan to switch it ON later (and convert flag into other type).
The trouble with false experimental flags is that code become dead and "rot" if nobody care about it.

@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 5, 2022

@JohnTortugo I just noticed that both new tests missing {} in a lot of conditional cases. Please, fix it. Code style.

@JohnTortugo
Copy link
Contributor Author

Yes, experimental flag is good for now if you DO plan to switch it ON later (and convert flag into other type).

Yes, our plan is to keep working on this until we can remove all the constraints in can_reduce_this_phi. We have already discussed ideas of how to remove those constraints and we'll be tackling that (and discussing it with you & the community) after we get this initial implementation upstreamed.

The trouble with false experimental flags is that code become dead and "rot" if nobody care about it.

Totally understand your concern. I'm fine with setting the flag to true and making it "Diagnostic" - I just don't see it as a "Diagnostic" patch, but if it's fine for you then it's fine for me.

@JohnTortugo
Copy link
Contributor Author

May for future work based on these benchmarks (and others) we can collect cases when this optimization does not work (or even bailout compilation).

The TraceReducedAllocationMerges option prints information about this. I actually have a spreadsheet where I list the cause and frequency of each case where the optimization can not be applied.

BTW, were you able to remove all allocations in your test run_IfElseInLoop()?

Yes, in that case both allocations are removed. I just confirmed it with a test locally. Also, there is an IR-based test for that case.

What about test case in https://bugs.openjdk.org/browse/JDK-6853701

The current patch bails out in that test because there is a Phi (or CmpP) consuming the merge Phi. Actually, that code example is one of the tests that I run "internally". There is already work going on to improve the current patch to make it able to handle CmpP with NULL.

@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 6, 2022

Thank you for answering my questions and doing all research.

Let do this: make flag "experimental" and true by default. This will allow to test it for some time after changes are integrated. If everything looks good we keep it that way otherwise we can switch it off before JDK 20 is shipped (or switch it off regardless before shipping).

I will start new round of testing (more tiers) and I will ask others to review changes.
I think we should go with current state of changes if it satisfied other reviewers.

@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 6, 2022

@JohnTortugo Looks like your latest change/fix cause new issue. I see GitHub Action testing failed with:

#  Internal Error (/home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:1825), pid=3815, tid=3830
#  assert(_base == Int) failed: Not an Int

@merykitty
Copy link
Member

@JohnTortugo I meant you can specify the warmup iterations for the whole test, not just some methods inside with TestFramework::setDefaultWarmup

The TraceReducedAllocationMerges option prints information about this. I actually have a spreadsheet where I list the cause and frequency of each case where the optimization can not be applied.

It would be great if you include all known failure cases of scalar replacement in the IR test. Thanks.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, Cesar. Most notably, I'm happy to see the test with so many non-trivial cases enumerated.

Speaking of the proposed patch itself, I'm not done yet reviewing it.

As of now, I don't fully grasp what's the purpose and motivation to introduce ReducedAllocationMerge. I would be grateful for additional information about how you ended up with the current design. (I went through the email thread, but it didn't help me.)

In particular, I still don't understand how it interacts with existing scalar replacement logic when it comes to unique (per-allocation) memory slices.

Also, on bailouts when the new analysis fails: I instrumented all possible failure modes with asserts and all the test failures I saw were in 2 places:

src/hotspot/share/opto/callnode.cpp:1912
    } else if (input->bottom_type()->base() == Type::Memory) {
      // Somehow the base was eliminated and we still have a memory reference left
      assert(false, "");
      return NULL;
    }

src/hotspot/share/opto/macro.cpp:2612
        // In some cases the region controlling the RAM might go away due to some simplification
        // of the IR graph. For now, we'll just bail out if this happens.
        if (n->in(0) == NULL || !n->in(0)->is_Region()) {
          assert(false, "");
          C->record_failure(C2Compiler::retry_no_reduce_allocation_merges());
          return;
        }

How hard would it be to extend the test with cases which demonstrate existing limitations?

@iwanowww
Copy link
Contributor

Additional high-level comments.

(1) Overall, I don't see (yet) enough motivation to justify the addition of ReducedAllocationMerge. I'd prefer to see the new node go away.

ReducedAllocationMerge nodes are short-lived. As I can infer from the code, they are used solely for the purpose of caching field information (in addition to marking that the original phi satisfied some requirements). Have you considered using existing ConnectionGraph instance for bookkeeping purposes? It's available during IGVN and macro expansion.

Also, I believe you face some ideal graph inconsistencies because you capture information too early (before split_unique_types and following IGVN pass; and previous allocation eliminations during eliminate_macro_nodes() may contribute to that).

(2) Following up on my earlier question about interactions with split_unique_types(), I'm worried that you remove corresponding LocalVars from the ConnectionGraph and introduce with unique memory slices. I'd feel much more confident in the correctness if you split slices for unions of interacting allocations instead.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. No need to act until we agree on high-level aspects of the patch.

@JohnTortugo
Copy link
Contributor Author

Hi Vladimir, first of all, thank you for reviewing the proposed patch. Sorry for
the delay answering your questions, I'm right now on paternity leave from work
and my time in front of a computer is being quite limited.

As of now, I don't fully grasp what's the purpose and motivation to introduce
ReducedAllocationMerge. I would be grateful for additional information about how
you ended up with the current design. (I went through the email thread, but it
didn't help me.)

My first implementation to deal with the problem was just replacing Phi's
merging object allocations by sets of Phi's merging field's loads from each
different base. That works fine in some cases. However, one challenge that I
faced in this approach, IIRC, was that the new Phi nodes (merging loads of
fields) were being factored away (eliminated) as part of IGVN optimizations and
consequently the graph ended up again with Phi's merging object allocations. I
considered inserting Opaque nodes in some branches of the graph to prevent
unadverted optimizations but at the end I found the approach below more
"robust".

The proposed idea of using a new type of node to represent the merges have
these Pros, IMO.

  • By replacing Phi's merging allocations by a new type of node (i.e., RAM) the
    existing code (split_unique_types) will be able to assign instance_ids to the
    scalar replaceable inputs of the merging Phi. That means that those inputs will
    be scalar replaced using existing code in C2. Additional code will be necessary
    only to scalar replace the new type of node.

  • As a side effect of the point described above, we'll also be able use existing
    code in C2 to find last stored values to fields (i.e., find_inst_mem,
    value_from_mem, etc).

  • Existing optimizations will not interfere with the scalar replacement (as they
    did in the previous approach outlined) because they don't know how to handle the
    new type of node. In essence, the new type of node will be Opaque to existing
    optimizations.

  • We can safely, AFAIU, run igvn.optimize() after replacing the allocation merge
    Phis with the new type of node.

  • For last, an additional benefit of using a new type of node to store
    information, instead of storing it in the ConnectionGraph, for instance, is
    that by using graph edges to capture the required "information" that the node
    need, the value is never in an "outdated" state. For instance, the current patch
    use input slots of the RAM node to store reference to required memory edges.
    Anytime a transformation happens in the graph and affect that memory edge the
    RAM node will be "automatically" updated using existing code in C2. If, instead,
    the memory edge was just stored as part of internal data of some class, then
    we'd need to handle those updates manually, AFAIU.

Some arguably Cons of the current approach:

  • Seems complex at first glance.

  • New (non functional) node in the IR. The node is quite similar to a
    PhiNode in the sense that it's there just to represent a state (or some set of
    information).

  • The new node is a Macro node. Failure to remove it from the IR graph can cause
    compilation failure.

In particular, I still don't understand how it interacts with existing scalar
replacement logic when it comes to unique (per-allocation) memory slices.

The RAM node itself doesn't interfere with alias indexes / memory slices
creation at all. RAM nodes are created before "adjust_scalar_replaceable_state"
is executed and because of that the allocations participating in merges may be
marked as ScalarReplaceable. Later, "split_unique_types" will run and be able to
assign instance_id's to allocations participating in the merge because there is
"virtually" no merge anymore at this point.

During the Macro node elimination phase the allocation nodes will be visited and
potentially scalar replaced before any RAM node is visited. When an allocation
node being scalar replaced is consumed by a RAM node some information are
"registered" in the RAM node. Those information will later be used when the time
come to scalar replace the RAM node itself.

I hope I have answered your question. If not please let me know and I'll be
happy to give more details.

How hard would it be to extend the test with cases which demonstrate existing
limitations?

I'll try and create some test cases that trigger those limitations.

Also, I believe you face some ideal graph inconsistencies because you capture
information too early (before split_unique_types and following IGVN pass; and
previous allocation eliminations during eliminate_macro_nodes() may contribute
to that).

Can you please elaborate on that?

Following up on my earlier question about interactions with
split_unique_types(), I'm worried that you remove corresponding LocalVars from
the ConnectionGraph and introduce with unique memory slices. I'd feel much more
confident in the correctness if you split slices for unions of interacting
allocations instead.

Can you please explain a bit more about this idea? Are you proposing to split the
phi into slices instead of removing it?

@iwanowww
Copy link
Contributor

iwanowww commented Oct 26, 2022

Thanks for the clarifications, Cesar.

The concept of RAM node still looks like a hack to me. And I'd like the patch to better fit the overall design of EA rather than trying to workaround different peculiarities of the current implementation.

I'll try to elaborate why I see RAMs redundant.

As of now, it serves dual purpose. It (1) marks a merge point as safe to be untangled during SR; and (2) caches information about field values.

I believe you can solve it in a cleaner manner without introducing placeholder
nodes and connection graph adjustments. IMO it's all about keeping escape
status and properly handling "safe" merges in split_unique_types.

One possible way to handle merge points is:

  • Handle merge points in adjust_scalar_replaceable_state and refrain from marking relevant bases as NSR when possible.
  • After adjust_scalar_replaceable_state is over, every merge point should have all its inputs as either NSR or SR.
  • split_unique_types incrementally builds value phis to eventually replace the base phi at merge point while processing SR allocations one by one.
  • After split_unique_types is done, there are no merge points anymore, each allocation has a dedicated memory graph and allocation elimination can proceed as before.

Do you see any problems with such an approach?

One thing still confuses me though: the patch mentions that RAMs can merge both eliminated and not-yet-eliminated allocations. What's the intended use case? I believe it's still required to have all merged allocations to be eventually eliminated. Do you try to handle the case during allocation elimination when part of the inputs are already eliminated and the rest is pending their turn?

@iwanowww
Copy link
Contributor

Also, I believe you face some ideal graph inconsistencies because you capture
information too early (before split_unique_types and following IGVN pass; and
previous allocation eliminations during eliminate_macro_nodes() may contribute
to that).

Can you please elaborate on that?

There are places in the patch where you check for unintended graph modifications, like in PhaseMacroExpand::eliminate_macro_nodes():

        // In some cases the region controlling the RAM might go away due to some simplification
        // of the IR graph. For now, we'll just bail out if this happens.
        if (n->in(0) == NULL || !n->in(0)->is_Region()) {
          C->record_failure(C2Compiler::retry_no_reduce_allocation_merges());
          return;
        }

I consider that an implementation peculiarity contributed by RAMs rather than an inherent complication coming from the transformation being performed.

@vnkozlov
Copy link
Contributor

vnkozlov commented Oct 27, 2022

As of now, it serves dual purpose. It (1) marks a merge point as safe to be untangled during SR; and (2) caches information about field values.

An other important purpose of RAM is to have information at SafePoints after merge point for reallocation during deoptimization. You need Klass information. I don't think having only Phis for values is enough.

I believe you can solve it in a cleaner manner without introducing placeholder nodes and connection graph adjustments. IMO it's all about keeping escape status and properly handling "safe" merges in split_unique_types.

One possible way to handle merge points is:

  • Handle merge points in adjust_scalar_replaceable_state and refrain from marking relevant bases as NSR when possible.
  • After adjust_scalar_replaceable_state is over, every merge point should have all its inputs as either NSR or SR.
  • split_unique_types incrementally builds value phis to eventually replace the base phi at merge point while processing SR allocations one by one.
  • After split_unique_types is done, there are no merge points anymore, each allocation has a dedicated memory graph and allocation elimination can proceed as before.

I am not sure how this could be possible. Currently EA rely on IGVN to propagate fields values based on unique memory slice. What you do with memory Load or Store nodes after merge point? Which memory slice you will use for them?

Do you see any problems with such an approach?

One thing still confuses me though: the patch mentions that RAMs can merge both eliminated and not-yet-eliminated allocations. What's the intended use case? I believe it's still required to have all merged allocations to be eventually eliminated. Do you try to handle the case during allocation elimination when part of the inputs are already eliminated and the rest is pending their turn?

There is check for it in ConnectionGraph::can_reduce_this_phi(). The only supported cases is when no deoptimization point (SFP or UNCT) after merge point. It allow eliminate SR allocations even if they merge with NSR allocations. This was idea.

@iwanowww
Copy link
Contributor

iwanowww commented Oct 27, 2022

An other important purpose of RAM is to have information at SafePoints after merge point for reallocation during deoptimization. You need Klass information. I don't think having only Phis for values is enough.

Klass information is available either from Allocation node in split_unique_types or ConnectionGraph instance the Phi is part of.

I am not sure how this could be possible. Currently EA rely on IGVN to propagate fields values based on unique memory slice. What you do with memory Load or Store nodes after merge point? Which memory slice you will use for them?

My understanding of how proposed approach is expected to work: merge points have to be simple enough to still allow splitting unique types for individual allocations.

For example, eliminate_ram_addp_use() replaces Load (AddP (Phi base1 ... basen) off) mem with Phi (val1 ... valn) and eliminate_reduced_allocation_merge() performs similar transformation for SafePoints.

Alternatively, corresponding Phis can be built incrementally while processing each individual base by split_unique_types. Or, just by splitting Loads through Phi:

Load (AddP (Phi base_1 ... base_n) off) mem 
== split-through-phi ==>
Phi ((Load (AddP base_1 off) mem) ... (Load (AddP base_n off) mem)) 
== split_unique_types ==>
Phi ((Load (AddP base_1 off) mem_1) ... (Load (AddP base_n off) mem_n)) 
== IGVN ==>
Phi (val_1 ... val_n)

There is check for it in ConnectionGraph::can_reduce_this_phi(). The only supported cases is when no deoptimization point (SFP or UNCT) after merge point. It allow eliminate SR allocations even if they merge with NSR allocations. This was idea.

That's nice! Now I see has_call_as_user-related code.

It means that only Load (AddP (Phi base_1 ... base_n) off) mem shapes are allowed now.
I believe the aforementioned split-through-phi transformation should handle it well:

Load (AddP (Phi base_1 ... base_n) off) mem 
== split-through-phi ==>
Phi ((Load (AddP base_1 off) mem) ... (Load (AddP base_n off) mem)) 
== split_unique_types ==>
Phi (... (Load (AddP base_SR_i off) mem_i) ... (Load (AddP base_NSR_n off) mem) ...) 
== IGVN ==>
Phi (... val_i ... (Load (AddP base_NSR_n off) mem) ... )

@JohnTortugo
Copy link
Contributor Author

Hi @iwanowww - Thank you for clarifying things!

After much thought and some testing, I think I can make the RAM node go away and achieve the results I want. Below are some additional comments & the overall approach that I'm going to switch to.

  • LoadNode::split_through_phi requires is_known_instance_field and therefore can't be run before split_unique_types without changes.

  • I think splitting the merge Phi as part of adjust_scalar_replaceable_state might be the best place to create the value Phi for the fields. However, if the merge Phi is used by a SafePoint/UncommonTrap then it's necessary to also create a SafePointScalarObjectNode (SSON). As you know, there is already logic to create SSON as part of PhaseMacroExpand::scalar_replacement. So, we have to decide if it's best to re-use the code in PhaseMacroExpand::scalar_replacement in adjust_scalar_replaceable_state or if we want to add new code to create SSON for merge Phis in PhaseMacroExpand::scalar_replacement. Here are the Pros & Cons of the two options I mentioned above:

    • Create SSON in adjust_scalar_replaceable_state:
      Pros: all logic to split phi is centered in one place.
      Cons: the logic to create SSON is used outside scalar_replacement routine.

    • Create SafePointScalarReplacedNode in scalar_replacement routine.
      Pros: Scalar replacement of merge Phi is contained in scalar_replacement routine.
      Cons: the logic to split merge phi is spread throughout escape analysis and scalar replacement.

Does it look like a better approach? TIA!

@iwanowww
Copy link
Contributor

iwanowww commented Nov 11, 2022

Does it look like a better approach?

It definitely does! Thanks a lot for thinking it through.

LoadNode::split_through_phi requires is_known_instance_field and therefore can't be run before split_unique_types without changes.

IMO it should be fine to customize LoadNode::split_through_phi for our purposes.

So, we have to decide if it's best to re-use the code in PhaseMacroExpand::scalar_replacement in adjust_scalar_replaceable_state or if we want to add new code to create SSON for merge Phis in PhaseMacroExpand::scalar_replacement.

From design perspective, it's cleaner to modify the IR during PhaseMacroExpand::scalar_replacement(), since ConnectionGraph::adjust_scalar_replaceable_state() operates solely on ConnectionGraph instance being built.

Cons: the logic to split merge phi is spread throughout escape analysis and scalar replacement.

It would be helpful to see an implementation sketch to get better understanding how much complexity it adds.

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 9, 2022

@JohnTortugo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@JohnTortugo
Copy link
Contributor Author

Hi all! I'm back from leave and I'm resuming the work on this project. I'll keep you updated. Thanks again for all the comments and feel free to add more in the meantime.

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 14, 2023

@JohnTortugo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Feb 17, 2023
@JohnTortugo
Copy link
Contributor Author

Hi, ALL.

I decided to create a new PR since after the latest changes the code looked much different than the version in this PR. I also attacked the problem from another direction: I decided to create an infrastructure for re-materializing objects before anything else since merges being used as debug information is the most common use case (see charts on new PR).

Still, in this new approach, I plan to include all the feedback I received here. 1) No need for RAM node; 2) Improve split-unique-types, 3) Make use of split-through-phi.

Thank you all again!

@JohnTortugo JohnTortugo closed this Mar 7, 2023
@JohnTortugo JohnTortugo deleted the allocation-merges branch April 18, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org merge-conflict Pull request has merge conflict with target branch rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

None yet

5 participants