-
Notifications
You must be signed in to change notification settings - Fork 778
x86: Improve String.hashCode for medium/long strings #22875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| int32_t unrollCount = 1; | ||
| #endif | ||
|
|
||
| int32_t charsPerMainLoopIteration = unrollCount * vectorSizes[vl]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On x86, getMaxPreferredVectorLength() only returns one of TR::VectorLength[128|256|512], whose enum values map to integers 1,2,3. This doesn't map correctly to the zero-indexed vectorSizes array and will exceed it in the case of TR::VectorLength512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great catch, and likely hurts performance by causing extra iterations in the small loop.
| deps->addPostCondition(hashXMM, TR::RealRegister::NoReg, cg); | ||
|
|
||
| // Generate Main Loop; 4x Unrolled seems to yield the best performance for large arrays | ||
| static char *unrollVar = feGetEnv("TR_setInlineStringHashCodeUnrollCount"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this under the #ifdef if it only applies to the 64-bit case.
|
|
||
| generateLabelInstruction(TR::InstOpCode::label, node, bigLoopLabel, cg); | ||
|
|
||
| // Secondary unrolled vectorized loop with larget vector lengths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
larget == large?
2168228 to
768f1b1
Compare
|
@0xdaryl See the force-push. |
| generateRegRegInstruction(TR::InstOpCode::XORRegReg(), node, index, index, cg); | ||
| generateRegRegInstruction(TR::InstOpCode::XORRegReg(), node, hash, hash, cg); | ||
| generateRegRegInstruction(TR::InstOpCode::MOV4RegReg, node, loopLimit, length, cg); | ||
| generateRegImmInstruction(TR::InstOpCode::AND4RegImm4, node, loopLimit, charsPerMainLoopIteration - 1, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works accurately if charsPerMainLoopIteration is a power of 2 (I think). Maybe add an assert in the code here to check for that. Practically speaking, I think this can only happen if someone specifies an unrollCount that is not a power of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an assert checking for this in the vectorizedHashCodeHelper. Setting TR_setInlineStringHashCodeUnrollCount=7 fires the following assert "Unroll count must be 1/2/4".
| generateRegRegInstruction(TR::InstOpCode::CMP4RegReg, node, index, length, cg); | ||
| generateRegRegInstruction(TR::InstOpCode::CMP4RegReg, node, index, loopLimit, cg); | ||
| generateLabelInstruction(TR::InstOpCode::JL4, node, loopLabel, cg); | ||
| generateLabelInstruction(TR::InstOpCode::label, node, endLabel, deps, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deps should be on the doneLabel to capture all register usage across this entire evaluator to ensure correct register assignment. Be sure all register usage is reflected (for example, you removed length from the dep conditions earlier) if it appears inside any codegen-created control flow.
768f1b1 to
a59b029
Compare
| deps->addPostCondition(loopLimit, TR::RealRegister::NoReg, cg); | ||
| deps->addPostCondition(multiplierXMM, TR::RealRegister::NoReg, cg); | ||
| deps->addPostCondition(tmpXMM, TR::RealRegister::NoReg, cg); | ||
| deps->addPostCondition(hashXMM, TR::RealRegister::NoReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length and hash are missing from the deps and used in the control flow below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See force-push
This commit improves String.hashCode() performance by calling vectorizedHashCodeLoopHelper(...) to process longer string in an unrolled vectorized loop with larger vector lengths. Minor tweaks were made, including to change how multiplication vectors are loaded. Signed-off-by: Bradley Wood <bradley.wood@ibm.com>
a59b029 to
c4ce760
Compare
|
Jenkins test sanity.functional,sanity.openjdk xlinux,win jdk21 |
|
There are some failures in the PR testing that are reproducible with a grinder. I believe they might be related to this PR because similar failures were seen a couple of years back in #18371 that were "fixed" when a suspected PR was reverted. I don't believe that PR has anything to do with this PR, but it does demonstrate that JIT changes can cause failures in these tests. Please investigate. |
|
I reviewed the failures, all seem to happen the same way. In each case, a NullPointerException is preceded by a testng error message complaining that class is not found. The NPE seems to be genuine, and not an artifact of the signal handler. It also happens in hotspot if the class under test is truly missing from the classpath. The 'missing' class seems to have no relation to the underlying test. I am not sure why it would even attempt to load the class, as it is a part of a different test completely. I have not seen any failures that manifest in a different way. I attempted to produce the issue locally, and create a reduced test case, but wasn't able to reproduce the issue. I tried increasing the register pressure, but this also did not yield any issue yet. Despite that, I think register assignment is the only plausible explanation. I reviewed the way registers are handled in this PR. Other than a potential issue with the start/end of inner-control-flow flags on labels, nothing sticks out to me. I tried updating the ICF flags key labels, but still had a couple occurrences in Jenkins. I will try reverting the last few edits made surrounding RA, the code does look suspicious after your review comment, but there may have been a reason I decided to implement it that way. I could use another set of eyes on the RA related code, especially the part where we call out to the |
|
Are you able to run a grinder (not locally, but using the farm machines) on the test to get an idea of what the failure rate is? I believe I restarted the test twice during the CI testing and it failed both times, but perhaps you could set up 10 grinders to see if this is an intermittent issue or not. |
This commit improves String.hashCode() performance by calling vectorizedHashCodeLoopHelper(...) to
process longer string in an unrolled vectorized loop with larger vector lengths. Minor tweaks were made, including to change how multiplication vectors are loaded.