-
Notifications
You must be signed in to change notification settings - Fork 1.8k
common: Add hashing support for REE arrays #18981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're aiming to close #16011 with this PR, perhaps we should add an end-to-end test showcasing REE arrays can be aggregated on?
Can do! Where do you think is the right place for a test like that? |
|
I have an example, but maybe in the interest of keeping PRs small and easily reviewable, I'll contribute the end-to-end example in a follow up? |
|
Let me know how you feel about the clippy lints, I personally find the suggestions from clippy in this case far harder to read since we're dealing with indexes, so I'd be inclined to add ignores, but let me know what you think! |
I prefer the clippy suggested way. Using indices is more C-style. |
c6669cf to
1e37fce
Compare
|
Adjusted to using mutable iterators. |
I think it's fine to add the tests in this PR. Perhaps we can add the test as an SLT, assuming |
|
In addition to the comments, I also realized that we were also always hashing all values, which is unnecessary, so I changed the implementation to first figure out via the runs which values are relevant and only hash those and then process them accordingly. |
|
Ok I went down a bit of a rabbithole, and I have a working SLT, but I think it's better as a follow up, as we need arrow 57.1.0 as we need apache/arrow-rs#8765 and another patch in datafusion for the arrow cast to work. |
That makes sense; I've edited the body to not close the issue on this PR merge. I hope to take a look at this PR again in next few days. |
vegarsti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
| } | ||
| DataType::RunEndEncoded(_, _) => downcast_run_array! { | ||
| array => hash_run_array(array, random_state, hashes_buffer, rehash)?, | ||
| _ => unreachable!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe return _internal_err like below (line 646) here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like the unreachable was more appropriate, as it's what the dictionary case does as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, makes sense! Andrew has nudged me to return an error instead of unreachable a couple of times, so I figured maybe it's a stance he takes generally, that's why I remembered to comment on it. But if the dict case does it, it makes sense to keep it!
(e.g. here, but I think that was slightly different apache/arrow-rs#8589 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strong opinion, whatever we do we should be consistent
4288e26 to
47abf30
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me, though I admit the logic around handling the REE array is a bit hard to follow at times; I hope to look into the arrow code for it and see if it can be made ergonomic 🤔
datafusion/common/src/hash_utils.rs
Outdated
| let start_in_slice = if absolute_run_end > array_offset { | ||
| logical_position | ||
| } else { | ||
| continue; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases can this else branch occur? Is it for when the slice starts mid run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me think about this one more time, now that we're using the arrow functions for getting the physical start and end this might indeed be no longer necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it with fresh eyes, I think this must have been a defensive leftover from when I was calculating/handling slicing/offset manually, but I think we need to rely on arrow being correct and I think with the tests we have we're well covered.
Removed this, and generally simplified things a bit more.
|
Are we good to merge this? |
|
Yep, planning to merge soon; just wanted to leave this open in case anyone else was interested in reviewing |
Which issue does this PR close?
Part of #16011
What changes are included in this PR?
Support for hashing of REE arrays
Are these changes tested?
Unit tests are added.
Are there any user-facing changes?
No, strictly additive behavior/new feature.
@alamb @vegarsti