Skip to content

Conversation

@brancz
Copy link
Contributor

@brancz brancz commented Nov 28, 2025

Which issue does this PR close?

Part of #16011

What changes are included in this PR?

Support for hashing of REE arrays

Are these changes tested?

Unit tests are added.

Are there any user-facing changes?

No, strictly additive behavior/new feature.

@alamb @vegarsti

@github-actions github-actions bot added the common Related to common crate label Nov 28, 2025
@brancz brancz mentioned this pull request Nov 28, 2025
4 tasks
@brancz brancz changed the title Add hashing of REE arrays and add REE aggregation example common: Add hashing support for REE arrays Nov 28, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're aiming to close #16011 with this PR, perhaps we should add an end-to-end test showcasing REE arrays can be aggregated on?

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

If we're aiming to close #16011 with this PR, perhaps we should add an end-to-end test showcasing REE arrays can be aggregated on?

Can do! Where do you think is the right place for a test like that?

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

I have an example, but maybe in the interest of keeping PRs small and easily reviewable, I'll contribute the end-to-end example in a follow up?

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

Let me know how you feel about the clippy lints, I personally find the suggestions from clippy in this case far harder to read since we're dealing with indexes, so I'd be inclined to add ignores, but let me know what you think!

@martin-g
Copy link
Member

Let me know how you feel about the clippy lints,

I prefer the clippy suggested way. Using indices is more C-style.
But this is just my personal preference !

@brancz
Copy link
Contributor Author

brancz commented Nov 28, 2025

Adjusted to using mutable iterators.

@Jefffrey
Copy link
Contributor

I have an example, but maybe in the interest of keeping PRs small and easily reviewable, I'll contribute the end-to-end example in a follow up?

I think it's fine to add the tests in this PR. Perhaps we can add the test as an SLT, assuming arrow_cast supports REE arrays?

@brancz
Copy link
Contributor Author

brancz commented Nov 29, 2025

In addition to the comments, I also realized that we were also always hashing all values, which is unnecessary, so I changed the implementation to first figure out via the runs which values are relevant and only hash those and then process them accordingly.

@brancz
Copy link
Contributor Author

brancz commented Nov 29, 2025

Ok I went down a bit of a rabbithole, and I have a working SLT, but I think it's better as a follow up, as we need arrow 57.1.0 as we need apache/arrow-rs#8765 and another patch in datafusion for the arrow cast to work.

@Jefffrey
Copy link
Contributor

Ok I went down a bit of a rabbithole, and I have a working SLT, but I think it's better as a follow up, as we need arrow 57.1.0 as we need apache/arrow-rs#8765 and another patch in datafusion for the arrow cast to work.

That makes sense; I've edited the body to not close the issue on this PR merge.

I hope to take a look at this PR again in next few days.

Copy link
Contributor

@vegarsti vegarsti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

}
DataType::RunEndEncoded(_, _) => downcast_run_array! {
array => hash_run_array(array, random_state, hashes_buffer, rehash)?,
_ => unreachable!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe return _internal_err like below (line 646) here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like the unreachable was more appropriate, as it's what the dictionary case does as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense! Andrew has nudged me to return an error instead of unreachable a couple of times, so I figured maybe it's a stance he takes generally, that's why I remembered to comment on it. But if the dict case does it, it makes sense to keep it!

(e.g. here, but I think that was slightly different apache/arrow-rs#8589 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong opinion, whatever we do we should be consistent

@brancz brancz force-pushed the ree-agg branch 3 times, most recently from 4288e26 to 47abf30 Compare December 1, 2025 17:52
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me, though I admit the logic around handling the REE array is a bit hard to follow at times; I hope to look into the arrow code for it and see if it can be made ergonomic 🤔

Comment on lines 533 to 537
let start_in_slice = if absolute_run_end > array_offset {
logical_position
} else {
continue;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases can this else branch occur? Is it for when the slice starts mid run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me think about this one more time, now that we're using the arrow functions for getting the physical start and end this might indeed be no longer necessary

Copy link
Contributor Author

@brancz brancz Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it with fresh eyes, I think this must have been a defensive leftover from when I was calculating/handling slicing/offset manually, but I think we need to rely on arrow being correct and I think with the tests we have we're well covered.

Removed this, and generally simplified things a bit more.

@brancz
Copy link
Contributor Author

brancz commented Dec 4, 2025

Are we good to merge this?

@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 4, 2025

Yep, planning to merge soon; just wanted to leave this open in case anyone else was interested in reviewing

@Jefffrey Jefffrey added this pull request to the merge queue Dec 4, 2025
Merged via the queue into apache:main with commit 63a8c65 Dec 4, 2025
27 checks passed
@Jefffrey
Copy link
Contributor

Jefffrey commented Dec 4, 2025

Thanks @brancz & @martin-g & @vegarsti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants