-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Add namespaced expressions that expose pyarrow functions #58465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add namespaced expressions that expose pyarrow functions #58465
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a powerful new feature by exposing PyArrow compute functions through namespaced expressions (.str, .list, .struct). The implementation is well-structured, using dynamic method generation from a configuration, which is a great pattern for extensibility. The addition of a .pyi stub file is excellent for static analysis and IDE support, and the new tests are comprehensive.
My main feedback is a medium-severity issue regarding the placement of pyarrow.compute imports in the manually defined namespace methods. Moving these imports inside the UDF wrappers will improve robustness by preventing potential serialization issues. I've left comments on all affected methods with suggestions.
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a powerful and well-designed feature for namespaced expressions, exposing a wide range of pyarrow compute functions for list, str, and struct types. The use of dynamic method generation via configuration dictionaries is clean and extensible, and the inclusion of a .pyi stub file for type hinting is excellent for developer experience and static analysis. The accompanying tests are comprehensive and well-structured. I have a few suggestions to improve type hint correctness and simplify some of the implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Alias Expressions: Incorrect Rename State
The AliasExpr.alias method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is called, it should always create an alias expression with _is_rename=False, regardless of whether the underlying expression was a rename. Preserving _is_rename=True causes the new alias to be incorrectly treated as a rename operation, which affects logical plan optimization and projection pushdown.
python/ray/data/expressions.py#L1306-L1311
ray/python/ray/data/expressions.py
Lines 1306 to 1311 in c39c65b
| return self._name | |
| def alias(self, name: str) -> "Expr": | |
| # Always unalias before creating new one | |
| return AliasExpr( | |
| self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename |
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Alias method mixes up rename and alias.
The AliasExpr.alias() method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is explicitly called, it creates an alias operation (not a rename), so _is_rename should always be False in the returned AliasExpr, regardless of the original expression's _is_rename value. This causes incorrect semantics when chaining operations like col("x")._rename("y").alias("z").
python/ray/data/expressions.py#L1152-L1157
ray/python/ray/data/expressions.py
Lines 1152 to 1157 in 3b5f1a4
| function_name: Optional name for the function (for debugging) | |
| Example: | |
| >>> from ray.data.expressions import col, udf | |
| >>> import pyarrow as pa | |
| >>> import pyarrow.compute as pc |
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
|
|
||
| @udf(return_dtype=DataType(object)) | ||
| def _list_slice(arr): | ||
| return pc.list_slice(arr, start=start or 0, stop=stop, step=step or 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Slice method: Zero values silently misinterpreted.
The _ListNamespace.slice method uses start or 0 and step or 1 to provide defaults, which incorrectly treats explicit 0 values as falsy. If a user passes step=0 explicitly, it gets silently converted to 1 instead of letting PyArrow raise an appropriate error. The code should use if start is None and if step is None checks instead of the or operator to properly distinguish between None and explicit 0 values.
Description
Adds support to expose pyarrow compute functions to expressions to make
with_columntransforms more powerful.Related issues
Closes #57668
Additional information