Skip to content

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Description

Adds support to expose pyarrow compute functions to expressions to make with_column transforms more powerful.

Related issues

Closes #57668

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 7, 2025 23:59
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful new feature by exposing PyArrow compute functions through namespaced expressions (.str, .list, .struct). The implementation is well-structured, using dynamic method generation from a configuration, which is a great pattern for extensibility. The addition of a .pyi stub file is excellent for static analysis and IDE support, and the new tests are comprehensive.

My main feedback is a medium-severity issue regarding the placement of pyarrow.compute imports in the manually defined namespace methods. Moving these imports inside the UDF wrappers will improve robustness by preventing potential serialization issues. I've left comments on all affected methods with suggestions.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 8, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a powerful and well-designed feature for namespaced expressions, exposing a wide range of pyarrow compute functions for list, str, and struct types. The use of dynamic method generation via configuration dictionaries is clean and extensible, and the inclusion of a .pyi stub file for type hinting is excellent for developer experience and static analysis. The accompanying tests are comprehensive and well-structured. I have a few suggestions to improve type hint correctness and simplify some of the implementations.

Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Alias Expressions: Incorrect Rename State

The AliasExpr.alias method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is called, it should always create an alias expression with _is_rename=False, regardless of whether the underlying expression was a rename. Preserving _is_rename=True causes the new alias to be incorrectly treated as a rename operation, which affects logical plan optimization and projection pushdown.

python/ray/data/expressions.py#L1306-L1311

return self._name
def alias(self, name: str) -> "Expr":
# Always unalias before creating new one
return AliasExpr(
self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename

Fix in Cursor Fix in Web


Signed-off-by: Goutam <goutam@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Alias method mixes up rename and alias.

The AliasExpr.alias() method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is explicitly called, it creates an alias operation (not a rename), so _is_rename should always be False in the returned AliasExpr, regardless of the original expression's _is_rename value. This causes incorrect semantics when chaining operations like col("x")._rename("y").alias("z").

python/ray/data/expressions.py#L1152-L1157

function_name: Optional name for the function (for debugging)
Example:
>>> from ray.data.expressions import col, udf
>>> import pyarrow as pa
>>> import pyarrow.compute as pc

Fix in Cursor Fix in Web


Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 8, 2025 02:38
Signed-off-by: Goutam <goutam@anyscale.com>

@udf(return_dtype=DataType(object))
def _list_slice(arr):
return pc.list_slice(arr, start=start or 0, stop=stop, step=step or 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Slice method: Zero values silently misinterpreted.

The _ListNamespace.slice method uses start or 0 and step or 1 to provide defaults, which incorrectly treats explicit 0 values as falsy. If a user passes step=0 explicitly, it gets silently converted to 1 instead of letting PyArrow raise an appropriate error. The code should use if start is None and if step is None checks instead of the or operator to properly distinguish between None and explicit 0 values.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ray Data] Enhance the expression for getting a field or slice

1 participant