Add namespaced expressions that expose pyarrow functions #58465

goutamvenkat-anyscale · 2025-11-07T23:59:10Z

Description

Adds support to expose pyarrow compute functions to expressions to make with_column transforms more powerful.

Related issues

Closes #57668

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a powerful new feature by exposing PyArrow compute functions through namespaced expressions (.str, .list, .struct). The implementation is well-structured, using dynamic method generation from a configuration, which is a great pattern for extensibility. The addition of a .pyi stub file is excellent for static analysis and IDE support, and the new tests are comprehensive.

My main feedback is a medium-severity issue regarding the placement of pyarrow.compute imports in the manually defined namespace methods. Moving these imports inside the UDF wrappers will improve robustness by preventing potential serialization issues. I've left comments on all affected methods with suggestions.

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-08T01:38:38Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a powerful and well-designed feature for namespaced expressions, exposing a wide range of pyarrow compute functions for list, str, and struct types. The use of dynamic method generation via configuration dictionaries is clean and extensible, and the inclusion of a .pyi stub file for type hinting is excellent for developer experience and static analysis. The accompanying tests are comprehensive and well-structured. I have a few suggestions to improve type hint correctness and simplify some of the implementations.

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Alias Expressions: Incorrect Rename State

The AliasExpr.alias method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is called, it should always create an alias expression with _is_rename=False, regardless of whether the underlying expression was a rename. Preserving _is_rename=True causes the new alias to be incorrectly treated as a rename operation, which affects logical plan optimization and projection pushdown.

python/ray/data/expressions.py#L1306-L1311

ray/python/ray/data/expressions.py

Lines 1306 to 1311 in c39c65b

    
               return self._name 
        
           def alias(self, name: str) -> "Expr": 
        
               # Always unalias before creating new one 
        
               return AliasExpr( 
        
                   self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Alias method mixes up rename and alias.

The AliasExpr.alias() method incorrectly preserves the _is_rename flag from the original expression when creating a new alias. When .alias() is explicitly called, it creates an alias operation (not a rename), so _is_rename should always be False in the returned AliasExpr, regardless of the original expression's _is_rename value. This causes incorrect semantics when chaining operations like col("x")._rename("y").alias("z").

python/ray/data/expressions.py#L1152-L1157

ray/python/ray/data/expressions.py

Lines 1152 to 1157 in 3b5f1a4

    
                   function_name: Optional name for the function (for debugging) 
        
               Example: 
        
                   >>> from ray.data.expressions import col, udf 
        
                   >>> import pyarrow as pa 
        
                   >>> import pyarrow.compute as pc

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/expressions.py

Signed-off-by: Goutam <goutam@anyscale.com>

cursor · 2025-11-08T03:05:58Z

python/ray/data/expressions.py

+
+        @udf(return_dtype=DataType(object))
+        def _list_slice(arr):
+            return pc.list_slice(arr, start=start or 0, stop=stop, step=step or 1)


Bug: Slice method: Zero values silently misinterpreted.

The _ListNamespace.slice method uses start or 0 and step or 1 to provide defaults, which incorrectly treats explicit 0 values as falsy. If a user passes step=0 explicitly, it gets silently converted to 1 instead of letting PyArrow raise an appropriate error. The code should use if start is None and if step is None checks instead of the or operator to properly distinguish between None and explicit 0 values.

goutamvenkat-anyscale added 3 commits November 7, 2025 15:30

[Data] - Pyarrow Functions as Expressions

2a721de

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/pyarrow_expr

d6a6229

Add .pyi file

e202479

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner November 7, 2025 23:59

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Nov 8, 2025

Some doc failures

9665ae3

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

python/ray/data/expressions.py Outdated Show resolved Hide resolved

Docs

c39c65b

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

Fix typing

3b5f1a4

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 7, 2025 18:17

Vale linter

dab16b2

Signed-off-by: Goutam <goutam@anyscale.com>

One more try

336d882

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner November 8, 2025 02:38

cursor bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/expressions.py Outdated Show resolved Hide resolved

idk

b1108a3

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

	return self._name

	def alias(self, name: str) -> "Expr":
	# Always unalias before creating new one
	return AliasExpr(
	self.expr.data_type, self.expr, _name=name, _is_rename=self._is_rename

	function_name: Optional name for the function (for debugging)

	Example:
	>>> from ray.data.expressions import col, udf
	>>> import pyarrow as pa
	>>> import pyarrow.compute as pc

Add namespaced expressions that expose pyarrow functions #58465

Are you sure you want to change the base?

Add namespaced expressions that expose pyarrow functions #58465

Conversation

goutamvenkat-anyscale commented Nov 7, 2025

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Alias Expressions: Incorrect Rename State

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Alias method mixes up rename and alias.

Uh oh!

Uh oh!

cursor bot Nov 8, 2025

Choose a reason for hiding this comment

Bug: Slice method: Zero values silently misinterpreted.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant