Skip to content

Commit 7e92f99

Browse files
Merge remote-tracking branch 'upstream/2.3.x' into backport-61909
2 parents ade1027 + c8b9658 commit 7e92f99

File tree

7 files changed

+198
-27
lines changed

7 files changed

+198
-27
lines changed

doc/source/user_guide/migration-3-strings.rst

Lines changed: 76 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,14 @@ let pandas do the inference. But if you want to be specific, you can specify the
188188
This is actually compatible with pandas 2.x as well, since in pandas < 3,
189189
``dtype="str"`` was essentially treated as an alias for object dtype.
190190

191+
.. attention::
192+
193+
While using ``dtype="str"`` in constructors is compatible with pandas 2.x,
194+
specifying it as the dtype in :meth:`~Series.astype` runs into the issue
195+
of also stringifying missing values in pandas 2.x. See the section
196+
:ref:`string_migration_guide-astype_str` for more details.
197+
198+
191199
The missing value sentinel is now always NaN
192200
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
193201

@@ -307,55 +315,103 @@ the :meth:`~pandas.Series.str.decode` method now has a ``dtype`` parameter to be
307315
able to specify object dtype instead of the default of string dtype for this use
308316
case.
309317

318+
:meth:`Series.values` now returns an :class:`~pandas.api.extensions.ExtensionArray`
319+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
320+
321+
With object dtype, using ``.values`` on a Series will return the underlying NumPy array.
322+
323+
.. code-block:: python
324+
325+
>>> ser = pd.Series(["a", "b", np.nan], dtype="object")
326+
>>> type(ser.values)
327+
<class 'numpy.ndarray'>
328+
329+
However with the new string dtype, the underlying ExtensionArray is returned instead.
330+
331+
.. code-block:: python
332+
333+
>>> ser = pd.Series(["a", "b", pd.NA], dtype="str")
334+
>>> ser.values
335+
<ArrowStringArray>
336+
['a', 'b', nan]
337+
Length: 3, dtype: str
338+
339+
If your code requires a NumPy array, you should use :meth:`Series.to_numpy`.
340+
341+
.. code-block:: python
342+
343+
>>> ser = pd.Series(["a", "b", pd.NA], dtype="str")
344+
>>> ser.to_numpy()
345+
['a' 'b' nan]
346+
347+
In general, you should always prefer :meth:`Series.to_numpy` to get a NumPy array or :meth:`Series.array` to get an ExtensionArray over using :meth:`Series.values`.
348+
310349
Notable bug fixes
311350
~~~~~~~~~~~~~~~~~
312351

352+
.. _string_migration_guide-astype_str:
353+
313354
``astype(str)`` preserving missing values
314355
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
315356

316-
This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
357+
The stringifying of missing values is a long standing "bug" or misfeature, as
358+
discussed in https://github.com/pandas-dev/pandas/issues/25353, but fixing it
359+
introduces a significant behaviour change.
317360

318-
With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not
319-
``astype("str")``!), the operation would convert every element to a string,
320-
including the missing values:
361+
With pandas < 3, when using ``astype(str)`` or ``astype("str")``, the operation
362+
would convert every element to a string, including the missing values:
321363

322364
.. code-block:: python
323365
324366
# OLD behavior in pandas < 3
325-
>>> ser = pd.Series(["a", np.nan], dtype=object)
367+
>>> ser = pd.Series([1.5, np.nan])
326368
>>> ser
327-
0 a
369+
0 1.5
328370
1 NaN
329-
dtype: object
330-
>>> ser.astype(str)
331-
0 a
371+
dtype: float64
372+
>>> ser.astype("str")
373+
0 1.5
332374
1 nan
333375
dtype: object
334-
>>> ser.astype(str).to_numpy()
335-
array(['a', 'nan'], dtype=object)
376+
>>> ser.astype("str").to_numpy()
377+
array(['1.5', 'nan'], dtype=object)
336378
337379
Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was
338380
not the intended behavior, and it was inconsistent with how other dtypes handled
339381
missing values.
340382

341-
With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias
342-
for ``astype("str")``, i.e. casting to the new string dtype, which will preserve
343-
the missing values:
383+
With pandas 3, this behavior has been fixed, and now ``astype("str")`` will cast
384+
to the new string dtype, which preserves the missing values:
344385

345386
.. code-block:: python
346387
347388
# NEW behavior in pandas 3
348389
>>> pd.options.future.infer_string = True
349-
>>> ser = pd.Series(["a", np.nan], dtype=object)
350-
>>> ser.astype(str)
351-
0 a
390+
>>> ser = pd.Series([1.5, np.nan])
391+
>>> ser.astype("str")
392+
0 1.5
352393
1 NaN
353394
dtype: str
354-
>>> ser.astype(str).values
355-
array(['a', nan], dtype=object)
395+
>>> ser.astype("str").to_numpy()
396+
array(['1.5', nan], dtype=object)
356397
357398
If you want to preserve the old behaviour of converting every object to a
358-
string, you can use ``ser.map(str)`` instead.
399+
string, you can use ``ser.map(str)`` instead. If you want do such conversion
400+
while preserving the missing values in a way that works with both pandas 2.x and
401+
3.x, you can use ``ser.map(str, na_action="ignore")`` (for pandas 3.x only, you
402+
can do ``ser.astype("str")``).
403+
404+
If you want to convert to object or string dtype for pandas 2.x and 3.x,
405+
respectively, without needing to stringify each individual element, you will
406+
have to use a conditional check on the pandas version.
407+
For example, to convert a categorical Series with string categories to its
408+
dense non-categorical version with object or string dtype:
409+
410+
.. code-block:: python
411+
412+
>>> import pandas as pd
413+
>>> ser = pd.Series(["a", np.nan], dtype="category")
414+
>>> ser.astype(object if pd.__version__ < "3" else "str")
359415
360416
361417
``prod()`` raising for string data

doc/source/whatsnew/v2.3.1.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,4 @@ Bug fixes
7373
Contributors
7474
~~~~~~~~~~~~
7575

76-
.. contributors:: v2.3.0..v2.3.1|HEAD
76+
.. contributors:: v2.3.0..v2.3.1

doc/source/whatsnew/v2.3.2.rst

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. _whatsnew_232:
22

3-
What's new in 2.3.2 (August XX, 2025)
3+
What's new in 2.3.2 (August 21, 2025)
44
-------------------------------------
55

66
These are the changes in pandas 2.3.2. See :ref:`release` for a full changelog
@@ -25,11 +25,16 @@ Bug fixes
2525
- Fix :meth:`~DataFrame.to_json` with ``orient="table"`` to correctly use the
2626
"string" type in the JSON Table Schema for :class:`StringDtype` columns
2727
(:issue:`61889`)
28-
- Fixed ``~Series.str.match``, ``~Series.str.fullmatch`` and ``~Series.str.contains``
29-
with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
28+
- Boolean operations (``|``, ``&``, ``^``) with bool-dtype objects on the left and :class:`StringDtype` objects on the right now cast the string to bool, with a deprecation warning (:issue:`60234`)
29+
- Fixed :meth:`~Series.str.match`, :meth:`~Series.str.fullmatch` and :meth:`~Series.str.contains`
30+
string methods with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
31+
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` inconsistently
32+
replacing matching values when missing values are present for string dtypes (:issue:`56599`)
3033

3134
.. ---------------------------------------------------------------------------
3235
.. _whatsnew_232.contributors:
3336

3437
Contributors
3538
~~~~~~~~~~~~
39+
40+
.. contributors:: v2.3.1..v2.3.2|HEAD

pandas/core/arrays/arrow/array.py

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
import functools
44
import operator
5+
from pathlib import Path
56
import re
67
import textwrap
78
from typing import (
@@ -829,10 +830,45 @@ def _logical_method(self, other, op):
829830
# integer types. Otherwise these are boolean ops.
830831
if pa.types.is_integer(self._pa_array.type):
831832
return self._evaluate_op_method(other, op, ARROW_BIT_WISE_FUNCS)
833+
elif (
834+
(
835+
pa.types.is_string(self._pa_array.type)
836+
or pa.types.is_large_string(self._pa_array.type)
837+
)
838+
and op in (roperator.ror_, roperator.rand_, roperator.rxor)
839+
and isinstance(other, np.ndarray)
840+
and other.dtype == bool
841+
):
842+
# GH#60234 backward compatibility for the move to StringDtype in 3.0
843+
op_name = op.__name__[1:].strip("_")
844+
warnings.warn(
845+
f"'{op_name}' operations between boolean dtype and {self.dtype} are "
846+
"deprecated and will raise in a future version. Explicitly "
847+
"cast the strings to a boolean dtype before operating instead.",
848+
DeprecationWarning,
849+
stacklevel=find_stack_level(),
850+
)
851+
return op(other, self.astype(bool))
832852
else:
833853
return self._evaluate_op_method(other, op, ARROW_LOGICAL_FUNCS)
834854

835-
def _arith_method(self, other, op):
855+
def _arith_method(self, other, op) -> Self | npt.NDArray[np.object_]:
856+
if (
857+
op in [operator.truediv, roperator.rtruediv]
858+
and isinstance(other, Path)
859+
and (
860+
pa.types.is_string(self._pa_array.type)
861+
or pa.types.is_large_string(self._pa_array.type)
862+
)
863+
):
864+
# GH#61940
865+
return np.array(
866+
[
867+
op(x, other) if isinstance(x, str) else self.dtype.na_value
868+
for x in self
869+
],
870+
dtype=object,
871+
)
836872
return self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS)
837873

838874
def equals(self, other) -> bool:

pandas/core/arrays/string_.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from functools import partial
44
import operator
5+
from pathlib import Path
56
from typing import (
67
TYPE_CHECKING,
78
Any,
@@ -49,6 +50,7 @@
4950
missing,
5051
nanops,
5152
ops,
53+
roperator,
5254
)
5355
from pandas.core.algorithms import isin
5456
from pandas.core.array_algos import masked_reductions
@@ -385,6 +387,26 @@ class BaseStringArray(ExtensionArray):
385387

386388
dtype: StringDtype
387389

390+
# TODO(4.0): Once the deprecation here is enforced, this method can be
391+
# removed and we use the parent class method instead.
392+
def _logical_method(self, other, op):
393+
if (
394+
op in (roperator.ror_, roperator.rand_, roperator.rxor)
395+
and isinstance(other, np.ndarray)
396+
and other.dtype == bool
397+
):
398+
# GH#60234 backward compatibility for the move to StringDtype in 3.0
399+
op_name = op.__name__[1:].strip("_")
400+
warnings.warn(
401+
f"'{op_name}' operations between boolean dtype and {self.dtype} are "
402+
"deprecated and will raise in a future version. Explicitly "
403+
"cast the strings to a boolean dtype before operating instead.",
404+
DeprecationWarning,
405+
stacklevel=find_stack_level(),
406+
)
407+
return op(other, self.astype(bool))
408+
return NotImplemented
409+
388410
@doc(ExtensionArray.tolist)
389411
def tolist(self):
390412
if self.ndim > 1:
@@ -1052,7 +1074,7 @@ def _cmp_method(self, other, op):
10521074
mask = isna(self) | isna(other)
10531075
valid = ~mask
10541076

1055-
if not lib.is_scalar(other):
1077+
if lib.is_list_like(other):
10561078
if len(other) != len(self):
10571079
# prevent improper broadcasting when other is 2D
10581080
raise ValueError(
@@ -1068,6 +1090,9 @@ def _cmp_method(self, other, op):
10681090
result = np.empty_like(self._ndarray, dtype="object")
10691091
result[mask] = self.dtype.na_value
10701092
result[valid] = op(self._ndarray[valid], other)
1093+
if isinstance(other, Path):
1094+
# GH#61940
1095+
return result
10711096
return self._from_backing_data(result)
10721097
else:
10731098
# logical

pandas/tests/copy_view/test_methods.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
)
1919
import pandas._testing as tm
2020
from pandas.tests.copy_view.util import get_array
21+
from pandas.util.version import Version
2122

2223

2324
def test_copy(using_copy_on_write):
@@ -1199,8 +1200,9 @@ def test_round(using_copy_on_write, warn_copy_on_write, decimals):
11991200
if using_copy_on_write:
12001201
assert tm.shares_memory(get_array(df2, "b"), get_array(df, "b"))
12011202
# TODO: Make inplace by using out parameter of ndarray.round?
1202-
if decimals >= 0:
1203+
if decimals >= 0 and Version(np.__version__) < Version("2.4.0.dev0"):
12031204
# Ensure lazy copy if no-op
1205+
# TODO: Cannot rely on Numpy returning view after version 2.3
12041206
assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
12051207
else:
12061208
assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))

pandas/tests/strings/test_strings.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,13 @@
22
datetime,
33
timedelta,
44
)
5+
from pathlib import Path
56

67
import numpy as np
78
import pytest
89

910
from pandas import (
11+
NA,
1012
DataFrame,
1113
Index,
1214
MultiIndex,
@@ -776,3 +778,48 @@ def test_series_str_decode():
776778
result = Series([b"x", b"y"]).str.decode(encoding="UTF-8", errors="strict")
777779
expected = Series(["x", "y"], dtype="str")
778780
tm.assert_series_equal(result, expected)
781+
782+
783+
def test_reversed_logical_ops(any_string_dtype):
784+
# GH#60234
785+
dtype = any_string_dtype
786+
warn = None if dtype == object else DeprecationWarning
787+
left = Series([True, False, False, True])
788+
right = Series(["", "", "b", "c"], dtype=dtype)
789+
790+
msg = "operations between boolean dtype and"
791+
with tm.assert_produces_warning(warn, match=msg):
792+
result = left | right
793+
expected = left | right.astype(bool)
794+
tm.assert_series_equal(result, expected)
795+
796+
with tm.assert_produces_warning(warn, match=msg):
797+
result = left & right
798+
expected = left & right.astype(bool)
799+
tm.assert_series_equal(result, expected)
800+
801+
with tm.assert_produces_warning(warn, match=msg):
802+
result = left ^ right
803+
expected = left ^ right.astype(bool)
804+
tm.assert_series_equal(result, expected)
805+
806+
807+
def test_pathlib_path_division(any_string_dtype, request):
808+
# GH#61940
809+
if any_string_dtype == object:
810+
mark = pytest.mark.xfail(
811+
reason="with NA present we go through _masked_arith_op which "
812+
"raises TypeError bc Path is not recognized by lib.is_scalar."
813+
)
814+
request.applymarker(mark)
815+
816+
item = Path("/Users/Irv/")
817+
ser = Series(["A", "B", NA], dtype=any_string_dtype)
818+
819+
result = item / ser
820+
expected = Series([item / "A", item / "B", ser.dtype.na_value], dtype=object)
821+
tm.assert_series_equal(result, expected)
822+
823+
result = ser / item
824+
expected = Series(["A" / item, "B" / item, ser.dtype.na_value], dtype=object)
825+
tm.assert_series_equal(result, expected)

0 commit comments

Comments
 (0)