Merge remote-tracking branch 'upstream/2.3.x' into backport-61909

jorisvandenbossche · jorisvandenbossche · commit 7e92f9978a3c · 2025-09-05T10:16:11.000+02:00
diff --git a/doc/source/user_guide/migration-3-strings.rst b/doc/source/user_guide/migration-3-strings.rst
@@ -188,6 +188,14 @@ let pandas do the inference. But if you want to be specific, you can specify the
 This is actually compatible with pandas 2.x as well, since in pandas < 3,
 ``dtype="str"`` was essentially treated as an alias for object dtype.
 
+.. attention::
+
+   While using ``dtype="str"`` in constructors is compatible with pandas 2.x,
+   specifying it as the dtype in :meth:`~Series.astype` runs into the issue
+   of also stringifying missing values in pandas 2.x. See the section
+   :ref:`string_migration_guide-astype_str` for more details.
+
+
 The missing value sentinel is now always NaN
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -307,55 +315,103 @@ the :meth:`~pandas.Series.str.decode` method now has a ``dtype`` parameter to be
 able to specify object dtype instead of the default of string dtype for this use
 case.
 
+:meth:`Series.values` now returns an :class:`~pandas.api.extensions.ExtensionArray`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With object dtype, using ``.values`` on a Series will return the underlying NumPy array.
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", np.nan], dtype="object")
+   >>> type(ser.values)
+   <class 'numpy.ndarray'>
+
+However with the new string dtype, the underlying ExtensionArray is returned instead.
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", pd.NA], dtype="str")
+   >>> ser.values
+   <ArrowStringArray>
+   ['a', 'b', nan]
+   Length: 3, dtype: str
+
+If your code requires a NumPy array, you should use :meth:`Series.to_numpy`.
+
+.. code-block:: python
+
+   >>> ser = pd.Series(["a", "b", pd.NA], dtype="str")
+   >>> ser.to_numpy()
+   ['a' 'b' nan]
+
+In general, you should always prefer :meth:`Series.to_numpy` to get a NumPy array or :meth:`Series.array` to get an ExtensionArray over using :meth:`Series.values`.
+
 Notable bug fixes
 ~~~~~~~~~~~~~~~~~
 
+.. _string_migration_guide-astype_str:
+
 ``astype(str)`` preserving missing values
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
+The stringifying of missing values is a long standing "bug" or misfeature, as
+discussed in https://github.com/pandas-dev/pandas/issues/25353, but fixing it
+introduces a significant behaviour change.
 
-With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not
-``astype("str")``!), the operation would convert every element to a string,
-including the missing values:
+With pandas < 3, when using ``astype(str)`` or ``astype("str")``, the operation
+would convert every element to a string, including the missing values:
 
 .. code-block:: python
 
    # OLD behavior in pandas < 3
-   >>> ser = pd.Series(["a", np.nan], dtype=object)
+   >>> ser = pd.Series([1.5, np.nan])
    >>> ser
-   0      a
+   0    1.5
    1    NaN
-   dtype: object
-   >>> ser.astype(str)
-   0      a
+   dtype: float64
+   >>> ser.astype("str")
+   0    1.5
    1    nan
    dtype: object
-   >>> ser.astype(str).to_numpy()
-   array(['a', 'nan'], dtype=object)
+   >>> ser.astype("str").to_numpy()
+   array(['1.5', 'nan'], dtype=object)
 
 Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was
 not the intended behavior, and it was inconsistent with how other dtypes handled
 missing values.
 
-With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias
-for ``astype("str")``, i.e. casting to the new string dtype, which will preserve
-the missing values:
+With pandas 3, this behavior has been fixed, and now ``astype("str")`` will cast
+to the new string dtype, which preserves the missing values:
 
 .. code-block:: python
 
    # NEW behavior in pandas 3
    >>> pd.options.future.infer_string = True
-   >>> ser = pd.Series(["a", np.nan], dtype=object)
-   >>> ser.astype(str)
-   0      a
+   >>> ser = pd.Series([1.5, np.nan])
+   >>> ser.astype("str")
+   0    1.5
    1    NaN
    dtype: str
-   >>> ser.astype(str).values
-   array(['a', nan], dtype=object)
+   >>> ser.astype("str").to_numpy()
+   array(['1.5', nan], dtype=object)
 
 If you want to preserve the old behaviour of converting every object to a
-string, you can use ``ser.map(str)`` instead.
+string, you can use ``ser.map(str)`` instead. If you want do such conversion
+while preserving the missing values in a way that works with both pandas 2.x and
+3.x, you can use ``ser.map(str, na_action="ignore")`` (for pandas 3.x only, you
+can do ``ser.astype("str")``).
+
+If you want to convert to object or string dtype for pandas 2.x and 3.x,
+respectively, without needing to stringify each individual element, you will
+have to use a conditional check on the pandas version.
+For example, to convert a categorical Series with string categories to its
+dense non-categorical version with object or string dtype:
+
+.. code-block:: python
+
+   >>> import pandas as pd
+   >>> ser = pd.Series(["a", np.nan], dtype="category")
+   >>> ser.astype(object if pd.__version__ < "3" else "str")
 
 
 ``prod()`` raising for string data
diff --git a/doc/source/whatsnew/v2.3.1.rst b/doc/source/whatsnew/v2.3.1.rst
@@ -73,4 +73,4 @@ Bug fixes
 Contributors
 ~~~~~~~~~~~~
 
-.. contributors:: v2.3.0..v2.3.1|HEAD
+.. contributors:: v2.3.0..v2.3.1
diff --git a/doc/source/whatsnew/v2.3.2.rst b/doc/source/whatsnew/v2.3.2.rst
@@ -1,6 +1,6 @@
 .. _whatsnew_232:
 
-What's new in 2.3.2 (August XX, 2025)
+What's new in 2.3.2 (August 21, 2025)
 -------------------------------------
 
 These are the changes in pandas 2.3.2. See :ref:`release` for a full changelog
@@ -25,11 +25,16 @@ Bug fixes
 - Fix :meth:`~DataFrame.to_json` with ``orient="table"`` to correctly use the
   "string" type in the JSON Table Schema for :class:`StringDtype` columns
   (:issue:`61889`)
-- Fixed ``~Series.str.match``, ``~Series.str.fullmatch`` and ``~Series.str.contains``
-  with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
+- Boolean operations (``|``, ``&``, ``^``) with bool-dtype objects on the left and :class:`StringDtype` objects on the right now cast the string to bool, with a deprecation warning (:issue:`60234`)
+- Fixed :meth:`~Series.str.match`, :meth:`~Series.str.fullmatch` and :meth:`~Series.str.contains`
+  string methods with compiled regex for the Arrow-backed string dtype (:issue:`61964`, :issue:`61942`)
+- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` inconsistently
+  replacing matching values when missing values are present for string dtypes (:issue:`56599`)
 
 .. ---------------------------------------------------------------------------
 .. _whatsnew_232.contributors:
 
 Contributors
 ~~~~~~~~~~~~
+
+.. contributors:: v2.3.1..v2.3.2|HEAD
diff --git a/pandas/core/arrays/arrow/array.py b/pandas/core/arrays/arrow/array.py
@@ -2,6 +2,7 @@
 
 import functools
 import operator
+from pathlib import Path
 import re
 import textwrap
 from typing import (
@@ -829,10 +830,45 @@ def _logical_method(self, other, op):
         # integer types. Otherwise these are boolean ops.
         if pa.types.is_integer(self._pa_array.type):
             return self._evaluate_op_method(other, op, ARROW_BIT_WISE_FUNCS)
+        elif (
+            (
+                pa.types.is_string(self._pa_array.type)
+                or pa.types.is_large_string(self._pa_array.type)
+            )
+            and op in (roperator.ror_, roperator.rand_, roperator.rxor)
+            and isinstance(other, np.ndarray)
+            and other.dtype == bool
+        ):
+            # GH#60234 backward compatibility for the move to StringDtype in 3.0
+            op_name = op.__name__[1:].strip("_")
+            warnings.warn(
+                f"'{op_name}' operations between boolean dtype and {self.dtype} are "
+                "deprecated and will raise in a future version. Explicitly "
+                "cast the strings to a boolean dtype before operating instead.",
+                DeprecationWarning,
+                stacklevel=find_stack_level(),
+            )
+            return op(other, self.astype(bool))
         else:
             return self._evaluate_op_method(other, op, ARROW_LOGICAL_FUNCS)
 
-    def _arith_method(self, other, op):
+    def _arith_method(self, other, op) -> Self | npt.NDArray[np.object_]:
+        if (
+            op in [operator.truediv, roperator.rtruediv]
+            and isinstance(other, Path)
+            and (
+                pa.types.is_string(self._pa_array.type)
+                or pa.types.is_large_string(self._pa_array.type)
+            )
+        ):
+            # GH#61940
+            return np.array(
+                [
+                    op(x, other) if isinstance(x, str) else self.dtype.na_value
+                    for x in self
+                ],
+                dtype=object,
+            )
         return self._evaluate_op_method(other, op, ARROW_ARITHMETIC_FUNCS)
 
     def equals(self, other) -> bool:
diff --git a/pandas/core/arrays/string_.py b/pandas/core/arrays/string_.py
@@ -2,6 +2,7 @@
 
 from functools import partial
 import operator
+from pathlib import Path
 from typing import (
     TYPE_CHECKING,
     Any,
@@ -49,6 +50,7 @@
     missing,
     nanops,
     ops,
+    roperator,
 )
 from pandas.core.algorithms import isin
 from pandas.core.array_algos import masked_reductions
@@ -385,6 +387,26 @@ class BaseStringArray(ExtensionArray):
 
     dtype: StringDtype
 
+    # TODO(4.0): Once the deprecation here is enforced, this method can be
+    #  removed and we use the parent class method instead.
+    def _logical_method(self, other, op):
+        if (
+            op in (roperator.ror_, roperator.rand_, roperator.rxor)
+            and isinstance(other, np.ndarray)
+            and other.dtype == bool
+        ):
+            # GH#60234 backward compatibility for the move to StringDtype in 3.0
+            op_name = op.__name__[1:].strip("_")
+            warnings.warn(
+                f"'{op_name}' operations between boolean dtype and {self.dtype} are "
+                "deprecated and will raise in a future version. Explicitly "
+                "cast the strings to a boolean dtype before operating instead.",
+                DeprecationWarning,
+                stacklevel=find_stack_level(),
+            )
+            return op(other, self.astype(bool))
+        return NotImplemented
+
     @doc(ExtensionArray.tolist)
     def tolist(self):
         if self.ndim > 1:
@@ -1052,7 +1074,7 @@ def _cmp_method(self, other, op):
         mask = isna(self) | isna(other)
         valid = ~mask
 
-        if not lib.is_scalar(other):
+        if lib.is_list_like(other):
             if len(other) != len(self):
                 # prevent improper broadcasting when other is 2D
                 raise ValueError(
@@ -1068,6 +1090,9 @@ def _cmp_method(self, other, op):
             result = np.empty_like(self._ndarray, dtype="object")
             result[mask] = self.dtype.na_value
             result[valid] = op(self._ndarray[valid], other)
+            if isinstance(other, Path):
+                # GH#61940
+                return result
             return self._from_backing_data(result)
         else:
             # logical
diff --git a/pandas/tests/copy_view/test_methods.py b/pandas/tests/copy_view/test_methods.py
@@ -18,6 +18,7 @@
 )
 import pandas._testing as tm
 from pandas.tests.copy_view.util import get_array
+from pandas.util.version import Version
 
 
 def test_copy(using_copy_on_write):
@@ -1199,8 +1200,9 @@ def test_round(using_copy_on_write, warn_copy_on_write, decimals):
     if using_copy_on_write:
         assert tm.shares_memory(get_array(df2, "b"), get_array(df, "b"))
         # TODO: Make inplace by using out parameter of ndarray.round?
-        if decimals >= 0:
+        if decimals >= 0 and Version(np.__version__) < Version("2.4.0.dev0"):
             # Ensure lazy copy if no-op
+            # TODO: Cannot rely on Numpy returning view after version 2.3
             assert np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
         else:
             assert not np.shares_memory(get_array(df2, "a"), get_array(df, "a"))
diff --git a/pandas/tests/strings/test_strings.py b/pandas/tests/strings/test_strings.py
@@ -2,11 +2,13 @@
     datetime,
     timedelta,
 )
+from pathlib import Path
 
 import numpy as np
 import pytest
 
 from pandas import (
+    NA,
     DataFrame,
     Index,
     MultiIndex,
@@ -776,3 +778,48 @@ def test_series_str_decode():
     result = Series([b"x", b"y"]).str.decode(encoding="UTF-8", errors="strict")
     expected = Series(["x", "y"], dtype="str")
     tm.assert_series_equal(result, expected)
+
+
+def test_reversed_logical_ops(any_string_dtype):
+    # GH#60234
+    dtype = any_string_dtype
+    warn = None if dtype == object else DeprecationWarning
+    left = Series([True, False, False, True])
+    right = Series(["", "", "b", "c"], dtype=dtype)
+
+    msg = "operations between boolean dtype and"
+    with tm.assert_produces_warning(warn, match=msg):
+        result = left | right
+    expected = left | right.astype(bool)
+    tm.assert_series_equal(result, expected)
+
+    with tm.assert_produces_warning(warn, match=msg):
+        result = left & right
+    expected = left & right.astype(bool)
+    tm.assert_series_equal(result, expected)
+
+    with tm.assert_produces_warning(warn, match=msg):
+        result = left ^ right
+    expected = left ^ right.astype(bool)
+    tm.assert_series_equal(result, expected)
+
+
+def test_pathlib_path_division(any_string_dtype, request):
+    # GH#61940
+    if any_string_dtype == object:
+        mark = pytest.mark.xfail(
+            reason="with NA present we go through _masked_arith_op which "
+            "raises TypeError bc Path is not recognized by lib.is_scalar."
+        )
+        request.applymarker(mark)
+
+    item = Path("/Users/Irv/")
+    ser = Series(["A", "B", NA], dtype=any_string_dtype)
+
+    result = item / ser
+    expected = Series([item / "A", item / "B", ser.dtype.na_value], dtype=object)
+    tm.assert_series_equal(result, expected)
+
+    result = ser / item
+    expected = Series(["A" / item, "B" / item, ser.dtype.na_value], dtype=object)
+    tm.assert_series_equal(result, expected)