Skip to content

Commit 6882c50

Browse files
committed
Merge branch 'main' into enh-value_counts
2 parents cd9b165 + 734f519 commit 6882c50

36 files changed

+382
-122
lines changed

doc/source/user_guide/categorical.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it
7777
.. ipython:: python
7878
7979
raw_cat = pd.Categorical(
80-
["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
80+
[None, "b", "c", None], categories=["b", "c", "d"], ordered=False
8181
)
8282
s = pd.Series(raw_cat)
8383
s
@@ -145,7 +145,7 @@ of :class:`~pandas.api.types.CategoricalDtype`.
145145
146146
from pandas.api.types import CategoricalDtype
147147
148-
s = pd.Series(["a", "b", "c", "a"])
148+
s = pd.Series([None, "b", "c", None])
149149
cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)
150150
s_cat = s.astype(cat_type)
151151
s_cat

doc/source/user_guide/io.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -499,11 +499,14 @@ When using ``dtype=CategoricalDtype``, "unexpected" values outside of
499499
``dtype.categories`` are treated as missing values.
500500

501501
.. ipython:: python
502+
:okwarning:
502503
503504
dtype = CategoricalDtype(["a", "b", "d"]) # No 'c'
504505
pd.read_csv(StringIO(data), dtype={"col1": dtype}).col1
505506
506-
This matches the behavior of :meth:`Categorical.set_categories`.
507+
This matches the behavior of :meth:`Categorical.set_categories`. This behavior is
508+
deprecated. In a future version, the presence of non-NA values that are not
509+
among the specified categories will raise.
507510

508511
.. note::
509512

doc/source/whatsnew/v3.0.0.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -647,6 +647,7 @@ Other Deprecations
647647
- Deprecated :meth:`Timestamp.utcfromtimestamp`, use ``Timestamp.fromtimestamp(ts, "UTC")`` instead (:issue:`56680`)
648648
- Deprecated :meth:`Timestamp.utcnow`, use ``Timestamp.now("UTC")`` instead (:issue:`56680`)
649649
- Deprecated ``pd.core.internals.api.maybe_infer_ndim`` (:issue:`40226`)
650+
- Deprecated allowing constructing or casting to :class:`Categorical` with non-NA values that are not present in specified ``dtype.categories`` (:issue:`40996`)
650651
- Deprecated allowing non-keyword arguments in :meth:`DataFrame.all`, :meth:`DataFrame.min`, :meth:`DataFrame.max`, :meth:`DataFrame.sum`, :meth:`DataFrame.prod`, :meth:`DataFrame.mean`, :meth:`DataFrame.median`, :meth:`DataFrame.sem`, :meth:`DataFrame.var`, :meth:`DataFrame.std`, :meth:`DataFrame.skew`, :meth:`DataFrame.kurt`, :meth:`Series.all`, :meth:`Series.min`, :meth:`Series.max`, :meth:`Series.sum`, :meth:`Series.prod`, :meth:`Series.mean`, :meth:`Series.median`, :meth:`Series.sem`, :meth:`Series.var`, :meth:`Series.std`, :meth:`Series.skew`, and :meth:`Series.kurt`. (:issue:`57087`)
651652
- Deprecated allowing non-keyword arguments in :meth:`Series.to_markdown` except ``buf``. (:issue:`57280`)
652653
- Deprecated allowing non-keyword arguments in :meth:`Series.to_string` except ``buf``. (:issue:`57280`)
@@ -969,6 +970,8 @@ Indexing
969970
- Bug in reindexing of :class:`DataFrame` with :class:`PeriodDtype` columns in case of consolidated block (:issue:`60980`, :issue:`60273`)
970971
- Bug in :meth:`DataFrame.loc.__getitem__` and :meth:`DataFrame.iloc.__getitem__` with a :class:`CategoricalDtype` column with integer categories raising when trying to index a row containing a ``NaN`` entry (:issue:`58954`)
971972
- Bug in :meth:`Index.__getitem__` incorrectly raising with a 0-dim ``np.ndarray`` key (:issue:`55601`)
973+
- Bug in indexing on a :class:`DatetimeIndex` with a ``timestamp[pyarrow]`` dtype or on a :class:`TimedeltaIndex` with a ``duration[pyarrow]`` dtype (:issue:`62277`)
974+
-
972975

973976
Missing
974977
^^^^^^^
@@ -1137,6 +1140,7 @@ Other
11371140
- Bug in :meth:`Series.diff` allowing non-integer values for the ``periods`` argument. (:issue:`56607`)
11381141
- Bug in :meth:`Series.dt` methods in :class:`ArrowDtype` that were returning incorrect values. (:issue:`57355`)
11391142
- Bug in :meth:`Series.isin` raising ``TypeError`` when series is large (>10**6) and ``values`` contains NA (:issue:`60678`)
1143+
- Bug in :meth:`Series.map` with a ``timestamp[pyarrow]`` dtype or ``duration[pyarrow]`` dtype incorrectly returning all-``NaN`` entries (:issue:`61231`)
11401144
- Bug in :meth:`Series.mode` where an exception was raised when taking the mode with nullable types with no null values in the series. (:issue:`58926`)
11411145
- Bug in :meth:`Series.rank` that doesn't preserve missing values for nullable integers when ``na_option='keep'``. (:issue:`56976`)
11421146
- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` throwing ``ValueError`` when ``regex=True`` and all NA values. (:issue:`60688`)
@@ -1149,8 +1153,10 @@ Other
11491153
- Bug in ``divmod`` and ``rdivmod`` with :class:`DataFrame`, :class:`Series`, and :class:`Index` with ``bool`` dtypes failing to raise, which was inconsistent with ``__floordiv__`` behavior (:issue:`46043`)
11501154
- Bug in printing a :class:`DataFrame` with a :class:`DataFrame` stored in :attr:`DataFrame.attrs` raised a ``ValueError`` (:issue:`60455`)
11511155
- Bug in printing a :class:`Series` with a :class:`DataFrame` stored in :attr:`Series.attrs` raised a ``ValueError`` (:issue:`60568`)
1156+
- Deprecated the keyword ``check_datetimelike_compat`` in :meth:`testing.assert_frame_equal` and :meth:`testing.assert_series_equal` (:issue:`55638`)
11521157
- Fixed bug where the :class:`DataFrame` constructor misclassified array-like objects with a ``.name`` attribute as :class:`Series` or :class:`Index` (:issue:`61443`)
11531158
- Fixed regression in :meth:`DataFrame.from_records` not initializing subclasses properly (:issue:`57008`)
1159+
-
11541160

11551161
.. ***DO NOT USE THIS SECTION***
11561162

pandas/_testing/asserters.py

Lines changed: 33 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
NoReturn,
88
cast,
99
)
10+
import warnings
1011

1112
import numpy as np
1213

@@ -15,6 +16,8 @@
1516
from pandas._libs.sparse import SparseIndex
1617
import pandas._libs.testing as _testing
1718
from pandas._libs.tslibs.np_datetime import compare_mismatched_resolutions
19+
from pandas.errors import Pandas4Warning
20+
from pandas.util._decorators import deprecate_kwarg
1821

1922
from pandas.core.dtypes.common import (
2023
is_bool,
@@ -843,6 +846,7 @@ def assert_extension_array_equal(
843846

844847

845848
# This could be refactored to use the NDFrame.equals method
849+
@deprecate_kwarg(Pandas4Warning, "check_datetimelike_compat", new_arg_name=None)
846850
def assert_series_equal(
847851
left,
848852
right,
@@ -897,6 +901,9 @@ def assert_series_equal(
897901
898902
check_datetimelike_compat : bool, default False
899903
Compare datetime-like which is comparable ignoring dtype.
904+
905+
.. deprecated:: 3.0
906+
900907
check_categorical : bool, default True
901908
Whether to compare internal Categorical exactly.
902909
check_category_order : bool, default True
@@ -1132,6 +1139,7 @@ def assert_series_equal(
11321139

11331140

11341141
# This could be refactored to use the NDFrame.equals method
1142+
@deprecate_kwarg(Pandas4Warning, "check_datetimelike_compat", new_arg_name=None)
11351143
def assert_frame_equal(
11361144
left,
11371145
right,
@@ -1194,6 +1202,9 @@ def assert_frame_equal(
11941202
``check_exact``, ``rtol`` and ``atol`` are specified.
11951203
check_datetimelike_compat : bool, default False
11961204
Compare datetime-like which is comparable ignoring dtype.
1205+
1206+
.. deprecated:: 3.0
1207+
11971208
check_categorical : bool, default True
11981209
Whether to compare internal Categorical exactly.
11991210
check_like : bool, default False
@@ -1320,22 +1331,28 @@ def assert_frame_equal(
13201331
# use check_index=False, because we do not want to run
13211332
# assert_index_equal for each column,
13221333
# as we already checked it for the whole dataframe before.
1323-
assert_series_equal(
1324-
lcol,
1325-
rcol,
1326-
check_dtype=check_dtype,
1327-
check_index_type=check_index_type,
1328-
check_exact=check_exact,
1329-
check_names=check_names,
1330-
check_datetimelike_compat=check_datetimelike_compat,
1331-
check_categorical=check_categorical,
1332-
check_freq=check_freq,
1333-
obj=f'{obj}.iloc[:, {i}] (column name="{col}")',
1334-
rtol=rtol,
1335-
atol=atol,
1336-
check_index=False,
1337-
check_flags=False,
1338-
)
1334+
with warnings.catch_warnings():
1335+
warnings.filterwarnings(
1336+
"ignore",
1337+
message="the 'check_datetimelike_compat' keyword",
1338+
category=Pandas4Warning,
1339+
)
1340+
assert_series_equal(
1341+
lcol,
1342+
rcol,
1343+
check_dtype=check_dtype,
1344+
check_index_type=check_index_type,
1345+
check_exact=check_exact,
1346+
check_names=check_names,
1347+
check_datetimelike_compat=check_datetimelike_compat,
1348+
check_categorical=check_categorical,
1349+
check_freq=check_freq,
1350+
obj=f'{obj}.iloc[:, {i}] (column name="{col}")',
1351+
rtol=rtol,
1352+
atol=atol,
1353+
check_index=False,
1354+
check_flags=False,
1355+
)
13391356

13401357

13411358
def assert_equal(left, right, **kwargs) -> None:

pandas/core/arrays/arrow/array.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1616,6 +1616,10 @@ def map(self, mapper, na_action: Literal["ignore"] | None = None):
16161616
if is_numeric_dtype(self.dtype):
16171617
return map_array(self.to_numpy(), mapper, na_action=na_action)
16181618
else:
1619+
# For "mM" cases, the super() method passes `self` without the
1620+
# to_numpy call, which inside map_array casts to ndarray[object].
1621+
# Without the to_numpy() call, NA is preserved instead of changed
1622+
# to None.
16191623
return super().map(mapper, na_action)
16201624

16211625
@doc(ExtensionArray.duplicated)

pandas/core/arrays/categorical.py

Lines changed: 61 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
cast,
1212
overload,
1313
)
14+
import warnings
1415

1516
import numpy as np
1617

@@ -23,6 +24,8 @@
2324
)
2425
from pandas._libs.arrays import NDArrayBacked
2526
from pandas.compat.numpy import function as nv
27+
from pandas.errors import Pandas4Warning
28+
from pandas.util._exceptions import find_stack_level
2629
from pandas.util._validators import validate_bool_kwarg
2730

2831
from pandas.core.dtypes.cast import (
@@ -476,7 +479,11 @@ def __init__(
476479
elif isinstance(values.dtype, CategoricalDtype):
477480
old_codes = extract_array(values)._codes
478481
codes = recode_for_categories(
479-
old_codes, values.dtype.categories, dtype.categories, copy=copy
482+
old_codes,
483+
values.dtype.categories,
484+
dtype.categories,
485+
copy=copy,
486+
warn=True,
480487
)
481488

482489
else:
@@ -528,7 +535,12 @@ def _from_sequence(
528535

529536
def _cast_pointwise_result(self, values) -> ArrayLike:
530537
res = super()._cast_pointwise_result(values)
531-
cat = type(self)._from_sequence(res, dtype=self.dtype)
538+
with warnings.catch_warnings():
539+
warnings.filterwarnings(
540+
"ignore",
541+
"Constructing a Categorical with a dtype and values containing",
542+
)
543+
cat = type(self)._from_sequence(res, dtype=self.dtype)
532544
if (cat.isna() == isna(res)).all():
533545
# i.e. the conversion was non-lossy
534546
return cat
@@ -565,6 +577,15 @@ def astype(self, dtype: AstypeArg, copy: bool = True) -> ArrayLike:
565577
dtype = self.dtype.update_dtype(dtype)
566578
self = self.copy() if copy else self
567579
result = self._set_dtype(dtype, copy=False)
580+
wrong = result.isna() & ~self.isna()
581+
if wrong.any():
582+
warnings.warn(
583+
"Constructing a Categorical with a dtype and values containing "
584+
"non-null entries not in that dtype's categories is deprecated "
585+
"and will raise in a future version.",
586+
Pandas4Warning,
587+
stacklevel=find_stack_level(),
588+
)
568589

569590
elif isinstance(dtype, ExtensionDtype):
570591
return super().astype(dtype, copy=copy)
@@ -659,14 +680,16 @@ def _from_inferred_categories(
659680
if known_categories:
660681
# Recode from observation order to dtype.categories order.
661682
categories = dtype.categories
662-
codes = recode_for_categories(inferred_codes, cats, categories, copy=False)
683+
codes = recode_for_categories(
684+
inferred_codes, cats, categories, copy=False, warn=True
685+
)
663686
elif not cats.is_monotonic_increasing:
664687
# Sort categories and recode for unknown categories.
665688
unsorted = cats.copy()
666689
categories = cats.sort_values()
667690

668691
codes = recode_for_categories(
669-
inferred_codes, unsorted, categories, copy=False
692+
inferred_codes, unsorted, categories, copy=False, warn=True
670693
)
671694
dtype = CategoricalDtype(categories, ordered=False)
672695
else:
@@ -787,7 +810,7 @@ def categories(self) -> Index:
787810
>>> ser.cat.categories
788811
Index(['a', 'b', 'c'], dtype='str')
789812
790-
>>> raw_cat = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"])
813+
>>> raw_cat = pd.Categorical([None, "b", "c", None], categories=["b", "c", "d"])
791814
>>> ser = pd.Series(raw_cat)
792815
>>> ser.cat.categories
793816
Index(['b', 'c', 'd'], dtype='str')
@@ -1095,7 +1118,7 @@ def set_categories(
10951118
For :class:`pandas.Series`:
10961119
10971120
>>> raw_cat = pd.Categorical(
1098-
... ["a", "b", "c", "A"], categories=["a", "b", "c"], ordered=True
1121+
... ["a", "b", "c", None], categories=["a", "b", "c"], ordered=True
10991122
... )
11001123
>>> ser = pd.Series(raw_cat)
11011124
>>> ser
@@ -1117,7 +1140,7 @@ def set_categories(
11171140
For :class:`pandas.CategoricalIndex`:
11181141
11191142
>>> ci = pd.CategoricalIndex(
1120-
... ["a", "b", "c", "A"], categories=["a", "b", "c"], ordered=True
1143+
... ["a", "b", "c", None], categories=["a", "b", "c"], ordered=True
11211144
... )
11221145
>>> ci
11231146
CategoricalIndex(['a', 'b', 'c', nan], categories=['a', 'b', 'c'],
@@ -1145,7 +1168,7 @@ def set_categories(
11451168
codes = cat._codes
11461169
else:
11471170
codes = recode_for_categories(
1148-
cat.codes, cat.categories, new_dtype.categories, copy=False
1171+
cat.codes, cat.categories, new_dtype.categories, copy=False, warn=False
11491172
)
11501173
NDArrayBacked.__init__(cat, codes, new_dtype)
11511174
return cat
@@ -2956,7 +2979,7 @@ def codes(self) -> Series:
29562979
29572980
Examples
29582981
--------
2959-
>>> raw_cate = pd.Categorical(["a", "b", "c", "a"], categories=["a", "b"])
2982+
>>> raw_cate = pd.Categorical(["a", "b", None, "a"], categories=["a", "b"])
29602983
>>> ser = pd.Series(raw_cate)
29612984
>>> ser.cat.codes
29622985
0 0
@@ -2991,11 +3014,25 @@ def _get_codes_for_values(
29913014
If `values` is known to be a Categorical, use recode_for_categories instead.
29923015
"""
29933016
codes = categories.get_indexer_for(values)
3017+
wrong = (codes == -1) & ~isna(values)
3018+
if wrong.any():
3019+
warnings.warn(
3020+
"Constructing a Categorical with a dtype and values containing "
3021+
"non-null entries not in that dtype's categories is deprecated "
3022+
"and will raise in a future version.",
3023+
Pandas4Warning,
3024+
stacklevel=find_stack_level(),
3025+
)
29943026
return coerce_indexer_dtype(codes, categories)
29953027

29963028

29973029
def recode_for_categories(
2998-
codes: np.ndarray, old_categories, new_categories, *, copy: bool
3030+
codes: np.ndarray,
3031+
old_categories,
3032+
new_categories,
3033+
*,
3034+
copy: bool = True,
3035+
warn: bool = False,
29993036
) -> np.ndarray:
30003037
"""
30013038
Convert a set of codes for to a new set of categories
@@ -3006,6 +3043,8 @@ def recode_for_categories(
30063043
old_categories, new_categories : Index
30073044
copy: bool, default True
30083045
Whether to copy if the codes are unchanged.
3046+
warn : bool, default False
3047+
Whether to warn on silent-NA mapping.
30093048
30103049
Returns
30113050
-------
@@ -3030,9 +3069,18 @@ def recode_for_categories(
30303069
return codes.copy()
30313070
return codes
30323071

3033-
indexer = coerce_indexer_dtype(
3034-
new_categories.get_indexer_for(old_categories), new_categories
3035-
)
3072+
codes_in_old_cats = new_categories.get_indexer_for(old_categories)
3073+
if warn:
3074+
wrong = codes_in_old_cats == -1
3075+
if wrong.any():
3076+
warnings.warn(
3077+
"Constructing a Categorical with a dtype and values containing "
3078+
"non-null entries not in that dtype's categories is deprecated "
3079+
"and will raise in a future version.",
3080+
Pandas4Warning,
3081+
stacklevel=find_stack_level(),
3082+
)
3083+
indexer = coerce_indexer_dtype(codes_in_old_cats, new_categories)
30363084
new_codes = take_nd(indexer, codes, fill_value=-1)
30373085
return new_codes
30383086

pandas/core/dtypes/dtypes.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ class CategoricalDtype(PandasExtensionDtype, ExtensionDtype):
203203
Examples
204204
--------
205205
>>> t = pd.CategoricalDtype(categories=["b", "a"], ordered=True)
206-
>>> pd.Series(["a", "b", "a", "c"], dtype=t)
206+
>>> pd.Series(["a", "b", "a", None], dtype=t)
207207
0 a
208208
1 b
209209
2 a

pandas/core/groupby/ops.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -718,7 +718,7 @@ def groups(self) -> dict[Hashable, Index]:
718718
return self.groupings[0].groups
719719
result_index, ids = self.result_index_and_ids
720720
values = result_index._values
721-
categories = Categorical(ids, categories=range(len(result_index)))
721+
categories = Categorical.from_codes(ids, categories=range(len(result_index)))
722722
result = {
723723
# mypy is not aware that group has to be an integer
724724
values[group]: self.axis.take(axis_ilocs) # type: ignore[call-overload]

0 commit comments

Comments
 (0)