Skip to content

Commit 2938ec7

Browse files
committed
Refinements
1 parent 5b2afe7 commit 2938ec7

File tree

2 files changed

+92
-142
lines changed

2 files changed

+92
-142
lines changed

doc/source/user_guide/migration-3-strings.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,8 @@ through the ``str`` accessor will work the same:
115115
class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``,
116116
but for general usage we recommend to use the shorter ``"str"`` alias.
117117

118+
.. _string_migration_guide-differences:
119+
118120
Overview of behavior differences and how to address them
119121
---------------------------------------------------------
120122

doc/source/user_guide/text.rst

Lines changed: 90 additions & 142 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Text data types
1818
There are two ways to store text data in pandas:
1919

2020
1. :class:`StringDtype` extension type.
21-
2. ``object`` dtype NumPy array.
21+
2. NumPy ``object`` dtype.
2222

2323
We recommend using :class:`StringDtype` to store text data via the alias
2424
``dtype="str"`` (the default when dtype of strings is inferred), see
@@ -87,153 +87,16 @@ or convert from existing pandas data:
8787
type(s2[0])
8888
8989
However there are four distinct :class:`StringDtype` variants that may be utilized.
90-
91-
Python storage with ``np.nan`` values
92-
-------------------------------------
93-
94-
.. note::
95-
This is the same as ``dtype='str'`` *when PyArrow is not installed*.
96-
97-
The implementation uses a NumPy object array, which directly stores the
98-
Python string objects, hence why the storage here is called ``'python'``.
99-
NA values in this array are stored using ``np.nan``.
100-
101-
.. ipython:: python
102-
103-
pd.Series(
104-
["a", "b", None, np.nan, pd.NA],
105-
dtype=pd.StringDtype(storage="python", na_value=np.nan)
106-
)
107-
108-
Notice that the last three values are all inferred by pandas as being
109-
an NA values, and hence stored as ``np.nan``.
110-
111-
PyArrow storage with ``np.nan`` values
112-
--------------------------------------
113-
114-
.. note::
115-
This is the same as ``dtype='str'`` *when PyArrow is installed*.
116-
117-
The implementation uses a PyArrow array, however NA values in this array
118-
are stored using ``np.nan``.
119-
120-
.. ipython:: python
121-
122-
pd.Series(
123-
["a", "b", None, np.nan, pd.NA],
124-
dtype=pd.StringDtype(storage="pyarrow", na_value=np.nan)
125-
)
126-
127-
Notice that the last three values are all inferred by pandas as being
128-
an NA values, and hence stored as ``np.nan``.
129-
130-
Python storage with ``pd.NA`` values
131-
------------------------------------
132-
133-
.. note::
134-
This is the same as ``dtype='string'`` *when PyArrow is not installed*.
135-
136-
The implementation uses a NumPy object array, which directly stores the
137-
Python string objects, hence why the storage here is called ``'python'``.
138-
NA values in this array are stored using ``np.nan``.
139-
140-
.. ipython:: python
141-
142-
pd.Series(
143-
["a", "b", None, np.nan, pd.NA],
144-
dtype=pd.StringDtype(storage="python", na_value=pd.NA)
145-
)
146-
147-
Notice that the last three values are all inferred by pandas as
148-
being an NA values, and hence stored as ``pd.NA``.
149-
150-
PyArrow storage with ``pd.NA`` values
151-
-------------------------------------
152-
153-
.. note::
154-
This is the same as ``dtype='string'`` *when PyArrow is installed*.
155-
156-
The implementation uses a PyArrow array. NA values in this array are
157-
stored using ``pd.NA``.
158-
159-
.. ipython:: python
160-
161-
pd.Series(
162-
["a", "b", None, np.nan, pd.NA],
163-
dtype=pd.StringDtype(storage="python", na_value=pd.NA)
164-
)
165-
166-
Notice that the last three values are all inferred by pandas as being an NA
167-
values, and hence stored as ``pd.NA``.
90+
See :ref:`text.four_string_variants` section below for details.
16891

16992
.. _text.differences:
17093

17194
Behavior differences
17295
====================
17396

174-
``StringDtype`` with ``np.nan`` NA values
175-
-----------------------------------------
176-
177-
1. Like ``dtype="object"``, :ref:`string accessor methods<api.series.str>`
178-
that return **integer** output will return a NumPy array that is
179-
either dtype int or float depending on the presence of NA values.
180-
Methods returning **boolean** output will return a NumPy array this is
181-
dtype bool, with the value ``False`` when an NA value is encountered.
182-
183-
.. ipython:: python
184-
185-
s = pd.Series(["a", None, "b"], dtype="str")
186-
s
187-
s.str.count("a")
188-
s.dropna().str.count("a")
189-
190-
When NA values are present, the output dtype is float64. However
191-
**boolean** output results in ``False`` for the NA values.
192-
193-
.. ipython:: python
194-
195-
s.str.isdigit()
196-
s.str.match("a")
197-
198-
2. Some string methods, like :meth:`Series.str.decode`, are not
199-
available because the underlying array can only contain
200-
strings, not bytes.
201-
3. Comparison operations will return a NumPy array with dtype bool. Missing
202-
values will always compare as unequal just as :attr:`np.nan` does.
203-
204-
``StringDtype`` with ``pd.NA`` NA values
205-
----------------------------------------
206-
207-
1. :ref:`String accessor methods<api.series.str>`
208-
that return **integer** output will always return a nullable integer dtype,
209-
rather than either int or float dtype (depending on the presence of NA values).
210-
Methods returning **boolean** output will return a nullable boolean dtype.
211-
212-
.. ipython:: python
213-
214-
s = pd.Series(["a", None, "b"], dtype="string")
215-
s
216-
s.str.count("a")
217-
s.dropna().str.count("a")
218-
219-
Both outputs are ``Int64`` dtype. Similarly for methods returning boolean values.
220-
221-
.. ipython:: python
222-
223-
s.str.isdigit()
224-
s.str.match("a")
225-
226-
2. Some string methods, like :meth:`Series.str.decode` because the underlying
227-
array can only contain strings, not bytes.
228-
3. Comparison operations will return an object with :class:`BooleanDtype`,
229-
rather than a ``bool`` dtype object. Missing values will propagate
230-
in comparison operations, rather than always comparing
231-
unequal like :attr:`numpy.nan`.
232-
233-
234-
.. important::
235-
Everything else that follows in the rest of this document applies equally to
236-
``'str'``, ``'string'``, and ``object`` dtype.
97+
There are various behavior differences between using NumPy ``object`` dtype,
98+
``dtype="str"``, and ``dtype="string"``. See the
99+
:ref:`String migration guide <string_migration_guide-differences>` section for further details.
237100

238101
.. _text.string_methods:
239102

@@ -823,6 +686,91 @@ String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.
823686
824687
See also :func:`~pandas.get_dummies`.
825688

689+
.. _text.four_string_variants:
690+
691+
The four :class:`StringDtype` variants
692+
======================================
693+
694+
There are four :class:`StringDtype` variants that are available to users.
695+
696+
Python storage with ``np.nan`` values
697+
-------------------------------------
698+
699+
.. note::
700+
This is the same as ``dtype='str'`` *when PyArrow is not installed*.
701+
702+
The implementation uses a NumPy object array, which directly stores the
703+
Python string objects, hence why the storage here is called ``'python'``.
704+
NA values in this array are represented and behave as ``np.nan``.
705+
706+
.. ipython:: python
707+
708+
pd.Series(
709+
["a", "b", None, np.nan, pd.NA],
710+
dtype=pd.StringDtype(storage="python", na_value=np.nan)
711+
)
712+
713+
Notice that the last three values are all inferred by pandas as being
714+
an NA values, and hence stored as ``np.nan``.
715+
716+
PyArrow storage with ``np.nan`` values
717+
--------------------------------------
718+
719+
.. note::
720+
This is the same as ``dtype='str'`` *when PyArrow is installed*.
721+
722+
The implementation uses a PyArrow array, however NA values in this array
723+
are represented and behave as ``np.nan``.
724+
725+
.. ipython:: python
726+
727+
pd.Series(
728+
["a", "b", None, np.nan, pd.NA],
729+
dtype=pd.StringDtype(storage="pyarrow", na_value=np.nan)
730+
)
731+
732+
Notice that the last three values are all inferred by pandas as being
733+
an NA values, and hence stored as ``np.nan``.
734+
735+
Python storage with ``pd.NA`` values
736+
------------------------------------
737+
738+
.. note::
739+
This is the same as ``dtype='string'`` *when PyArrow is not installed*.
740+
741+
The implementation uses a NumPy object array, which directly stores the
742+
Python string objects, hence why the storage here is called ``'python'``.
743+
NA values in this array are represented and behave as ``np.nan``.
744+
745+
.. ipython:: python
746+
747+
pd.Series(
748+
["a", "b", None, np.nan, pd.NA],
749+
dtype=pd.StringDtype(storage="python", na_value=pd.NA)
750+
)
751+
752+
Notice that the last three values are all inferred by pandas as
753+
being an NA values, and hence stored as ``pd.NA``.
754+
755+
PyArrow storage with ``pd.NA`` values
756+
-------------------------------------
757+
758+
.. note::
759+
This is the same as ``dtype='string'`` *when PyArrow is installed*.
760+
761+
The implementation uses a PyArrow array. NA values in this array are
762+
represented and behave as ``pd.NA``.
763+
764+
.. ipython:: python
765+
766+
pd.Series(
767+
["a", "b", None, np.nan, pd.NA],
768+
dtype=pd.StringDtype(storage="python", na_value=pd.NA)
769+
)
770+
771+
Notice that the last three values are all inferred by pandas as being an NA
772+
values, and hence stored as ``pd.NA``.
773+
826774
Method summary
827775
==============
828776

0 commit comments

Comments
 (0)