@@ -18,7 +18,7 @@ Text data types
1818There are two ways to store text data in pandas:
1919
20201. :class: `StringDtype ` extension type.
21- 2. ``object `` dtype NumPy array .
21+ 2. NumPy ``object `` dtype.
2222
2323We recommend using :class: `StringDtype ` to store text data via the alias
2424``dtype="str" `` (the default when dtype of strings is inferred), see
@@ -87,153 +87,16 @@ or convert from existing pandas data:
8787 type (s2[0 ])
8888
8989 However there are four distinct :class: `StringDtype ` variants that may be utilized.
90-
91- Python storage with ``np.nan `` values
92- -------------------------------------
93-
94- .. note ::
95- This is the same as ``dtype='str' `` *when PyArrow is not installed *.
96-
97- The implementation uses a NumPy object array, which directly stores the
98- Python string objects, hence why the storage here is called ``'python' ``.
99- NA values in this array are stored using ``np.nan ``.
100-
101- .. ipython :: python
102-
103- pd.Series(
104- [" a" , " b" , None , np.nan, pd.NA ],
105- dtype = pd.StringDtype(storage = " python" , na_value = np.nan)
106- )
107-
108- Notice that the last three values are all inferred by pandas as being
109- an NA values, and hence stored as ``np.nan ``.
110-
111- PyArrow storage with ``np.nan `` values
112- --------------------------------------
113-
114- .. note ::
115- This is the same as ``dtype='str' `` *when PyArrow is installed *.
116-
117- The implementation uses a PyArrow array, however NA values in this array
118- are stored using ``np.nan ``.
119-
120- .. ipython :: python
121-
122- pd.Series(
123- [" a" , " b" , None , np.nan, pd.NA ],
124- dtype = pd.StringDtype(storage = " pyarrow" , na_value = np.nan)
125- )
126-
127- Notice that the last three values are all inferred by pandas as being
128- an NA values, and hence stored as ``np.nan ``.
129-
130- Python storage with ``pd.NA `` values
131- ------------------------------------
132-
133- .. note ::
134- This is the same as ``dtype='string' `` *when PyArrow is not installed *.
135-
136- The implementation uses a NumPy object array, which directly stores the
137- Python string objects, hence why the storage here is called ``'python' ``.
138- NA values in this array are stored using ``np.nan ``.
139-
140- .. ipython :: python
141-
142- pd.Series(
143- [" a" , " b" , None , np.nan, pd.NA ],
144- dtype = pd.StringDtype(storage = " python" , na_value = pd.NA )
145- )
146-
147- Notice that the last three values are all inferred by pandas as
148- being an NA values, and hence stored as ``pd.NA ``.
149-
150- PyArrow storage with ``pd.NA `` values
151- -------------------------------------
152-
153- .. note ::
154- This is the same as ``dtype='string' `` *when PyArrow is installed *.
155-
156- The implementation uses a PyArrow array. NA values in this array are
157- stored using ``pd.NA ``.
158-
159- .. ipython :: python
160-
161- pd.Series(
162- [" a" , " b" , None , np.nan, pd.NA ],
163- dtype = pd.StringDtype(storage = " python" , na_value = pd.NA )
164- )
165-
166- Notice that the last three values are all inferred by pandas as being an NA
167- values, and hence stored as ``pd.NA ``.
90+ See :ref: `text.four_string_variants ` section below for details.
16891
16992.. _text.differences :
17093
17194Behavior differences
17295====================
17396
174- ``StringDtype `` with ``np.nan `` NA values
175- -----------------------------------------
176-
177- 1. Like ``dtype="object" ``, :ref: `string accessor methods<api.series.str> `
178- that return **integer ** output will return a NumPy array that is
179- either dtype int or float depending on the presence of NA values.
180- Methods returning **boolean ** output will return a NumPy array this is
181- dtype bool, with the value ``False `` when an NA value is encountered.
182-
183- .. ipython :: python
184-
185- s = pd.Series([" a" , None , " b" ], dtype = " str" )
186- s
187- s.str.count(" a" )
188- s.dropna().str.count(" a" )
189-
190- When NA values are present, the output dtype is float64. However
191- **boolean ** output results in ``False `` for the NA values.
192-
193- .. ipython :: python
194-
195- s.str.isdigit()
196- s.str.match(" a" )
197-
198- 2. Some string methods, like :meth: `Series.str.decode `, are not
199- available because the underlying array can only contain
200- strings, not bytes.
201- 3. Comparison operations will return a NumPy array with dtype bool. Missing
202- values will always compare as unequal just as :attr: `np.nan ` does.
203-
204- ``StringDtype `` with ``pd.NA `` NA values
205- ----------------------------------------
206-
207- 1. :ref: `String accessor methods<api.series.str> `
208- that return **integer ** output will always return a nullable integer dtype,
209- rather than either int or float dtype (depending on the presence of NA values).
210- Methods returning **boolean ** output will return a nullable boolean dtype.
211-
212- .. ipython :: python
213-
214- s = pd.Series([" a" , None , " b" ], dtype = " string" )
215- s
216- s.str.count(" a" )
217- s.dropna().str.count(" a" )
218-
219- Both outputs are ``Int64 `` dtype. Similarly for methods returning boolean values.
220-
221- .. ipython :: python
222-
223- s.str.isdigit()
224- s.str.match(" a" )
225-
226- 2. Some string methods, like :meth: `Series.str.decode ` because the underlying
227- array can only contain strings, not bytes.
228- 3. Comparison operations will return an object with :class: `BooleanDtype `,
229- rather than a ``bool `` dtype object. Missing values will propagate
230- in comparison operations, rather than always comparing
231- unequal like :attr: `numpy.nan `.
232-
233-
234- .. important ::
235- Everything else that follows in the rest of this document applies equally to
236- ``'str' ``, ``'string' ``, and ``object `` dtype.
97+ There are various behavior differences between using NumPy ``object `` dtype,
98+ ``dtype="str" ``, and ``dtype="string" ``. See the
99+ :ref: `String migration guide <string_migration_guide-differences >` section for further details.
237100
238101.. _text.string_methods :
239102
@@ -823,6 +686,91 @@ String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``.
823686
824687 See also :func: `~pandas.get_dummies `.
825688
689+ .. _text.four_string_variants :
690+
691+ The four :class: `StringDtype ` variants
692+ ======================================
693+
694+ There are four :class: `StringDtype ` variants that are available to users.
695+
696+ Python storage with ``np.nan `` values
697+ -------------------------------------
698+
699+ .. note ::
700+ This is the same as ``dtype='str' `` *when PyArrow is not installed *.
701+
702+ The implementation uses a NumPy object array, which directly stores the
703+ Python string objects, hence why the storage here is called ``'python' ``.
704+ NA values in this array are represented and behave as ``np.nan ``.
705+
706+ .. ipython :: python
707+
708+ pd.Series(
709+ [" a" , " b" , None , np.nan, pd.NA ],
710+ dtype = pd.StringDtype(storage = " python" , na_value = np.nan)
711+ )
712+
713+ Notice that the last three values are all inferred by pandas as being
714+ an NA values, and hence stored as ``np.nan ``.
715+
716+ PyArrow storage with ``np.nan `` values
717+ --------------------------------------
718+
719+ .. note ::
720+ This is the same as ``dtype='str' `` *when PyArrow is installed *.
721+
722+ The implementation uses a PyArrow array, however NA values in this array
723+ are represented and behave as ``np.nan ``.
724+
725+ .. ipython :: python
726+
727+ pd.Series(
728+ [" a" , " b" , None , np.nan, pd.NA ],
729+ dtype = pd.StringDtype(storage = " pyarrow" , na_value = np.nan)
730+ )
731+
732+ Notice that the last three values are all inferred by pandas as being
733+ an NA values, and hence stored as ``np.nan ``.
734+
735+ Python storage with ``pd.NA `` values
736+ ------------------------------------
737+
738+ .. note ::
739+ This is the same as ``dtype='string' `` *when PyArrow is not installed *.
740+
741+ The implementation uses a NumPy object array, which directly stores the
742+ Python string objects, hence why the storage here is called ``'python' ``.
743+ NA values in this array are represented and behave as ``np.nan ``.
744+
745+ .. ipython :: python
746+
747+ pd.Series(
748+ [" a" , " b" , None , np.nan, pd.NA ],
749+ dtype = pd.StringDtype(storage = " python" , na_value = pd.NA )
750+ )
751+
752+ Notice that the last three values are all inferred by pandas as
753+ being an NA values, and hence stored as ``pd.NA ``.
754+
755+ PyArrow storage with ``pd.NA `` values
756+ -------------------------------------
757+
758+ .. note ::
759+ This is the same as ``dtype='string' `` *when PyArrow is installed *.
760+
761+ The implementation uses a PyArrow array. NA values in this array are
762+ represented and behave as ``pd.NA ``.
763+
764+ .. ipython :: python
765+
766+ pd.Series(
767+ [" a" , " b" , None , np.nan, pd.NA ],
768+ dtype = pd.StringDtype(storage = " python" , na_value = pd.NA )
769+ )
770+
771+ Notice that the last three values are all inferred by pandas as being an NA
772+ values, and hence stored as ``pd.NA ``.
773+
826774Method summary
827775==============
828776
0 commit comments