@@ -35,14 +35,21 @@ for many reasons:
35353. When reading code, the contents of an ``object `` dtype array is less clear
3636 than ``'string' ``.
3737
38- Currently, the performance of ``object `` dtype arrays of strings and
39- :class: `arrays.StringArray ` are about the same. We expect future enhancements
38+ When using :class: `StringDtype ` with PyArrow as the storage (see below),
39+ users will see large performance improvements in memory as well as time
40+ for certain operations when compared to ``object `` dtype arrays. When
41+ not using PyArrow as the storage, the performance of :class: `StringDtype `
42+ is about the same as that of ``object ``. We expect future enhancements
4043to significantly increase the performance and lower the memory overhead of
41- :class: `~arrays.StringArray ` .
44+ :class: `StringDtype ` in this case .
4245
4346.. versionchanged :: 3.0
4447
45- The default when pandas infers the dtype of a collection of strings is to use ``dtype='str' ``.
48+ The default when pandas infers the dtype of a collection of
49+ strings is to use ``dtype='str' ``. This will use ``np.nan ``
50+ as it's NA value and be backed by a PyArrow string array when
51+ PyArrow is installed, or backed by NumPy ``object `` array
52+ when PyArrow is not installed.
4653
4754.. ipython :: python
4855
@@ -51,15 +58,17 @@ to significantly increase the performance and lower the memory overhead of
5158 Specifying :class: `StringDtype ` explicitly
5259==========================================
5360
54- When it is desired to explicitly specify the dtype, we generally recommend using the alias ``dtype="str" ``.
61+ When it is desired to explicitly specify the dtype, we generally recommend
62+ using the alias ``dtype="str" `` if you desire to have ``np.nan `` as the NA
63+ value or the alias ``dtype="string" `` if you desire to have ``pd.NA `` as
64+ the NA value.
5565
5666.. ipython :: python
5767
58- pd.Series([" a" , " b" , " c" ], dtype = " str" )
68+ pd.Series([" a" , " b" , None ], dtype = " str" )
69+ pd.Series([" a" , " b" , None ], dtype = " string" )
5970
60- However there are four distinct :class: `StringDtype ` variants that may be utilized.
61- You can also use :class: `StringDtype `/``"str" ``/``"string" `` as the dtype
62- on non-string data and it will be converted to strings:
71+ Specifying either alias will also convert non-string data to strings:
6372
6473.. ipython :: python
6574
@@ -73,10 +82,12 @@ or convert from existing pandas data:
7382
7483 s1 = pd.Series([1 , 2 , pd.NA ], dtype = " Int64" )
7584 s1
76- s2 = s1.astype(" str " )
85+ s2 = s1.astype(" string " )
7786 s2
7887 type (s2[0 ])
7988
89+ However there are four distinct :class: `StringDtype ` variants that may be utilized.
90+
8091Python storage with ``np.nan `` values
8192-------------------------------------
8293
@@ -184,15 +195,16 @@ Behavior differences
184195 s.str.isdigit()
185196 s.str.match(" a" )
186197
187- 2. Some string methods, like :meth: `Series.str.decode ` because the underlying
188- array can only contain strings, not bytes.
198+ 2. Some string methods, like :meth: `Series.str.decode `, are not
199+ available because the underlying array can only contain
200+ strings, not bytes.
1892013. Comparison operations will return a NumPy array with dtype bool. Missing
190- values will always compare as unequal just as :attr: `numpy .nan ` does.
202+ values will always compare as unequal just as :attr: `np .nan ` does.
191203
192204``StringDtype `` with ``pd.NA `` NA values
193205----------------------------------------
194206
195- 1. For `` StringDtype ``, :ref: `string accessor methods<api.series.str> `
207+ 1. :ref: `String accessor methods<api.series.str> `
196208 that return **integer ** output will always return a nullable integer dtype,
197209 rather than either int or float dtype (depending on the presence of NA values).
198210 Methods returning **boolean ** output will return a nullable boolean dtype.
0 commit comments