@@ -160,6 +160,44 @@ some variant of the SMOTE algorithm::
160160 >>> print(sorted(Counter(y_resampled).items()))
161161 [(0, 4674), (1, 4674), (2, 4674)]
162162
163+ When dealing with mixed data type such as continuous and categorical features,
164+ none of the presented methods (apart of the class :class: `RandomOverSampler `)
165+ can deal with the categorical features. The :class: `SMOTENC ` [CBHK2002 ]_ is an
166+ extension of the :class: `SMOTE ` algorithm for which categorical data are
167+ treated differently::
168+
169+ >>> # create a synthetic data set with continuous and categorical features
170+ >>> rng = np.random.RandomState(42)
171+ >>> n_samples = 50
172+ >>> X = np.empty((n_samples, 3), dtype=object)
173+ >>> X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)
174+ >>> X[:, 1] = rng.randn(n_samples)
175+ >>> X[:, 2] = rng.randint(3, size=n_samples)
176+ >>> y = np.array([0] * 20 + [1] * 30)
177+ >>> print(sorted(Counter(y).items()))
178+ [(0, 20), (1, 30)]
179+
180+ In this data set, the first and last features are considered as categorical
181+ features. One need to provide this information to :class: `SMOTENC ` via the
182+ parameters ``categorical_features `` either by passing the indices of these
183+ features or a boolean mask marking these features::
184+
185+ >>> from imblearn.over_sampling import SMOTENC
186+ >>> smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
187+ >>> X_resampled, y_resampled = smote_nc.fit_resample(X, y)
188+ >>> print(sorted(Counter(y_resampled).items()))
189+ [(0, 30), (1, 30)]
190+ >>> print(X_resampled[-5:])
191+ [['B' 0.1989993778979113 0]
192+ ['A' -0.3657680728116921 1]
193+ ['B' 0.8790828729585258 0]
194+ ['A' 0.3710891618824609 0]
195+ ['A' 0.3327240726719727 0]]
196+
197+ Therefore, it can be seen that the samples generated in the first and last
198+ columns are belonging to the same categories originally presented without any
199+ other extra interpolation.
200+
163201.. topic :: References
164202
165203 .. [HWB2005 ] H. Han, W. Wen-Yuan, M. Bing-Huan, "Borderline-SMOTE: a new
@@ -198,8 +236,13 @@ interpolation will create a sample on the line between :math:`x_{i}` and
198236 :scale: 60
199237 :align: center
200238
201- Each SMOTE variant and ADASYN differ from each other by selecting the samples
202- :math: `x_i` ahead of generating the new samples.
239+ SMOTE-NC slightly change the way a new sample is generated by performing
240+ something specific for the categorical features. In fact, the categories of a
241+ new generated sample are decided by picking the most frequent category of the
242+ nearest neighbors present during the generation.
243+
244+ The other SMOTE variants and ADASYN differ from each other by selecting the
245+ samples :math: `x_i` ahead of generating the new samples.
203246
204247The **regular ** SMOTE algorithm --- cf. to the :class: `SMOTE ` object --- does not
205248impose any rule and will randomly pick-up all possible :math: `x_i` available.
0 commit comments