User guide
The package cluster-over-sampling
extends the functionality of imbalanced-learn oversamplers by introducing the
ClusterOverSampler
class, while KMeansSMOTE
and SOMO
classes are provided for convinience. The distribution of the generated
samples to the clusters is controled by the distributor
parameter with DensityDistributor
being an example of distribution
that is based on the density of the clusters.
Initially, we generate multi-class imbalanced data represented by the imput data X
and targets y
:
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_classes=3, weights=[0.10, 0.10, 0.80], random_state=0, n_informative=10)
>>> print(sorted(Counter(y).items()))
[(0, 10), (1, 10), (2, 80)]
Below we provided some examples of the cluster-over-sampling
functionality.
KMeans-SMOTE algorithm
KMeans-SMOTE[^2] algorithm is a combination of KMeans clusterer and SMOTE oversampler and it is implemented by the
KMeansSMOTE
class. We initialize it with the default parameters and use it to resample the data:
>>> from clover.over_sampling import KMeansSMOTE
>>> kmeans_smote = KMeansSMOTE(random_state=5)
>>> X_resampled, y_resampled = kmeans_smote.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 80), (1, 80), (2, 80)]
The augmented data set can be used instead of the original data set to train a classifier:
```python
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier()
>>> clf.fit(X_resampled, y_resampled)
Combining clusterers and oversamplers
The ClusterOverSampler
class allows to combine imbalanced-learn oversamplers with scikit-learn clusterers. This achieved
through the use of the parameters oversampler
and clusterer
. For example, we can select BorderlineSMOTE as the oversampler
and DBSCAN as the clustering algorithm:
>>> from sklearn.cluster import DBSCAN
>>> from imblearn.over_sampling import BorderlineSMOTE
>>> from clover.over_sampling import ClusterOverSampler
>>> dbscan_bsmote = ClusterOverSampler(oversampler=BorderlineSMOTE(random_state=5), clusterer=DBSCAN())
>>> X_resampled, y_resampled = dbscan_bsmote.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 80), (1, 80), (2, 80)]
Additionally, if the clusterer supports a neighboring structure for the clusters through a neighbors_
attribute, then it can
be used to generate inter-cluster artificial data similarly to SOMO[^1] and G-SOMO[^3] algorithms.
Adjusting the distribution of generated samples
The parameter distributor
of the ClusterOverSampler
is used to define the distribution of the generated samples to the
clusters. The DensityDistributor
class implements a density based distribution and it is the default distributor
for all
objects of the ClusterOverSampler
class:
>>> from sklearn.cluster import AgglomerativeClustering
>>> from imblearn.over_sampling import SMOTE
>>> agg_smote = ClusterOverSampler(oversampler=SMOTE(random_state=5), clusterer=AgglomerativeClustering())
>>> agg_smote.fit(X, y)
>>> agg_smote.distributor_
DensityDistributor()
The DensityDistributor
objects can be parametrized:
>>> distributor = DensityDistributor(distances_exponent=0)
In order to distribute the samples a labels
parameter is required, while neighbors
is optional:
>>> from sklearn.cluster import KMeans
>>> clusterer = KMeans(n_clusters=4, random_state=1).fit(X, y)
>>> labels = clusterer.labels_
The distribution samples of the samples is provided by the fit_distribute
method and it is described in the intra_distribution
and inter_distribution
dictionaries:
>>> intra_distribution, inter_distribution = distributor.fit_distribute(X, y, labels, neighbors=None)
>>> print(distributor.filtered_clusters_)
[(0, 1), (1, 0), (1, 1)]
>>> print(distributor.clusters_density_)
{(0, 1): 3.0, (1, 0): 7.0, (1, 1): 7.0}
>>> print(intra_distribution)
{(0, 1): 0.7, (1, 0): 1.0, (1, 1): 0.3}
>>> print(inter_distribution)
{}
The keys of the above dictionaries are tuples of (cluster_label, class_label)
shape, while their values are proportions of the
total generated samples for the particular class. For example (0, 1): 0.7
means that 70% of samples of class 1
will be
generated in the cluster 0
. Any other distributor can be defined by extending the BaseDistributor
class.
Compatibility
The API of cluster-over-sampling
is fully compatible to imbalanced-learn. Any oversampler from cluster-over-sampling that does
not use clustering, i.e. when clusterer=None
, is equivalent to the corresponding imbalanced-learn oversampler:
>>> import numpy as np
>>> X_res_im, y_res_im = SMOTE(random_state=5).fit_resample(X, y)
>>> X_res_cl, y_res_cl = ClusterOverSampler(SMOTE(random_state=5), clusterer=None).fit_resample(X, y)
>>> np.testing.assert_equal(X_res_im, X_res_cl)
>>> np.testing.assert_equal(y_res_im, y_res_cl)
References
[^1]: G. Douzas, F. Bacao, "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning", Expert Systems with Applications, vol. 82, pp. 40-52, 2017. [^2]: G. Douzas, F. Bacao, F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE", Information Sciences, vol. 465, pp. 1-20, 2018. [^3]: G. Douzas, F. Bacao, F. Last, "G-SOMO: An oversampling approach based on self-organized maps and geometric SMOTE", Expert Systems with Applications, vol. 183,115230, 2021.