Skip to content

Distribution

Distributor classes for clustering-based oversampling.

DensityDistributor(filtering_threshold='auto', distances_exponent='auto', sparsity_based=True, distribution_ratio=1.0)

Bases: BaseDistributor

Class to perform density based distribution.

Samples are distributed based on the density of clusters.

Read more in the [user_guide].

Parameters:

Name Type Description Default
filtering_threshold float | str

The threshold of a filtered cluster. It can be any non-negative number or 'auto' to be calculated automatically.

  • If 'auto', the filtering threshold is calculated from the imbalance ratio of the target for the binary case or the maximum of the target's imbalance ratios for the multiclass case.

  • If float then it is manually set to this number. Any cluster that has an imbalance ratio smaller than the filtering threshold is identified as a filtered cluster and can be potentially used to generate minority class instances. Higher values increase the number of filtered clusters.

'auto'
distances_exponent float | str

The exponent of the mean distance in the density calculation. It can be any non-negative number or 'auto' to be calculated automatically.

  • If 'auto' then it is set equal to the number of features. Higher values make the calculation of density more sensitive to the cluster's size i.e. clusters with large mean euclidean distance between samples are penalized.

  • If float then it is manually set to this number.

'auto'
sparsity_based bool

Whether sparse clusters receive more generated samples.

  • When True clusters receive generated samples that are inversely proportional to their density.

  • When False clusters receive generated samples that are proportional to their density.

True
distribution_ratio float

The ratio of intra-cluster to inter-cluster generated samples. It is a number in the [0.0, 1.0] range. The default value is 1.0, a case corresponding to only intra-cluster generation. As the number decreases, less intra-cluster samples are generated. Inter-cluster generation, i.e. when distribution_ratio is less than 1.0, requires a neighborhood structure for the clusters, i.e. a neighbors_ attribute should be created after fitting and it will raise an error when it is not found.

1.0

Attributes:

Name Type Description
clusters_density_ Density

Each dict key is a multi-label tuple of shape (cluster_label, class_label), while the values correspond to the density.

distances_exponent_ float

Actual exponent of the mean distance used in the calculations.

distribution_ratio_ float

A copy of the parameter in the constructor.

filtered_clusters_ List[MultiLabel]

Each element is a tuple of (cluster_label, class_label) pairs.

filtering_threshold_ float

Actual filtering threshold used in the calculations.

inter_distribution_ InterDistribution

Each dict key is a multi-label tuple of shape ((cluster_label1, cluster_label2), class_label) while the values are the proportion of samples per class.

intra_distribution_ IntraDistribution

Each dict key is a multi-label tuple of shape (cluster_label, class_label) while the values are the proportion of samples per class.

labels_ Labels

Labels of each sample.

neighbors_ Neighbors

An array that contains all neighboring pairs. Each row is a unique neighboring pair.

majority_class_label_ int

The majority class label.

n_samples_ int

The number of samples.

sparsity_based_ bool

A copy of the parameter in the constructor.

unique_class_labels_ Labels

An array of unique class labels.

unique_cluster_labels_ Labels

An array of unique cluster labels.

Examples:

>>> from clover.distribution import DensityDistributor
>>> from sklearn.datasets import load_iris
>>> from sklearn.cluster import KMeans
>>> from imblearn.datasets import make_imbalance
>>> X, y = make_imbalance(
...     *load_iris(return_X_y=True),
...     sampling_strategy={0:50, 1:40, 2:30},
...     random_state=0
... )
>>> labels = KMeans(random_state=0, n_init='auto').fit_predict(X, y)
>>> density_distributor = DensityDistributor().fit(X, y, labels)
>>> density_distributor.filtered_clusters_
[(6, 1), (0, 1), (3, 1), (7, 1), (5, 2), (2, 2), (3, 2), (6, 2), (0, 2)]
>>> density_distributor.intra_distribution_
{(6, 1): 0.50604609281056... (0, 1): 0.143311766542165...}
>>> density_distributor.inter_distribution_
{}
Source code in src/clover/distribution/_density.py
140
141
142
143
144
145
146
147
148
149
150
def __init__(
    self: Self,
    filtering_threshold: float | str = 'auto',
    distances_exponent: float | str = 'auto',
    sparsity_based: bool = True,
    distribution_ratio: float = 1.0,
) -> None:
    self.filtering_threshold = filtering_threshold
    self.distances_exponent = distances_exponent
    self.sparsity_based = sparsity_based
    self.distribution_ratio = distribution_ratio