Distribution

Distributor classes for clustering-based oversampling.

`DensityDistributor(filtering_threshold='auto', distances_exponent='auto', sparsity_based=True, distribution_ratio=1.0)`

Bases: BaseDistributor

Class to perform density based distribution.

Samples are distributed based on the density of clusters.

Read more in the [user_guide].

Parameters:

Name	Type	Description	Default
`filtering_threshold`	`float \| str`	The threshold of a filtered cluster. It can be any non-negative number or `'auto'` to be calculated automatically. If `'auto'`, the filtering threshold is calculated from the imbalance ratio of the target for the binary case or the maximum of the target's imbalance ratios for the multiclass case. If `float` then it is manually set to this number. Any cluster that has an imbalance ratio smaller than the filtering threshold is identified as a filtered cluster and can be potentially used to generate minority class instances. Higher values increase the number of filtered clusters.	`'auto'`
`distances_exponent`	`float \| str`	The exponent of the mean distance in the density calculation. It can be any non-negative number or `'auto'` to be calculated automatically. If `'auto'` then it is set equal to the number of features. Higher values make the calculation of density more sensitive to the cluster's size i.e. clusters with large mean euclidean distance between samples are penalized. If `float` then it is manually set to this number.	`'auto'`
`sparsity_based`	`bool`	Whether sparse clusters receive more generated samples. When `True` clusters receive generated samples that are inversely proportional to their density. When `False` clusters receive generated samples that are proportional to their density.	`True`
`distribution_ratio`	`float`	The ratio of intra-cluster to inter-cluster generated samples. It is a number in the `[0.0, 1.0]` range. The default value is `1.0`, a case corresponding to only intra-cluster generation. As the number decreases, less intra-cluster samples are generated. Inter-cluster generation, i.e. when `distribution_ratio` is less than `1.0`, requires a neighborhood structure for the clusters, i.e. a `neighbors_` attribute should be created after fitting and it will raise an error when it is not found.	`1.0`

Attributes:

Name	Type	Description
`clusters_density_`	`Density`	Each dict key is a multi-label tuple of shape `(cluster_label, class_label)`, while the values correspond to the density.
`distances_exponent_`	`float`	Actual exponent of the mean distance used in the calculations.
`distribution_ratio_`	`float`	A copy of the parameter in the constructor.
`filtered_clusters_`	`List[MultiLabel]`	Each element is a tuple of `(cluster_label, class_label)` pairs.
`filtering_threshold_`	`float`	Actual filtering threshold used in the calculations.
`inter_distribution_`	`InterDistribution`	Each dict key is a multi-label tuple of shape `((cluster_label1, cluster_label2), class_label)` while the values are the proportion of samples per class.
`intra_distribution_`	`IntraDistribution`	Each dict key is a multi-label tuple of shape `(cluster_label, class_label)` while the values are the proportion of samples per class.
`labels_`	`Labels`	Labels of each sample.
`neighbors_`	`Neighbors`	An array that contains all neighboring pairs. Each row is a unique neighboring pair.
`majority_class_label_`	`int`	The majority class label.
`n_samples_`	`int`	The number of samples.
`sparsity_based_`	`bool`	A copy of the parameter in the constructor.
`unique_class_labels_`	`Labels`	An array of unique class labels.
`unique_cluster_labels_`	`Labels`	An array of unique cluster labels.

Examples:

>>> from clover.distribution import DensityDistributor
>>> from sklearn.datasets import load_iris
>>> from sklearn.cluster import KMeans
>>> from imblearn.datasets import make_imbalance
>>> X, y = make_imbalance(
...     *load_iris(return_X_y=True),
...     sampling_strategy={0:50, 1:40, 2:30},
...     random_state=0
... )
>>> labels = KMeans(random_state=0, n_init='auto').fit_predict(X, y)
>>> density_distributor = DensityDistributor().fit(X, y, labels)
>>> density_distributor.filtered_clusters_
[(6, 1), (0, 1), (3, 1), (7, 1), (5, 2), (2, 2), (3, 2), (6, 2), (0, 2)]
>>> density_distributor.intra_distribution_
{(6, 1): 0.50604609281056... (0, 1): 0.143311766542165...}
>>> density_distributor.inter_distribution_
{}

Source code in src/clover/distribution/_density.py

def __init__(
    self: Self,
    filtering_threshold: float | str = 'auto',
    distances_exponent: float | str = 'auto',
    sparsity_based: bool = True,
    distribution_ratio: float = 1.0,
) -> None:
    self.filtering_threshold = filtering_threshold
    self.distances_exponent = distances_exponent
    self.sparsity_based = sparsity_based
    self.distribution_ratio = distribution_ratio