Skip to content

Geometric smote

Class to perform over-sampling using Geometric SMOTE.

GeometricSMOTE(sampling_strategy='auto', k_neighbors=5, truncation_factor=1.0, deformation_factor=0.0, selection_strategy='combined', categorical_features=None, random_state=None, n_jobs=1)

Bases: BaseOverSampler

Class to to perform over-sampling using Geometric SMOTE.

This algorithm is an implementation of Geometric SMOTE, a geometrically enhanced drop-in replacement for SMOTE. Read more in the [user_guide].

Parameters:

Name Type Description Default
categorical_features ArrayLike | None

Specified which features are categorical. Can either be:

- array of indices specifying the categorical features.

- mask array of shape (n_features, ) and `bool` dtype for which
`True` indicates the categorical features.
None
sampling_strategy dict[int, int] | str

Sampling information to resample the data set.

  • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. It is only available for binary classification.

  • When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

    • 'minority': resample only the minority class.
    • 'not minority': resample all classes but the minority class.
    • 'not majority': resample all classes but the majority class.
    • 'all': resample all classes.
    • 'auto': equivalent to 'not majority'.
  • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

  • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

'auto'
random_state RandomState | int | None

Control the randomization of the algorithm.

  • If int, it is the seed used by the random number generator.
  • If np.random.RandomState instance, it is the random number generator.
  • If None, the random number generator is the RandomState instance used by np.random.
None
truncation_factor float

The type of truncation. The values should be in the [-1.0, 1.0] range.

1.0
deformation_factor float

The type of geometry. The values should be in the [0.0, 1.0] range.

0.0
selection_strategy str

The type of Geometric SMOTE algorithm with the following options: 'combined', 'majority', 'minority'.

'combined'
k_neighbors NearestNeighbors | int

If int, number of nearest neighbours to use when synthetic samples are constructed for the minority method. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin class that will be used to find the k_neighbors.

5
n_jobs int | None

The number of threads to open if possible.

1

Attributes:

Name Type Description
n_features_in_

int Number of features in the input dataset.

nns_pos_

estimator object Validated k-nearest neighbours created from the k_neighbors parameter. It is used to find the nearest neighbors of the same class of a selected observation.

nn_neg_

estimator object Validated k-nearest neighbours created from the k_neighbors parameter. It is used to find the nearest neighbor of the remaining classes (k=1) of a selected observation.

random_state_ RandomState

An instance of np.random.RandomState class.

sampling_strategy_ dict[int, int]

Actual sampling strategy.

Examples:

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from gsmote import GeometricSMOTE
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({{1: 900, 0: 100}})
>>> gsmote = GeometricSMOTE(random_state=1)
>>> X_resampled, y_resampled = gsmote.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_resampled))
Resampled dataset shape Counter({{0: 900, 1: 900}})
Source code in src/gsmote/geometric_smote.py
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def __init__(
    self: Self,
    sampling_strategy: dict[int, int] | str = 'auto',
    k_neighbors: NearestNeighbors | int = 5,
    truncation_factor: float = 1.0,
    deformation_factor: float = 0.0,
    selection_strategy: str = 'combined',
    categorical_features: ArrayLike | None = None,
    random_state: np.random.RandomState | int | None = None,
    n_jobs: int | None = 1,
) -> None:
    """Initialize oversampler."""
    super().__init__(sampling_strategy=sampling_strategy)
    self.k_neighbors = k_neighbors
    self.truncation_factor = truncation_factor
    self.deformation_factor = deformation_factor
    self.selection_strategy = selection_strategy
    self.categorical_features = categorical_features
    self.random_state = random_state
    self.n_jobs = n_jobs

make_geometric_sample(center, surface_point, truncation_factor, deformation_factor, random_state)

A support function that returns an artificial point.

Parameters:

Name Type Description Default
center NDArray

The center point.

required
surface_point NDArray

The point on the surface of the hypersphere.

required
truncation_factor float

The truncation factor of the algorithm.

required
deformation_factor float

The defirmation factor of the algorithm.

required
random_state RandomState

The random state of the process.

required

Returns:

Name Type Description
geometric_sample NDArray

The generated geometric sample.

Source code in src/gsmote/geometric_smote.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def make_geometric_sample(
    center: NDArray,
    surface_point: NDArray,
    truncation_factor: float,
    deformation_factor: float,
    random_state: np.random.RandomState,
) -> NDArray:
    """A support function that returns an artificial point.

    Args:
        center:
            The center point.

        surface_point:
            The point on the surface of the hypersphere.

        truncation_factor:
            The truncation factor of the algorithm.

        deformation_factor:
            The defirmation factor of the algorithm.

        random_state:
            The random state of the process.

    Returns:
        geometric_sample:
            The generated geometric sample.
    """

    # Zero radius case
    if np.array_equal(center, surface_point):
        return center

    # Generate a point on the surface of a unit hyper-sphere
    radius = norm(center - surface_point)
    normal_samples = random_state.normal(size=center.size)
    point_on_unit_sphere = normal_samples / norm(normal_samples)
    point: NDArray = (random_state.uniform(size=1) ** (1 / center.size)) * point_on_unit_sphere

    # Parallel unit vector
    parallel_unit_vector = (surface_point - center) / norm(surface_point - center)

    # Truncation
    close_to_opposite_boundary = truncation_factor > 0 and np.dot(point, parallel_unit_vector) < truncation_factor - 1
    close_to_boundary = truncation_factor < 0 and np.dot(point, parallel_unit_vector) > truncation_factor + 1
    if close_to_opposite_boundary or close_to_boundary:
        point -= 2 * np.dot(point, parallel_unit_vector) * parallel_unit_vector

    # Deformation
    parallel_point_position = np.dot(point, parallel_unit_vector) * parallel_unit_vector
    perpendicular_point_position = point - parallel_point_position
    point = parallel_point_position + (1 - deformation_factor) * perpendicular_point_position

    # Translation
    point = center + radius * point

    return point

populate_categorical_features(X_new, neighbors, categories_size, random_state)

A support function that populates categorical features.

Source code in src/gsmote/geometric_smote.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def populate_categorical_features(
    X_new: NDArray,
    neighbors: NDArray,
    categories_size: list[int] | None,
    random_state: np.random.RandomState,
) -> NDArray:
    """A support function that populates categorical features."""
    if categories_size is not None:
        for start_idx, end_idx in zip(
            np.cumsum(categories_size)[:-1],
            np.cumsum(categories_size)[1:],
        ):
            col_maxs = neighbors[:, start_idx:end_idx].sum(axis=0)
            is_max = np.isclose(col_maxs, col_maxs.max(axis=0))
            max_idxs = random_state.permutation(np.argwhere(is_max))
            col_sels = max_idxs[0]
            ys = start_idx + col_sels
            X_new[start_idx:end_idx] = 0
            X_new[ys] = 1
    return X_new