GitHub - SandroMartens/DBGSOM: A scikit-learn compatible Python implementation of the Directed Batch Growing Self-Organizing Map

DBGSOM (Directed Batch Growing Self-Organizing Map): A Neural Network for Clustering, Classification, Nonlinear Projection/Manifold learning, Data Visualization.

The network automatically determines the number of prototypes needed to represent the data. Starting from 4 neurons, the map expands at boundary positions where quantization error exceeds a configurable threshold: no need to pre-specify cluster count. The result is a topology-preserving 2D grid where neighboring neurons represent similar inputs.

Features

No cluster count needed — map grows until quantization error falls below threshold; lambda_ controls sensitivity
sklearn-compatible — drop-in for KMeans, DBSCAN: implements fit_predict, transform, score, and predict_proba
Topology-preserving — related samples cluster as grid neighbors; topographic error < 5% on Digits
Faster than classical SOMs — batch learning rule trains on all samples per epoch (vs. online, sample-by-sample)
Built-in visualization — plot() renders neuron grid coloured by density, label, error or hit count.

How it works

In brief: Four neurons initialize → samples assigned to nearest neuron → weights update toward assigned samples → boundary neurons with high error spawn new neighbors → σ decays → repeat until max_neurons or n_iter reached. Neighboring neurons influence each other's weight update → topology preserved during training.

DBGSOM builds a 2D rectangular prototype map where each neuron connects to four neighbors. Four neurons init with random weights from input data. Each epoch: every sample is assigned to the nearest neuron (BMU); weights are updated toward mean of the mapped samples. A neighborhood function couples neighboring neurons so that low-dimensional map ordering is preserved; neighborhood width shrinks over time (global → local structure). A growing mechanism inserts new neurons at boundary positions where quantization error exceeds growing threshold.

How to install

Download from PyPI

Install from PyPI via uv (recommended):

or with pip:

Install from source

Clone and install with uv (recommended):

git clone https://github.com/SandroMartens/DBGSOM.git
cd DBGSOM
uv sync

Alternatively with pip:

git clone https://github.com/SandroMartens/DBGSOM.git
cd DBGSOM
pip install -e .

Usage

DBGSOM implements the scikit-learn API and provides two estimators:

Class	Use case
`SomVQ`	Unsupervised clustering / vector quantization
`SomClassifier`	Supervised classification

Clustering / Vector Quantization

from dbgsom import SomVQ
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

vq = SomVQ(lambda_=80.0, max_neurons=80)
labels = vq.fit_predict(X)

print(f"Neurons: {len(vq.neurons_)}")
print(f"Quantization error: {vq.quantization_error_:.4f}")
print(f"Topographic error:  {vq.topographic_error_:.4f}")

Key growth parameters:

Parameter	Default	Effect
`lambda_`	115.0	Growing threshold — higher → fewer neurons
`max_neurons`	`5 x sqrt(n_samples)`	Hard cap on neuron count
`n_iter`	500	Training epochs; growth only happens in first half

Classification

from dbgsom import SomClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = SomClassifier(lambda_=80.0, max_neurons=80)
clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))           # accuracy
proba = clf.predict_proba(X_test)          # class probabilities

Transform

Both estimators implement transform() — represents each sample as sparse non-negative linear combination of prototype weights:

coefs = vq.transform(X)   # shape (n_samples, n_prototypes)

Visualization

plot() renders SOM neurons as dots and neighborhood edges as grey lines via seaborn objects.

vq.plot(color="density")                       # continuous -> colour gradient
clf.plot(color="label")                        # categorical -> colour legend
vq.plot(color="hit_count", pointsize="error")  # colour + size encoding
vq.plot(color="density", layout="pca", palette="magma_r")

Supported attributes for color / pointsize: 'label', 'epoch_created', 'error', 'average_distance', 'density', 'hit_count'

Parameter	Options	Description
`color`	any node attribute	Numeric attributes → continuous colour scale; int/str with ≤ 20 unique values → legend
`pointsize`	any numeric attribute	Node size proportional to attribute value
`layout`	`'grid'` (default), `'pca'`	Node placement algorithm
`palette`	any Matplotlib colormap	Applied to colour mapping

Examples

Example	Description
	2D input: prototypes (red) approximate input distribution (white), square topology preserved.
	Fashion-MNIST: weight of each prototype plotted; neighboring prototypes pairwise similar.
	Each prototype coloured by majority class; same-class samples cluster together. Trained on MNIST digits.

Comparisons

SOM algorithm comparison (Digits, PCA projection)

DBGSOM (dynamic grid, size determined automatically) vs. MiniSom and SuSi (fixed grids) vs. KMeans (no topology). All trained on same Digits embedding.

Clustering metrics (Digits dataset)

ARI, Silhouette, Davies-Bouldin, training time. All algorithms use same cluster count — determined automatically by DBGSOM.

Full benchmark notebooks:

Notebook	What it shows
`clustering_comparison.ipynb`	DBGSOM vs. KMeans, MiniBatchKMeans, AgglomerativeClustering on Iris and Digits
`som_comparison.ipynb`	DBGSOM vs. MiniSom, SuSi on Digits and Fashion-MNIST (QE, TE, training time, scaling)
`manifold_comparison.ipynb`	DBGSOM vs. Isomap, t-SNE, UMAP on MNIST: trustworthiness, continuity, folds/tears, runtime

Dependencies

Python >= 3.12
numpy
numba
NetworkX
tqdm
scikit-learn
seaborn
pandas

Citation

If you use DBGSOM in your research, please cite:

Martens, S. (2025). DBGSOM: A Python implementation of the Directed Batch Growing Self-Organizing Map. Zenodo. https://doi.org/10.5281/zenodo.20525611

References

A directed batch growing approach to enhance the topology preservation of self-organizing map, Mahdi Vasighi and Homa Amini, 2017, http://dx.doi.org/10.1016/j.asoc.2017.02.015
Reference implementation by the authors in Matlab: https://github.com/mvasighi/DBGSOM
Statistics-enhanced Direct Batch Growth Self-Organizing Mapping for efficient DoS Attack Detection, Xiaofei Qu et al., 2019, 10.1109/ACCESS.2019.2922737
Entropy-Defined Direct Batch Growing Hierarchical Self-Organizing Mapping for Efficient Network Anomaly Detection, Xiaofei Qu et al., 2021, 10.1109/ACCESS.2021.3064200
Self-Organizing Maps, 3rd Edition, Teuvo Kohonen, 2003
MATLAB Implementations and Applications of the Self-Organizing Map, Teuvo Kohonen, 2014
Smoothed self-organizing map for robust clustering, P. D'Urso, L. De Giovanni and R. Massari, 2019, https://doi.org/10.1016/j.ins.2019.06.038

License

dbgsom is licensed under MIT license.

推荐订阅源

Hacker News: Show HN