Clustering Workflow

vdfpy takes velocity distribution functions (VDFs) from observation and simulation data, and clusters the distributions using unsupervised machine learning algorithms from sklearn and other population packages. We aim to provide uniform interfaces for data from different sources, and standard methods for training and evaluating the clustering performance. On the top level, the task is divided into several steps:

Data preparation
- Data collection
- Data cleaning
- Feature engineering
- Data splitting
Modeling
- Hyperparameter tuning
- Training
- Predicting
- Assessing performance

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import vdfpy

pd.set_option("display.max_rows", 10)

sns.set_theme(rc={'figure.figsize':(10, 3)})

Data Preparation

Currently, we support VDF and plasma moment collection from simulation data of Vlasiator, and FLEKS, and observation data from MMS and MAVEN. For this tutorial, we generate pseudo-particle data from vdfpy.

from vdfpy.generator import make_clusters

df = make_clusters(n_clusters=3, n_dims=1, n_points=100, n_samples=20)
df

	class	particle velocity	density	bulk velocity	temperature
0	1	vx 0 1.269714 1 -0.484702 2 0.6...	33.0	0.062291	0.757375
1	1	vx 0 -0.734136 1 0.770994 2 0.98843...	9.0	0.344572	1.000219
2	1	vx 0 0.527443 1 0.293697 2 -1.2...	67.0	-0.079349	1.008849
3	1	vx 0 -1.219661 1 0.344993 2 0.5...	85.0	0.154404	0.958812
4	1	vx 0 -1.129424 1 -1.447648 2 1.6...	25.0	-0.368064	1.000766
...	...	...	...	...	...
15	3	vx 0 -0.526728 1 0.711446 2 -1.0...	63.0	-1.370080	2.110039
16	3	vx 0 -1.374929 1 2.272116 2 0.84043...	6.0	-0.914809	2.182220
17	3	vx 0 0.594049 1 0.437241 2 -1.0...	33.0	-1.562680	2.210282
18	3	vx 0 -0.142911 1 0.602093 2 -0.6...	81.0	-1.427195	2.189792
19	3	vx 0 -0.197543 1 0.241016 2 0.3...	18.0	-1.796465	2.105182

20 rows × 5 columns

fig, ax = plt.subplots(figsize=(10, 3), layout="constrained")
[
    sns.kdeplot(df["particle velocity"][i], x="vx", ax=ax)
    for i in range(df["particle velocity"].size)
]
plt.show()

_images/50b765dcbf291fd5d0c2459f6586ce71e75d560c1015a7b46b78650906d38e61.png

Normalizing the data

When working with distance-based algorithms (e.g. k-Means), we must normalize the data. Otherwise, variables with different scaling will be weighted differently in the distance formula that is being optimized during training. For example, if we were to include velocity [km/s] in the cluster, in addition to density [amu/cc] and pressure [nPa], density would have an outsized impact on the optimizations because its scale is significantly larger and wider than the bounded location variables.

We first set up training and test splits using train_test_split from sklearn.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[["density", "bulk velocity", "temperature"]],
    df[["class"]],
    test_size=0.2,
    random_state=0,
)

Next, we normalize the training and test data using the preprocessing.normalize method from sklearn.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Scale the moments
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.fit_transform(X_test)

sklearn provides several methods for data normalization and scaling. Check the documentation for more details.

Modeling

Fitting the model

For the first iteration, we will arbitrarily choose a number of clusters of 3. Building and fitting models in sklearn is very simple. We will create an instance of KMeans, define the number of clusters using the n_clusters attribute, set n_init, which defines the number of iterations the algorithm will run with different centroid seeds, to “auto”, and we will set the random_state to 0 so we get the same result each time we run the code. We can then fit the model to the normalized training data using the fit() method.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=0, n_init="auto")
kmeans.fit(X_train_norm)

KMeans(n_clusters=3, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Once the data are fit, we can access labels from the labels_ attribute.

Evaluating the model

Below, we visualize the data we just fit.

import seaborn as sns

sns.scatterplot(data=X_train, x="density", y="temperature", hue=kmeans.labels_)
plt.show()

_images/3a276bbea99dd9605869179f42ccc32e5918561f767dc0f9e58f237e115c0c77.png

We can evaluate performance of the clustering algorithm using a Silhouette score which is a part of sklearn.metrics where a higher score (upper limit 1) represents a better fit.

from sklearn.metrics import silhouette_score

silhouette_score(X_train_norm, kmeans.labels_, metric="euclidean")

0.5733699871140752

Choosing the best number of clusters

The weakness of k-means clustering is that we don’t know how many clusters we need by just running the model. We need to test ranges of values and make a decision on the best value of k. We typically make a decision using the Elbow method to determine the optimal number of clusters where we are both not overfitting the data with too many clusters, and also not underfitting with too few.

We create the below loop to test and store different model results so that we can make a decision on the best number of clusters.

K = range(2, 6)
fits = []
score = []

# Train the model for current value of k on training data, and append to lists
for k in K:
    model = KMeans(n_clusters=k, random_state=0).fit(X_train_norm)
    fits.append(model)
    score.append(silhouette_score(X_train_norm, model.labels_, metric="euclidean"))

results_eval = pd.DataFrame({"K": K, "score": score})

Typically, as we increase the value of K, we see improvements in clusters and what they represent until a certain point. We then start to see diminishing returns or even worse performance. We can visually see this to help make a decision on the value of k by using an elbow plot where the y-axis is a measure of goodness of fit and the x-axis is the value of k.

sns.lineplot(results_eval, x="K", y="score", marker='*', markersize=10, markerfacecolor="tab:red")
plt.show()

_images/8e025297404263ec60f75de62431bb2a2b3f44c7a9b04e421a21824e2344a73f.png

We typically choose the point where the improvements in performance start to flatten or get worse.

When will k-means cluster analysis fail?

K-means clustering performs best on data that are spherical. Spherical data are data that group in space in close proximity to each other either. This can be visualized in 2 or 3 dimensional space more easily. Data that aren’t spherical or should not be spherical do not work well with k-means clustering. For example, k-means clustering would not do well on the below data as we would not be able to find distinct centroids to cluster the two circles or arcs differently, despite them clearly visually being two distinct circles and arcs that should be labeled as such.

Quick Demo

df = make_clusters(n_clusters=2, n_dims=1, n_points=1000, n_samples=6)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Scale the moments
data_scaled = scaler.fit_transform(df.iloc[:, 2:])
df_scaled = pd.DataFrame(data_scaled)
df_scaled

	0	1	2
0	-0.542211	-1.058542	-0.866496
1	-1.569997	-0.908429	-1.184732
2	0.290603	-1.028117	-0.929473
3	0.519000	0.959425	1.025932
4	1.660985	0.969032	1.073753
5	-0.358379	1.066632	0.881018

method = "kmeans"
labels = vdfpy.cluster(df_scaled, n_clusters=2, method=method)

# clusters: 2; # samples: 6; # features: 3

xrange = np.linspace(1, labels.size, labels.size)
ylocs = np.zeros(labels.size)

fig, ax = plt.subplots(figsize=(10, 2))

for g in np.unique(labels):
    ix = np.where(labels == g)
    ax.scatter(xrange[ix], ylocs[ix], label=g, s=30)

ax.get_yaxis().set_visible(False)
ax.legend(loc="upper center", fancybox=True, shadow=True, ncol=4, fontsize=16)
ax.set_title('Pseudo distribution clustering, 2-class k-means')

plt.show()

_images/5d9d294716bed5c911a50fd806bde3870edf67dd99c642b128bf74634883acad.png