Dimension Reduction

Dimension reduction has a close relation to feature engineering. Before performing the clustering, we wish to keep the main features of data instead of dumping the whole raw data into the clustering models.

import pandas as pd 

pd.set_option("display.max_rows", 10)

Data Preparation

When we have the raw data of discrete particle velocities at a given location, we first need to convert that into a distribution function. Here we take advantage of the Kernel Density Estimation (KDE) technique implemented in scikit-learn. Note that an alternative implementation exists in seaborn but for plotting purposes.

from sklearn.neighbors import KernelDensity
import numpy as np
import matplotlib.pyplot as plt

from vdfpy.generator import make_clusters

df = make_clusters(n_clusters=3, n_dims=1, n_points=100, n_samples=50, random_state=1)
df

	class	particle velocity	density	bulk velocity	temperature
0	1	vx 0 -1.889013 1 -0.174772 2 -0.4...	47.0	-0.095871	0.864831
1	1	vx 0 -0.810815 1 0.752244 2 0.2...	51.0	-0.083029	0.928393
2	1	vx 0 0.616879 1 2.547898 2 -1.0...	75.0	-0.073963	0.989770
3	1	vx 0 0.548405 1 -1.065125 2 1.8...	95.0	-0.201632	0.909026
4	1	vx 0 0.782085 1 0.134622 2 0.26290...	4.0	0.099152	0.564061
...	...	...	...	...	...
45	3	vx 0 -0.965805 1 -1.218734 2 0.5...	73.0	-1.205324	1.962643
46	3	vx 0 0.528009 1 0.129235 2 -0.7...	18.0	-1.582758	1.908900
47	3	vx 0 1.391988 1 -0.746404 2 ...	147.0	-1.354992	2.117926
48	3	vx 0 -0.305792 1 -1.474984 2 ...	112.0	-1.399677	1.998004
49	3	vx 0 -0.276332 1 0.977128 2 ...	144.0	-1.301792	2.110255

50 rows × 5 columns

n_features = 200
X_plot = np.linspace(-4, 4, n_features)[:, np.newaxis]

density = np.zeros(shape=(df["particle velocity"].size, n_features))

for isample, pv in enumerate(df["particle velocity"]):
    kde = KernelDensity(bandwidth="silverman").fit(pv.values)
    log_den = kde.score_samples(X_plot)
    density[isample,:] = np.exp(log_den)

fig, ax = plt.subplots(1, 1, figsize=(8, 3), layout="constrained")

ax.plot(X_plot[:, 0], np.transpose(density), alpha=0.6)
plt.show()

_images/2e076cf5ff1fb76951bf9faa46eb383f4f839cdefa1847b0072583c9e910f01e.png

from sklearn.decomposition import PCA

n_components = 2
pca = PCA(n_components=n_components)
pca.fit(density)
X = pca.transform(density)

fig, ax = plt.subplots(1, 1, figsize=(6, 4), layout="constrained")
ax.scatter(X[:,0], X[:,1], s=10, c=df["class"], cmap="Set1")
ax.set_xlabel("1st Principle Component", fontsize=14)
ax.set_ylabel("2nd Principle Component", fontsize=14)
plt.show()

_images/90991cd0a5133681b945bb8d63dbbdac61479e8b461bc2822b2f35df92830b06.png

import umap

reducer = umap.UMAP(n_components=2)

X = reducer.fit_transform(density)
fig, ax = plt.subplots(1, 1, figsize=(6, 4), layout="constrained")
ax.scatter(X[:,0], X[:,1], s=10, c=df["class"], cmap="Set1")
xax = ax.axes.get_xaxis()
xax.set_visible(False)
yax = ax.axes.get_yaxis()
yax.set_visible(False)
plt.title("UMAP projection", fontsize=16)
plt.show()

/home/runner/.cache/pypoetry/virtualenvs/vdfpy-djU7PhlF-py3.11/lib/python3.11/site-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
  warnings.warn(

_images/2fb8067cf27eb81830c4c9faf2209a8ac97a9f19ced8496e634069165ce6b483.png

Note that umap has its own plotting support with some extra dependencies.