# Dimension Reduction

Dimension reduction has a close relation to feature engineering. Before performing the clustering, we wish to keep the main features of data instead of dumping the whole raw data into the clustering models.

In [1]:
import pandas as pd 

pd.set_option("display.max_rows", 10)

## Data Preparation

When we have the raw data of discrete particle velocities at a given location, we first need to convert that into a distribution function. Here we take advantage of the Kernel Density Estimation (KDE) technique implemented in scikit-learn. Note that an alternative implementation exists in `seaborn` but for plotting purposes.

In [None]:
from sklearn.neighbors import KernelDensity
import numpy as np
import matplotlib.pyplot as plt

from vdfpy.generator import make_clusters

df = make_clusters(n_clusters=3, n_dims=1, n_points=100, n_samples=50, random_state=1)
df

In [3]:
n_features = 200
X_plot = np.linspace(-4, 4, n_features)[:, np.newaxis]

density = np.zeros(shape=(df["particle velocity"].size, n_features))

for isample, pv in enumerate(df["particle velocity"]):
 kde = KernelDensity(bandwidth="silverman").fit(pv.values)
 log_den = kde.score_samples(X_plot)
 density[isample,:] = np.exp(log_den)


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 3), layout="constrained")

ax.plot(X_plot[:, 0], np.transpose(density), alpha=0.6)
plt.show()

In [None]:
from sklearn.decomposition import PCA

n_components = 2
pca = PCA(n_components=n_components)
pca.fit(density)
X = pca.transform(density)

fig, ax = plt.subplots(1, 1, figsize=(6, 4), layout="constrained")
ax.scatter(X[:,0], X[:,1], s=10, c=df["class"], cmap="Set1")
ax.set_xlabel("1st Principle Component", fontsize=14)
ax.set_ylabel("2nd Principle Component", fontsize=14)
plt.show()

In [None]:
import umap

reducer = umap.UMAP(n_components=2)

X = reducer.fit_transform(density)
fig, ax = plt.subplots(1, 1, figsize=(6, 4), layout="constrained")
ax.scatter(X[:,0], X[:,1], s=10, c=df["class"], cmap="Set1")
xax = ax.axes.get_xaxis()
xax.set_visible(False)
yax = ax.axes.get_yaxis()
yax.set_visible(False)
plt.title("UMAP projection", fontsize=16)
plt.show()

Note that `umap` has its own [plotting support](https://umap-learn.readthedocs.io/en/latest/plotting.html#) with some extra dependencies.