{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dimension Reduction\n", "\n", "Dimension reduction has a close relation to feature engineering. Before performing the clustering, we wish to keep the main features of data instead of dumping the whole raw data into the clustering models." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd \n", "\n", "pd.set_option(\"display.max_rows\", 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "When we have the raw data of discrete particle velocities at a given location, we first need to convert that into a distribution function. Here we take advantage of the Kernel Density Estimation (KDE) technique implemented in scikit-learn. Note that an alternative implementation exists in `seaborn` but for plotting purposes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KernelDensity\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from vdfpy.generator import make_clusters\n", "\n", "df = make_clusters(n_clusters=3, n_dims=1, n_points=100, n_samples=50, random_state=1)\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "n_features = 200\n", "X_plot = np.linspace(-4, 4, n_features)[:, np.newaxis]\n", "\n", "density = np.zeros(shape=(df[\"particle velocity\"].size, n_features))\n", "\n", "for isample, pv in enumerate(df[\"particle velocity\"]):\n", " kde = KernelDensity(bandwidth=\"silverman\").fit(pv.values)\n", " log_den = kde.score_samples(X_plot)\n", " density[isample,:] = np.exp(log_den)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(1, 1, figsize=(8, 3), layout=\"constrained\")\n", "\n", "ax.plot(X_plot[:, 0], np.transpose(density), alpha=0.6)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "n_components = 2\n", "pca = PCA(n_components=n_components)\n", "pca.fit(density)\n", "X = pca.transform(density)\n", "\n", "fig, ax = plt.subplots(1, 1, figsize=(6, 4), layout=\"constrained\")\n", "ax.scatter(X[:,0], X[:,1], s=10, c=df[\"class\"], cmap=\"Set1\")\n", "ax.set_xlabel(\"1st Principle Component\", fontsize=14)\n", "ax.set_ylabel(\"2nd Principle Component\", fontsize=14)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import umap\n", "\n", "reducer = umap.UMAP(n_components=2)\n", "\n", "X = reducer.fit_transform(density)\n", "fig, ax = plt.subplots(1, 1, figsize=(6, 4), layout=\"constrained\")\n", "ax.scatter(X[:,0], X[:,1], s=10, c=df[\"class\"], cmap=\"Set1\")\n", "xax = ax.axes.get_xaxis()\n", "xax.set_visible(False)\n", "yax = ax.axes.get_yaxis()\n", "yax.set_visible(False)\n", "plt.title(\"UMAP projection\", fontsize=16)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `umap` has its own [plotting support](https://umap-learn.readthedocs.io/en/latest/plotting.html#) with some extra dependencies." ] } ], "metadata": { "kernelspec": { "display_name": "vdfpy-L8WReCYd-py3.11", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }