{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering Workflow\n", "\n", "`vdfpy` takes velocity distribution functions (VDFs) from observation and simulation data, and clusters the distributions using unsupervised machine learning algorithms from `sklearn` and other population packages. We aim to provide uniform interfaces for data from different sources, and standard methods for training and evaluating the clustering performance. On the top level, the task is divided into several steps:\n", "\n", "* Data preparation\n", " * Data collection\n", " * Data cleaning\n", " * Feature engineering\n", " * Data splitting\n", "* Modeling\n", " * Hyperparameter tuning\n", " * Training\n", " * Predicting\n", " * Assessing performance" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import vdfpy\n", "\n", "pd.set_option(\"display.max_rows\", 10)\n", "\n", "sns.set_theme(rc={'figure.figsize':(10, 3)})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation\n", "\n", "Currently, we support VDF and plasma moment collection from simulation data of Vlasiator, and FLEKS, and observation data from MMS and MAVEN. For this tutorial, we generate pseudo-particle data from `vdfpy`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from vdfpy.generator import make_clusters\n", "\n", "df = make_clusters(n_clusters=3, n_dims=1, n_points=100, n_samples=20)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10, 3), layout=\"constrained\")\n", "[\n", " sns.kdeplot(df[\"particle velocity\"][i], x=\"vx\", ax=ax)\n", " for i in range(df[\"particle velocity\"].size)\n", "]\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizing the data\n", "\n", "When working with distance-based algorithms (e.g. k-Means), we must normalize the data. Otherwise, variables with different scaling will be weighted differently in the distance formula that is being optimized during training. For example, if we were to include velocity [km/s] in the cluster, in addition to density [amu/cc] and pressure [nPa], density would have an outsized impact on the optimizations because its scale is significantly larger and wider than the bounded location variables.\n", "\n", "We first set up training and test splits using `train_test_split` from `sklearn`." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[[\"density\", \"bulk velocity\", \"temperature\"]],\n", " df[[\"class\"]],\n", " test_size=0.2,\n", " random_state=0,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we normalize the training and test data using the `preprocessing.normalize` method from `sklearn`." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "# Scale the moments\n", "X_train_norm = scaler.fit_transform(X_train)\n", "X_test_norm = scaler.fit_transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sklearn` provides several methods for data normalization and scaling. Check the documentation for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modeling\n", "\n", "### Fitting the model\n", "\n", "For the first iteration, we will arbitrarily choose a number of clusters of 3. Building and fitting models in `sklearn` is very simple. We will create an instance of `KMeans`, define the number of clusters using the `n_clusters` attribute, set `n_init`, which defines the number of iterations the algorithm will run with different centroid seeds, to \"auto\", and we will set the `random_state` to 0 so we get the same result each time we run the code. We can then fit the model to the normalized training data using the `fit()` method.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "kmeans = KMeans(n_clusters=3, random_state=0, n_init=\"auto\")\n", "kmeans.fit(X_train_norm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the data are fit, we can access labels from the `labels_` attribute.\n", "\n", "### Evaluating the model\n", "\n", "Below, we visualize the data we just fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "sns.scatterplot(data=X_train, x=\"density\", y=\"temperature\", hue=kmeans.labels_)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate performance of the clustering algorithm using a Silhouette score which is a part of `sklearn.metrics` where a higher score (upper limit 1) represents a better fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import silhouette_score\n", "\n", "silhouette_score(X_train_norm, kmeans.labels_, metric=\"euclidean\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choosing the best number of clusters\n", "\n", "The weakness of k-means clustering is that we don’t know how many clusters we need by just running the model. We need to test ranges of values and make a decision on the best value of k. We typically make a decision using the Elbow method to determine the optimal number of clusters where we are both not overfitting the data with too many clusters, and also not underfitting with too few.\n", "\n", "We create the below loop to test and store different model results so that we can make a decision on the best number of clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "K = range(2, 6)\n", "fits = []\n", "score = []\n", "\n", "# Train the model for current value of k on training data, and append to lists\n", "for k in K:\n", " model = KMeans(n_clusters=k, random_state=0).fit(X_train_norm)\n", " fits.append(model)\n", " score.append(silhouette_score(X_train_norm, model.labels_, metric=\"euclidean\"))\n", "\n", "results_eval = pd.DataFrame({\"K\": K, \"score\": score})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically, as we increase the value of K, we see improvements in clusters and what they represent until a certain point. We then start to see diminishing returns or even worse performance. We can visually see this to help make a decision on the value of k by using an elbow plot where the y-axis is a measure of goodness of fit and the x-axis is the value of k." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.lineplot(results_eval, x=\"K\", y=\"score\", marker='*', markersize=10, markerfacecolor=\"tab:red\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We typically choose the point where the improvements in performance start to flatten or get worse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When will k-means cluster analysis fail?\n", "\n", "K-means clustering performs best on data that are spherical. Spherical data are data that group in space in close proximity to each other either. This can be visualized in 2 or 3 dimensional space more easily. Data that aren’t spherical or should not be spherical do not work well with k-means clustering. For example, k-means clustering would not do well on the below data as we would not be able to find distinct centroids to cluster the two circles or arcs differently, despite them clearly visually being two distinct circles and arcs that should be labeled as such." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Demo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = make_clusters(n_clusters=2, n_dims=1, n_points=1000, n_samples=6)\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "# Scale the moments\n", "data_scaled = scaler.fit_transform(df.iloc[:, 2:])\n", "df_scaled = pd.DataFrame(data_scaled)\n", "df_scaled" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "method = \"kmeans\"\n", "labels = vdfpy.cluster(df_scaled, n_clusters=2, method=method)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xrange = np.linspace(1, labels.size, labels.size)\n", "ylocs = np.zeros(labels.size)\n", "\n", "fig, ax = plt.subplots(figsize=(10, 2))\n", "\n", "for g in np.unique(labels):\n", " ix = np.where(labels == g)\n", " ax.scatter(xrange[ix], ylocs[ix], label=g, s=30)\n", "\n", "ax.get_yaxis().set_visible(False)\n", "ax.legend(loc=\"upper center\", fancybox=True, shadow=True, ncol=4, fontsize=16)\n", "ax.set_title('Pseudo distribution clustering, 2-class k-means')\n", "\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "vdfpy-L8WReCYd-py3.11", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }