{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Distribution Functions\n",
    "\n",
    "NumPy provides efficient and flexible tools for random sampling. Earlier versions of numpy (< v1.17) uses `numpy.random` methods directly (e.g. `numpy.random.normal`); users of newer versions are recommended to use the `Generator` instance with `default_rng` and call the various methods on it to obtain samples from different distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_theme(rc={'figure.figsize':(10, 3)})\n",
    "# For reproducibility\n",
    "seed = 1234\n",
    "rng = np.random.default_rng(seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## One-dimensional Distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discrete distribution\n",
    "\n",
    "In the simplest case, if we have a discrete distribution (a set of values with associated probabilities), we can sample points like the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "population = [1, 2, 3, 4, 5]\n",
    "probabilities = [0.1, 0.2, 0.3, 0.25, 0.15]\n",
    "\n",
    "samples = rng.choice(population, size=500, p=probabilities)\n",
    "df_discrete = pd.DataFrame(samples, columns=[\"x\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hp = sns.histplot(df_discrete,\n",
    "                  x=\"x\",\n",
    "                  stat=\"probability\",\n",
    "                  discrete=True\n",
    "                 )\n",
    "hp.set(title=\"Discrete Distribution\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summation of two Maxwellian distributions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean1, std1 = -2, 2\n",
    "mean2, std2 = 4, 1\n",
    "n1 = 9000\n",
    "n2 = 1000\n",
    "\n",
    "samples1 = rng.normal(mean1, std1, size=n1)\n",
    "samples2 = rng.normal(mean2, std2, size=n2)\n",
    "samples = np.concatenate([samples1, samples2])\n",
    "np.random.shuffle(samples) # Shuffle to mix the samples randomly\n",
    "df_double_peak = pd.DataFrame(samples, columns=[\"x\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hp = sns.histplot(df_double_peak,\n",
    "                  x=\"x\",\n",
    "                  stat=\"density\",\n",
    "                  kde=True\n",
    "                 )\n",
    "hp.set(title=\"Mixture of two Maxwellian Distributions\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Two-dimensional Distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2D Maxwellian"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean = [0, 0]\n",
    "cov = [[1, 0], [0, 1]]\n",
    "n1 = 5000\n",
    "\n",
    "samples = rng.multivariate_normal(mean, cov, size=n1)\n",
    "df_double_peak_2d = pd.DataFrame(samples, columns=[\"x\", \"y\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(7, 6), layout=\"constrained\")\n",
    "hp = sns.histplot(df_double_peak_2d,\n",
    "                  x=\"x\", y=\"y\",\n",
    "                  stat=\"density\",\n",
    "                  ax=ax,\n",
    "                 )\n",
    "hp.set(title=\"2D Maxwellian Distribution\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2D double-peak Maxwellian\n",
    "\n",
    "Here is an analytical construction of a mixture of two Maxwellian distribution in 2D. Note that the kernel density estimation in high dimension is an expensive operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean = [0, 0]\n",
    "cov = [[1, 0], [0, 1]]\n",
    "n1 = 8000\n",
    "samples1 = rng.multivariate_normal(mean, cov, size=n1)\n",
    "\n",
    "mean = [3, 3]\n",
    "cov = [[1, 0], [0, 1]]\n",
    "n2 = 2000\n",
    "samples2 = rng.multivariate_normal(mean, cov, size=n2)\n",
    "\n",
    "samples = np.concatenate((samples1, samples2))\n",
    "df_double_peak_2d = pd.DataFrame(samples, columns=[\"x\", \"y\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(7, 6), layout=\"constrained\")\n",
    "hp = sns.histplot(df_double_peak_2d,\n",
    "                  x=\"x\", y=\"y\",\n",
    "                  stat=\"density\",\n",
    "                  ax=ax\n",
    "                 )\n",
    "hp.set(title=\"2D Double Peak Distribution\")\n",
    "sns.kdeplot(df_double_peak_2d, x=\"x\", y=\"y\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating Pseudo-Data from VDFpy\n",
    "\n",
    "In `VDFpy`, we provide a utility function `make_clusters` for generating pseudo-data for testing the clustering algorithms.\n",
    "\n",
    "### 1D\n",
    "\n",
    "For instance, to generate 10 samples of one-dimensional VDFs from 2 clusters: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from vdfpy.generator import make_clusters\n",
    "\n",
    "df = make_clusters(n_clusters=2, n_dims=1, n_points=100, n_samples=3)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hp = sns.kdeplot(df[\"particle velocity\"][0], x=\"vx\")\n",
    "[sns.kdeplot(d, x=\"vx\") for d in df[\"particle velocity\"][1:]]\n",
    "hp.set(title=\"Pseudo distributions, 2 classes\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2D\n",
    "\n",
    "To generate 3 samples of two-dimensional distributions from 2 classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = make_clusters(n_clusters=2, n_dims=2, n_points=30, n_samples=3)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(7, 6), layout=\"constrained\")\n",
    "hp = sns.kdeplot(df[\"particle velocity\"][0], x=\"vx\", y=\"vy\", ax=ax, fill=True)\n",
    "[sns.kdeplot(d, x=\"vx\", y=\"vy\", ax=ax, fill=True, alpha=0.5) for d in df[\"particle velocity\"][1:]]\n",
    "hp.set(title=\"Pseudo distributions, 1 class\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More detailed usages can be found in the documentation. Also note that kernel density estimation in high dimension (N > 1) is a very expensive operation; users are recommended to use `histplot` instead of `kdeplot` in general."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "vdfpy-L8WReCYd-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}