Generating Distribution Functions

NumPy provides efficient and flexible tools for random sampling. Earlier versions of numpy (< v1.17) uses numpy.random methods directly (e.g. numpy.random.normal); users of newer versions are recommended to use the Generator instance with default_rng and call the various methods on it to obtain samples from different distributions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(rc={'figure.figsize':(10, 3)})
# For reproducibility
seed = 1234
rng = np.random.default_rng(seed)

One-dimensional Distributions

Discrete distribution

In the simplest case, if we have a discrete distribution (a set of values with associated probabilities), we can sample points like the following:

population = [1, 2, 3, 4, 5]
probabilities = [0.1, 0.2, 0.3, 0.25, 0.15]

samples = rng.choice(population, size=500, p=probabilities)
df_discrete = pd.DataFrame(samples, columns=["x"])
hp = sns.histplot(df_discrete,
                  x="x",
                  stat="probability",
                  discrete=True
                 )
hp.set(title="Discrete Distribution");
_images/3062705f65393dd7e2b3beae02efd9507bee4f3baf93eddb482aba750227ce28.png

Summation of two Maxwellian distributions

mean1, std1 = -2, 2
mean2, std2 = 4, 1
n1 = 9000
n2 = 1000

samples1 = rng.normal(mean1, std1, size=n1)
samples2 = rng.normal(mean2, std2, size=n2)
samples = np.concatenate([samples1, samples2])
np.random.shuffle(samples) # Shuffle to mix the samples randomly
df_double_peak = pd.DataFrame(samples, columns=["x"])
hp = sns.histplot(df_double_peak,
                  x="x",
                  stat="density",
                  kde=True
                 )
hp.set(title="Mixture of two Maxwellian Distributions");
_images/ba4d6bde203106be25730b1f2a85da41f7d419fe979d3e41ceeefde3f83342f7.png

Two-dimensional Distributions

2D Maxwellian

mean = [0, 0]
cov = [[1, 0], [0, 1]]
n1 = 5000

samples = rng.multivariate_normal(mean, cov, size=n1)
df_double_peak_2d = pd.DataFrame(samples, columns=["x", "y"])
fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.histplot(df_double_peak_2d,
                  x="x", y="y",
                  stat="density",
                  ax=ax,
                 )
hp.set(title="2D Maxwellian Distribution")
plt.show()
_images/48eda0a1c735b075f13095c76f3327116610893b78320916e18275ee88afb268.png

2D double-peak Maxwellian

Here is an analytical construction of a mixture of two Maxwellian distribution in 2D. Note that the kernel density estimation in high dimension is an expensive operation.

mean = [0, 0]
cov = [[1, 0], [0, 1]]
n1 = 8000
samples1 = rng.multivariate_normal(mean, cov, size=n1)

mean = [3, 3]
cov = [[1, 0], [0, 1]]
n2 = 2000
samples2 = rng.multivariate_normal(mean, cov, size=n2)

samples = np.concatenate((samples1, samples2))
df_double_peak_2d = pd.DataFrame(samples, columns=["x", "y"])
fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.histplot(df_double_peak_2d,
                  x="x", y="y",
                  stat="density",
                  ax=ax
                 )
hp.set(title="2D Double Peak Distribution")
sns.kdeplot(df_double_peak_2d, x="x", y="y")
plt.show()
_images/8be68cce54b3bd387612bf030e34847864655adc5f7c4b590d5e426a26fc4732.png

Generating Pseudo-Data from VDFpy

In VDFpy, we provide a utility function make_clusters for generating pseudo-data for testing the clustering algorithms.

1D

For instance, to generate 10 samples of one-dimensional VDFs from 2 clusters:

from vdfpy.generator import make_clusters

df = make_clusters(n_clusters=2, n_dims=1, n_points=100, n_samples=3)
df
class particle velocity density bulk velocity temperature
0 1 vx 0 -0.947819 1 0.667292 2 -1.3... 85.0 0.075374 1.139530
1 2 vx 0 0.241391 1 1.300642 2 -0.8... 94.0 1.379030 2.182799
2 2 vx 0 0.860236 1 -1.186913 2 -1.1... 27.0 1.073050 2.180692
hp = sns.kdeplot(df["particle velocity"][0], x="vx")
[sns.kdeplot(d, x="vx") for d in df["particle velocity"][1:]]
hp.set(title="Pseudo distributions, 2 classes");
_images/d57bfe47a273c3899412a17a15dff2a71b137fc63a45340ef18acee4495c767e.png

2D

To generate 3 samples of two-dimensional distributions from 2 classes:

df = make_clusters(n_clusters=2, n_dims=2, n_points=30, n_samples=3)
df
class particle velocity density bulk velocity temperature
0 1 vx vy 0 0.459237 -0.416954 ... 16.0 vx 0.297316 vy -0.380066 dtype: float64 0.861148
1 2 vx vy 0 1.631649 -1.420858 ... 36.0 vx 1.351862 vy 1.057606 dtype: float64 2.082108
2 2 vx vy 0 -0.123663 1.762656 1 ... 10.0 vx 1.003698 vy 1.231401 dtype: float64 2.307296
fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.kdeplot(df["particle velocity"][0], x="vx", y="vy", ax=ax, fill=True)
[sns.kdeplot(d, x="vx", y="vy", ax=ax, fill=True, alpha=0.5) for d in df["particle velocity"][1:]]
hp.set(title="Pseudo distributions, 1 class");
_images/f74dddea7231819f49015f2bb21d5795bf9229d45ae8e8c7eb55f1bf1aca709a.png

More detailed usages can be found in the documentation. Also note that kernel density estimation in high dimension (N > 1) is a very expensive operation; users are recommended to use histplot instead of kdeplot in general.