Generating Distribution Functions

NumPy provides efficient and flexible tools for random sampling. Earlier versions of numpy (< v1.17) uses numpy.random methods directly (e.g. numpy.random.normal); users of newer versions are recommended to use the Generator instance with default_rng and call the various methods on it to obtain samples from different distributions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(rc={'figure.figsize':(10, 3)})
# For reproducibility
seed = 1234
rng = np.random.default_rng(seed)

One-dimensional Distributions

Discrete distribution

In the simplest case, if we have a discrete distribution (a set of values with associated probabilities), we can sample points like the following:

population = [1, 2, 3, 4, 5]
probabilities = [0.1, 0.2, 0.3, 0.25, 0.15]

samples = rng.choice(population, size=500, p=probabilities)
df_discrete = pd.DataFrame(samples, columns=["x"])

hp = sns.histplot(df_discrete,
                  x="x",
                  stat="probability",
                  discrete=True
                 )
hp.set(title="Discrete Distribution");

_images/3062705f65393dd7e2b3beae02efd9507bee4f3baf93eddb482aba750227ce28.png

Summation of two Maxwellian distributions

mean1, std1 = -2, 2
mean2, std2 = 4, 1
n1 = 9000
n2 = 1000

samples1 = rng.normal(mean1, std1, size=n1)
samples2 = rng.normal(mean2, std2, size=n2)
samples = np.concatenate([samples1, samples2])
np.random.shuffle(samples) # Shuffle to mix the samples randomly
df_double_peak = pd.DataFrame(samples, columns=["x"])

hp = sns.histplot(df_double_peak,
                  x="x",
                  stat="density",
                  kde=True
                 )
hp.set(title="Mixture of two Maxwellian Distributions");

_images/ba4d6bde203106be25730b1f2a85da41f7d419fe979d3e41ceeefde3f83342f7.png

Two-dimensional Distributions

2D Maxwellian

mean = [0, 0]
cov = [[1, 0], [0, 1]]
n1 = 5000

samples = rng.multivariate_normal(mean, cov, size=n1)
df_double_peak_2d = pd.DataFrame(samples, columns=["x", "y"])

fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.histplot(df_double_peak_2d,
                  x="x", y="y",
                  stat="density",
                  ax=ax,
                 )
hp.set(title="2D Maxwellian Distribution")
plt.show()

_images/48eda0a1c735b075f13095c76f3327116610893b78320916e18275ee88afb268.png

2D double-peak Maxwellian

Here is an analytical construction of a mixture of two Maxwellian distribution in 2D. Note that the kernel density estimation in high dimension is an expensive operation.

mean = [0, 0]
cov = [[1, 0], [0, 1]]
n1 = 8000
samples1 = rng.multivariate_normal(mean, cov, size=n1)

mean = [3, 3]
cov = [[1, 0], [0, 1]]
n2 = 2000
samples2 = rng.multivariate_normal(mean, cov, size=n2)

samples = np.concatenate((samples1, samples2))
df_double_peak_2d = pd.DataFrame(samples, columns=["x", "y"])

fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.histplot(df_double_peak_2d,
                  x="x", y="y",
                  stat="density",
                  ax=ax
                 )
hp.set(title="2D Double Peak Distribution")
sns.kdeplot(df_double_peak_2d, x="x", y="y")
plt.show()

_images/8be68cce54b3bd387612bf030e34847864655adc5f7c4b590d5e426a26fc4732.png

Generating Pseudo-Data from VDFpy

In VDFpy, we provide a utility function make_clusters for generating pseudo-data for testing the clustering algorithms.

1D

For instance, to generate 10 samples of one-dimensional VDFs from 2 clusters:

from vdfpy.generator import make_clusters

df = make_clusters(n_clusters=2, n_dims=1, n_points=100, n_samples=3)
df

	class	particle velocity	density	bulk velocity	temperature
0	1	vx 0 -0.947819 1 0.667292 2 -1.3...	85.0	0.075374	1.139530
1	2	vx 0 0.241391 1 1.300642 2 -0.8...	94.0	1.379030	2.182799
2	2	vx 0 0.860236 1 -1.186913 2 -1.1...	27.0	1.073050	2.180692

hp = sns.kdeplot(df["particle velocity"][0], x="vx")
[sns.kdeplot(d, x="vx") for d in df["particle velocity"][1:]]
hp.set(title="Pseudo distributions, 2 classes");

_images/d57bfe47a273c3899412a17a15dff2a71b137fc63a45340ef18acee4495c767e.png

2D

To generate 3 samples of two-dimensional distributions from 2 classes:

df = make_clusters(n_clusters=2, n_dims=2, n_points=30, n_samples=3)
df

	class	particle velocity	density	bulk velocity	temperature
0	1	vx vy 0 0.459237 -0.416954 ...	16.0	vx 0.297316 vy -0.380066 dtype: float64	0.861148
1	2	vx vy 0 1.631649 -1.420858 ...	36.0	vx 1.351862 vy 1.057606 dtype: float64	2.082108
2	2	vx vy 0 -0.123663 1.762656 1 ...	10.0	vx 1.003698 vy 1.231401 dtype: float64	2.307296

fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.kdeplot(df["particle velocity"][0], x="vx", y="vy", ax=ax, fill=True)
[sns.kdeplot(d, x="vx", y="vy", ax=ax, fill=True, alpha=0.5) for d in df["particle velocity"][1:]]
hp.set(title="Pseudo distributions, 1 class");

_images/f74dddea7231819f49015f2bb21d5795bf9229d45ae8e8c7eb55f1bf1aca709a.png

More detailed usages can be found in the documentation. Also note that kernel density estimation in high dimension (N > 1) is a very expensive operation; users are recommended to use histplot instead of kdeplot in general.