Generating Distribution Functions
NumPy provides efficient and flexible tools for random sampling. Earlier versions of numpy (< v1.17) uses numpy.random
methods directly (e.g. numpy.random.normal
); users of newer versions are recommended to use the Generator
instance with default_rng
and call the various methods on it to obtain samples from different distributions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(rc={'figure.figsize':(10, 3)})
# For reproducibility
seed = 1234
rng = np.random.default_rng(seed)
One-dimensional Distributions
Discrete distribution
In the simplest case, if we have a discrete distribution (a set of values with associated probabilities), we can sample points like the following:
population = [1, 2, 3, 4, 5]
probabilities = [0.1, 0.2, 0.3, 0.25, 0.15]
samples = rng.choice(population, size=500, p=probabilities)
df_discrete = pd.DataFrame(samples, columns=["x"])
hp = sns.histplot(df_discrete,
x="x",
stat="probability",
discrete=True
)
hp.set(title="Discrete Distribution");

Summation of two Maxwellian distributions
mean1, std1 = -2, 2
mean2, std2 = 4, 1
n1 = 9000
n2 = 1000
samples1 = rng.normal(mean1, std1, size=n1)
samples2 = rng.normal(mean2, std2, size=n2)
samples = np.concatenate([samples1, samples2])
np.random.shuffle(samples) # Shuffle to mix the samples randomly
df_double_peak = pd.DataFrame(samples, columns=["x"])
hp = sns.histplot(df_double_peak,
x="x",
stat="density",
kde=True
)
hp.set(title="Mixture of two Maxwellian Distributions");

Two-dimensional Distributions
2D Maxwellian
mean = [0, 0]
cov = [[1, 0], [0, 1]]
n1 = 5000
samples = rng.multivariate_normal(mean, cov, size=n1)
df_double_peak_2d = pd.DataFrame(samples, columns=["x", "y"])
fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.histplot(df_double_peak_2d,
x="x", y="y",
stat="density",
ax=ax,
)
hp.set(title="2D Maxwellian Distribution")
plt.show()

2D double-peak Maxwellian
Here is an analytical construction of a mixture of two Maxwellian distribution in 2D. Note that the kernel density estimation in high dimension is an expensive operation.
mean = [0, 0]
cov = [[1, 0], [0, 1]]
n1 = 8000
samples1 = rng.multivariate_normal(mean, cov, size=n1)
mean = [3, 3]
cov = [[1, 0], [0, 1]]
n2 = 2000
samples2 = rng.multivariate_normal(mean, cov, size=n2)
samples = np.concatenate((samples1, samples2))
df_double_peak_2d = pd.DataFrame(samples, columns=["x", "y"])
fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.histplot(df_double_peak_2d,
x="x", y="y",
stat="density",
ax=ax
)
hp.set(title="2D Double Peak Distribution")
sns.kdeplot(df_double_peak_2d, x="x", y="y")
plt.show()

Generating Pseudo-Data from VDFpy
In VDFpy
, we provide a utility function make_clusters
for generating pseudo-data for testing the clustering algorithms.
1D
For instance, to generate 10 samples of one-dimensional VDFs from 2 clusters:
from vdfpy.generator import make_clusters
df = make_clusters(n_clusters=2, n_dims=1, n_points=100, n_samples=3)
df
class | particle velocity | density | bulk velocity | temperature | |
---|---|---|---|---|---|
0 | 1 | vx 0 -0.947819 1 0.667292 2 -1.3... | 85.0 | 0.075374 | 1.139530 |
1 | 2 | vx 0 0.241391 1 1.300642 2 -0.8... | 94.0 | 1.379030 | 2.182799 |
2 | 2 | vx 0 0.860236 1 -1.186913 2 -1.1... | 27.0 | 1.073050 | 2.180692 |
hp = sns.kdeplot(df["particle velocity"][0], x="vx")
[sns.kdeplot(d, x="vx") for d in df["particle velocity"][1:]]
hp.set(title="Pseudo distributions, 2 classes");

2D
To generate 3 samples of two-dimensional distributions from 2 classes:
df = make_clusters(n_clusters=2, n_dims=2, n_points=30, n_samples=3)
df
class | particle velocity | density | bulk velocity | temperature | |
---|---|---|---|---|---|
0 | 1 | vx vy 0 0.459237 -0.416954 ... | 16.0 | vx 0.297316 vy -0.380066 dtype: float64 | 0.861148 |
1 | 2 | vx vy 0 1.631649 -1.420858 ... | 36.0 | vx 1.351862 vy 1.057606 dtype: float64 | 2.082108 |
2 | 2 | vx vy 0 -0.123663 1.762656 1 ... | 10.0 | vx 1.003698 vy 1.231401 dtype: float64 | 2.307296 |
fig, ax = plt.subplots(figsize=(7, 6), layout="constrained")
hp = sns.kdeplot(df["particle velocity"][0], x="vx", y="vy", ax=ax, fill=True)
[sns.kdeplot(d, x="vx", y="vy", ax=ax, fill=True, alpha=0.5) for d in df["particle velocity"][1:]]
hp.set(title="Pseudo distributions, 1 class");

More detailed usages can be found in the documentation. Also note that kernel density estimation in high dimension (N > 1) is a very expensive operation; users are recommended to use histplot
instead of kdeplot
in general.