The Dynamic Dataset Generator (DDG) is a Python framework for generating dynamic datasets with controllable characteristics. It is specifically designed for benchmarking and evaluating clustering algorithms in dynamic environments where data distributions evolve over time.
DDG simulates realistic dynamic scenarios using Dynamic Gaussian Components (DGCs), which can vary in location, scale, rotation, and weight, enabling the creation of diverse and challenging benchmark datasets.
Paper: Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes
Published in: GECCO 2024 (Genetic and Evolutionary Computation Conference)
Danial Yazdani et al., 2024
- Features
- Installation
- Quick Start
- Class Reference
- Configuration Guide
- Dynamic Change Types
- Examples
- Citation
- License
- Contact
- Multiple Dynamic Gaussian Components (DGCs): Generate multimodal data distributions
- Gradual Local Changes: Smooth, correlated changes to individual DGC parameters
- Severe Global Changes: Abrupt changes affecting all DGCs simultaneously
- Configurable Dynamics: Control over number of DGCs, variables, and clusters
- Rotation Support: Full rotation matrix control for each DGC
- Boundary Control: Reflect method ensures parameters stay within valid ranges (Equation 4)
- Performance Tracking: Built-in evaluation and performance measurement
- Reproducible Results: Seed-based random number generation
- Vectorized Operations: Optimized NumPy implementation for efficiency
- Python 3.8 or later
- NumPy 1.20 or later
-
Clone or download this repository:
git clone https://github.com/Danial-Yazdani/DDG-Python.git cd DDG-Python -
Install dependencies:
pip install numpy
-
(Optional) Install in development mode:
pip install -e .
from DDG_Python import DDG
import numpy as np
# Initialize DDG with default settings
ddg = DDG()
# Access the generated dataset
print(f"Dataset shape: {ddg.dataset.shape}")
print(f"Number of DGCs: {ddg.DGC_number}")
print(f"Number of clusters: {ddg.cluster_number}")
# Evaluate a random clustering solution
solution = np.random.uniform(-70, 70, ddg.cluster_number * ddg.num_of_variables)
fitness = ddg.clustering_evaluation(solution)
print(f"Fitness: {fitness[0]:.4f}")The main class that encapsulates all DDG functionality.
ddg = DDG()Initializes DDG with default parameters. Customize by modifying attributes after initialization or editing the __init__ method.
| Attribute | Type | Description |
|---|---|---|
dataset |
ndarray | Current dataset [data_size × num_of_variables] |
dgc |
list[DGC] | List of Dynamic Gaussian Components |
DGC_number |
int | Number of DGCs |
num_of_variables |
int | Dimensionality of data |
cluster_number |
int | Number of clusters |
FE |
int | Current function evaluation count |
current_best_solution |
ndarray | Best solution found |
current_best_solution_value |
float | Best fitness value |
Evaluates clustering solutions using sum of intra-cluster distances.
# Single solution
fitness = ddg.clustering_evaluation(solution) # Returns array of shape (1,)
# Multiple solutions
solutions = np.random.uniform(-70, 70, (10, ddg.cluster_number * ddg.num_of_variables))
fitness = ddg.clustering_evaluation(solutions) # Returns array of shape (10,)Parameters:
X: Solution array [n_solutions × (n_clusters × n_variables)] or [n_clusters × n_variables]
Returns: NumPy array of fitness values (lower is better)
Side Effects:
- Increments
FEcounter - May trigger environmental changes
- Updates
current_best_solutionif improved
Represents a single Dynamic Gaussian Component.
| Attribute | Description |
|---|---|
center |
Mean position (ndarray) |
weight |
Probability weight (float) |
sigma |
Standard deviations (ndarray) |
rotation_matrix |
Rotation matrix (ndarray) |
theta_matrix |
Rotation angles (ndarray) |
shift_severity |
Center shift magnitude |
shift_correlation_factor |
Movement correlation (ρ) |
previous_shift_direction |
Last movement direction |
sigma_severity |
Sigma change magnitude |
sigma_direction |
Sigma change directions |
weight_severity |
Weight change magnitude |
weight_direction |
Weight change direction |
rotation_severity |
Rotation change magnitude |
rotation_direction |
Rotation change directions |
local_change_likelihood |
Probability of local change |
direction_change_probability |
Probability of direction flip |
Modify these attributes after creating a DDG instance or edit the __init__ method.
| Parameter | Default | Description |
|---|---|---|
seed |
2151 | Random seed for reproducibility |
max_evals |
500000 | Maximum function evaluations |
num_of_variables |
2 | Dimensionality of data |
DGC_number |
7 | Number of Dynamic Gaussian Components |
cluster_number |
5 | Number of clusters |
| Parameter | Default | Description |
|---|---|---|
min_coordinate / max_coordinate |
-70 / 70 | DGC center bounds |
min_sigma / max_sigma |
7 / 20 | Standard deviation bounds |
min_weight / max_weight |
1 / 3 | DGC weight bounds |
min_angle / max_angle |
-π / π | Rotation angle bounds |
| Parameter | Default | Description |
|---|---|---|
local_shift_severity_range |
[0.1, 0.2] | Center shift magnitude |
relocation_correlation_range |
[0.99, 0.995] | Movement correlation (ρ) |
local_sigma_severity_range |
[0.05, 0.1] | Sigma change magnitude |
local_weight_severity_range |
[0.02, 0.05] | Weight change magnitude |
local_rotation_severity_range |
[π/360, π/180] | Rotation change magnitude |
| Parameter | Default | Description |
|---|---|---|
global_shift_severity_value |
10 | Global center shift |
global_sigma_severity_value |
5 | Global sigma change |
global_weight_severity_value |
0.5 | Global weight change |
global_angle_severity_value |
π/4 | Global rotation change |
global_severity_control |
0.1 | Beta distribution parameter (α=β) |
| Parameter | Default | Description |
|---|---|---|
local_temporal_severity_range |
[0.05, 0.1] | Probability of local change per DGC |
global_change_likelihood |
0.0001 | Probability of global change |
DGC_number_change_likelihood |
0.0001 | Probability of DGC count change |
variable_number_change_likelihood |
0.0001 | Probability of dimension change |
cluster_number_change_likelihood |
0.0001 | Probability of cluster count change |
| Parameter | Default | Description |
|---|---|---|
data_size |
1000 | Dataset size |
frequent_sampling_likelihood |
0.1 | Incremental sampling probability |
incremental_sampling_size |
50 (5%) | Samples added per incremental update |
Applied with probability local_change_likelihood for each DGC at every function evaluation.
- Center Relocation (Equations 5-6): Correlated random walk
v(t+1) = normalize((1-ρ)·random + ρ·v(t)) c(t+1) = c(t) + |N(0,1)| · severity · v(t+1) - Sigma Changes (Equation 7): Direction-based incremental
- Weight Changes (Equation 8): Direction-based incremental
- Rotation Changes (Equation 9): Angle adjustments
Applied with probability global_change_likelihood. Uses heavy-tail Beta distribution.
change = severity · (2·Beta(α,β) - 1) where α=β=0.1
- DGC Count: Can increase/decrease (Equation 14)
- Variable Count: Dimensions can change (Equation 15)
- Cluster Count: Target clusters can change (Equation 16)
All parameters use the reflect method:
if value > max: value = 2·max - value
if value < min: value = 2·min - value
from DDG_Python import DDG
import numpy as np
# Initialize
ddg = DDG()
# Simple optimization loop
best_fitness = np.inf
for iteration in range(100):
# Generate random solutions
solutions = np.random.uniform(
ddg.min_coordinate, ddg.max_coordinate,
(20, ddg.cluster_number * ddg.num_of_variables)
)
# Evaluate
fitness = ddg.clustering_evaluation(solutions)
# Track best
if np.nanmin(fitness) < best_fitness:
best_fitness = np.nanmin(fitness)
# Check termination
if ddg.FE >= ddg.max_evals:
break
print(f"Best fitness found: {best_fitness:.4f}")
print(f"Function evaluations used: {ddg.FE}")ddg = DDG()
# Disable all changes
ddg.global_change_likelihood = 0
ddg.DGC_number_change_likelihood = 0
ddg.variable_number_change_likelihood = 0
ddg.cluster_number_change_likelihood = 0
ddg.frequent_sampling_likelihood = 0
for dgc in ddg.dgc:
dgc.local_change_likelihood = 0ddg = DDG()
# Increase change frequency
ddg.global_change_likelihood = 0.001
# Increase local change likelihood for all DGCs
for dgc in ddg.dgc:
dgc.local_change_likelihood = 0.2import matplotlib.pyplot as plt
from DDG_Python import DDG
ddg = DDG()
# Only works for 2D data
if ddg.num_of_variables == 2:
plt.figure(figsize=(10, 8))
plt.scatter(ddg.dataset[:, 0], ddg.dataset[:, 1], alpha=0.5, s=10)
# Plot DGC centers
centers = np.array([dgc.center for dgc in ddg.dgc])
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=100, marker='x', linewidths=3)
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.title('DDG Dataset with DGC Centers')
plt.grid(True, alpha=0.3)
plt.show()ddg = DDG()
# Store DGC center positions over time
center_history = []
for _ in range(1000):
solution = np.random.uniform(-70, 70, ddg.cluster_number * ddg.num_of_variables)
ddg.clustering_evaluation(solution)
# Record current centers
centers = [dgc.center.copy() for dgc in ddg.dgc]
center_history.append(centers)
# Analyze movement of first DGC
dgc0_positions = np.array([h[0] for h in center_history])
print(f"DGC 0 moved from {dgc0_positions[0]} to {dgc0_positions[-1]}")
print(f"Total distance traveled: {np.sum(np.linalg.norm(np.diff(dgc0_positions, axis=0), axis=1)):.2f}")If you use DDG in your research, please cite:
@inproceedings{yazdani2024clustering,
title={Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes},
author={Yazdani, Danial and Branke, J{\"u}rgen and Meghdadi, Amir Hossein and Omidvar, Mohammad Nabi and Stoean, Catalin and Stoean, Ruxandra and Gandomi, Amir H and Shi, Yuhui},
booktitle={Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)},
year={2024},
organization={ACM}
}arXiv preprint: arXiv:2402.15731
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
Danial Yazdani
Email: danial.yazdani@gmail.com
GitHub: Danial-Yazdani
- MATLAB Implementation: DDG-MATLAB - Functionally identical MATLAB version
- Both implementations produce consistent results when using the same random seed