Dynamic Dataset Generator (DDG) - Python Implementation

Overview

The Dynamic Dataset Generator (DDG) is a Python framework for generating dynamic datasets with controllable characteristics. It is specifically designed for benchmarking and evaluating clustering algorithms in dynamic environments where data distributions evolve over time.

DDG simulates realistic dynamic scenarios using Dynamic Gaussian Components (DGCs), which can vary in location, scale, rotation, and weight, enabling the creation of diverse and challenging benchmark datasets.

Paper: Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes
Published in: GECCO 2024 (Genetic and Evolutionary Computation Conference)
Danial Yazdani et al., 2024

Features

Multiple Dynamic Gaussian Components (DGCs): Generate multimodal data distributions
Gradual Local Changes: Smooth, correlated changes to individual DGC parameters
Severe Global Changes: Abrupt changes affecting all DGCs simultaneously
Configurable Dynamics: Control over number of DGCs, variables, and clusters
Rotation Support: Full rotation matrix control for each DGC
Boundary Control: Reflect method ensures parameters stay within valid ranges (Equation 4)
Performance Tracking: Built-in evaluation and performance measurement
Reproducible Results: Seed-based random number generation
Vectorized Operations: Optimized NumPy implementation for efficiency

Installation

Prerequisites

Python 3.8 or later
NumPy 1.20 or later

Setup

Clone or download this repository:

git clone https://github.com/Danial-Yazdani/DDG-Python.git
cd DDG-Python

Install dependencies:
```
pip install numpy
```
(Optional) Install in development mode:
```
pip install -e .
```

Quick Start

from DDG_Python import DDG
import numpy as np

# Initialize DDG with default settings
ddg = DDG()

# Access the generated dataset
print(f"Dataset shape: {ddg.dataset.shape}")
print(f"Number of DGCs: {ddg.DGC_number}")
print(f"Number of clusters: {ddg.cluster_number}")

# Evaluate a random clustering solution
solution = np.random.uniform(-70, 70, ddg.cluster_number * ddg.num_of_variables)
fitness = ddg.clustering_evaluation(solution)
print(f"Fitness: {fitness[0]:.4f}")

Class Reference

`DDG` Class

The main class that encapsulates all DDG functionality.

Constructor

ddg = DDG()

Initializes DDG with default parameters. Customize by modifying attributes after initialization or editing the __init__ method.

Key Attributes

Attribute	Type	Description
`dataset`	ndarray	Current dataset [data_size × num_of_variables]
`dgc`	list[DGC]	List of Dynamic Gaussian Components
`DGC_number`	int	Number of DGCs
`num_of_variables`	int	Dimensionality of data
`cluster_number`	int	Number of clusters
`FE`	int	Current function evaluation count
`current_best_solution`	ndarray	Best solution found
`current_best_solution_value`	float	Best fitness value

Methods

`clustering_evaluation(X)`

Evaluates clustering solutions using sum of intra-cluster distances.

# Single solution
fitness = ddg.clustering_evaluation(solution)  # Returns array of shape (1,)

# Multiple solutions
solutions = np.random.uniform(-70, 70, (10, ddg.cluster_number * ddg.num_of_variables))
fitness = ddg.clustering_evaluation(solutions)  # Returns array of shape (10,)

Parameters:

X: Solution array [n_solutions × (n_clusters × n_variables)] or [n_clusters × n_variables]

Returns: NumPy array of fitness values (lower is better)

Side Effects:

Increments FE counter
May trigger environmental changes
Updates current_best_solution if improved

`DGC` Class (Inner Class)

Represents a single Dynamic Gaussian Component.

Attributes

Attribute	Description
`center`	Mean position (ndarray)
`weight`	Probability weight (float)
`sigma`	Standard deviations (ndarray)
`rotation_matrix`	Rotation matrix (ndarray)
`theta_matrix`	Rotation angles (ndarray)
`shift_severity`	Center shift magnitude
`shift_correlation_factor`	Movement correlation (ρ)
`previous_shift_direction`	Last movement direction
`sigma_severity`	Sigma change magnitude
`sigma_direction`	Sigma change directions
`weight_severity`	Weight change magnitude
`weight_direction`	Weight change direction
`rotation_severity`	Rotation change magnitude
`rotation_direction`	Rotation change directions
`local_change_likelihood`	Probability of local change
`direction_change_probability`	Probability of direction flip

Configuration Guide

Modify these attributes after creating a DDG instance or edit the __init__ method.

Basic Settings

Parameter	Default	Description
`seed`	2151	Random seed for reproducibility
`max_evals`	500000	Maximum function evaluations
`num_of_variables`	2	Dimensionality of data
`DGC_number`	7	Number of Dynamic Gaussian Components
`cluster_number`	5	Number of clusters

Bounds

Parameter	Default	Description
`min_coordinate` / `max_coordinate`	-70 / 70	DGC center bounds
`min_sigma` / `max_sigma`	7 / 20	Standard deviation bounds
`min_weight` / `max_weight`	1 / 3	DGC weight bounds
`min_angle` / `max_angle`	-π / π	Rotation angle bounds

Change Severity (Local/Gradual)

Parameter	Default	Description
`local_shift_severity_range`	[0.1, 0.2]	Center shift magnitude
`relocation_correlation_range`	[0.99, 0.995]	Movement correlation (ρ)
`local_sigma_severity_range`	[0.05, 0.1]	Sigma change magnitude
`local_weight_severity_range`	[0.02, 0.05]	Weight change magnitude
`local_rotation_severity_range`	[π/360, π/180]	Rotation change magnitude

Change Severity (Global/Severe)

Parameter	Default	Description
`global_shift_severity_value`	10	Global center shift
`global_sigma_severity_value`	5	Global sigma change
`global_weight_severity_value`	0.5	Global weight change
`global_angle_severity_value`	π/4	Global rotation change
`global_severity_control`	0.1	Beta distribution parameter (α=β)

Change Likelihoods

Parameter	Default	Description
`local_temporal_severity_range`	[0.05, 0.1]	Probability of local change per DGC
`global_change_likelihood`	0.0001	Probability of global change
`DGC_number_change_likelihood`	0.0001	Probability of DGC count change
`variable_number_change_likelihood`	0.0001	Probability of dimension change
`cluster_number_change_likelihood`	0.0001	Probability of cluster count change

Dataset Settings

Parameter	Default	Description
`data_size`	1000	Dataset size
`frequent_sampling_likelihood`	0.1	Incremental sampling probability
`incremental_sampling_size`	50 (5%)	Samples added per incremental update

Dynamic Change Types

1. Gradual Local Changes (per DGC)

Applied with probability local_change_likelihood for each DGC at every function evaluation.

Center Relocation (Equations 5-6): Correlated random walk

v(t+1) = normalize((1-ρ)·random + ρ·v(t))
c(t+1) = c(t) + |N(0,1)| · severity · v(t+1)

Sigma Changes (Equation 7): Direction-based incremental
Weight Changes (Equation 8): Direction-based incremental
Rotation Changes (Equation 9): Angle adjustments

2. Severe Global Changes (all DGCs)

Applied with probability global_change_likelihood. Uses heavy-tail Beta distribution.

change = severity · (2·Beta(α,β) - 1)  where α=β=0.1

3. Structural Changes

DGC Count: Can increase/decrease (Equation 14)
Variable Count: Dimensions can change (Equation 15)
Cluster Count: Target clusters can change (Equation 16)

Boundary Control (Equation 4)

All parameters use the reflect method:

if value > max: value = 2·max - value
if value < min: value = 2·min - value

Examples

Example 1: Basic Usage

from DDG_Python import DDG
import numpy as np

# Initialize
ddg = DDG()

# Simple optimization loop
best_fitness = np.inf
for iteration in range(100):
    # Generate random solutions
    solutions = np.random.uniform(
        ddg.min_coordinate, ddg.max_coordinate,
        (20, ddg.cluster_number * ddg.num_of_variables)
    )
    
    # Evaluate
    fitness = ddg.clustering_evaluation(solutions)
    
    # Track best
    if np.nanmin(fitness) < best_fitness:
        best_fitness = np.nanmin(fitness)
    
    # Check termination
    if ddg.FE >= ddg.max_evals:
        break

print(f"Best fitness found: {best_fitness:.4f}")
print(f"Function evaluations used: {ddg.FE}")

Example 2: Static Environment (No Changes)

ddg = DDG()

# Disable all changes
ddg.global_change_likelihood = 0
ddg.DGC_number_change_likelihood = 0
ddg.variable_number_change_likelihood = 0
ddg.cluster_number_change_likelihood = 0
ddg.frequent_sampling_likelihood = 0

for dgc in ddg.dgc:
    dgc.local_change_likelihood = 0

Example 3: High-Frequency Dynamic Environment

ddg = DDG()

# Increase change frequency
ddg.global_change_likelihood = 0.001

# Increase local change likelihood for all DGCs
for dgc in ddg.dgc:
    dgc.local_change_likelihood = 0.2

Example 4: Visualize Dataset

import matplotlib.pyplot as plt
from DDG_Python import DDG

ddg = DDG()

# Only works for 2D data
if ddg.num_of_variables == 2:
    plt.figure(figsize=(10, 8))
    plt.scatter(ddg.dataset[:, 0], ddg.dataset[:, 1], alpha=0.5, s=10)
    
    # Plot DGC centers
    centers = np.array([dgc.center for dgc in ddg.dgc])
    plt.scatter(centers[:, 0], centers[:, 1], c='red', s=100, marker='x', linewidths=3)
    
    plt.xlabel('Variable 1')
    plt.ylabel('Variable 2')
    plt.title('DDG Dataset with DGC Centers')
    plt.grid(True, alpha=0.3)
    plt.show()

Example 5: Track Changes Over Time

ddg = DDG()

# Store DGC center positions over time
center_history = []

for _ in range(1000):
    solution = np.random.uniform(-70, 70, ddg.cluster_number * ddg.num_of_variables)
    ddg.clustering_evaluation(solution)
    
    # Record current centers
    centers = [dgc.center.copy() for dgc in ddg.dgc]
    center_history.append(centers)

# Analyze movement of first DGC
dgc0_positions = np.array([h[0] for h in center_history])
print(f"DGC 0 moved from {dgc0_positions[0]} to {dgc0_positions[-1]}")
print(f"Total distance traveled: {np.sum(np.linalg.norm(np.diff(dgc0_positions, axis=0), axis=1)):.2f}")

Citation

If you use DDG in your research, please cite:

@inproceedings{yazdani2024clustering,
  title={Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes},
  author={Yazdani, Danial and Branke, J{\"u}rgen and Meghdadi, Amir Hossein and Omidvar, Mohammad Nabi and Stoean, Catalin and Stoean, Ruxandra and Gandomi, Amir H and Shi, Yuhui},
  booktitle={Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)},
  year={2024},
  organization={ACM}
}

arXiv preprint: arXiv:2402.15731

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Contact

Danial Yazdani
Email: danial.yazdani@gmail.com
GitHub: Danial-Yazdani

Related Projects

MATLAB Implementation: DDG-MATLAB - Functionally identical MATLAB version
Both implementations produce consistent results when using the same random seed

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
DDG-Python.py		DDG-Python.py
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Dynamic Dataset Generator (DDG) - Python Implementation

Overview

Table of Contents

Features

Installation

Prerequisites

Setup

Quick Start

Class Reference

DDG Class

Constructor

Key Attributes

Methods

clustering_evaluation(X)

DGC Class (Inner Class)

Attributes

Configuration Guide

Basic Settings

Bounds

Change Severity (Local/Gradual)

Change Severity (Global/Severe)

Change Likelihoods

Dataset Settings

Dynamic Change Types

1. Gradual Local Changes (per DGC)

2. Severe Global Changes (all DGCs)

3. Structural Changes

Boundary Control (Equation 4)

Examples

Example 1: Basic Usage

Example 2: Static Environment (No Changes)

Example 3: High-Frequency Dynamic Environment

Example 4: Visualize Dataset

Example 5: Track Changes Over Time

Citation

License

Contact

Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`DDG` Class

`clustering_evaluation(X)`

`DGC` Class (Inner Class)

Packages