Skip to content

Danial-Yazdani/DDG-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Dynamic Dataset Generator (DDG) - Python Implementation

License: GPL v3 Python NumPy

Overview

The Dynamic Dataset Generator (DDG) is a Python framework for generating dynamic datasets with controllable characteristics. It is specifically designed for benchmarking and evaluating clustering algorithms in dynamic environments where data distributions evolve over time.

DDG simulates realistic dynamic scenarios using Dynamic Gaussian Components (DGCs), which can vary in location, scale, rotation, and weight, enabling the creation of diverse and challenging benchmark datasets.

Paper: Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes
Published in: GECCO 2024 (Genetic and Evolutionary Computation Conference)
Danial Yazdani et al., 2024


Table of Contents


Features

  • Multiple Dynamic Gaussian Components (DGCs): Generate multimodal data distributions
  • Gradual Local Changes: Smooth, correlated changes to individual DGC parameters
  • Severe Global Changes: Abrupt changes affecting all DGCs simultaneously
  • Configurable Dynamics: Control over number of DGCs, variables, and clusters
  • Rotation Support: Full rotation matrix control for each DGC
  • Boundary Control: Reflect method ensures parameters stay within valid ranges (Equation 4)
  • Performance Tracking: Built-in evaluation and performance measurement
  • Reproducible Results: Seed-based random number generation
  • Vectorized Operations: Optimized NumPy implementation for efficiency

Installation

Prerequisites

  • Python 3.8 or later
  • NumPy 1.20 or later

Setup

  1. Clone or download this repository:

    git clone https://github.com/Danial-Yazdani/DDG-Python.git
    cd DDG-Python
  2. Install dependencies:

    pip install numpy
  3. (Optional) Install in development mode:

    pip install -e .

Quick Start

from DDG_Python import DDG
import numpy as np

# Initialize DDG with default settings
ddg = DDG()

# Access the generated dataset
print(f"Dataset shape: {ddg.dataset.shape}")
print(f"Number of DGCs: {ddg.DGC_number}")
print(f"Number of clusters: {ddg.cluster_number}")

# Evaluate a random clustering solution
solution = np.random.uniform(-70, 70, ddg.cluster_number * ddg.num_of_variables)
fitness = ddg.clustering_evaluation(solution)
print(f"Fitness: {fitness[0]:.4f}")

Class Reference

DDG Class

The main class that encapsulates all DDG functionality.

Constructor

ddg = DDG()

Initializes DDG with default parameters. Customize by modifying attributes after initialization or editing the __init__ method.

Key Attributes

Attribute Type Description
dataset ndarray Current dataset [data_size × num_of_variables]
dgc list[DGC] List of Dynamic Gaussian Components
DGC_number int Number of DGCs
num_of_variables int Dimensionality of data
cluster_number int Number of clusters
FE int Current function evaluation count
current_best_solution ndarray Best solution found
current_best_solution_value float Best fitness value

Methods

clustering_evaluation(X)

Evaluates clustering solutions using sum of intra-cluster distances.

# Single solution
fitness = ddg.clustering_evaluation(solution)  # Returns array of shape (1,)

# Multiple solutions
solutions = np.random.uniform(-70, 70, (10, ddg.cluster_number * ddg.num_of_variables))
fitness = ddg.clustering_evaluation(solutions)  # Returns array of shape (10,)

Parameters:

  • X: Solution array [n_solutions × (n_clusters × n_variables)] or [n_clusters × n_variables]

Returns: NumPy array of fitness values (lower is better)

Side Effects:

  • Increments FE counter
  • May trigger environmental changes
  • Updates current_best_solution if improved

DGC Class (Inner Class)

Represents a single Dynamic Gaussian Component.

Attributes

Attribute Description
center Mean position (ndarray)
weight Probability weight (float)
sigma Standard deviations (ndarray)
rotation_matrix Rotation matrix (ndarray)
theta_matrix Rotation angles (ndarray)
shift_severity Center shift magnitude
shift_correlation_factor Movement correlation (ρ)
previous_shift_direction Last movement direction
sigma_severity Sigma change magnitude
sigma_direction Sigma change directions
weight_severity Weight change magnitude
weight_direction Weight change direction
rotation_severity Rotation change magnitude
rotation_direction Rotation change directions
local_change_likelihood Probability of local change
direction_change_probability Probability of direction flip

Configuration Guide

Modify these attributes after creating a DDG instance or edit the __init__ method.

Basic Settings

Parameter Default Description
seed 2151 Random seed for reproducibility
max_evals 500000 Maximum function evaluations
num_of_variables 2 Dimensionality of data
DGC_number 7 Number of Dynamic Gaussian Components
cluster_number 5 Number of clusters

Bounds

Parameter Default Description
min_coordinate / max_coordinate -70 / 70 DGC center bounds
min_sigma / max_sigma 7 / 20 Standard deviation bounds
min_weight / max_weight 1 / 3 DGC weight bounds
min_angle / max_angle -π / π Rotation angle bounds

Change Severity (Local/Gradual)

Parameter Default Description
local_shift_severity_range [0.1, 0.2] Center shift magnitude
relocation_correlation_range [0.99, 0.995] Movement correlation (ρ)
local_sigma_severity_range [0.05, 0.1] Sigma change magnitude
local_weight_severity_range [0.02, 0.05] Weight change magnitude
local_rotation_severity_range [π/360, π/180] Rotation change magnitude

Change Severity (Global/Severe)

Parameter Default Description
global_shift_severity_value 10 Global center shift
global_sigma_severity_value 5 Global sigma change
global_weight_severity_value 0.5 Global weight change
global_angle_severity_value π/4 Global rotation change
global_severity_control 0.1 Beta distribution parameter (α=β)

Change Likelihoods

Parameter Default Description
local_temporal_severity_range [0.05, 0.1] Probability of local change per DGC
global_change_likelihood 0.0001 Probability of global change
DGC_number_change_likelihood 0.0001 Probability of DGC count change
variable_number_change_likelihood 0.0001 Probability of dimension change
cluster_number_change_likelihood 0.0001 Probability of cluster count change

Dataset Settings

Parameter Default Description
data_size 1000 Dataset size
frequent_sampling_likelihood 0.1 Incremental sampling probability
incremental_sampling_size 50 (5%) Samples added per incremental update

Dynamic Change Types

1. Gradual Local Changes (per DGC)

Applied with probability local_change_likelihood for each DGC at every function evaluation.

  • Center Relocation (Equations 5-6): Correlated random walk
    v(t+1) = normalize((1-ρ)·random + ρ·v(t))
    c(t+1) = c(t) + |N(0,1)| · severity · v(t+1)
    
  • Sigma Changes (Equation 7): Direction-based incremental
  • Weight Changes (Equation 8): Direction-based incremental
  • Rotation Changes (Equation 9): Angle adjustments

2. Severe Global Changes (all DGCs)

Applied with probability global_change_likelihood. Uses heavy-tail Beta distribution.

change = severity · (2·Beta(α,β) - 1)  where α=β=0.1

3. Structural Changes

  • DGC Count: Can increase/decrease (Equation 14)
  • Variable Count: Dimensions can change (Equation 15)
  • Cluster Count: Target clusters can change (Equation 16)

Boundary Control (Equation 4)

All parameters use the reflect method:

if value > max: value = 2·max - value
if value < min: value = 2·min - value

Examples

Example 1: Basic Usage

from DDG_Python import DDG
import numpy as np

# Initialize
ddg = DDG()

# Simple optimization loop
best_fitness = np.inf
for iteration in range(100):
    # Generate random solutions
    solutions = np.random.uniform(
        ddg.min_coordinate, ddg.max_coordinate,
        (20, ddg.cluster_number * ddg.num_of_variables)
    )
    
    # Evaluate
    fitness = ddg.clustering_evaluation(solutions)
    
    # Track best
    if np.nanmin(fitness) < best_fitness:
        best_fitness = np.nanmin(fitness)
    
    # Check termination
    if ddg.FE >= ddg.max_evals:
        break

print(f"Best fitness found: {best_fitness:.4f}")
print(f"Function evaluations used: {ddg.FE}")

Example 2: Static Environment (No Changes)

ddg = DDG()

# Disable all changes
ddg.global_change_likelihood = 0
ddg.DGC_number_change_likelihood = 0
ddg.variable_number_change_likelihood = 0
ddg.cluster_number_change_likelihood = 0
ddg.frequent_sampling_likelihood = 0

for dgc in ddg.dgc:
    dgc.local_change_likelihood = 0

Example 3: High-Frequency Dynamic Environment

ddg = DDG()

# Increase change frequency
ddg.global_change_likelihood = 0.001

# Increase local change likelihood for all DGCs
for dgc in ddg.dgc:
    dgc.local_change_likelihood = 0.2

Example 4: Visualize Dataset

import matplotlib.pyplot as plt
from DDG_Python import DDG

ddg = DDG()

# Only works for 2D data
if ddg.num_of_variables == 2:
    plt.figure(figsize=(10, 8))
    plt.scatter(ddg.dataset[:, 0], ddg.dataset[:, 1], alpha=0.5, s=10)
    
    # Plot DGC centers
    centers = np.array([dgc.center for dgc in ddg.dgc])
    plt.scatter(centers[:, 0], centers[:, 1], c='red', s=100, marker='x', linewidths=3)
    
    plt.xlabel('Variable 1')
    plt.ylabel('Variable 2')
    plt.title('DDG Dataset with DGC Centers')
    plt.grid(True, alpha=0.3)
    plt.show()

Example 5: Track Changes Over Time

ddg = DDG()

# Store DGC center positions over time
center_history = []

for _ in range(1000):
    solution = np.random.uniform(-70, 70, ddg.cluster_number * ddg.num_of_variables)
    ddg.clustering_evaluation(solution)
    
    # Record current centers
    centers = [dgc.center.copy() for dgc in ddg.dgc]
    center_history.append(centers)

# Analyze movement of first DGC
dgc0_positions = np.array([h[0] for h in center_history])
print(f"DGC 0 moved from {dgc0_positions[0]} to {dgc0_positions[-1]}")
print(f"Total distance traveled: {np.sum(np.linalg.norm(np.diff(dgc0_positions, axis=0), axis=1)):.2f}")

Citation

If you use DDG in your research, please cite:

@inproceedings{yazdani2024clustering,
  title={Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous Changes},
  author={Yazdani, Danial and Branke, J{\"u}rgen and Meghdadi, Amir Hossein and Omidvar, Mohammad Nabi and Stoean, Catalin and Stoean, Ruxandra and Gandomi, Amir H and Shi, Yuhui},
  booktitle={Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)},
  year={2024},
  organization={ACM}
}

arXiv preprint: arXiv:2402.15731


License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.


Contact

Danial Yazdani
Email: danial.yazdani@gmail.com
GitHub: Danial-Yazdani


Related Projects

  • MATLAB Implementation: DDG-MATLAB - Functionally identical MATLAB version
  • Both implementations produce consistent results when using the same random seed

About

Python source code of Dynamic Dataset Generator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages