Skip to content

19ska/Routing-and-Spectrum-Allocation-Problem

Repository files navigation

Routing and Spectrum Allocation using Deep Q-Network

Designed a reinforcement learning environment for a constrained network optimization problem, modeled network state/action spaces, trained DQN agents over 1M timesteps, and achieved sub-1% blocking rates on unseen traffic traces.

How to Execute

Prerequisites

Install required Python packages:

pip install -r requirements.txt

Training and Evaluation

Option 1: Standard Training (Default Hyperparameters)

python dqn_runner.py

Option 2: With Hyperparameter Optimization

python dqn_runner.py --optimize --trials 20

This will:

  1. Run Optuna hyperparameter optimization (20 trials, ~30-60 minutes)
  2. Train both models with the best hyperparameters found
  3. Save models with "_optimized" suffix

Training Process

The training will:

  1. Train DQN agent with capacity=20 (1M timesteps, ~10K episodes)
  2. Evaluate the capacity=20 model on 1000 evaluation files
  3. Generate 3 plots for capacity=20 results (saved in plots/ directory)
  4. Train DQN agent with capacity=10 (1M timesteps, ~10K episodes)
  5. Evaluate the capacity=10 model on 1000 evaluation files
  6. Generate 3 plots for capacity=10 results (saved in plots/ directory)
  7. Save both trained models as .zip files

Expected Duration:

  • Standard training: 2-4 hours total for both models
  • With optimization: 30-60 minutes + 2-4 hours for final training

Command Line Options

python dqn_runner.py --help

Available options:

  • --optimize: Enable hyperparameter optimization with Optuna
  • --trials N: Number of optimization trials (default: 20)

Quick Testing

To test the setup with minimal time investment:

python dqn_runner.py --optimize --trials 2

This runs a quick 2-trial optimization (~5 minutes) to verify everything works.

Environment

State Representation

The environment state is represented as a 15-dimensional vector containing:

Link Utilizations (12 values):

  • Fraction of occupied wavelengths for each of the 12 network links
  • Values range from 0.0 (no wavelengths used) to 1.0 (all wavelengths occupied)
  • Links are ordered consistently: sorted by (u,v) where u < v

Request Features (3 values):

  • Normalized source node: source_id / 8.0 (nodes 0-8)
  • Normalized destination node: destination_id / 8.0
  • Normalized holding time: min(holding_time / 50.0, 1.0) (capped at 50)

State Transitions

State transitions occur at each discrete time slot following this sequence:

  1. Release Phase: Expired lightpaths are released based on holding time

    • Check all active lightpaths for expiry_time <= current_time
    • Free wavelengths and update link utilizations
  2. Action Execution: Agent selects a path (actions 0-7) or blocks (action 8)

    • Action validated against current request's source-destination pair
  3. Allocation Phase: If valid path selected, attempt wavelength allocation

    • Use first-fit algorithm to find available wavelength across all path links
    • If wavelength found, allocate lightpath with expiry time
  4. State Update: Recompute link utilizations based on wavelength occupancy

    • utilization = occupied_wavelengths / total_capacity
  5. Time Advancement: Increment time slot and load next request

Network State Storage in LinkState Data Structure

Located in nwutil.py, the LinkState class stores complete link information:

class LinkState(BaseLinkState):
    def __init__(self, u, v, capacity=20, utilization=0.0):
        super().__init__(u, v, capacity, utilization)
        self.wavelengths = [True] * capacity      # Boolean array: True=available, False=occupied
        self.lightpaths = {}                      # {wavelength_idx: (src, dst, expiry_time)}

Key Attributes:

  • endpoints: Tuple (u, v) with u < v for consistent link identification
  • capacity: Total number of wavelengths available on this link
  • utilization: Current fraction of wavelengths in use (0.0 to 1.0)
  • wavelengths: Boolean array tracking availability of each wavelength
  • lightpaths: Dictionary mapping wavelength index to lightpath details

State Update Process:

  • When lightpath allocated: wavelengths[w] = False, add to lightpaths
  • When lightpath expires: wavelengths[w] = True, remove from lightpaths
  • Utilization automatically recomputed: sum(not w for w in wavelengths) / capacity

Action Representation

The action space consists of 9 discrete actions:

Path Selection Actions (0-7):

  • Actions 0-1: Paths P1-P2 for source-destination pair (0,3)
    • Action 0: Path [0,1,2,3]
    • Action 1: Path [0,8,7,6,3]
  • Actions 2-3: Paths P3-P4 for source-destination pair (0,4)
    • Action 2: Path [0,1,5,4]
    • Action 3: Path [0,8,7,6,3,4]
  • Actions 4-5: Paths P5-P6 for source-destination pair (7,3)
    • Action 4: Path [7,1,2,3]
    • Action 5: Path [7,6,3]
  • Actions 6-7: Paths P7-P8 for source-destination pair (7,4)
    • Action 6: Path [7,1,5,4]
    • Action 7: Path [7,6,3,4]

Block Action (8):

  • Explicit decision to block the current request

Action Validation: Actions are validated based on the current request's source-destination pair. Invalid actions (e.g., selecting a (0,3) path for a (7,4) request) result in automatic blocking with penalty.

Reward Function

The reward function provides clear learning signals:

  • +1.0: Successful lightpath allocation

    • Valid path selected and wavelength available on all links
    • Lightpath successfully established
  • -1.0: Request blocked

    • No available wavelength on selected path, or
    • Explicit block action (action 8)
  • -2.0: Invalid action penalty

    • Action does not correspond to current request's source-destination pair
    • Encourages agent to learn valid action selection

This sparse reward structure directly aligns with the optimization objective of minimizing blocking rate.

Additional Constraints

The implementation enforces three critical optical network constraints:

  1. Wavelength Continuity Constraint

    • Same wavelength must be used across all links in the selected path
    • Enforced by find_available_wavelength() checking all path links
    • No wavelength conversion allowed
  2. Link Capacity Constraint

    • Total active lightpaths per link cannot exceed link capacity
    • Enforced by wavelengths array size and availability checking
    • Prevents oversubscription of network resources
  3. Wavelength Conflict Constraint

    • No two lightpaths can use the same wavelength on the same link
    • Enforced by lightpaths dictionary and wavelength marking
    • Ensures proper resource isolation

Training Setup

Agent Training Process

The DQN agent is trained using the stable-baselines3 library with the following setup:

Environment Wrapper: MultiFileEnvWrapper cycles through 10,000 training request files to ensure diverse traffic patterns and prevent overfitting to specific request sequences.

Learning Algorithm: Deep Q-Network (DQN) with:

  • Experience replay buffer for breaking temporal correlations
  • Target network for stable Q-value targets
  • Epsilon-greedy exploration strategy

Training Duration: 1,000,000 timesteps per model (~10,000 episodes)

Hyperparameter Tuning

Manual Tuning (Original)

Hyperparameters were systematically tuned through experimentation. The final configuration is implemented in dqn_runner.py:

# Tuned hyperparameters
learning_rate = 1e-4
buffer_size = 50000
learning_starts = 1000
batch_size = 64
tau = 1.0
gamma = 0.99
target_update_interval = 1000
exploration_fraction = 0.3
exploration_initial_eps = 1.0
exploration_final_eps = 0.05

Tuning Process:

  1. Started with default stable-baselines3 parameters
  2. Learning Rate: Reduced from 1e-3 to 1e-4 to prevent training instability and oscillations
  3. Buffer Size: Increased from 10,000 to 50,000 to store more diverse experiences and break correlations
  4. Exploration: Extended exploration_fraction from 0.1 to 0.3 for more thorough policy discovery
  5. Target Update: Set to 1000 steps to balance stability and responsiveness

The tuning focused on stability and exploration, critical for the sequential decision-making nature of the RSA problem.

Automated Tuning with Optuna (Integrated)

The implementation includes integrated Optuna hyperparameter optimization accessible via command-line flags:

Search Space:

  • learning_rate: 1e-5 to 1e-2 (log scale)
  • buffer_size: [10000, 25000, 50000, 100000]
  • learning_starts: 500 to 2000
  • batch_size: [32, 64, 128]
  • gamma: 0.95 to 0.999
  • target_update_interval: [500, 1000, 2000]
  • exploration_fraction: 0.1 to 0.5
  • exploration_final_eps: 0.01 to 0.1

Optimization Features:

  • TPE Sampler: Tree-structured Parzen Estimator for efficient search
  • Median Pruning: Stops unpromising trials early to save time
  • Objective: Maximize (1 - blocking_rate) on validation subset
  • Fast Search: Uses reduced training time (100K timesteps) and subset of files per trial
  • Automatic Integration: Best hyperparameters automatically used for full training

Usage Examples:

# Standard training with manual hyperparameters
python dqn_runner.py

# With optimization (recommended)
python dqn_runner.py --optimize --trials 20

# Quick test with 5 trials
python dqn_runner.py --optimize --trials 5

# Extensive search with 50 trials
python dqn_runner.py --optimize --trials 50

Output Files:

  • Standard training: dqn_rsa_capacity20.zip, dqn_rsa_capacity10.zip
  • With optimization: dqn_rsa_capacity20_optimized.zip, dqn_rsa_capacity10_optimized.zip
  • Optimization study: optuna_study.pkl (for analysis)

Results

Capacity = 20 Results

Training Learning Curve - Capacity 20

Training Learning Curve: Shows the learning progression over 10,000 episodes. The agent starts with poor performance due to random policy and gradually learns effective routing strategies, achieving strong performance by training completion.

Training Objective - Capacity 20

Training Objective: The objective function (1 - blocking rate) shows the learning progression during training. The model successfully learns to minimize blocking, achieving excellent performance.

Evaluation Objective - Capacity 20

Evaluation Performance: Evaluated on 1000 unseen episodes, achieving 99.80% objective (0.20% final blocking rate, 0.50% average). The model demonstrates exceptional generalization with median performance of 0.00% blocking rate across diverse request patterns.

Capacity = 10 Results

Training Learning Curve - Capacity 10

Training Learning Curve: Demonstrates more challenging learning due to limited resources. The agent shows slower initial progress but successfully learns effective strategies for the resource-constrained environment.

Training Objective - Capacity 10

Training Objective: Shows gradual improvement throughout training as the agent learns to handle the more challenging resource-constrained environment. The model successfully adapts to limited capacity.

Evaluation Objective - Capacity 10

Evaluation Performance: Evaluated on 1000 episodes, achieving 96.90% objective (3.10% final blocking rate, 4.88% average). Despite 50% reduced capacity, the model maintains excellent performance with median blocking rate of only 0.50%, demonstrating superior resource utilization.

Performance Analysis

Actual Training Results:

Metric Capacity=20 Capacity=10
Final Training Reward 86.30 85.90
Final Training Objective 95.30% 94.50%
Evaluation Blocking Rate (Final 10) 0.20% 3.10%
Average Evaluation Blocking Rate 0.50% 4.88%
Evaluation Objective (Final 10) 99.80% 96.90%
Training Episodes 10,000 10,000
Training Timesteps 1,000,000 1,000,000

Key Findings:

  1. Exceptional Performance: Both models achieve outstanding results with very low blocking rates
  2. Initial vs Final Training: Rewards improved from -121.80 to 86.30 (Capacity=20) and -129.10 to 85.90 (Capacity=10)
  3. Capacity=20 Superior Performance: Achieves exceptional 0.20% final blocking rate vs 3.10% for Capacity=10
  4. Learning Milestones: Capacity=20 achieved first positive reward at episode 1,274, Capacity=10 at episode 1,351

Learning Characteristics:

  • Capacity=20: Exceptional performance with 75% of episodes achieving ≤1.00% blocking
  • Capacity=10: Strong performance despite resource constraints, with 50% achieving ≤0.50% blocking

Training Progression Details

Learning Milestones:

  • Capacity=20: First positive reward achieved at episode 1,274
  • Capacity=10: First positive reward achieved at episode 1,351
  • Both models required ~1,300 episodes to transition from random exploration to effective policy

Training Improvement:

  • Capacity=20: Improved from -121.80 initial reward to +86.30 final reward (208.10 point improvement)
  • Capacity=10: Improved from -129.10 initial reward to +85.90 final reward (215.00 point improvement)

All plots show rolling 10-episode averages to smooth out noise and highlight learning trends. The results demonstrate that DQN successfully learns highly effective routing and spectrum allocation policies, with the capacity=20 model achieving exceptional performance (0.20% blocking) and the capacity=10 model maintaining excellent results despite resource constraints.

Project Structure

├── dqn_runner.py          # Main training script with integrated Optuna optimization
├── rsaenv.py              # Custom Gymnasium environment for RSA problem
├── nwutil.py              # Network utilities and LinkState implementation
├── requirements.txt       # Python dependencies
├── README.md              # This documentation
├── data/                  # Training and evaluation datasets
│   ├── train/             # 10,000 training request files
│   └── eval/              # 1,000 evaluation request files
└── plots/                 # Generated training and evaluation plots
    ├── training_capacity20_learning_curve.png
    ├── training_capacity20_objective.png
    ├── eval_capacity20_objective.png
    ├── training_capacity10_learning_curve.png
    ├── training_capacity10_objective.png
    └── eval_capacity10_objective.png

Key Implementation Files

dqn_runner.py - Main training script featuring:

  • Integrated Optuna hyperparameter optimization
  • Command-line interface with --optimize and --trials flags
  • Training pipeline for both capacity configurations
  • Automatic model evaluation and plot generation
  • Support for both manual and optimized hyperparameters

rsaenv.py - Custom Gymnasium environment implementing:

  • 15-dimensional state space (12 link utilizations + 3 request features)
  • 9-action discrete space (8 paths + 1 block action)
  • Reward function (+1.0, -1.0, -2.0) aligned with blocking minimization
  • Proper constraint enforcement and state transitions

nwutil.py - Network utilities providing:

  • Extended LinkState class with wavelength tracking
  • First-fit wavelength allocation algorithm
  • Lightpath management and expiry handling
  • Network topology and pre-defined path definitions

Getting Started

  1. Install dependencies:

    pip install -r requirements.txt
  2. Run with optimization (recommended):

    python dqn_runner.py --optimize --trials 20
  3. Or run with default hyperparameters:

    python dqn_runner.py

The implementation is designed to be self-contained and easy to run, with all optimization features integrated into the main training script.

Summary of Results

Based on actual training runs with 1,000,000 timesteps each:

  • Capacity=20 Model: 99.80% final objective (0.20% blocking rate) - Exceptional performance
  • Capacity=10 Model: 96.90% final objective (3.10% blocking rate) - Excellent performance
  • Training Duration: ~10,000 episodes per model
  • Evaluation: 1,000 episodes each on unseen data
  • Key Insight: Both models achieve outstanding results, with capacity=20 reaching near-perfect performance

Both models successfully learned highly effective RSA policies, with detailed metrics saved to results.txt for reproducibility.

About

Built a Deep Reinforcement Learning framework for Routing and Spectrum Allocation in optical networks using DQN, custom network simulation, and automated hyperparameter optimization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages