Designed a reinforcement learning environment for a constrained network optimization problem, modeled network state/action spaces, trained DQN agents over 1M timesteps, and achieved sub-1% blocking rates on unseen traffic traces.
Install required Python packages:
pip install -r requirements.txtpython dqn_runner.pypython dqn_runner.py --optimize --trials 20This will:
- Run Optuna hyperparameter optimization (20 trials, ~30-60 minutes)
- Train both models with the best hyperparameters found
- Save models with "_optimized" suffix
The training will:
- Train DQN agent with capacity=20 (1M timesteps, ~10K episodes)
- Evaluate the capacity=20 model on 1000 evaluation files
- Generate 3 plots for capacity=20 results (saved in
plots/directory) - Train DQN agent with capacity=10 (1M timesteps, ~10K episodes)
- Evaluate the capacity=10 model on 1000 evaluation files
- Generate 3 plots for capacity=10 results (saved in
plots/directory) - Save both trained models as
.zipfiles
Expected Duration:
- Standard training: 2-4 hours total for both models
- With optimization: 30-60 minutes + 2-4 hours for final training
python dqn_runner.py --helpAvailable options:
--optimize: Enable hyperparameter optimization with Optuna--trials N: Number of optimization trials (default: 20)
To test the setup with minimal time investment:
python dqn_runner.py --optimize --trials 2This runs a quick 2-trial optimization (~5 minutes) to verify everything works.
The environment state is represented as a 15-dimensional vector containing:
Link Utilizations (12 values):
- Fraction of occupied wavelengths for each of the 12 network links
- Values range from 0.0 (no wavelengths used) to 1.0 (all wavelengths occupied)
- Links are ordered consistently: sorted by (u,v) where u < v
Request Features (3 values):
- Normalized source node:
source_id / 8.0(nodes 0-8) - Normalized destination node:
destination_id / 8.0 - Normalized holding time:
min(holding_time / 50.0, 1.0)(capped at 50)
State transitions occur at each discrete time slot following this sequence:
-
Release Phase: Expired lightpaths are released based on holding time
- Check all active lightpaths for
expiry_time <= current_time - Free wavelengths and update link utilizations
- Check all active lightpaths for
-
Action Execution: Agent selects a path (actions 0-7) or blocks (action 8)
- Action validated against current request's source-destination pair
-
Allocation Phase: If valid path selected, attempt wavelength allocation
- Use first-fit algorithm to find available wavelength across all path links
- If wavelength found, allocate lightpath with expiry time
-
State Update: Recompute link utilizations based on wavelength occupancy
utilization = occupied_wavelengths / total_capacity
-
Time Advancement: Increment time slot and load next request
Located in nwutil.py, the LinkState class stores complete link information:
class LinkState(BaseLinkState):
def __init__(self, u, v, capacity=20, utilization=0.0):
super().__init__(u, v, capacity, utilization)
self.wavelengths = [True] * capacity # Boolean array: True=available, False=occupied
self.lightpaths = {} # {wavelength_idx: (src, dst, expiry_time)}Key Attributes:
endpoints: Tuple (u, v) with u < v for consistent link identificationcapacity: Total number of wavelengths available on this linkutilization: Current fraction of wavelengths in use (0.0 to 1.0)wavelengths: Boolean array tracking availability of each wavelengthlightpaths: Dictionary mapping wavelength index to lightpath details
State Update Process:
- When lightpath allocated:
wavelengths[w] = False, add tolightpaths - When lightpath expires:
wavelengths[w] = True, remove fromlightpaths - Utilization automatically recomputed:
sum(not w for w in wavelengths) / capacity
The action space consists of 9 discrete actions:
Path Selection Actions (0-7):
- Actions 0-1: Paths P1-P2 for source-destination pair (0,3)
- Action 0: Path [0,1,2,3]
- Action 1: Path [0,8,7,6,3]
- Actions 2-3: Paths P3-P4 for source-destination pair (0,4)
- Action 2: Path [0,1,5,4]
- Action 3: Path [0,8,7,6,3,4]
- Actions 4-5: Paths P5-P6 for source-destination pair (7,3)
- Action 4: Path [7,1,2,3]
- Action 5: Path [7,6,3]
- Actions 6-7: Paths P7-P8 for source-destination pair (7,4)
- Action 6: Path [7,1,5,4]
- Action 7: Path [7,6,3,4]
Block Action (8):
- Explicit decision to block the current request
Action Validation: Actions are validated based on the current request's source-destination pair. Invalid actions (e.g., selecting a (0,3) path for a (7,4) request) result in automatic blocking with penalty.
The reward function provides clear learning signals:
-
+1.0: Successful lightpath allocation
- Valid path selected and wavelength available on all links
- Lightpath successfully established
-
-1.0: Request blocked
- No available wavelength on selected path, or
- Explicit block action (action 8)
-
-2.0: Invalid action penalty
- Action does not correspond to current request's source-destination pair
- Encourages agent to learn valid action selection
This sparse reward structure directly aligns with the optimization objective of minimizing blocking rate.
The implementation enforces three critical optical network constraints:
-
Wavelength Continuity Constraint
- Same wavelength must be used across all links in the selected path
- Enforced by
find_available_wavelength()checking all path links - No wavelength conversion allowed
-
Link Capacity Constraint
- Total active lightpaths per link cannot exceed link capacity
- Enforced by
wavelengthsarray size and availability checking - Prevents oversubscription of network resources
-
Wavelength Conflict Constraint
- No two lightpaths can use the same wavelength on the same link
- Enforced by
lightpathsdictionary and wavelength marking - Ensures proper resource isolation
The DQN agent is trained using the stable-baselines3 library with the following setup:
Environment Wrapper: MultiFileEnvWrapper cycles through 10,000 training request files to ensure diverse traffic patterns and prevent overfitting to specific request sequences.
Learning Algorithm: Deep Q-Network (DQN) with:
- Experience replay buffer for breaking temporal correlations
- Target network for stable Q-value targets
- Epsilon-greedy exploration strategy
Training Duration: 1,000,000 timesteps per model (~10,000 episodes)
Hyperparameters were systematically tuned through experimentation. The final configuration is implemented in dqn_runner.py:
# Tuned hyperparameters
learning_rate = 1e-4
buffer_size = 50000
learning_starts = 1000
batch_size = 64
tau = 1.0
gamma = 0.99
target_update_interval = 1000
exploration_fraction = 0.3
exploration_initial_eps = 1.0
exploration_final_eps = 0.05Tuning Process:
- Started with default stable-baselines3 parameters
- Learning Rate: Reduced from 1e-3 to 1e-4 to prevent training instability and oscillations
- Buffer Size: Increased from 10,000 to 50,000 to store more diverse experiences and break correlations
- Exploration: Extended exploration_fraction from 0.1 to 0.3 for more thorough policy discovery
- Target Update: Set to 1000 steps to balance stability and responsiveness
The tuning focused on stability and exploration, critical for the sequential decision-making nature of the RSA problem.
The implementation includes integrated Optuna hyperparameter optimization accessible via command-line flags:
Search Space:
learning_rate: 1e-5 to 1e-2 (log scale)buffer_size: [10000, 25000, 50000, 100000]learning_starts: 500 to 2000batch_size: [32, 64, 128]gamma: 0.95 to 0.999target_update_interval: [500, 1000, 2000]exploration_fraction: 0.1 to 0.5exploration_final_eps: 0.01 to 0.1
Optimization Features:
- TPE Sampler: Tree-structured Parzen Estimator for efficient search
- Median Pruning: Stops unpromising trials early to save time
- Objective: Maximize (1 - blocking_rate) on validation subset
- Fast Search: Uses reduced training time (100K timesteps) and subset of files per trial
- Automatic Integration: Best hyperparameters automatically used for full training
Usage Examples:
# Standard training with manual hyperparameters
python dqn_runner.py
# With optimization (recommended)
python dqn_runner.py --optimize --trials 20
# Quick test with 5 trials
python dqn_runner.py --optimize --trials 5
# Extensive search with 50 trials
python dqn_runner.py --optimize --trials 50Output Files:
- Standard training:
dqn_rsa_capacity20.zip,dqn_rsa_capacity10.zip - With optimization:
dqn_rsa_capacity20_optimized.zip,dqn_rsa_capacity10_optimized.zip - Optimization study:
optuna_study.pkl(for analysis)
Training Learning Curve: Shows the learning progression over 10,000 episodes. The agent starts with poor performance due to random policy and gradually learns effective routing strategies, achieving strong performance by training completion.
Training Objective: The objective function (1 - blocking rate) shows the learning progression during training. The model successfully learns to minimize blocking, achieving excellent performance.
Evaluation Performance: Evaluated on 1000 unseen episodes, achieving 99.80% objective (0.20% final blocking rate, 0.50% average). The model demonstrates exceptional generalization with median performance of 0.00% blocking rate across diverse request patterns.
Training Learning Curve: Demonstrates more challenging learning due to limited resources. The agent shows slower initial progress but successfully learns effective strategies for the resource-constrained environment.
Training Objective: Shows gradual improvement throughout training as the agent learns to handle the more challenging resource-constrained environment. The model successfully adapts to limited capacity.
Evaluation Performance: Evaluated on 1000 episodes, achieving 96.90% objective (3.10% final blocking rate, 4.88% average). Despite 50% reduced capacity, the model maintains excellent performance with median blocking rate of only 0.50%, demonstrating superior resource utilization.
Actual Training Results:
| Metric | Capacity=20 | Capacity=10 |
|---|---|---|
| Final Training Reward | 86.30 | 85.90 |
| Final Training Objective | 95.30% | 94.50% |
| Evaluation Blocking Rate (Final 10) | 0.20% | 3.10% |
| Average Evaluation Blocking Rate | 0.50% | 4.88% |
| Evaluation Objective (Final 10) | 99.80% | 96.90% |
| Training Episodes | 10,000 | 10,000 |
| Training Timesteps | 1,000,000 | 1,000,000 |
Key Findings:
- Exceptional Performance: Both models achieve outstanding results with very low blocking rates
- Initial vs Final Training: Rewards improved from -121.80 to 86.30 (Capacity=20) and -129.10 to 85.90 (Capacity=10)
- Capacity=20 Superior Performance: Achieves exceptional 0.20% final blocking rate vs 3.10% for Capacity=10
- Learning Milestones: Capacity=20 achieved first positive reward at episode 1,274, Capacity=10 at episode 1,351
Learning Characteristics:
- Capacity=20: Exceptional performance with 75% of episodes achieving ≤1.00% blocking
- Capacity=10: Strong performance despite resource constraints, with 50% achieving ≤0.50% blocking
Learning Milestones:
- Capacity=20: First positive reward achieved at episode 1,274
- Capacity=10: First positive reward achieved at episode 1,351
- Both models required ~1,300 episodes to transition from random exploration to effective policy
Training Improvement:
- Capacity=20: Improved from -121.80 initial reward to +86.30 final reward (208.10 point improvement)
- Capacity=10: Improved from -129.10 initial reward to +85.90 final reward (215.00 point improvement)
All plots show rolling 10-episode averages to smooth out noise and highlight learning trends. The results demonstrate that DQN successfully learns highly effective routing and spectrum allocation policies, with the capacity=20 model achieving exceptional performance (0.20% blocking) and the capacity=10 model maintaining excellent results despite resource constraints.
├── dqn_runner.py # Main training script with integrated Optuna optimization
├── rsaenv.py # Custom Gymnasium environment for RSA problem
├── nwutil.py # Network utilities and LinkState implementation
├── requirements.txt # Python dependencies
├── README.md # This documentation
├── data/ # Training and evaluation datasets
│ ├── train/ # 10,000 training request files
│ └── eval/ # 1,000 evaluation request files
└── plots/ # Generated training and evaluation plots
├── training_capacity20_learning_curve.png
├── training_capacity20_objective.png
├── eval_capacity20_objective.png
├── training_capacity10_learning_curve.png
├── training_capacity10_objective.png
└── eval_capacity10_objective.png
dqn_runner.py - Main training script featuring:
- Integrated Optuna hyperparameter optimization
- Command-line interface with
--optimizeand--trialsflags - Training pipeline for both capacity configurations
- Automatic model evaluation and plot generation
- Support for both manual and optimized hyperparameters
rsaenv.py - Custom Gymnasium environment implementing:
- 15-dimensional state space (12 link utilizations + 3 request features)
- 9-action discrete space (8 paths + 1 block action)
- Reward function (+1.0, -1.0, -2.0) aligned with blocking minimization
- Proper constraint enforcement and state transitions
nwutil.py - Network utilities providing:
- Extended
LinkStateclass with wavelength tracking - First-fit wavelength allocation algorithm
- Lightpath management and expiry handling
- Network topology and pre-defined path definitions
-
Install dependencies:
pip install -r requirements.txt
-
Run with optimization (recommended):
python dqn_runner.py --optimize --trials 20
-
Or run with default hyperparameters:
python dqn_runner.py
The implementation is designed to be self-contained and easy to run, with all optimization features integrated into the main training script.
Based on actual training runs with 1,000,000 timesteps each:
- Capacity=20 Model: 99.80% final objective (0.20% blocking rate) - Exceptional performance
- Capacity=10 Model: 96.90% final objective (3.10% blocking rate) - Excellent performance
- Training Duration: ~10,000 episodes per model
- Evaluation: 1,000 episodes each on unseen data
- Key Insight: Both models achieve outstanding results, with capacity=20 reaching near-perfect performance
Both models successfully learned highly effective RSA policies, with detailed metrics saved to results.txt for reproducibility.





