A cloud auto-scaling framework that combines machine learning prediction, cost optimization, container-VM hybrid management, anomaly detection, and adaptive resource tuning to achieve zero SLA violations at 91% lower cost than existing approaches.
Cloud applications face unpredictable traffic patterns — from gradual growth to sudden 10x spikes. Traditional auto-scalers either react too slowly (causing dropped requests) or over-provision resources (wasting money). EAGLE-Scale solves both problems simultaneously.
We identified 5 critical vulnerabilities in existing load-aware predictive auto-scaling approaches and built a dedicated engine to address each one:
| # | Vulnerability | EAGLE-Scale Engine | What It Does |
|---|---|---|---|
| 1 | Single-model prediction fails on diverse patterns | Hybrid Predictor | Combines LSTM neural network + polynomial regression with adaptive weighting |
| 2 | No cost optimization (all on-demand pricing) | Cost Optimizer | Automatically mixes Spot ($0.01/hr), Reserved ($0.03/hr), and On-Demand ($0.10/hr) instances |
| 3 | VM-only scaling (60s boot time) | Container-VM Manager | Deploys containers in 3 seconds for burst traffic, migrates to VMs for sustained load |
| 4 | No anomaly handling | Anomaly Detector | Multi-layer detection (Z-score + rate-of-change + Isolation Forest) with emergency mode |
| 5 | Fixed CPU utilization target | Adaptive U_ref | Dynamically adjusts target CPU utilization (0.45–0.95) based on workload trends |
We tested EAGLE-Scale against 3 baseline approaches across 6 diverse workload patterns (24 experiments total):
| Metric | Avg Load (Reactive) | Load-Aware (Base Paper) | K8s HPA | EAGLE-Scale |
|---|---|---|---|---|
| Mean Response Time | 10.07 ms | 44.71 ms | 24.96 ms | 8.99 ms |
| SLA Violations | 0.51% | 11.85% | 5.39% | 0.00% |
| Average Cost | $56.23 | $35.82 | $39.19 | $2.99 |
| Dropped Requests | 88,863 | 5,128,522 | 4,038,862 | 0 |
EAGLE-Scale achieves the best performance across every metric — 5x faster response times, zero SLA violations, 91.7% cost reduction, and zero dropped requests.
EAGLE-Scale/
├── eagle-scale-sim/ # CloudSim Plus simulation framework
│ ├── pom.xml # Maven project config (Java 21)
│ ├── run_experiments.sh # Run all 24 experiments
│ │
│ ├── src/main/java/com/eaglescale/
│ │ ├── EagleScaleSimulation.java # Main entry point
│ │ │
│ │ ├── engines/ # Core EAGLE-Scale engines
│ │ │ ├── HybridPredictor.java # Engine 1: LSTM + Polynomial ensemble
│ │ │ ├── CostOptimizer.java # Engine 2: Spot/Reserved/On-Demand mix
│ │ │ ├── ContainerVMManager.java # Engine 3: Container-VM hybrid scaling
│ │ │ ├── AnomalyDetector.java # Engine 4: Multi-layer anomaly detection
│ │ │ ├── AdaptiveUref.java # Engine 5: Dynamic utilization target
│ │ │ ├── PolynomialPredictor.java# Baseline predictor (base paper method)
│ │ │ └── LSTMBridge.java # Java-Python LSTM bridge
│ │ │
│ │ ├── policies/ # Scaling decision policies
│ │ │ ├── EagleScalePolicy.java # Our complete framework (all 5 engines)
│ │ │ ├── LoadAwarePolicy.java # Base paper implementation
│ │ │ ├── AverageLoadPolicy.java # Reactive baseline
│ │ │ └── KubernetesHPAPolicy.java# Kubernetes HPA baseline
│ │ │
│ │ ├── workloads/ # Traffic pattern generators
│ │ │ ├── FlashCrowdWorkload.java # Sudden 10x spike
│ │ │ ├── FlightBookingWorkload.java # 24-hour dual-peak pattern
│ │ │ ├── MultiSeasonWorkload.java# 7-day daily+weekly cycles
│ │ │ ├── PeriodicWorkload.java # Sine wave pattern
│ │ │ ├── IncreasingWorkload.java # Linear growth
│ │ │ └── UnpredictableWorkload.java # Random walk with bursts
│ │ │
│ │ └── utils/ # Utility classes
│ │ ├── MetricsCollector.java # Performance metrics and CSV export
│ │ └── PythonBridge.java # Java-to-Python IPC bridge
│ │
│ ├── python_models/ # Machine learning models
│ │ ├── lstm_predictor.py # LSTM time-series predictor
│ │ ├── isolation_forest.py # Anomaly detection model
│ │ ├── generate_graphs.py # Publication-quality graph generator
│ │ └── requirements.txt # Python dependencies
│ │
│ └── results/ # Experiment output (CSV + PNG graphs)
│ ├── *_summary.csv # Per-experiment summary metrics
│ ├── *_timeseries.csv # Per-step time-series data
│ ├── sla_violations.png # SLA comparison chart
│ ├── cost_comparison.png # Cost comparison chart
│ ├── rt_flashcrowd.png # Response time during flash crowd
│ └── ... # More graphs
│
└── docs/ # Research documentation
├── experiment_guide.md # Detailed experiment design
└── results_report.md # Full experiment results analysis
- Java 21 (OpenJDK)
- Maven 3.x
- Python 3.10+ with virtual environment
- 8GB+ RAM recommended
git clone https://github.com/Om-Rohilla/EAGLE-Scale.git
cd EAGLE-Scale
# Set up Python virtual environment
python3 -m venv venv
source venv/bin/activate
pip install -r eagle-scale-sim/python_models/requirements.txtcd eagle-scale-sim
mvn compilesource ../venv/bin/activate # Activate Python env (needed for LSTM/ML bridge)
mvn -q exec:java \
-Dexec.mainClass="com.eaglescale.EagleScaleSimulation" \
-Dexec.args="flashcrowd eagle_scale"Available workloads: increasing, periodic, unpredictable, flight, flashcrowd, multiseason
Available policies: average_load, load_aware, kubernetes_hpa, eagle_scale
chmod +x run_experiments.sh
./run_experiments.shThis runs all 6 workloads × 4 policies. Expected time: 2-4 hours depending on hardware.
python3 python_models/generate_graphs.pyOutputs PNG graphs and a summary comparison table in the results/ directory.
For each 15-second interval:
1. Workload Generator → produces current traffic (req/s)
2. Anomaly Detector → checks for unusual patterns
3. Adaptive U_ref → adjusts CPU target based on conditions
4. Hybrid Predictor → forecasts next-minute traffic
5. Cost Optimizer → selects cheapest instance mix
6. Container/VM Mgr → deploys resources (containers first, then VMs)
7. Metrics Collector → records response time, cost, utilization
We use an M/M/1 queueing model to simulate realistic response times:
RT = BASE_RT + (1 / (capacity × (1 - utilization))) × 1000ms
When utilization exceeds 95%, a significant penalty is applied. When load exceeds 1.5× capacity, requests are dropped.
When the anomaly detector confirms an anomaly (Z-score > 4.0 AND rate-of-change > 100%), emergency mode activates:
- Immediately deploys 3× current capacity using containers
- Containers boot in 3 seconds (vs 60s for VMs)
- Maintains emergency scaling for 2+ minutes until traffic stabilizes
- Automatically deactivates when conditions normalize
| EAGLE-Scale | Base Paper | Improvement | |
|---|---|---|---|
| Max Response Time | 14.62 ms | 411.94 ms | 28× better |
| SLA Violations | 0.00% | 13.75% | 100% fewer |
| Dropped Requests | 0 | 345,645 | 100% fewer |
| Recovery Time | ~3 sec | ~120 sec | 97.5% faster |
| EAGLE-Scale | Base Paper | Improvement | |
|---|---|---|---|
| Mean RT | 8.80 ms | 103.98 ms | 11.8× better |
| SLA Violations | 0.00% | 32.29% | 100% fewer |
| Dropped Requests | 0 | 4,018,718 | 100% fewer |
| Component | Technology | Version |
|---|---|---|
| Simulation Engine | CloudSim Plus | 8.0.0 |
| Language (Simulation) | Java | 21 |
| Language (ML) | Python | 3.12 |
| ML Framework | TensorFlow (CPU) | 2.16.2 |
| Anomaly Detection | scikit-learn | 1.4.0 |
| Build Tool | Maven | 3.x |
| Math Library | Apache Commons Math | 3.6.1 |
| JSON Bridge | Gson | 2.10.1 |
This work extends and improves upon load-aware predictive auto-scaling for cloud environments. The base approach uses polynomial regression to predict future workload and pre-scale virtual machines. EAGLE-Scale addresses five identified limitations through dedicated engines, achieving superior performance across all evaluated metrics.
The full experiment suite validates claims through 24 experiments covering diverse real-world traffic patterns including flash crowds, periodic loads, linearly increasing traffic, random bursts, flight booking patterns, and multi-day seasonal cycles.
This project is part of academic research. Please cite appropriately if used in your work.
Om Rohilla — GitHub