Skip to content

Om-Rohilla/EAGLE-Scale

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

EAGLE-Scale: Enhanced Adaptive GPU-Less Elastic Scaling

A cloud auto-scaling framework that combines machine learning prediction, cost optimization, container-VM hybrid management, anomaly detection, and adaptive resource tuning to achieve zero SLA violations at 91% lower cost than existing approaches.

What is EAGLE-Scale?

Cloud applications face unpredictable traffic patterns — from gradual growth to sudden 10x spikes. Traditional auto-scalers either react too slowly (causing dropped requests) or over-provision resources (wasting money). EAGLE-Scale solves both problems simultaneously.

We identified 5 critical vulnerabilities in existing load-aware predictive auto-scaling approaches and built a dedicated engine to address each one:

# Vulnerability EAGLE-Scale Engine What It Does
1 Single-model prediction fails on diverse patterns Hybrid Predictor Combines LSTM neural network + polynomial regression with adaptive weighting
2 No cost optimization (all on-demand pricing) Cost Optimizer Automatically mixes Spot ($0.01/hr), Reserved ($0.03/hr), and On-Demand ($0.10/hr) instances
3 VM-only scaling (60s boot time) Container-VM Manager Deploys containers in 3 seconds for burst traffic, migrates to VMs for sustained load
4 No anomaly handling Anomaly Detector Multi-layer detection (Z-score + rate-of-change + Isolation Forest) with emergency mode
5 Fixed CPU utilization target Adaptive U_ref Dynamically adjusts target CPU utilization (0.45–0.95) based on workload trends

Key Results

We tested EAGLE-Scale against 3 baseline approaches across 6 diverse workload patterns (24 experiments total):

Metric Avg Load (Reactive) Load-Aware (Base Paper) K8s HPA EAGLE-Scale
Mean Response Time 10.07 ms 44.71 ms 24.96 ms 8.99 ms
SLA Violations 0.51% 11.85% 5.39% 0.00%
Average Cost $56.23 $35.82 $39.19 $2.99
Dropped Requests 88,863 5,128,522 4,038,862 0

EAGLE-Scale achieves the best performance across every metric — 5x faster response times, zero SLA violations, 91.7% cost reduction, and zero dropped requests.

Project Structure

EAGLE-Scale/
├── eagle-scale-sim/                    # CloudSim Plus simulation framework
│   ├── pom.xml                         # Maven project config (Java 21)
│   ├── run_experiments.sh              # Run all 24 experiments
│   │
│   ├── src/main/java/com/eaglescale/
│   │   ├── EagleScaleSimulation.java   # Main entry point
│   │   │
│   │   ├── engines/                    # Core EAGLE-Scale engines
│   │   │   ├── HybridPredictor.java    # Engine 1: LSTM + Polynomial ensemble
│   │   │   ├── CostOptimizer.java      # Engine 2: Spot/Reserved/On-Demand mix
│   │   │   ├── ContainerVMManager.java # Engine 3: Container-VM hybrid scaling
│   │   │   ├── AnomalyDetector.java    # Engine 4: Multi-layer anomaly detection
│   │   │   ├── AdaptiveUref.java       # Engine 5: Dynamic utilization target
│   │   │   ├── PolynomialPredictor.java# Baseline predictor (base paper method)
│   │   │   └── LSTMBridge.java         # Java-Python LSTM bridge
│   │   │
│   │   ├── policies/                   # Scaling decision policies
│   │   │   ├── EagleScalePolicy.java   # Our complete framework (all 5 engines)
│   │   │   ├── LoadAwarePolicy.java    # Base paper implementation
│   │   │   ├── AverageLoadPolicy.java  # Reactive baseline
│   │   │   └── KubernetesHPAPolicy.java# Kubernetes HPA baseline
│   │   │
│   │   ├── workloads/                  # Traffic pattern generators
│   │   │   ├── FlashCrowdWorkload.java # Sudden 10x spike
│   │   │   ├── FlightBookingWorkload.java # 24-hour dual-peak pattern
│   │   │   ├── MultiSeasonWorkload.java# 7-day daily+weekly cycles
│   │   │   ├── PeriodicWorkload.java   # Sine wave pattern
│   │   │   ├── IncreasingWorkload.java # Linear growth
│   │   │   └── UnpredictableWorkload.java # Random walk with bursts
│   │   │
│   │   └── utils/                      # Utility classes
│   │       ├── MetricsCollector.java   # Performance metrics and CSV export
│   │       └── PythonBridge.java       # Java-to-Python IPC bridge
│   │
│   ├── python_models/                  # Machine learning models
│   │   ├── lstm_predictor.py           # LSTM time-series predictor
│   │   ├── isolation_forest.py         # Anomaly detection model
│   │   ├── generate_graphs.py          # Publication-quality graph generator
│   │   └── requirements.txt           # Python dependencies
│   │
│   └── results/                        # Experiment output (CSV + PNG graphs)
│       ├── *_summary.csv               # Per-experiment summary metrics
│       ├── *_timeseries.csv            # Per-step time-series data
│       ├── sla_violations.png          # SLA comparison chart
│       ├── cost_comparison.png         # Cost comparison chart
│       ├── rt_flashcrowd.png           # Response time during flash crowd
│       └── ...                         # More graphs
│
└── docs/                               # Research documentation
    ├── experiment_guide.md             # Detailed experiment design
    └── results_report.md               # Full experiment results analysis

Prerequisites

  • Java 21 (OpenJDK)
  • Maven 3.x
  • Python 3.10+ with virtual environment
  • 8GB+ RAM recommended

Quick Start

1. Clone and set up environment

git clone https://github.com/Om-Rohilla/EAGLE-Scale.git
cd EAGLE-Scale

# Set up Python virtual environment
python3 -m venv venv
source venv/bin/activate
pip install -r eagle-scale-sim/python_models/requirements.txt

2. Compile the Java simulation

cd eagle-scale-sim
mvn compile

3. Run a single experiment

source ../venv/bin/activate   # Activate Python env (needed for LSTM/ML bridge)
mvn -q exec:java \
  -Dexec.mainClass="com.eaglescale.EagleScaleSimulation" \
  -Dexec.args="flashcrowd eagle_scale"

Available workloads: increasing, periodic, unpredictable, flight, flashcrowd, multiseason

Available policies: average_load, load_aware, kubernetes_hpa, eagle_scale

4. Run all 24 experiments

chmod +x run_experiments.sh
./run_experiments.sh

This runs all 6 workloads × 4 policies. Expected time: 2-4 hours depending on hardware.

5. Generate comparison graphs

python3 python_models/generate_graphs.py

Outputs PNG graphs and a summary comparison table in the results/ directory.

How It Works

Simulation Flow

For each 15-second interval:
  1. Workload Generator → produces current traffic (req/s)
  2. Anomaly Detector   → checks for unusual patterns
  3. Adaptive U_ref     → adjusts CPU target based on conditions
  4. Hybrid Predictor   → forecasts next-minute traffic
  5. Cost Optimizer     → selects cheapest instance mix
  6. Container/VM Mgr   → deploys resources (containers first, then VMs)
  7. Metrics Collector  → records response time, cost, utilization

Response Time Model

We use an M/M/1 queueing model to simulate realistic response times:

RT = BASE_RT + (1 / (capacity × (1 - utilization))) × 1000ms

When utilization exceeds 95%, a significant penalty is applied. When load exceeds 1.5× capacity, requests are dropped.

Emergency Mode

When the anomaly detector confirms an anomaly (Z-score > 4.0 AND rate-of-change > 100%), emergency mode activates:

  • Immediately deploys 3× current capacity using containers
  • Containers boot in 3 seconds (vs 60s for VMs)
  • Maintains emergency scaling for 2+ minutes until traffic stabilizes
  • Automatically deactivates when conditions normalize

Experiment Results Highlights

Flash Crowd (10× Traffic Spike)

EAGLE-Scale Base Paper Improvement
Max Response Time 14.62 ms 411.94 ms 28× better
SLA Violations 0.00% 13.75% 100% fewer
Dropped Requests 0 345,645 100% fewer
Recovery Time ~3 sec ~120 sec 97.5% faster

Unpredictable Traffic (Worst Case)

EAGLE-Scale Base Paper Improvement
Mean RT 8.80 ms 103.98 ms 11.8× better
SLA Violations 0.00% 32.29% 100% fewer
Dropped Requests 0 4,018,718 100% fewer

Technology Stack

Component Technology Version
Simulation Engine CloudSim Plus 8.0.0
Language (Simulation) Java 21
Language (ML) Python 3.12
ML Framework TensorFlow (CPU) 2.16.2
Anomaly Detection scikit-learn 1.4.0
Build Tool Maven 3.x
Math Library Apache Commons Math 3.6.1
JSON Bridge Gson 2.10.1

Research Context

This work extends and improves upon load-aware predictive auto-scaling for cloud environments. The base approach uses polynomial regression to predict future workload and pre-scale virtual machines. EAGLE-Scale addresses five identified limitations through dedicated engines, achieving superior performance across all evaluated metrics.

The full experiment suite validates claims through 24 experiments covering diverse real-world traffic patterns including flash crowds, periodic loads, linearly increasing traffic, random bursts, flight booking patterns, and multi-day seasonal cycles.

License

This project is part of academic research. Please cite appropriately if used in your work.

Author

Om RohillaGitHub

About

A cloud auto-scaling framework that combines machine learning prediction, cost optimization, container-VM hybrid management, anomaly detection, and adaptive resource tuning to achieve zero SLA violations at 91% lower cost than existing approaches.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors