Federated Learning allows distributed devices to collaboratively train a model without sharing raw data, but the dominant architecture — a central aggregation server — introduces a structural fragility that is widely acknowledged yet rarely demonstrated on real infrastructure. This thesis builds both a centralized and a decentralized peer-to-peer gossip FL system from scratch, deploys them on 8 AWS EC2 instances using raw TCP sockets, and subjects each to controlled failure injection under identical hardware and data conditions. Beyond confirming fault tolerance differences, the experiments uncover a cascading synchronization failure in the ring gossip protocol — an emergent behavior where a single node's death eventually isolates its neighbors from the entire ring through TCP timeout inflation — a dynamic that existing theoretical FL models do not predict.
This project implements and empirically compares two Federated Learning architectures deployed on 8 AWS EC2 instances:
-
Centralized FL (FedAvg) — McMahan et al. 2017. A central server aggregates models from all clients each round. Standard star topology.
-
Decentralized FL (Gossip) — No server. Nodes train locally and exchange model weights peer-to-peer in a ring or fully-connected topology. Synchronous equal-weight FedAvg blend.
Both architectures were implemented from scratch in Python using raw TCP sockets — no FL framework (Flower, PySyft, etc.) for Experiments 1–6. The same CNN model, dataset (CIFAR-10), hyperparameters, and hardware were used for all experiments so that topology is the only variable that changes between centralized and decentralized comparisons.
Experiment 7 deploys the same ring topology via the Fedstellar (p2pfl) framework (gRPC-based gossip) to directly compare a framework-based implementation against the hand-rolled synchronous TCP implementation under identical conditions.
Federated Learning surveys and theory papers consistently argue that decentralized FL eliminates the Single Point of Failure (SPOF) of centralized architectures. But this argument is always made structurally — "there is no central server, therefore no SPOF."
No prior work:
- Empirically demonstrates SPOF in centralized FL through live server failure injection on real distributed infrastructure (timestamped logs, real network conditions, real retries)
- Characterizes the secondary failure dynamics that emerge when a ring node fails under synchronous TCP coupling — specifically, how a single node's death propagates through timeout inflation to eventually isolate its neighbors from the rest of the ring
- Quantifies the accuracy-fault tolerance tradeoff under controlled, identical hardware conditions: same model, same dataset, same instance type, only topology changes
This thesis fills those gaps through empirical systems research: build it, run it under failure conditions, measure what theory does not predict.
- Does decentralized synchronous gossip FL converge to comparable accuracy as centralized FedAvg over 50 rounds, under IID and Non-IID data distributions?
- What is the per-round communication overhead of ring gossip vs centralized FedAvg on real cloud infrastructure — in data volume and wall-clock time?
- When the central server fails, what happens? When one ring node fails, what happens? (Empirical proof via live failure injection — not theoretical argument)
- Does data heterogeneity (Non-IID, Dirichlet α=0.5) affect convergence differently in centralized vs decentralized architectures, and why?
- What secondary behavioral effects emerge in ring gossip under node failure — effects not predicted by existing theoretical models?
┌─────────┐
┌──────► SERVER ◄──────┐
│ │ Node 0 │ │
│ └────┬────┘ │
│ broadcast │
upload │ upload
│ ▼ │
Client 1 Client 2 ... Client 7
Round structure: train locally → upload to server → server FedAvg → broadcast back
All 7 clients must upload before aggregation begins. If the server goes down, every client stalls and cannot recover — there is no fallback.
0 ── 1 ── 2 ── 3 ── 4 ── 5 ── 6 ── 7 ── (wraps to 0)
Round structure: train locally → push to left + right neighbor → receive from neighbors → blend → evaluate
No coordinator. Each node blends 3 models per round (self + 2 neighbors). If one node goes down, its two direct neighbors detect the failure and continue with 1 neighbor instead of 2. All other nodes are unaffected.
Every node connects to all 7 other nodes each round.
Round structure: train locally → push to all 7 peers → receive from all 7 peers → blend 8 models → evaluate
Theoretical upper bound for decentralized connectivity. Higher accuracy per round than ring, but 7× the communication volume.
Same ring topology as Experiments 3 and 5-A, but using p2pfl 0.4.4 (gRPC-based semi-synchronous gossip) instead of hand-rolled synchronous TCP. Same model, same data split, same number of rounds — only the communication framework changes.
Decentralized_FL/
│
├── config.env.example ← fill in your IPs, save as config.env (gitignored)
├── requirements.txt ← pip dependencies
├── setup_nodes.sh ← one-time node environment setup
│
├── src/ ← all Python source (run with PYTHONPATH=src)
│ ├── server.py ← centralized FL server (Exps 1, 2, 5-B)
│ ├── client.py ← centralized FL client (Exps 1, 2, 5-B)
│ ├── node.py ← decentralized ring gossip node (Exps 3, 4, 5-A)
│ ├── node_fc.py ← decentralized fully-connected node (Exps 6-A/B/C)
│ └── shared/
│ ├── data.py ← CIFAR-10 IID / Non-IID (Dirichlet) splits
│ ├── model.py ← CNNCifar architecture (~62,006 parameters)
│ ├── net.py ← length-prefixed TCP socket protocol
│ ├── log.py ← stdout tee to terminal + log file
│ └── train.py ← local SGD training, evaluation, FedAvg
│
├── experiments/
│ ├── run_experiment.sh ← single launcher for all 9 experiments (reads config.env)
│ └── fedstellar/ ← p2pfl-based Fedstellar experiments
│ ├── fedstellar_node.py
│ ├── run_fedstellar.sh
│ ├── setup_fedstellar.sh
│ └── shared/ ← Lightning model wrapper, CIFAR-10 for p2pfl
│
├── tools/
│ ├── parse_comm_bottleneck.py ← parses [COMM_SUMMARY] log lines → markdown table
│ └── generate_round_delay_graphs.py ← generates PNG round-delay graphs from logs
│
└── docs/
├── Implementation_Approach.md ← research design rationale and defense guide
└── LOG_FORMAT.md ← [COMM_SUMMARY] line format specification
Setup and operational instructions: See setup_nodes.sh and docs/Implementation_Approach.md.
Experiment launch commands and start order: See experiments/run_experiment.sh.
Log format and analysis commands: See docs/LOG_FORMAT.md.
CNNCifar — lightweight CNN for CIFAR-10. Identical architecture across all experiments.
Input (3 × 32 × 32)
→ Conv2d(3→6, k=5) → ReLU → MaxPool2d(2×2)
→ Conv2d(6→16, k=5) → ReLU → MaxPool2d(2×2)
→ Flatten → Linear(400→120) → ReLU
→ Linear(120→84) → ReLU → Linear(84→10)
Parameters : ~62,006
Model size : ~245 KB (serialized)
Optimizer : SGD lr=0.01 momentum=0.5
All experiments: 8 AWS EC2 t3.micro instances (us-east-1), 50 FL rounds, 5 local epochs per round, CIFAR-10 (6,250 samples per node), seed=42. Experiments 1–4 replicated 3× (May 7, 8, 9) — results consistent within 0.33 pp.
| Experiment | Topology | Distribution | Peak Accuracy | Final Accuracy (R50) |
|---|---|---|---|---|
| Exp 1 | Centralized | IID | 61.69% (R28) | 60.02% |
| Exp 2 | Centralized | Non-IID | 57.60% (R26) | 56.58% |
| Exp 3 | Ring Gossip | IID | 57.48% (R21) | 55.98% |
| Exp 4 | Ring Gossip | Non-IID | 53.44% (R47) | 49.49% |
| Exp 6-A | Fully Connected | IID | 62.40% | 61.17% |
| Exp 6-B | Fully Connected | Non-IID | — | 57.99% |
IID accuracy gap (centralized − ring gossip): −4.21 pp Non-IID accuracy gap: −6.49 pp (gossip degrades more due to smaller aggregation pool — 2 neighbors vs 7 clients)
| Metric | Centralized (Exp 1) | Ring Gossip (Exp 3) | FC Gossip (Exp 6-A) |
|---|---|---|---|
| Avg comm time / round | 3.9 s | 25.9 s | 3.4 s |
| Avg round duration | 16.1 s | 37.5 s | 15.4 s |
| Data / round / node | 490.6 KB | 981.2 KB | 1717.1 KB |
| Total data / node (50R) | 24.5 MB | 49.1 MB | 167.7 MB |
Bottleneck analysis:
- Centralized: bottleneck is the server serializing N models (O(N) scaling). Comm time ~3.9s.
- Ring gossip: bottleneck is waiting for the slower neighbor's training to complete — not bandwidth. Comm time ~25.9s (6.6× longer than centralized), but would halve on homogeneous hardware with faster CPUs.
- Fully connected: same per-round time as centralized despite 7× data volume, because all 7 pushes happen in parallel threads.
| Centralized SPOF (Exp 5-B) | Ring Fault (Exp 5-A) | FC Fault (Exp 6-C) | |
|---|---|---|---|
| Failure event | Server exits after R9 | Node 3 exits at R10 | Node 3 exits at R10 |
| Nodes detecting failure | All 7 clients | 2 (direct ring neighbors) | 7 (all survivors) |
| System continues? | No — permanent halt | Yes | Yes |
| Rounds completed | 9 / 50 (0 of 7 clients complete) | 50 / 50 (7 of 8 nodes) | 50 / 50 (7 of 8 nodes) |
| Recovery mechanism | None — no fallback server | Self-healing: blends 1 model instead of 2 | Self-healing: blends 6 models instead of 7 |
When Node 3 fails in the ring experiment (Exp 5-A), the 120-second TCP receive timeout inflates the round duration of direct neighbors (Nodes 2 and 4) from ~37s to ~130s per round. This timing inflation then cascades: the remaining neighbor of Node 2 and Node 4 cannot synchronize with them, eventually causing Node 2 (round 38) and Node 4 (round 41) to lose ALL connectivity — not just to Node 3, but to every other node in the ring.
This cascading isolation effect was not designed into the experiment. It emerged from real distributed system behavior and is not predicted by theoretical gossip FL models.
Distance-dependent communication time inflation after Node 3 failure:
| Ring distance from Node 3 | Avg comm time R11–50 |
|---|---|
| Direct neighbors (Nodes 2, 4) | ~96.5 s / round |
| 1-hop (Nodes 1, 5) | ~65 s / round |
| 2-hops (Nodes 0, 6) | ~51 s / round |
| 3-hops (Node 7) | ~48.6 s / round |
These limitations are acknowledged as deliberate scope boundaries, not oversights.
- Small scale (8 nodes). Ring gossip convergence is O(N/2) hops — at 100 nodes, full model diffusion takes 50 rounds alone. Results are valid as a proof-of-concept; scaling behavior at larger N is left as future work.
- Homogeneous hardware. All nodes are identical t3.micro instances running synchronously. Heterogeneous devices with straggler nodes (different training speeds) represent a separate research problem — async FL — and are intentionally out of scope for a clean architectural comparison.
- Fixed 120s receive timeout. The cascading isolation effect (Nodes 2 and 4 losing all connectivity by rounds 38 and 41) is a direct consequence of this fixed timeout. Adaptive timeout backoff or skip-and-continue logic would prevent secondary isolation and is the most actionable direction for follow-up work.
- Clean failure only. Experiments inject honest node death — a node stops responding. Byzantine behavior (a malicious node sending corrupted weights) is a separate research track and is not tested here.
- Synchronous protocol only. Both architectures use synchronous round execution by design, to keep topology as the sole variable. Asynchronous gossip (e.g. Fedstellar) introduces staleness as a second variable and would make the centralized vs decentralized comparison uninterpretable.
This work was carried out under the guidance of Dr. Anshu S. Anand, at the Indian Institute of Information Technology Allahabad as part of the M.Tech programme in Information Technology.
The experiments were conducted on AWS EC2 instances. CIFAR-10 dataset was used solely for academic research purposes.