Fine-tune LLMs on encrypted data. The server computes on ciphertexts and never sees your plaintext.
You have sensitive data (medical records, financial docs, legal contracts) and want to fine-tune a language model on it. But you can't send plaintext to a cloud GPU — regulations forbid it, or your threat model doesn't trust the server.
Split the transformer into linear ops (server, encrypted) and non-linear ops (client, plaintext). The server multiplies weight matrices by encrypted vectors homomorphically. The client does everything else (softmax, SiLU, RoPE, LoRA) in the clear.
The server only ever sees ciphertexts. It cannot decrypt them — that's a mathematical guarantee, not a policy.
graph LR
subgraph Client ["Client (Hospital)"]
A[Sensitive Data] --> B[Tokenize + Embed]
B --> C[Encrypt Hidden States]
G[Decrypt Results] --> H[Non-linear Ops]
H --> I[Train LoRA Locally]
end
subgraph Server ["Server (Cloud)"]
D[Base Weights W]
E["W @ Enc(x)<br/>Homomorphic Matmul"]
end
C -- "LWE Ciphertexts" --> E
E -- "Enc(W @ x)" --> G
D --> E
style Client fill:#e8f5e9,stroke:#2e7d32
style Server fill:#e3f2fd,stroke:#1565c0
Each transformer layer requires two round-trips between client and server. This is the core of how the system works:
sequenceDiagram
participant C as Client
participant S as Server
Note over C,S: Pass 1 — Attention + MLP Projections
C->>C: Quantize hidden states → int8
C->>C: Encrypt with LWE
C->>S: Send Enc(x)
S->>S: Compute W_q @ Enc(x)
S->>S: Compute W_k @ Enc(x)
S->>S: Compute W_v @ Enc(x)
S->>S: Compute W_gate @ Enc(x)
S->>S: Compute W_up @ Enc(x)
S->>C: Return encrypted projections
Note over C: Client-side (plaintext)
C->>C: Decrypt all projections
C->>C: RoPE positional encoding
C->>C: Attention: softmax(QK^T/√d) @ V
C->>C: SiLU activation + gate
C->>C: Add LoRA: α(U @ D @ x)
Note over C,S: Pass 2 — Output Projections
C->>C: Encrypt attention output + MLP hidden
C->>S: Send Enc(attn), Enc(mlp)
S->>S: Compute W_o @ Enc(attn)
S->>S: Compute W_down @ Enc(mlp)
S->>C: Return encrypted outputs
C->>C: Decrypt → residual add → next layer
Why two passes? The server can only do linear operations (matrix multiply) on encrypted data. Softmax, SiLU, and RoPE are non-linear — they require the plaintext values. So the client must decrypt between the attention projection step and the output projection step.
This project layers four independent privacy mechanisms. Each protects a different attack surface:
graph TB
subgraph stack ["Privacy Stack"]
direction TB
FHE["FHE (Homomorphic Encryption)<br/>Server never sees hidden states"]
DP["DP-SGD (Differential Privacy)<br/>Individual records can't be extracted from model"]
FED["Federated Learning<br/>Raw data never leaves each hospital"]
CIPHER["Token Cipher<br/>Token IDs scrambled before encryption"]
end
FHE --> DP
DP --> FED
FED --> CIPHER
ATK1["Server inspects activations"] -.->|"Blocked by"| FHE
ATK2["Model memorization attack"] -.->|"Blocked by"| DP
ATK3["Data centralization"] -.->|"Blocked by"| FED
ATK4["Frequency analysis on tokens"] -.->|"Blocked by"| CIPHER
style stack fill:#fff3e0,stroke:#e65100
style ATK1 fill:#ffcdd2,stroke:#c62828
style ATK2 fill:#ffcdd2,stroke:#c62828
style ATK3 fill:#ffcdd2,stroke:#c62828
style ATK4 fill:#ffcdd2,stroke:#c62828
| Layer | What it protects | Guarantee | Overhead |
|---|---|---|---|
| FHE | Hidden states in transit | Cryptographic (LWE hardness) | ~178x |
| DP-SGD | Individual records in trained model | Statistical (ε,δ)-DP | ~1.5x |
| Federation | Raw data locality | Organizational (data never leaves) | ~1x per client |
| Token cipher | Token frequency patterns | Substitution cipher | ~0x |
LWE (Learning With Errors) encryption: a plaintext value m becomes (a, b) where b = a·s + m + noise. The secret key s stays on the client.
Because LWE is additively homomorphic, the server can compute W @ Enc(x) and get Enc(W @ x) — without ever knowing x or s.
graph LR
subgraph Encrypt
M["m (plaintext)"] --> ENC["(a, b = a·s + m + e)"]
end
subgraph "Homomorphic Matmul"
ENC --> HOM["W @ (a, b)"]
HOM --> RES["(W·a, W·b) = Enc(W·m)"]
end
subgraph Decrypt
RES --> DEC["b' - a'·s = W·m + noise"]
end
style Encrypt fill:#e8f5e9,stroke:#2e7d32
style Decrypt fill:#e8f5e9,stroke:#2e7d32
| Parameter | Value | Why |
|---|---|---|
| LWE dimension | 1024 | ~128-bit security (HE Standard) |
| Noise | 2^(-25) | Balance between accuracy and security margin |
| Modulus | 2^32 | Implicit int32 arithmetic |
| Post-quantum | Yes | LWE is not broken by Shor's algorithm |
LoRA adds small adapter matrices to each layer: y = W @ x + α(U @ D @ x). Only U and D are trained, and they stay on the client.
graph TB
subgraph forward ["Forward Pass"]
X[Input x] --> ENC2[Encrypt]
ENC2 --> SERVER["Server: W @ Enc(x)"]
SERVER --> DEC2[Decrypt → W·x]
X --> LORA["LoRA: α(U @ D @ x)"]
DEC2 --> ADD["y = W·x + LoRA"]
LORA --> ADD
end
subgraph backward ["Backward Pass (Client-Only)"]
LOSS[Loss] --> GRAD["∇L projected through lm_head"]
GRAD --> GU["∇U = grad @ (D @ x)^T"]
GRAD --> GD["∇D = U^T @ grad @ x^T"]
GU --> UPDATE["Adam update U, D"]
GD --> UPDATE
end
ADD --> LOSS
style forward fill:#e3f2fd,stroke:#1565c0
style backward fill:#fce4ec,stroke:#c62828
The backward pass computes analytical gradients (no autograd). One limitation: each layer gets the same top-level gradient rather than proper chain-rule backprop through layers. The client doesn't have W in production mode, so inter-layer gradients can't be computed. Training still converges — the Zama paper has the same constraint.
Additional privacy during training:
- DP-SGD: Gaussian noise on gradients with RDP accounting (formal ε/δ bounds)
- DP-Forward: Embedding noise injection (SeqLDP guarantee)
- FFA-LoRA: Freeze D matrix — only train U to reduce DP noise amplification
- Federated learning: Multiple clients train locally, aggregate via FedAvg
git clone https://github.com/jeffelin/engima-fhe.git
cd engima-fhe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"source .venv/bin/activate
# Run tests (919 tests, ~7 min on CPU)
FHE_BACKEND=numpy PYTHONPATH=src python -m pytest tests/ -v --tb=short
# Demo (plaintext, FHE, blind mode, training, MedQA eval)
PYTHONPATH=src python demo.py --mode all
# Training experiments (6 controlled experiments)
PYTHONPATH=src python train_medical.py
PYTHONPATH=src python train_medical.py --fhe # with FHE comparison
# Benchmarks
PYTHONPATH=src python benchmarks/run_benchmarks.py
# Web UI (http://localhost:8000)
PYTHONPATH=src python web/server.pyOr bash run_demo.sh to run everything.
The default backend is NumPy (CPU). For GPU acceleration:
NVIDIA (CuPy)
pip install -e ".[gpu-cuda]"
FHE_BACKEND=cupy PYTHONPATH=src python demo.py --mode allApple Silicon (MLX)
pip install -e ".[gpu-mlx]"
FHE_BACKEND=mlx PYTHONPATH=src python demo.py --mode allGPU backends are experimental. The FHE pipeline is validated on NumPy. CuPy and MLX dispatch through
src/fhe/device.pybut have not been tested end-to-end.
Single container (web UI + API):
docker build -t engima-fhe .
docker run -p 8000:8000 engima-fheGPU container (NVIDIA):
docker build -f Dockerfile.gpu -t engima-fhe-gpu .
docker compose -f docker-compose.gpu.yml upSplit deployment (separate client and server containers):
docker compose --profile split upThis starts two containers:
fhe-server— runsBlindFHEServerAppon port 8001 (no secret key)fhe-client— holds the secret key, connects to the server
graph LR
subgraph client-container ["fhe-client container"]
CL[RealFHEClient<br/>Secret key here]
end
subgraph server-container ["fhe-server container"]
SV[BlindFHEServerApp<br/>No secret key]
end
CL -- "POST /compute<br/>encrypted bytes" --> SV
SV -- "encrypted result" --> CL
style client-container fill:#e8f5e9,stroke:#2e7d32
style server-container fill:#e3f2fd,stroke:#1565c0
All numbers from this implementation: pure NumPy, scalar LWE, single-threaded CPU, dim=32, lwe_dim=1024.
| Metric | Value |
|---|---|
| FHE single-layer correlation (random weights) | 0.86 |
| FHE single-layer correlation (real Ollama weights + safe_qmax) | 0.999 |
| FHE latency per layer | ~28 ms |
| Plaintext latency | ~0.2 ms |
| Overhead | ~178x |
| MedQA accuracy (random weights) | 25% |
| Training convergence (200 steps) | Loss 6.03 → 5.99 |
Two correlation numbers because they measure different conditions. Random weights in [-2, 2] have high L1 row norms that cause more quantization clipping, giving 0.86. Real TinyLlama weights (Q4_0) are sparser and better-conditioned — with safe_qmax auto-scaling, correlation reaches 0.999. Both are reproducible via benchmarks/run_benchmarks.py.
Training convergence is real but marginal (0.7% over 200 steps). The model is small (hidden_size=64, 1 layer) and the backward pass approximation limits learning speed. The web UI shows ~11% loss drops in some runs with higher learning rates.
Zama published the paper this project is based on.
| This project | Zama (Concrete ML) | |
|---|---|---|
| Language | Python / NumPy | Rust (tfhe-rs) + Python |
| Ciphertext packing | Scalar LWE (1 value per ct) | RLWE SIMD (~1000 values per ct) |
| Hardware | CPU (+ experimental GPU) | GPU (CUDA), multi-threaded CPU |
| Model size | 64-dim, 1 layer | Full GPT-2 / Llama layers |
| Backward pass | Same approximation | Same approximation |
| Throughput | Educational | ~216 sec/token on RTX 4060 |
The performance gap is large — scalar LWE encrypts each value separately (64-dim = 64 ciphertexts), while RLWE packing fits the same vector in 1 ciphertext. The 178x overhead here would be 10-50x in a production RLWE system.
| Approach | Guarantee | Overhead | Maturity |
|---|---|---|---|
| FHE (this, Zama) | Cryptographic — server can't see data | 10-200x | Research |
| DP-SGD | Statistical — individual records protected | 1-3x | Production |
| Secure enclaves (TEE) | Hardware — trusted execution environment | ~1x | Production |
| Federated learning | Data never leaves client | ~1x per client | Production |
FHE gives the strongest guarantee but pays the most in performance. This project stacks FHE + DP-SGD + federation because they're complementary.
- Tiny model. 64-dim, 1 layer is far from a real LLM. Scalar LWE is impractically slow at full Llama dimensions (2048+).
- Marginal training. The gradient approximation converges but isn't competitive with standard backprop. This is a fundamental privacy tradeoff.
- Simplified security estimate. The 128-bit claim uses HE Standard tables. A real audit would use the lattice-estimator tool.
- GPU backends untested. CuPy/MLX interfaces exist but haven't been validated end-to-end.
- MedQA = random chance. 25% accuracy measures the evaluation pipeline, not model quality (random weights).
src/ 20,600+ lines across 52 files
fhe/ TFHE crypto: LWE, RLWE, GSW, bootstrap, NTT, SIMD packing
models/ FHELlamaForCausalLM, LoRA layers, kernel attention, RoPE
client/ Training orchestrator, LoRA manager, FHE client
server/ FHEServerCallback (sim), BlindFHEServerApp (production)
privacy/ DP-SGD, DP-Forward (embedding noise), RDP accounting
federation/ FedAvg federated trainer
cipher/ Token substitution cipher (simple + homophonic)
anonymization/ HIPAA PII removal
network/ Ciphertext binary serialization
core/ Config, training state, LR scheduler
tests/ 919 tests across 43 files
benchmarks/ MedQA eval, FHE overhead, training convergence
scripts/ Split deployment entry points, MedQA download
web/ Browser UI with training wizard and playground
data/ 20 medical training notes, 20 MedQA questions
- Chillotti et al., "TFHE: Fast Fully Homomorphic Encryption over the Torus", J. Cryptology 2020
- Frery et al., "Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption", arXiv:2505.07329
- Regev, "On Lattices, Learning with Errors, Random Linear Codes, and Cryptography", STOC 2005
- Gentry, Sahai, Waters, "Homomorphic Encryption from Learning with Errors", Crypto 2013
This is a research/educational implementation. For production FHE, use TFHE-rs, OpenFHE, or Microsoft SEAL.
Apache 2.0