Skip to content

jeffelin/engima-fhe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Engima FHE

Fine-tune LLMs on encrypted data. The server computes on ciphertexts and never sees your plaintext.

Python 3.9+ Tests License

The Problem

You have sensitive data (medical records, financial docs, legal contracts) and want to fine-tune a language model on it. But you can't send plaintext to a cloud GPU — regulations forbid it, or your threat model doesn't trust the server.

The Solution

Split the transformer into linear ops (server, encrypted) and non-linear ops (client, plaintext). The server multiplies weight matrices by encrypted vectors homomorphically. The client does everything else (softmax, SiLU, RoPE, LoRA) in the clear.

The server only ever sees ciphertexts. It cannot decrypt them — that's a mathematical guarantee, not a policy.

graph LR
    subgraph Client ["Client (Hospital)"]
        A[Sensitive Data] --> B[Tokenize + Embed]
        B --> C[Encrypt Hidden States]
        G[Decrypt Results] --> H[Non-linear Ops]
        H --> I[Train LoRA Locally]
    end

    subgraph Server ["Server (Cloud)"]
        D[Base Weights W]
        E["W @ Enc(x)<br/>Homomorphic Matmul"]
    end

    C -- "LWE Ciphertexts" --> E
    E -- "Enc(W @ x)" --> G
    D --> E

    style Client fill:#e8f5e9,stroke:#2e7d32
    style Server fill:#e3f2fd,stroke:#1565c0
Loading

Two-Pass Protocol

Each transformer layer requires two round-trips between client and server. This is the core of how the system works:

sequenceDiagram
    participant C as Client
    participant S as Server

    Note over C,S: Pass 1 — Attention + MLP Projections

    C->>C: Quantize hidden states → int8
    C->>C: Encrypt with LWE
    C->>S: Send Enc(x)
    S->>S: Compute W_q @ Enc(x)
    S->>S: Compute W_k @ Enc(x)
    S->>S: Compute W_v @ Enc(x)
    S->>S: Compute W_gate @ Enc(x)
    S->>S: Compute W_up @ Enc(x)
    S->>C: Return encrypted projections

    Note over C: Client-side (plaintext)
    C->>C: Decrypt all projections
    C->>C: RoPE positional encoding
    C->>C: Attention: softmax(QK^T/√d) @ V
    C->>C: SiLU activation + gate
    C->>C: Add LoRA: α(U @ D @ x)

    Note over C,S: Pass 2 — Output Projections

    C->>C: Encrypt attention output + MLP hidden
    C->>S: Send Enc(attn), Enc(mlp)
    S->>S: Compute W_o @ Enc(attn)
    S->>S: Compute W_down @ Enc(mlp)
    S->>C: Return encrypted outputs

    C->>C: Decrypt → residual add → next layer
Loading

Why two passes? The server can only do linear operations (matrix multiply) on encrypted data. Softmax, SiLU, and RoPE are non-linear — they require the plaintext values. So the client must decrypt between the attention projection step and the output projection step.

Privacy Stack

This project layers four independent privacy mechanisms. Each protects a different attack surface:

graph TB
    subgraph stack ["Privacy Stack"]
        direction TB
        FHE["FHE (Homomorphic Encryption)<br/>Server never sees hidden states"]
        DP["DP-SGD (Differential Privacy)<br/>Individual records can't be extracted from model"]
        FED["Federated Learning<br/>Raw data never leaves each hospital"]
        CIPHER["Token Cipher<br/>Token IDs scrambled before encryption"]
    end

    FHE --> DP
    DP --> FED
    FED --> CIPHER

    ATK1["Server inspects activations"] -.->|"Blocked by"| FHE
    ATK2["Model memorization attack"] -.->|"Blocked by"| DP
    ATK3["Data centralization"] -.->|"Blocked by"| FED
    ATK4["Frequency analysis on tokens"] -.->|"Blocked by"| CIPHER

    style stack fill:#fff3e0,stroke:#e65100
    style ATK1 fill:#ffcdd2,stroke:#c62828
    style ATK2 fill:#ffcdd2,stroke:#c62828
    style ATK3 fill:#ffcdd2,stroke:#c62828
    style ATK4 fill:#ffcdd2,stroke:#c62828
Loading
Layer What it protects Guarantee Overhead
FHE Hidden states in transit Cryptographic (LWE hardness) ~178x
DP-SGD Individual records in trained model Statistical (ε,δ)-DP ~1.5x
Federation Raw data locality Organizational (data never leaves) ~1x per client
Token cipher Token frequency patterns Substitution cipher ~0x

How FHE Works Here

LWE (Learning With Errors) encryption: a plaintext value m becomes (a, b) where b = a·s + m + noise. The secret key s stays on the client.

Because LWE is additively homomorphic, the server can compute W @ Enc(x) and get Enc(W @ x) — without ever knowing x or s.

graph LR
    subgraph Encrypt
        M["m (plaintext)"] --> ENC["(a, b = a·s + m + e)"]
    end

    subgraph "Homomorphic Matmul"
        ENC --> HOM["W @ (a, b)"]
        HOM --> RES["(W·a, W·b) = Enc(W·m)"]
    end

    subgraph Decrypt
        RES --> DEC["b' - a'·s = W·m + noise"]
    end

    style Encrypt fill:#e8f5e9,stroke:#2e7d32
    style Decrypt fill:#e8f5e9,stroke:#2e7d32
Loading
Parameter Value Why
LWE dimension 1024 ~128-bit security (HE Standard)
Noise 2^(-25) Balance between accuracy and security margin
Modulus 2^32 Implicit int32 arithmetic
Post-quantum Yes LWE is not broken by Shor's algorithm

Training

LoRA adds small adapter matrices to each layer: y = W @ x + α(U @ D @ x). Only U and D are trained, and they stay on the client.

graph TB
    subgraph forward ["Forward Pass"]
        X[Input x] --> ENC2[Encrypt]
        ENC2 --> SERVER["Server: W @ Enc(x)"]
        SERVER --> DEC2[Decrypt → W·x]
        X --> LORA["LoRA: α(U @ D @ x)"]
        DEC2 --> ADD["y = W·x + LoRA"]
        LORA --> ADD
    end

    subgraph backward ["Backward Pass (Client-Only)"]
        LOSS[Loss] --> GRAD["∇L projected through lm_head"]
        GRAD --> GU["∇U = grad @ (D @ x)^T"]
        GRAD --> GD["∇D = U^T @ grad @ x^T"]
        GU --> UPDATE["Adam update U, D"]
        GD --> UPDATE
    end

    ADD --> LOSS

    style forward fill:#e3f2fd,stroke:#1565c0
    style backward fill:#fce4ec,stroke:#c62828
Loading

The backward pass computes analytical gradients (no autograd). One limitation: each layer gets the same top-level gradient rather than proper chain-rule backprop through layers. The client doesn't have W in production mode, so inter-layer gradients can't be computed. Training still converges — the Zama paper has the same constraint.

Additional privacy during training:

  • DP-SGD: Gaussian noise on gradients with RDP accounting (formal ε/δ bounds)
  • DP-Forward: Embedding noise injection (SeqLDP guarantee)
  • FFA-LoRA: Freeze D matrix — only train U to reduce DP noise amplification
  • Federated learning: Multiple clients train locally, aggregate via FedAvg

Setup

git clone https://github.com/jeffelin/engima-fhe.git
cd engima-fhe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Usage

source .venv/bin/activate

# Run tests (919 tests, ~7 min on CPU)
FHE_BACKEND=numpy PYTHONPATH=src python -m pytest tests/ -v --tb=short

# Demo (plaintext, FHE, blind mode, training, MedQA eval)
PYTHONPATH=src python demo.py --mode all

# Training experiments (6 controlled experiments)
PYTHONPATH=src python train_medical.py
PYTHONPATH=src python train_medical.py --fhe           # with FHE comparison

# Benchmarks
PYTHONPATH=src python benchmarks/run_benchmarks.py

# Web UI (http://localhost:8000)
PYTHONPATH=src python web/server.py

Or bash run_demo.sh to run everything.

GPU Setup

The default backend is NumPy (CPU). For GPU acceleration:

NVIDIA (CuPy)

pip install -e ".[gpu-cuda]"
FHE_BACKEND=cupy PYTHONPATH=src python demo.py --mode all

Apple Silicon (MLX)

pip install -e ".[gpu-mlx]"
FHE_BACKEND=mlx PYTHONPATH=src python demo.py --mode all

GPU backends are experimental. The FHE pipeline is validated on NumPy. CuPy and MLX dispatch through src/fhe/device.py but have not been tested end-to-end.

Docker

Single container (web UI + API):

docker build -t engima-fhe .
docker run -p 8000:8000 engima-fhe

GPU container (NVIDIA):

docker build -f Dockerfile.gpu -t engima-fhe-gpu .
docker compose -f docker-compose.gpu.yml up

Split deployment (separate client and server containers):

docker compose --profile split up

This starts two containers:

  • fhe-server — runs BlindFHEServerApp on port 8001 (no secret key)
  • fhe-client — holds the secret key, connects to the server
graph LR
    subgraph client-container ["fhe-client container"]
        CL[RealFHEClient<br/>Secret key here]
    end

    subgraph server-container ["fhe-server container"]
        SV[BlindFHEServerApp<br/>No secret key]
    end

    CL -- "POST /compute<br/>encrypted bytes" --> SV
    SV -- "encrypted result" --> CL

    style client-container fill:#e8f5e9,stroke:#2e7d32
    style server-container fill:#e3f2fd,stroke:#1565c0
Loading

Results

All numbers from this implementation: pure NumPy, scalar LWE, single-threaded CPU, dim=32, lwe_dim=1024.

Metric Value
FHE single-layer correlation (random weights) 0.86
FHE single-layer correlation (real Ollama weights + safe_qmax) 0.999
FHE latency per layer ~28 ms
Plaintext latency ~0.2 ms
Overhead ~178x
MedQA accuracy (random weights) 25%
Training convergence (200 steps) Loss 6.03 → 5.99

Two correlation numbers because they measure different conditions. Random weights in [-2, 2] have high L1 row norms that cause more quantization clipping, giving 0.86. Real TinyLlama weights (Q4_0) are sparser and better-conditioned — with safe_qmax auto-scaling, correlation reaches 0.999. Both are reproducible via benchmarks/run_benchmarks.py.

Training convergence is real but marginal (0.7% over 200 steps). The model is small (hidden_size=64, 1 layer) and the backward pass approximation limits learning speed. The web UI shows ~11% loss drops in some runs with higher learning rates.

Compared to Zama (arXiv:2505.07329)

Zama published the paper this project is based on.

This project Zama (Concrete ML)
Language Python / NumPy Rust (tfhe-rs) + Python
Ciphertext packing Scalar LWE (1 value per ct) RLWE SIMD (~1000 values per ct)
Hardware CPU (+ experimental GPU) GPU (CUDA), multi-threaded CPU
Model size 64-dim, 1 layer Full GPT-2 / Llama layers
Backward pass Same approximation Same approximation
Throughput Educational ~216 sec/token on RTX 4060

The performance gap is large — scalar LWE encrypts each value separately (64-dim = 64 ciphertexts), while RLWE packing fits the same vector in 1 ciphertext. The 178x overhead here would be 10-50x in a production RLWE system.

Compared to Other Privacy Approaches

Approach Guarantee Overhead Maturity
FHE (this, Zama) Cryptographic — server can't see data 10-200x Research
DP-SGD Statistical — individual records protected 1-3x Production
Secure enclaves (TEE) Hardware — trusted execution environment ~1x Production
Federated learning Data never leaves client ~1x per client Production

FHE gives the strongest guarantee but pays the most in performance. This project stacks FHE + DP-SGD + federation because they're complementary.

Honest Limitations

  • Tiny model. 64-dim, 1 layer is far from a real LLM. Scalar LWE is impractically slow at full Llama dimensions (2048+).
  • Marginal training. The gradient approximation converges but isn't competitive with standard backprop. This is a fundamental privacy tradeoff.
  • Simplified security estimate. The 128-bit claim uses HE Standard tables. A real audit would use the lattice-estimator tool.
  • GPU backends untested. CuPy/MLX interfaces exist but haven't been validated end-to-end.
  • MedQA = random chance. 25% accuracy measures the evaluation pipeline, not model quality (random weights).

Project Layout

src/                        20,600+ lines across 52 files
  fhe/                      TFHE crypto: LWE, RLWE, GSW, bootstrap, NTT, SIMD packing
  models/                   FHELlamaForCausalLM, LoRA layers, kernel attention, RoPE
  client/                   Training orchestrator, LoRA manager, FHE client
  server/                   FHEServerCallback (sim), BlindFHEServerApp (production)
  privacy/                  DP-SGD, DP-Forward (embedding noise), RDP accounting
  federation/               FedAvg federated trainer
  cipher/                   Token substitution cipher (simple + homophonic)
  anonymization/            HIPAA PII removal
  network/                  Ciphertext binary serialization
  core/                     Config, training state, LR scheduler

tests/                      919 tests across 43 files
benchmarks/                 MedQA eval, FHE overhead, training convergence
scripts/                    Split deployment entry points, MedQA download
web/                        Browser UI with training wizard and playground
data/                       20 medical training notes, 20 MedQA questions

References

  1. Chillotti et al., "TFHE: Fast Fully Homomorphic Encryption over the Torus", J. Cryptology 2020
  2. Frery et al., "Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption", arXiv:2505.07329
  3. Regev, "On Lattices, Learning with Errors, Random Linear Codes, and Cryptography", STOC 2005
  4. Gentry, Sahai, Waters, "Homomorphic Encryption from Learning with Errors", Crypto 2013

Not for Production

This is a research/educational implementation. For production FHE, use TFHE-rs, OpenFHE, or Microsoft SEAL.

License

Apache 2.0

About

Privacy-preserving LLM fine-tuning with Fully Homomorphic Encryption, train LoRA adapters on encrypted data without exposing plaintext to the server

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages