Engima FHE

Fine-tune LLMs on encrypted data. The server computes on ciphertexts and never sees your plaintext.

The Problem

You have sensitive data (medical records, financial docs, legal contracts) and want to fine-tune a language model on it. But you can't send plaintext to a cloud GPU — regulations forbid it, or your threat model doesn't trust the server.

The Solution

Split the transformer into linear ops (server, encrypted) and non-linear ops (client, plaintext). The server multiplies weight matrices by encrypted vectors homomorphically. The client does everything else (softmax, SiLU, RoPE, LoRA) in the clear.

The server only ever sees ciphertexts. It cannot decrypt them — that's a mathematical guarantee, not a policy.

graph LR
    subgraph Client ["Client (Hospital)"]
        A[Sensitive Data] --> B[Tokenize + Embed]
        B --> C[Encrypt Hidden States]
        G[Decrypt Results] --> H[Non-linear Ops]
        H --> I[Train LoRA Locally]
    end

    subgraph Server ["Server (Cloud)"]
        D[Base Weights W]
        E["W @ Enc(x)<br/>Homomorphic Matmul"]
    end

    C -- "LWE Ciphertexts" --> E
    E -- "Enc(W @ x)" --> G
    D --> E

    style Client fill:#e8f5e9,stroke:#2e7d32
    style Server fill:#e3f2fd,stroke:#1565c0

Two-Pass Protocol

Each transformer layer requires two round-trips between client and server. This is the core of how the system works:

sequenceDiagram
    participant C as Client
    participant S as Server

    Note over C,S: Pass 1 — Attention + MLP Projections

    C->>C: Quantize hidden states → int8
    C->>C: Encrypt with LWE
    C->>S: Send Enc(x)
    S->>S: Compute W_q @ Enc(x)
    S->>S: Compute W_k @ Enc(x)
    S->>S: Compute W_v @ Enc(x)
    S->>S: Compute W_gate @ Enc(x)
    S->>S: Compute W_up @ Enc(x)
    S->>C: Return encrypted projections

    Note over C: Client-side (plaintext)
    C->>C: Decrypt all projections
    C->>C: RoPE positional encoding
    C->>C: Attention: softmax(QK^T/√d) @ V
    C->>C: SiLU activation + gate
    C->>C: Add LoRA: α(U @ D @ x)

    Note over C,S: Pass 2 — Output Projections

    C->>C: Encrypt attention output + MLP hidden
    C->>S: Send Enc(attn), Enc(mlp)
    S->>S: Compute W_o @ Enc(attn)
    S->>S: Compute W_down @ Enc(mlp)
    S->>C: Return encrypted outputs

    C->>C: Decrypt → residual add → next layer

Why two passes? The server can only do linear operations (matrix multiply) on encrypted data. Softmax, SiLU, and RoPE are non-linear — they require the plaintext values. So the client must decrypt between the attention projection step and the output projection step.

Privacy Stack

This project layers four independent privacy mechanisms. Each protects a different attack surface:

graph TB
    subgraph stack ["Privacy Stack"]
        direction TB
        FHE["FHE (Homomorphic Encryption)<br/>Server never sees hidden states"]
        DP["DP-SGD (Differential Privacy)<br/>Individual records can't be extracted from model"]
        FED["Federated Learning<br/>Raw data never leaves each hospital"]
        CIPHER["Token Cipher<br/>Token IDs scrambled before encryption"]
    end

    FHE --> DP
    DP --> FED
    FED --> CIPHER

    ATK1["Server inspects activations"] -.->|"Blocked by"| FHE
    ATK2["Model memorization attack"] -.->|"Blocked by"| DP
    ATK3["Data centralization"] -.->|"Blocked by"| FED
    ATK4["Frequency analysis on tokens"] -.->|"Blocked by"| CIPHER

    style stack fill:#fff3e0,stroke:#e65100
    style ATK1 fill:#ffcdd2,stroke:#c62828
    style ATK2 fill:#ffcdd2,stroke:#c62828
    style ATK3 fill:#ffcdd2,stroke:#c62828
    style ATK4 fill:#ffcdd2,stroke:#c62828

Layer	What it protects	Guarantee	Overhead
FHE	Hidden states in transit	Cryptographic (LWE hardness)	~178x
DP-SGD	Individual records in trained model	Statistical (ε,δ)-DP	~1.5x
Federation	Raw data locality	Organizational (data never leaves)	~1x per client
Token cipher	Token frequency patterns	Substitution cipher	~0x

How FHE Works Here

LWE (Learning With Errors) encryption: a plaintext value m becomes (a, b) where b = a·s + m + noise. The secret key s stays on the client.

Because LWE is additively homomorphic, the server can compute W @ Enc(x) and get Enc(W @ x) — without ever knowing x or s.

graph LR
    subgraph Encrypt
        M["m (plaintext)"] --> ENC["(a, b = a·s + m + e)"]
    end

    subgraph "Homomorphic Matmul"
        ENC --> HOM["W @ (a, b)"]
        HOM --> RES["(W·a, W·b) = Enc(W·m)"]
    end

    subgraph Decrypt
        RES --> DEC["b' - a'·s = W·m + noise"]
    end

    style Encrypt fill:#e8f5e9,stroke:#2e7d32
    style Decrypt fill:#e8f5e9,stroke:#2e7d32

Parameter	Value	Why
LWE dimension	1024	~128-bit security (HE Standard)
Noise	2^(-25)	Balance between accuracy and security margin
Modulus	2^32	Implicit int32 arithmetic
Post-quantum	Yes	LWE is not broken by Shor's algorithm

Training

LoRA adds small adapter matrices to each layer: y = W @ x + α(U @ D @ x). Only U and D are trained, and they stay on the client.

graph TB
    subgraph forward ["Forward Pass"]
        X[Input x] --> ENC2[Encrypt]
        ENC2 --> SERVER["Server: W @ Enc(x)"]
        SERVER --> DEC2[Decrypt → W·x]
        X --> LORA["LoRA: α(U @ D @ x)"]
        DEC2 --> ADD["y = W·x + LoRA"]
        LORA --> ADD
    end

    subgraph backward ["Backward Pass (Client-Only)"]
        LOSS[Loss] --> GRAD["∇L projected through lm_head"]
        GRAD --> GU["∇U = grad @ (D @ x)^T"]
        GRAD --> GD["∇D = U^T @ grad @ x^T"]
        GU --> UPDATE["Adam update U, D"]
        GD --> UPDATE
    end

    ADD --> LOSS

    style forward fill:#e3f2fd,stroke:#1565c0
    style backward fill:#fce4ec,stroke:#c62828

The backward pass computes analytical gradients (no autograd). One limitation: each layer gets the same top-level gradient rather than proper chain-rule backprop through layers. The client doesn't have W in production mode, so inter-layer gradients can't be computed. Training still converges — the Zama paper has the same constraint.

Additional privacy during training:

DP-SGD: Gaussian noise on gradients with RDP accounting (formal ε/δ bounds)
DP-Forward: Embedding noise injection (SeqLDP guarantee)
FFA-LoRA: Freeze D matrix — only train U to reduce DP noise amplification
Federated learning: Multiple clients train locally, aggregate via FedAvg

Setup

git clone https://github.com/jeffelin/engima-fhe.git
cd engima-fhe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Usage

source .venv/bin/activate

# Run tests (919 tests, ~7 min on CPU)
FHE_BACKEND=numpy PYTHONPATH=src python -m pytest tests/ -v --tb=short

# Demo (plaintext, FHE, blind mode, training, MedQA eval)
PYTHONPATH=src python demo.py --mode all

# Training experiments (6 controlled experiments)
PYTHONPATH=src python train_medical.py
PYTHONPATH=src python train_medical.py --fhe           # with FHE comparison

# Benchmarks
PYTHONPATH=src python benchmarks/run_benchmarks.py

# Web UI (http://localhost:8000)
PYTHONPATH=src python web/server.py

Or bash run_demo.sh to run everything.

GPU Setup

The default backend is NumPy (CPU). For GPU acceleration:

NVIDIA (CuPy)

pip install -e ".[gpu-cuda]"
FHE_BACKEND=cupy PYTHONPATH=src python demo.py --mode all

Apple Silicon (MLX)

pip install -e ".[gpu-mlx]"
FHE_BACKEND=mlx PYTHONPATH=src python demo.py --mode all

GPU backends are experimental. The FHE pipeline is validated on NumPy. CuPy and MLX dispatch through src/fhe/device.py but have not been tested end-to-end.

Docker

Single container (web UI + API):

docker build -t engima-fhe .
docker run -p 8000:8000 engima-fhe

GPU container (NVIDIA):

docker build -f Dockerfile.gpu -t engima-fhe-gpu .
docker compose -f docker-compose.gpu.yml up

Split deployment (separate client and server containers):

docker compose --profile split up

This starts two containers:

fhe-server — runs BlindFHEServerApp on port 8001 (no secret key)
fhe-client — holds the secret key, connects to the server

graph LR
    subgraph client-container ["fhe-client container"]
        CL[RealFHEClient<br/>Secret key here]
    end

    subgraph server-container ["fhe-server container"]
        SV[BlindFHEServerApp<br/>No secret key]
    end

    CL -- "POST /compute<br/>encrypted bytes" --> SV
    SV -- "encrypted result" --> CL

    style client-container fill:#e8f5e9,stroke:#2e7d32
    style server-container fill:#e3f2fd,stroke:#1565c0

Results

All numbers from this implementation: pure NumPy, scalar LWE, single-threaded CPU, dim=32, lwe_dim=1024.

Metric	Value
FHE single-layer correlation (random weights)	0.86
FHE single-layer correlation (real Ollama weights + safe_qmax)	0.999
FHE latency per layer	~28 ms
Plaintext latency	~0.2 ms
Overhead	~178x
MedQA accuracy (random weights)	25%
Training convergence (200 steps)	Loss 6.03 → 5.99

Two correlation numbers because they measure different conditions. Random weights in [-2, 2] have high L1 row norms that cause more quantization clipping, giving 0.86. Real TinyLlama weights (Q4_0) are sparser and better-conditioned — with safe_qmax auto-scaling, correlation reaches 0.999. Both are reproducible via benchmarks/run_benchmarks.py.

Training convergence is real but marginal (0.7% over 200 steps). The model is small (hidden_size=64, 1 layer) and the backward pass approximation limits learning speed. The web UI shows ~11% loss drops in some runs with higher learning rates.

Compared to Zama (arXiv:2505.07329)

Zama published the paper this project is based on.

	This project	Zama (Concrete ML)
Language	Python / NumPy	Rust (tfhe-rs) + Python
Ciphertext packing	Scalar LWE (1 value per ct)	RLWE SIMD (~1000 values per ct)
Hardware	CPU (+ experimental GPU)	GPU (CUDA), multi-threaded CPU
Model size	64-dim, 1 layer	Full GPT-2 / Llama layers
Backward pass	Same approximation	Same approximation
Throughput	Educational	~216 sec/token on RTX 4060

The performance gap is large — scalar LWE encrypts each value separately (64-dim = 64 ciphertexts), while RLWE packing fits the same vector in 1 ciphertext. The 178x overhead here would be 10-50x in a production RLWE system.

Compared to Other Privacy Approaches

Approach	Guarantee	Overhead	Maturity
FHE (this, Zama)	Cryptographic — server can't see data	10-200x	Research
DP-SGD	Statistical — individual records protected	1-3x	Production
Secure enclaves (TEE)	Hardware — trusted execution environment	~1x	Production
Federated learning	Data never leaves client	~1x per client	Production

FHE gives the strongest guarantee but pays the most in performance. This project stacks FHE + DP-SGD + federation because they're complementary.

Honest Limitations

Tiny model. 64-dim, 1 layer is far from a real LLM. Scalar LWE is impractically slow at full Llama dimensions (2048+).
Marginal training. The gradient approximation converges but isn't competitive with standard backprop. This is a fundamental privacy tradeoff.
Simplified security estimate. The 128-bit claim uses HE Standard tables. A real audit would use the lattice-estimator tool.
GPU backends untested. CuPy/MLX interfaces exist but haven't been validated end-to-end.
MedQA = random chance. 25% accuracy measures the evaluation pipeline, not model quality (random weights).

Project Layout

src/                        20,600+ lines across 52 files
  fhe/                      TFHE crypto: LWE, RLWE, GSW, bootstrap, NTT, SIMD packing
  models/                   FHELlamaForCausalLM, LoRA layers, kernel attention, RoPE
  client/                   Training orchestrator, LoRA manager, FHE client
  server/                   FHEServerCallback (sim), BlindFHEServerApp (production)
  privacy/                  DP-SGD, DP-Forward (embedding noise), RDP accounting
  federation/               FedAvg federated trainer
  cipher/                   Token substitution cipher (simple + homophonic)
  anonymization/            HIPAA PII removal
  network/                  Ciphertext binary serialization
  core/                     Config, training state, LR scheduler

tests/                      919 tests across 43 files
benchmarks/                 MedQA eval, FHE overhead, training convergence
scripts/                    Split deployment entry points, MedQA download
web/                        Browser UI with training wizard and playground
data/                       20 medical training notes, 20 MedQA questions

References

Chillotti et al., "TFHE: Fast Fully Homomorphic Encryption over the Torus", J. Cryptology 2020
Frery et al., "Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption", arXiv:2505.07329
Regev, "On Lattices, Learning with Errors, Random Linear Codes, and Cryptography", STOC 2005
Gentry, Sahai, Waters, "Homomorphic Encryption from Learning with Errors", Crypto 2013

Not for Production

This is a research/educational implementation. For production FHE, use TFHE-rs, OpenFHE, or Microsoft SEAL.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
checkpoints		checkpoints
data		data
experiments		experiments
scripts		scripts
src		src
tests		tests
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
BUGFIX_LOG.md		BUGFIX_LOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
demo.py		demo.py
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
run_demo.sh		run_demo.sh
train_medical.py		train_medical.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Engima FHE

The Problem

The Solution

Two-Pass Protocol

Privacy Stack

How FHE Works Here

Training

Setup

Usage

GPU Setup

Docker

Results

Compared to Zama (arXiv:2505.07329)

Compared to Other Privacy Approaches

Honest Limitations

Project Layout

References

Not for Production

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Engima FHE

The Problem

The Solution

Two-Pass Protocol

Privacy Stack

How FHE Works Here

Training

Setup

Usage

GPU Setup

Docker

Results

Compared to Zama (arXiv:2505.07329)

Compared to Other Privacy Approaches

Honest Limitations

Project Layout

References

Not for Production

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages