Formal Statistical Learning Theory in Lean 4

FormalSLT is a compact Lean 4 library for the finite-sample statistical learning theory route from empirical risk minimization to VC-style generalization bounds, with recent extensions for contraction, linear predictors, finite sub-Gaussian chaining, algorithmic stability, and finite PAC-Bayes confidence bounds, and a first total-bounded finite-net bridge for the Dudley lane.

45 Lean modules. Zero sorry. Zero admit. Zero custom axioms.

Axioms used by the public theorem spine: [propext, Classical.choice, Quot.sound].

The core path runs from ERM through Rademacher symmetrization, high-probability Rademacher bounds, Massart, Sauer-Shelah, the binary VC bridge, finite contraction, linear predictors, finite sub-Gaussian chaining, finite Dudley entropy-budget wrappers, finite algorithmic stability, finite localized-Rademacher scaffolding, finite PAC-Bayes KL/DV/MGF and bounded-loss confidence bounds, and total-bounded finite-net adapters for the next Dudley steps.

Where to start

For ML readers: start with How to read the proofs, then Intuition.
For proof structure: see Diagrams.
For exact theorem names: use Theorem map.
For scope and assumptions: read Assumptions and current boundaries.
For contributors: read Contributing, then Good first issues.
For related Lean projects: see Related work. FormalSLT is complementary to YuanheZ/lean-stat-learning-theory, which builds deeper empirical-process infrastructure; this repo focuses on a readable finite-class theorem spine.

Why this matters

Finite-sample proofs in statistical learning theory carry assumptions that are easy to blur in prose. FormalSLT turns a readable finite-sample route into machine-checked Lean statements, so each bound carries its hypotheses, constants, and finite-class scope in the theorem signature. The motivations:

Reproducibility. Every constant — the 8B² exponent, the 2 * E[Rad] factor, the (en/d)^d Sauer-Shelah closed form — is a Lean term that the kernel re-checks on every build. There is no implicit "without loss of generality" or "up to constants".
Curriculum. A graduate student reading the README can click through to the exact Lean statement for any bound they have seen on a chalkboard, with the assumptions made explicit in the type signature.
Frontier ML. As learning theory increasingly feeds back into modern deep-learning analyses (PAC-Bayes generalization, stability, chaining), having a checked finite scaffold is the first step toward formal guarantees on the methods themselves.

Theorem families

Family	Main modules	Representative result	Status
Finite-class ERM	`Risk`, `ERM`, `Rademacher.ERMGeneralization`	Excess risk controlled by uniform deviation	Verified
Rademacher symmetrization	`GhostSample`, `Rademacher.Symmetrization`	`E[genGap] ≤ 2 * E[Rad]`	Verified
High-probability Rademacher	`Azuma.*`, `Rademacher.HighProbability`	`P(genGap ≥ 2 * E[Rad] + ε) ≤ exp(-ε² n / (8B²))`	Verified
Massart finite-class bound	`Rademacher.Massart`	`Rad(H,S) ≤ B * sqrt(2 * log card(H) / n)`	Verified
Binary VC route	`VC.SauerShelah`, `VC.BinaryVCBridge`, `VC.SampleComplexity`	VC-style ERM excess-risk tail and closed sample-complexity form via effective classes	Verified
Finite contraction	`Rademacher.Contraction`	`Rad_S(φ ∘ F) ≤ L * Rad_S(F)` for scalar finite samples/classes	Verified
Linear predictors	`Rademacher.LinearPredictor`	`Rad ≤ R * n⁻¹ * sqrt(∑ k, ‖z k‖²)` and `Rad ≤ R * B / sqrt n`	Verified
Localized Rademacher scaffold	`Rademacher.Localized`	Bernstein excess-risk localization embeds into second-moment localized empirical Rademacher complexity	Verified finite scaffold
Finite covering and two-scale chaining	`Covering.Rademacher`, `Covering.DudleyChaining`	ε-net peeling and two-scale finite chaining	Verified
Finite sub-Gaussian chaining foundation	`Covering.FiniteSubGaussianChaining`	finite-max entropy bounds and finite Dudley-style entropy-budget sums	Verified finite infrastructure
Total-bounded Dudley bridge	`Covering.TotalBoundedDudley`	totally bounded metric spaces yield dyadic finite-net schedules, projected finite-net wrappers, truncated interval-integral entropy comparisons, and supplied-supremum / finite-skeleton / pathwise-modulus / epsilonized boundary adapters	Verified bridge
Algorithmic stability	`AlgorithmicStability`, `Stability.BousquetElisseeff`	bounded-differences constants, finite and product-measure expected-gap wrappers with bound `β`, bounded-loss measurability adapters, and bounded-loss Azuma-constant concentration wrappers	Verified finite scaffold
PAC-Bayes finite confidence layer	`PACBayesKL`, `PACBayesMcAllester`, `PACBayesFiniteProductMGF`, `PACBayesBoundedLoss`	finite KL/DV change-of-measure, bounded-loss Catoni-style bound, closed PAC-Bayes good-event payoff, fixed-budget McAllester corollary, and finite-grid McAllester peeling wrapper	Verified finite layer

Scope and assumptions

The main generalization theorems are intentionally finite and explicit.

Scope item	Current state
Hypothesis classes	Finite index types unless a theorem states a separate finite net/family
Samples	Finite iid samples through product measures
Losses/processes	Scalar real-valued, with boundedness or finite sub-Gaussian MGF assumptions
Constants	High-probability Rademacher bounds use the Azuma `8B²` exponent
Chaining	Finite nets/images, finite support/outcome spaces, finite entropy sums
Public axiom target	`[propext, Classical.choice, Quot.sound]` only

Current boundaries

Short version:

The main Rademacher and VC results are finite-class / finite-sample theorems.
High-probability Rademacher bounds use the Azuma 8B² exponent; the sharper McDiarmid constant is future work.
The chaining layer proves finite entropy-budget infrastructure and a first total-bounded finite-net extraction bridge, not the continuous Dudley integral.
PAC-Bayes includes a finite [0,1] bounded-loss Catoni-style confidence bound, a closed high-confidence good-event theorem, a fixed-budget McAllester-style square-root corollary, and a finite-grid peeling wrapper for posterior-dependent penalties. Exact all-real-λ, infinite-hypothesis, and continuous-posterior variants remain future work.
Algorithmic stability includes finite iid and measure-theoretic iid expected-gap wrappers, plus bounded-loss high-probability wrappers for finite measurable hypothesis interfaces.

For the full scope statement, see Assumptions and current boundaries.

Installation

This project requires elan, the Lean toolchain manager. elan will read lean-toolchain and fetch the pinned Lean version automatically.

git clone https://github.com/Robby955/lean-statistical-learning.git
cd lean-statistical-learning
lake exe cache get      # download pre-built Mathlib oleans
lake build FormalSLT    # build the library

If lake is not on your shell path, use the elan binary directly:

~/.elan/bin/lake exe cache get
~/.elan/bin/lake build FormalSLT

The first build takes a few minutes (downloading the Mathlib cache); after that, builds are incremental.

Release-candidate checks

Run these before treating a branch as a showcase candidate:

lake exe cache get
lake build FormalSLT
lake env lean examples/CheckShowcaseTheorems.lean

Audit commands

Use the release-candidate checks above, then run the proof-debt and whitespace audits:

rg -n --pcre2 '^\s*(?:by\s+)?(?:sorry|admit)\b|:=\s*(?:by\s+)?(?:sorry|admit)\b' FormalSLT examples
rg -n --pcre2 '^\s*(?:axiom|constant)\s+[A-Za-z_]' FormalSLT examples
git diff --check

The expected result is:

lake build FormalSLT exits successfully;
examples/CheckShowcaseTheorems.lean prints standard Lean/Mathlib axioms for selected public theorems;
the rg commands find no executable sorry, no executable admit, and no custom axioms/constants in FormalSLT or examples;
git diff --check reports no whitespace errors.

Module map

Layer	Modules
Core definitions	`Risk`, `ERM`, `UniformConvergence`, `GhostSample`
Probability utilities	`Probability.Concentration`, `Probability.FiniteUnionBound`, `Probability.FiniteExpectation`
Rademacher route	`Rademacher.FiniteSample`, `Rademacher.FiniteSampleSymmetrization`, `Rademacher.ProbabilityBridge`, `Rademacher.Decoupling`, `Rademacher.Symmetrization`, `Rademacher.Massart`, `Rademacher.HighProbability`, `Rademacher.FiniteClassHighProb`, `Rademacher.UniformDeviation`, `Rademacher.ERMGeneralization`, `Rademacher.Contraction`, `Rademacher.LinearPredictor`, `Rademacher.Localized`
Azuma infrastructure	`Azuma.ExposureMartingale`, `Azuma.BoundedDifferences`, `Azuma.BoundedDiffMartingale`, `Azuma.BoundedDiffsAzumaInput`, `Azuma.BoundedIncrementBound`, `Azuma.HasBoundedDifferences`, `Azuma.ExposureIncrementHoeffding`, `Azuma.ExposureIncrementCondMGF`, `Azuma.GenGapTail`
VC route	`VC.Dimension`, `VC.PACBridge`, `VC.SauerShelah`, `VC.Rademacher`, `VC.SampleComplexity`, `VC.BinaryVCBridge`
Covering and chaining	`Covering.Rademacher`, `Covering.DudleyChaining`, `Covering.FiniteSubGaussianChaining`, `Covering.TotalBoundedDudley`
Stability and PAC-Bayes foundations	`AlgorithmicStability`, `Stability.BousquetElisseeff`, `PACBayesKL`, `PACBayesMcAllester`, `PACBayesFiniteProductMGF`, `PACBayesBoundedLoss`

Roadmap

Dependencies

Lean 4 v4.30.0-rc2
Mathlib4 @ 25b7ac7

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening a PR. The short version: one theorem per PR, no sorry / no admit, only the standard [propext, Classical.choice, Quot.sound] axioms, and assumptions stated in the type signature rather than buried in hypotheses.

For ideas, see the open-formalization-problems list, good first issues, and the unchecked items in the roadmap above. Before a public release, maintainers can use the public release checklist.

Citation

If you use FormalSLT in academic work, please cite:

@software{formal_slt,
  title  = {FormalSLT: Formal Statistical Learning Theory in Lean 4},
  author = {Sneiderman, Robert},
  year   = {2026},
  url    = {https://github.com/Robby955/FormalSLT},
  note   = {Lean 4 formalization of finite-sample SLT bounds.}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
FormalSLT		FormalSLT
docs		docs
examples		examples
paper		paper
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
FormalSLT.lean		FormalSLT.lean
LICENSE		LICENSE
README.md		README.md
lake-manifest.json		lake-manifest.json
lakefile.lean		lakefile.lean
lean-toolchain		lean-toolchain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formal Statistical Learning Theory in Lean 4

Where to start

Why this matters

Theorem families

Scope and assumptions

Current boundaries

Installation

Release-candidate checks

Audit commands

Module map

Roadmap

Dependencies

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Formal Statistical Learning Theory in Lean 4

Where to start

Why this matters

Theorem families

Scope and assumptions

Current boundaries

Installation

Release-candidate checks

Audit commands

Module map

Roadmap

Dependencies

Contributing

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages