Skip to content

Robby955/FormalSLT

Repository files navigation

Formal Statistical Learning Theory in Lean 4

CI Lean 4 Mathlib Zero sorry Axioms License: MIT

FormalSLT is a compact Lean 4 library for the finite-sample statistical learning theory route from empirical risk minimization to VC-style generalization bounds, with recent extensions for contraction, linear predictors, finite sub-Gaussian chaining, algorithmic stability, and finite PAC-Bayes confidence bounds, and a first total-bounded finite-net bridge for the Dudley lane.

45 Lean modules. Zero sorry. Zero admit. Zero custom axioms.

Axioms used by the public theorem spine: [propext, Classical.choice, Quot.sound].

FormalSLT theorem chain

The core path runs from ERM through Rademacher symmetrization, high-probability Rademacher bounds, Massart, Sauer-Shelah, the binary VC bridge, finite contraction, linear predictors, finite sub-Gaussian chaining, finite Dudley entropy-budget wrappers, finite algorithmic stability, finite localized-Rademacher scaffolding, finite PAC-Bayes KL/DV/MGF and bounded-loss confidence bounds, and total-bounded finite-net adapters for the next Dudley steps.

Where to start

Why this matters

Finite-sample proofs in statistical learning theory carry assumptions that are easy to blur in prose. FormalSLT turns a readable finite-sample route into machine-checked Lean statements, so each bound carries its hypotheses, constants, and finite-class scope in the theorem signature. The motivations:

  • Reproducibility. Every constant — the 8B² exponent, the 2 * E[Rad] factor, the (en/d)^d Sauer-Shelah closed form — is a Lean term that the kernel re-checks on every build. There is no implicit "without loss of generality" or "up to constants".
  • Curriculum. A graduate student reading the README can click through to the exact Lean statement for any bound they have seen on a chalkboard, with the assumptions made explicit in the type signature.
  • Frontier ML. As learning theory increasingly feeds back into modern deep-learning analyses (PAC-Bayes generalization, stability, chaining), having a checked finite scaffold is the first step toward formal guarantees on the methods themselves.

Theorem families

Family Main modules Representative result Status
Finite-class ERM Risk, ERM, Rademacher.ERMGeneralization Excess risk controlled by uniform deviation Verified
Rademacher symmetrization GhostSample, Rademacher.Symmetrization E[genGap] ≤ 2 * E[Rad] Verified
High-probability Rademacher Azuma.*, Rademacher.HighProbability P(genGap ≥ 2 * E[Rad] + ε) ≤ exp(-ε² n / (8B²)) Verified
Massart finite-class bound Rademacher.Massart Rad(H,S) ≤ B * sqrt(2 * log card(H) / n) Verified
Binary VC route VC.SauerShelah, VC.BinaryVCBridge, VC.SampleComplexity VC-style ERM excess-risk tail and closed sample-complexity form via effective classes Verified
Finite contraction Rademacher.Contraction Rad_S(φ ∘ F) ≤ L * Rad_S(F) for scalar finite samples/classes Verified
Linear predictors Rademacher.LinearPredictor Rad ≤ R * n⁻¹ * sqrt(∑ k, ‖z k‖²) and Rad ≤ R * B / sqrt n Verified
Localized Rademacher scaffold Rademacher.Localized Bernstein excess-risk localization embeds into second-moment localized empirical Rademacher complexity Verified finite scaffold
Finite covering and two-scale chaining Covering.Rademacher, Covering.DudleyChaining ε-net peeling and two-scale finite chaining Verified
Finite sub-Gaussian chaining foundation Covering.FiniteSubGaussianChaining finite-max entropy bounds and finite Dudley-style entropy-budget sums Verified finite infrastructure
Total-bounded Dudley bridge Covering.TotalBoundedDudley totally bounded metric spaces yield dyadic finite-net schedules, projected finite-net wrappers, truncated interval-integral entropy comparisons, and supplied-supremum / finite-skeleton / pathwise-modulus / epsilonized boundary adapters Verified bridge
Algorithmic stability AlgorithmicStability, Stability.BousquetElisseeff bounded-differences constants, finite and product-measure expected-gap wrappers with bound β, bounded-loss measurability adapters, and bounded-loss Azuma-constant concentration wrappers Verified finite scaffold
PAC-Bayes finite confidence layer PACBayesKL, PACBayesMcAllester, PACBayesFiniteProductMGF, PACBayesBoundedLoss finite KL/DV change-of-measure, bounded-loss Catoni-style bound, closed PAC-Bayes good-event payoff, fixed-budget McAllester corollary, and finite-grid McAllester peeling wrapper Verified finite layer

Scope and assumptions

The main generalization theorems are intentionally finite and explicit.

Scope item Current state
Hypothesis classes Finite index types unless a theorem states a separate finite net/family
Samples Finite iid samples through product measures
Losses/processes Scalar real-valued, with boundedness or finite sub-Gaussian MGF assumptions
Constants High-probability Rademacher bounds use the Azuma 8B² exponent
Chaining Finite nets/images, finite support/outcome spaces, finite entropy sums
Public axiom target [propext, Classical.choice, Quot.sound] only

Current boundaries

Short version:

  • The main Rademacher and VC results are finite-class / finite-sample theorems.
  • High-probability Rademacher bounds use the Azuma 8B² exponent; the sharper McDiarmid constant is future work.
  • The chaining layer proves finite entropy-budget infrastructure and a first total-bounded finite-net extraction bridge, not the continuous Dudley integral.
  • PAC-Bayes includes a finite [0,1] bounded-loss Catoni-style confidence bound, a closed high-confidence good-event theorem, a fixed-budget McAllester-style square-root corollary, and a finite-grid peeling wrapper for posterior-dependent penalties. Exact all-real-λ, infinite-hypothesis, and continuous-posterior variants remain future work.
  • Algorithmic stability includes finite iid and measure-theoretic iid expected-gap wrappers, plus bounded-loss high-probability wrappers for finite measurable hypothesis interfaces.

For the full scope statement, see Assumptions and current boundaries.

Installation

This project requires elan, the Lean toolchain manager. elan will read lean-toolchain and fetch the pinned Lean version automatically.

git clone https://github.com/Robby955/lean-statistical-learning.git
cd lean-statistical-learning
lake exe cache get      # download pre-built Mathlib oleans
lake build FormalSLT    # build the library

If lake is not on your shell path, use the elan binary directly:

~/.elan/bin/lake exe cache get
~/.elan/bin/lake build FormalSLT

The first build takes a few minutes (downloading the Mathlib cache); after that, builds are incremental.

Release-candidate checks

Run these before treating a branch as a showcase candidate:

lake exe cache get
lake build FormalSLT
lake env lean examples/CheckShowcaseTheorems.lean

Audit commands

Use the release-candidate checks above, then run the proof-debt and whitespace audits:

rg -n --pcre2 '^\s*(?:by\s+)?(?:sorry|admit)\b|:=\s*(?:by\s+)?(?:sorry|admit)\b' FormalSLT examples
rg -n --pcre2 '^\s*(?:axiom|constant)\s+[A-Za-z_]' FormalSLT examples
git diff --check

The expected result is:

  • lake build FormalSLT exits successfully;
  • examples/CheckShowcaseTheorems.lean prints standard Lean/Mathlib axioms for selected public theorems;
  • the rg commands find no executable sorry, no executable admit, and no custom axioms/constants in FormalSLT or examples;
  • git diff --check reports no whitespace errors.

Module map

Layer Modules
Core definitions Risk, ERM, UniformConvergence, GhostSample
Probability utilities Probability.Concentration, Probability.FiniteUnionBound, Probability.FiniteExpectation
Rademacher route Rademacher.FiniteSample, Rademacher.FiniteSampleSymmetrization, Rademacher.ProbabilityBridge, Rademacher.Decoupling, Rademacher.Symmetrization, Rademacher.Massart, Rademacher.HighProbability, Rademacher.FiniteClassHighProb, Rademacher.UniformDeviation, Rademacher.ERMGeneralization, Rademacher.Contraction, Rademacher.LinearPredictor, Rademacher.Localized
Azuma infrastructure Azuma.ExposureMartingale, Azuma.BoundedDifferences, Azuma.BoundedDiffMartingale, Azuma.BoundedDiffsAzumaInput, Azuma.BoundedIncrementBound, Azuma.HasBoundedDifferences, Azuma.ExposureIncrementHoeffding, Azuma.ExposureIncrementCondMGF, Azuma.GenGapTail
VC route VC.Dimension, VC.PACBridge, VC.SauerShelah, VC.Rademacher, VC.SampleComplexity, VC.BinaryVCBridge
Covering and chaining Covering.Rademacher, Covering.DudleyChaining, Covering.FiniteSubGaussianChaining, Covering.TotalBoundedDudley
Stability and PAC-Bayes foundations AlgorithmicStability, Stability.BousquetElisseeff, PACBayesKL, PACBayesMcAllester, PACBayesFiniteProductMGF, PACBayesBoundedLoss

Roadmap

  • Finite-sample Rademacher definitions
  • Rademacher symmetrization
  • Massart finite-class bound
  • Azuma-Hoeffding genGap tail
  • High-probability Rademacher bound
  • Sauer-Shelah polynomial bound
  • VC-style pointwise Rademacher
  • VC uniform deviation and ERM excess-risk tail
  • VC closed-form ERM sample-complexity theorem
  • Binary-class VC to effective loss-pattern bridge
  • Finite-sample scalar contraction
  • Finite-dimensional linear predictor Rademacher bound
  • Finite localized Rademacher/Bernstein variance-localization scaffold
  • Covering number peeling and two-scale chaining
  • Finite sub-Gaussian max and finite chaining entropy budgets
  • Algorithmic stability bounded-differences scaffold
  • Finite algorithmic stability expected-gap adapter under coordinate-swap identity
  • Finite iid coordinate-swap identity and literal expected-generalization-gap specialization
  • Finite iid two-sided algorithmic stability expected-gap wrappers
  • PAC-Bayes KL divergence and Donsker-Varadhan variational inequality
  • PAC-Bayes finite iid product MGF bridge for empirical-risk deviations
  • PAC-Bayes bounded-loss MGF, Markov confidence, and finite Catoni-style bound
  • PAC-Bayes closed high-confidence generalization payoff theorem
  • PAC-Bayes fixed-budget McAllester-style square-root corollary
  • PAC-Bayes finite-grid McAllester peeling and optimized finite-grid wrapper
  • Finite Dudley discrete entropy-bound refinements: annulus, integral-budget, and prefix-envelope wrappers
  • Total-bounded finite-net extraction bridge for the continuous Dudley lane
  • Projected-sup total-bounded dyadic Dudley wrapper
  • Projected finite-net total-bounded dyadic Dudley wrapper without finite ambient index type
  • Finite dyadic-budget to entropy-at-radius upper-sum comparison
  • Shifted finite-annulus entropy budget collapsed to one truncated interval integral
  • Supplied-supremum boundary adapter with explicit terminal approximation error
  • Finite-skeleton/dense-net boundary adapter with explicit separability and terminal projection errors
  • Pathwise-modulus and finite-skeleton witness lemmas that discharge the continuous-boundary adapter hypotheses
  • Epsilonized finite-choice boundary adapter: every positive error budget can be discharged by a finite skeleton and terminal dyadic scale certificate
  • Finite-cover/pathwise-modulus certificate for the epsilonized total-bounded Dudley boundary layer
  • Dudley boundary epsilon elimination under a uniform global finite-budget hypothesis
  • Separable-terminal Dudley boundary adapter under explicit finite-skeleton and terminal-projection hypotheses
  • Pathwise terminal-modulus constructor for separable-terminal Dudley boundary certificates
  • Finite-cover/pathwise-modulus bridge into the separable-terminal Dudley boundary interface
  • Finite-terminal total-bounded dyadic Dudley wrapper
  • Continuous Dudley entropy-integral theorem over total-bounded classes
  • Measure-theoretic iid algorithmic stability expected bound, with explicit integrability assumptions
  • Bounded-loss measurability adapters for the measure-theoretic stability expected-gap theorem
  • Bounded-loss high-probability stability wrappers for finite measurable hypothesis interfaces
  • PAC-Bayes all-real-λ or continuous-posterior extensions
  • Sharp McDiarmid/product-kernel decomposition
  • Continuous Dudley-style entropy integral

Dependencies

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening a PR. The short version: one theorem per PR, no sorry / no admit, only the standard [propext, Classical.choice, Quot.sound] axioms, and assumptions stated in the type signature rather than buried in hypotheses.

For ideas, see the open-formalization-problems list, good first issues, and the unchecked items in the roadmap above. Before a public release, maintainers can use the public release checklist.

Citation

If you use FormalSLT in academic work, please cite:

@software{formal_slt,
  title  = {FormalSLT: Formal Statistical Learning Theory in Lean 4},
  author = {Sneiderman, Robert},
  year   = {2026},
  url    = {https://github.com/Robby955/FormalSLT},
  note   = {Lean 4 formalization of finite-sample SLT bounds.}
}

License

This project is released under the MIT License.

About

Machine-verified statistical learning theory in Lean 4 — 45 modules, 19,521 lines, 0 sorry

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors