Skip to content

JamesCarzon/lfi-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lfi-benchmark

This repo is for pedagogical demonstration. It was prepared for a seminar talk titled "Tools for Scalable and Reproducible Research Pipelines," given 24 September 2025 as part of the StatBytes Statistical Computing Seminar for the Department of Statistics and Data Science at Carnegie Mellon University.

In this repo, we demonstrate some research-level statistical computing concepts to which the literature typically refers as simulator-based inference (SBI) or likelihood-free inference (LFI). The scientific content of the demo is chiefly meant to be sufficiently complex for relevance to practicing computational statisticians, and so the content itself may be of separate interest. We use the NeurIPS 2024 HiggsML Uncertainty Challenge benchmark data set from FAIR Universe, the original ingestion repo of which has been slightly modified to suit the demo.

Some of the code used in this demonstration was generated by the Claude large language model.

Structure of this repo

The main branch of this demo repo represents some good practices for scalable and reproducible research software. The structure of this repo mimics typical difficulty-scaling with methodological development for any data analysis-heavy paper:

  1. In Parameterize, we start small by just writing one script -- it may very well start with a minimal Jupyter notebook, say -- that completes a minimal reproducible proof-of-concept result with a synthetic toy model.
  2. In Parallelize, we take the minimal example and start pushing and prodding it, so to speak, in directions that help us explore its richness to the extent that phenomena we see may inform our intuition for what will happen when transferring our methodology to a more realistic case study.
  3. In Modularize, we tackle a realistic case study, drawing on code from internal and external "modules" that will help us maintain scripts as short and readable as were had in the toy examples.

1. Parameterize

Refer to the script/gaussian/dev subdirectory. This folder is all-inclusive -- if you activate the appropriate conda environment, say, then the one_np.py script can be run from the command line without drawing on code written in other subdirectories within this code repo. Furthermore, it features a typical but bared-down use of the click library's command line interface (CLI). This means that you can change the way that the script is run as you submit the script from the command line.

Example:

cd lfi-benchmark/script/gaussian/dev
conda activate -n <CONDA_ENV>
python one_np.py

Remark: At this stage, I like to open a GNU screen in my terminal with screen -R run_script, say. With the screen open, I can run my script as needed, and then I can close the screen and resume using my terminal window without it being occupied by the standard output of the script.

2. Parallelize

Refer to the script/gaussian/batch subdirectory. In this folder, we have retained one_np.py as identical to the one in ../dev, but this time, there are a few new files:

  • A shell script, one_np.sh, which lists a sequence of commands which will be automatically interpreted by the terminal as though a user were interfacing with the terminal with that exact sequence, line by line; and
  • A text file, array_params.txt, each line of which parameterizes a desired iterate of one_np.py.

Example:

cd lfi-benchmark/script/gaussian/batch
sbatch one_np.sh

Remark: It is very important in general to conceal slurm output files like those in the err/ and out/ directories from the git repo by adding relevant lines to the .gitignore file. I've only refrained from doing so here for illustrative purposes.

3. Modularize

Finally, we assemble a few topics simultaneously in script/higgs, namely imports from

  • The src directory
  • The FAIR_Universe_dataset submodule

Remark: A substantial effort was put into writing src/lfibm/simulator/higgs.py so that it was meaningful, efficient, and concise. Conceptually, the role in the analysis pipeline that is served by the HiggsSimulator object defined therein may be understood as substituting many chunks of code that would have been needed in the preamble of a Jupyter notebook meant for conducting the same analysis performed in a script. By my estimation, a substantial effort should be anticipated for developing any such source code. Good data generation is crucial for LFI and statistical computing in general. This part of the process should be respected!

Example:

cd lfi-benchmark/script/higgs
sbatch two_params.sh

Modifying FAIR_Universe_dataset to be installable

To make this submodule immediately installable (and thus easy to invoke in my source and script code), my main contribution was to add a bare-bones setup.py file:

# setup.py
from setuptools import setup, find_packages

setup(
    name="hep_challenge",
    version="0.1",
    packages=find_packages(where="."),
    package_dir={"": "."},
)

This file is enough to use pip for installation now. Note that the name here determines what the library will be called when I try importing objects from its namespace, e.g. from hep_challenge.datasets import Data.

Research task

In brief, we study methods similar to those used in practice for the discovery of the Higgs boson and other such experiments. Although the likelihood is intractable in experiments encoded only by a simulator, we use what is often known as the ``likelihood ratio trick'' to estimate the likelihood. Let $p(x\vert\theta)$ denote the true likelihood for observed data $x$ and given parameter $\theta$. In our Gaussian toy model, this likelihood is a known bivariate Gaussian. In our Higgs example, the likelihood is represented by a large (theoretically assumed-to-be-representative) data set that is provided by the FAIR Universe's API.

The likelihood ratio trick cleverly bypasses the limitations of a finite simulated data set by leveraging the identity,

$$\frac{p(x\vert\theta)}{p(x)} \approx \frac{p(y=1\vert x,\theta)}{1-p(y=1\vert x,\theta)},$$

where $y$ is derived in data generation to indicate whether pairs $(x,\theta)$ are matched in the sense that $x\sim p(x\vert\theta)$ or mismatched. For a uniform proposal $\pi$ on $\theta$, we

  • Generate some training pairs $(x, \theta, y)\sim p(x\vert\theta)\pi(\theta)\delta(1)$ and $(x, \theta, y)\sim p(x\vert\theta)\pi(\theta')\delta(0)$, $\theta'\ne \theta$;
  • Train a probabilistic classifier $h$ for $y\vert x,\theta$ using a multilayer perceptron; and
  • Output $\hat{p}(x\vert\theta) \propto h(x, \theta) / (1-h(x, \theta))$.

Inference

Define likelihood ratio test statistic

$$t(x, \theta) = -2\log \left( \frac{\hat{p}(x\vert\theta)}{\hat{p}(x\vert\hat{\theta}_{\text{MLE}})} \right).$$

With the likelihood ratio estimator, we construct 95% confidence intervals by invoking Wilks' theorem on the sampling distribution of the statistic,

$$C_{0.95}(x) = \{ \theta\in\Theta : t(x, \theta) < \chi^2_{0.95}(df=\text{dim}(\Theta)) \}.$$

As a diagnostic check, we also plot the sampling distribution of $t$ to see if it is indeed approximately $\chi^2(df=\text{dim}(\Theta))$-distributed.

Getting started

Cloning with the FAIR_Universe_dataset submodule

This repo has an example of a git submodule. Upon cloning, your local version of this repo will have a reference to a particular commit of the remote version of the submodule repo. In order to access the content of the submodule in your subsequent scripts, be sure to run

git clone git@github.com:JamesCarzon/lfi-benchmark.git
git submodule init
git submodule update

to collect the content.

Virtual environment with development-mode installs

When you are simultaneously developing two bodies of code, it helps to compartmentalize them. For this demo, I've opted to put certain methods into the present project's src directory and others in script based on which files I want to run to actually perform analyses -- those go in script -- and which only serve to support analysis -- in src. Then, src is set up to be pip-installable using

  • A pyproject.toml file in root; and
  • Appropriately formatted __init__.py files throughout src.

I'm choosing to work with a conda environment that will manage all of my needed dependencies, even though they are all pip-installable. This is because conda is very general and useful, and so I don't need to re-teach myself too much about virtual environment management each time I start a new project. Set-up and installation proceeds thusly:

conda create --name lfi_benchmark python=3.9
conda activate lfi_benchmark
# navigate to root directory of present project
pip install -e . # to install present project in editable mode in virtual env
cd submodule/FAIR_Universe_dataset
pip install -e . # to install submodule in editable mode in virtual env

Remark: Why pip-install when we've already updated the submodules with git? Updating the git submodules populates the relevant directories in your machine's file system, but your computing environment (e.g. the conda environment for your script's context, or the Jupyter kernel for your notebook) requires the pip-install for names of the objects defined in that module to be identified and referred to those directories. In other words, this step allows you to import from the src directory at the top of a script, e.g.

from hep_challenge.datasets import Data
from lfibm.simulator.higgs import HiggsSimulator

Configuring asset directories

One more touch is needed for the Higgs example: data loading. The hep_challenge repo contains methods for downloading a very sizable reference data set with millions of rows, too big to be appropriate to store in a GitHub repo itself. Instead, we store this file somewhere on our local machine (perhaps by having downloaded it using hep_challenge.datasets.Data) and then referring to that location privately with the configparser library. I've written a file called config.ini for myself to do just that, assigning paths to directories that I'd like to consistently use in my script between jobs but whose content is too large to save on GitHub.

Remark: I've concealed this .ini file from being recognized as part of my directory, too, as I don't want to publish my local machine's file directory structure -- this information is possibly security-sensitive!

My config.ini file is written as follows:

[DEFAULT]
DataDir = <INSERT PATH>/public_data
OutputDir = <INSERT PATH>/results

Note that configparser will allow you to parse additional categories besides "DEFAULT" if you name further sections in this file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors