This repo is for pedagogical demonstration. It was prepared for a seminar talk titled "Tools for Scalable and Reproducible Research Pipelines," given 24 September 2025 as part of the StatBytes Statistical Computing Seminar for the Department of Statistics and Data Science at Carnegie Mellon University.
In this repo, we demonstrate some research-level statistical computing concepts to which the literature typically refers as simulator-based inference (SBI) or likelihood-free inference (LFI). The scientific content of the demo is chiefly meant to be sufficiently complex for relevance to practicing computational statisticians, and so the content itself may be of separate interest. We use the NeurIPS 2024 HiggsML Uncertainty Challenge benchmark data set from FAIR Universe, the original ingestion repo of which has been slightly modified to suit the demo.
Some of the code used in this demonstration was generated by the Claude large language model.
The main branch of this demo repo represents some good practices for scalable and reproducible research software. The structure of this repo mimics typical difficulty-scaling with methodological development for any data analysis-heavy paper:
- In Parameterize, we start small by just writing one script -- it may very well start with a minimal Jupyter notebook, say -- that completes a minimal reproducible proof-of-concept result with a synthetic toy model.
- In Parallelize, we take the minimal example and start pushing and prodding it, so to speak, in directions that help us explore its richness to the extent that phenomena we see may inform our intuition for what will happen when transferring our methodology to a more realistic case study.
- In Modularize, we tackle a realistic case study, drawing on code from internal and external "modules" that will help us maintain scripts as short and readable as were had in the toy examples.
Refer to the script/gaussian/dev subdirectory. This folder is all-inclusive -- if you activate the appropriate conda environment, say, then the one_np.py script can be run from the command line without drawing on code written in other subdirectories within this code repo. Furthermore, it features a typical but bared-down use of the click library's command line interface (CLI). This means that you can change the way that the script is run as you submit the script from the command line.
Example:
cd lfi-benchmark/script/gaussian/dev
conda activate -n <CONDA_ENV>
python one_np.py
Remark: At this stage, I like to open a GNU screen in my terminal with
screen -R run_script, say. With the screen open, I can run my script as needed, and then I can close the screen and resume using my terminal window without it being occupied by the standard output of the script.
Refer to the script/gaussian/batch subdirectory. In this folder, we have retained one_np.py as identical to the one in ../dev, but this time, there are a few new files:
- A shell script,
one_np.sh, which lists a sequence of commands which will be automatically interpreted by the terminal as though a user were interfacing with the terminal with that exact sequence, line by line; and - A text file,
array_params.txt, each line of which parameterizes a desired iterate ofone_np.py.
Example:
cd lfi-benchmark/script/gaussian/batch
sbatch one_np.sh
Remark: It is very important in general to conceal
slurmoutput files like those in theerr/andout/directories from thegitrepo by adding relevant lines to the.gitignorefile. I've only refrained from doing so here for illustrative purposes.
Finally, we assemble a few topics simultaneously in script/higgs, namely imports from
- The
srcdirectory - The
FAIR_Universe_datasetsubmodule
Remark: A substantial effort was put into writing
src/lfibm/simulator/higgs.pyso that it was meaningful, efficient, and concise. Conceptually, the role in the analysis pipeline that is served by theHiggsSimulatorobject defined therein may be understood as substituting many chunks of code that would have been needed in the preamble of a Jupyter notebook meant for conducting the same analysis performed in a script. By my estimation, a substantial effort should be anticipated for developing any such source code. Good data generation is crucial for LFI and statistical computing in general. This part of the process should be respected!
Example:
cd lfi-benchmark/script/higgs
sbatch two_params.sh
To make this submodule immediately installable (and thus easy to invoke in my source and script code), my main contribution was to add a bare-bones setup.py file:
# setup.py
from setuptools import setup, find_packages
setup(
name="hep_challenge",
version="0.1",
packages=find_packages(where="."),
package_dir={"": "."},
)
This file is enough to use pip for installation now. Note that the name here determines what the library will be called when I try importing objects from its namespace, e.g. from hep_challenge.datasets import Data.
In brief, we study methods similar to those used in practice for the discovery of the Higgs boson and other such experiments. Although the likelihood is intractable in experiments encoded only by a simulator, we use what is often known as the ``likelihood ratio trick'' to estimate the likelihood. Let
The likelihood ratio trick cleverly bypasses the limitations of a finite simulated data set by leveraging the identity,
where
- Generate some training pairs
$(x, \theta, y)\sim p(x\vert\theta)\pi(\theta)\delta(1)$ and$(x, \theta, y)\sim p(x\vert\theta)\pi(\theta')\delta(0)$ ,$\theta'\ne \theta$ ; - Train a probabilistic classifier
$h$ for$y\vert x,\theta$ using a multilayer perceptron; and - Output
$\hat{p}(x\vert\theta) \propto h(x, \theta) / (1-h(x, \theta))$ .
Define likelihood ratio test statistic
With the likelihood ratio estimator, we construct 95% confidence intervals by invoking Wilks' theorem on the sampling distribution of the statistic,
As a diagnostic check, we also plot the sampling distribution of
This repo has an example of a git submodule. Upon cloning, your local version of this repo will have a reference to a particular commit of the remote version of the submodule repo. In order to access the content of the submodule in your subsequent scripts, be sure to run
git clone git@github.com:JamesCarzon/lfi-benchmark.git
git submodule init
git submodule update
to collect the content.
When you are simultaneously developing two bodies of code, it helps to compartmentalize them. For this demo, I've opted to put certain methods into the present project's src directory and others in script based on which files I want to run to actually perform analyses -- those go in script -- and which only serve to support analysis -- in src. Then, src is set up to be pip-installable using
- A
pyproject.tomlfile in root; and - Appropriately formatted
__init__.pyfiles throughoutsrc.
I'm choosing to work with a conda environment that will manage all of my needed dependencies, even though they are all pip-installable. This is because conda is very general and useful, and so I don't need to re-teach myself too much about virtual environment management each time I start a new project. Set-up and installation proceeds thusly:
conda create --name lfi_benchmark python=3.9
conda activate lfi_benchmark
# navigate to root directory of present project
pip install -e . # to install present project in editable mode in virtual env
cd submodule/FAIR_Universe_dataset
pip install -e . # to install submodule in editable mode in virtual env
Remark: Why
pip-install when we've already updated the submodules withgit? Updating thegitsubmodules populates the relevant directories in your machine's file system, but your computing environment (e.g. thecondaenvironment for your script's context, or the Jupyter kernel for your notebook) requires thepip-install for names of the objects defined in that module to be identified and referred to those directories. In other words, this step allows you to import from thesrcdirectory at the top of a script, e.g.from hep_challenge.datasets import Data from lfibm.simulator.higgs import HiggsSimulator
One more touch is needed for the Higgs example: data loading. The hep_challenge repo contains methods for downloading a very sizable reference data set with millions of rows, too big to be appropriate to store in a GitHub repo itself. Instead, we store this file somewhere on our local machine (perhaps by having downloaded it using hep_challenge.datasets.Data) and then referring to that location privately with the configparser library. I've written a file called config.ini for myself to do just that, assigning paths to directories that I'd like to consistently use in my script between jobs but whose content is too large to save on GitHub.
Remark: I've concealed this .ini file from being recognized as part of my directory, too, as I don't want to publish my local machine's file directory structure -- this information is possibly security-sensitive!
My config.ini file is written as follows:
[DEFAULT]
DataDir = <INSERT PATH>/public_data
OutputDir = <INSERT PATH>/results
Note that configparser will allow you to parse additional categories besides "DEFAULT" if you name further sections in this file.