A command-line toolkit for running privacy–utility experiments on pangenome graphs.
This pipeline supports experiment initialization, optimization, stacking, VCF merging, and a wide set of downstream analyses.
This toolkit is designed to run in a cluster environment and currently supports execution via slurm using sbatch.
This repository accompanies the PanMixer paper. The
mainbranch reflects the publication-ready state of the toolkit.
The toolkit organizes workflows around experiments identified by an --exp number.
If not specified, the latest experiment is automatically selected.
Typical flow:
experiment_starteroptimizestackerconvert_2_vcfcombine_vcfs- Downstream analyses (
gap_score,af_loss,ld_loss,beagle, etc.) gather_results
# Create and activate an environment
conda create -y -n panmixer python=3.11
conda activate panmixer
# or use venv:
# python -m venv .venv && source .venv/bin/activate
# Install Python dependencies
pip install -r requirements.txt
# Ensure external tools (bcftools, tabix, plink) are available in PATHDefined in config.yaml, these constants need to be set before running any scripts
python_env– path to python environment (ex: ./envs/panmixer-env)base_path– base path to root of this directory (ex: ./PanMixer)
We suggest upon cloning this repository to copy the config, cp config.yaml config.local.yaml and changing the configurations in config.local.yaml
Note: Large data assets (pangenomes, read fastqs, 1000 Genomes datasets) are not bundled in this repository. You must download them yourselves. We provide scripts to help you :)
cd into the starting-data directory ./starting_data/. Run each command in get_data.sh one at a time, waiting for the previous command to finish.
WARNING: External download links may break over time. If a script fails, check the source URL inside it before retrying.
Commands:
./scripts/get_pangenomes.sh: Downloads the PGGB draft human pangenome./scripts/get_pangenie_alignments.sh: Downloads the Pangenie callset./scripts/get_1000g_phased.sh: Downloads the 1000 Genomes Project phased panelbcftools index pangenome.vcf.gz: Indexes the downloaded pangenome./scripts/remove_X.sh: Removes any X chromosome data./scripts/remove_chm13.sh: Removes the chm13 backbonesbatch scripts/split_data.sbatch: Splits the data by chromosomesbatch scripts/get_num_alleles.sbatch: Identifies the unique alleles and variantssbatch scripts/get_blocks.sbatch: Computes the LD blockssbatch scripts/convert_2_npy.sbatch: Converts the VCF files into Numpy files for easier IOsbatch scripts/get_mappings.sbatch: Computes variant mappingssbatch scripts/segment_blocks.sbatch: Refines segmented blocks produced by plinksbatch scripts/get_pmi_utility.sbatch: Computes the PMI and utility loss for each obfuscation move
These commands take about 1–2 days to run end-to-end. Please be patient.
To make the sampling step reproducible during preprocessing, pass a base seed to the SLURM job:
sbatch --export=SEED=123 scripts/get_pmi_utility.sbatchEach array task derives its own seed from this base seed, so different chromosome/subject tasks do not reuse the same random stream.
# 1) Start a new experiment
python3 main.py experiment_starter
# 2) Run optimizer
python3 main.py --exp 0 optimize --fixed_param utility
# Optional: make random optimizer baselines reproducible
python3 main.py --exp 0 --seed 123 optimize --fixed_param random
# 3) Stack edits using a strategy
python3 main.py --exp 0 stacker --strategy to_best
# 4) Merge per-chromosome VCFs into one
python3 main.py --exp 0 convert_2_vcf
python3 main.py --exp 0 combine_vcfs
# 5) Run downstream analyses
python3 main.py --exp 0 gap_score_all
python3 main.py --exp 0 af_loss
python3 main.py --exp 0 ld_loss
python3 main.py --exp 0 vg_prep
python3 main.py --exp 0 quick_align--exp INT– experiment number (defaults to latest if not set).--overwrite– allow overwriting existing results.--seed INT– base random seed for commands that sample. Currently used by the random optimizer baseline; per-SLURM-task seeds are derived from this base seed.
The HMM and allele-frequency resampling paths support explicit seeds. Without a seed, NumPy uses its default process-level randomness.
For the standalone obfuscation tool:
python3 tools/obfuscate.py starting_data HG00438 0.1 21 obfuscation_output --seed 123For the obfuscation SLURM wrapper:
sbatch --export=DATA_DIR=/path/to/starting_data,SUBJECT=HG00438,CAPACITY=0.1,OUTPUT_DIR=/path/to/output,SEED=123 tools/obfuscate.sbatchFor starting-data PMI/utility precomputation:
sbatch --export=SEED=123 starting_data/scripts/get_pmi_utility.sbatchInitialize an experiment.
python3 main.py experiment_starter \
--capacity_file path/to/capacities.txt \
--subjects_file path/to/subjects.txt \
[--baseline_unedited] [--baseline_empty] [--baseline_unique]Parameters:
--capacity_filesupplies a new line delimited file with all the target capacities (target privacy risk and utility loss)--subjects_filesupplies a list of target individuals to obfuscate--baseline_uneditedruns an experiment with the original pangenome graphs (no edits) as a baseline--baseline_emptyruns an experiment with the subject removed--baseline_uniqueruns a baseline experiment where unique-to-subject variants are removed
Run linear optimizer.
python3 main.py optimize --fixed_param utilityParameters:
--fixed_paramcan be eitherprivacyorutilityand optimize for a fixed capacity limit as set by the capacity file of that parameter--baseline_uniqueoptimize for the baseline with removed unique variants (use with experiments started with--baseline_unique)
Apply stacking strategy which applies the obfuscated moves taken by the optimizer step.
python3 main.py stacker --strategy to_bestParameters:
--strategycan either beto_bestwhich is the strategy described in the paper,to_emptyremoves the individual and is run with experiments setup using the--baseline_emptyflag, orto_uneditedwhich ignores all obfuscation moves and is used with--baseline_unedited
Converts the output of stacker to a variant centric representation of the graph
python3 main.py convert_2_vcfMerge per-chromosome VCFs.
python3 main.py combine_vcfsCompute haploid/diploid gap scores.
python3 main.py gap_score
python3 main.py gap_score_allRun gap_score for one attack
Run gap_score_all for all attacks
Compute allele frequency loss and linkage disequilibrium loss.
python3 main.py af_loss
python3 main.py ld_lossParameters (both commands):
--dont_replacecompute metrics without replacing existing results
Computes beagle reconstruction of genotypes
python3 main.py beagleGet accuracy statistics of beagle refinements.
python3 main.py accuracy_statsPrepare vg giraffe indices and run read mapping for all reads in /read_fastqs/.
python3 main.py vg_prep
python3 main.py quick_alignVerify genotype consistency across all samples for an experiment (runs per-chromosome in parallel via Slurm).
python3 main.py --exp 0 verifyConvert per-subject VCF files to NumPy arrays in parallel across chromosomes via Slurm. Useful for preprocessing new VCF datasets into the internal numpy format used by the pipeline.
python3 main.py --exp 0 VCFtoNP_parallelCreate multi-target VCFs by merging obfuscation results from a source experiment into the format expected by a target experiment. Useful for composing experiments that cover multiple subjects.
python3 main.py --exp 0 create_multitarget_vcfs --target_exp 1Parameters:
--target_expthe experiment number to use as the target layout
Analyze gap score privacy thresholds across a range of pangenome sizes. Useful for understanding how privacy guarantees scale with pangenome size.
python3 main.py --exp 0 gap_threshold --pangenome_sizes 50,100,250,500,1000,2000Parameters:
--pangenome_sizescomma-separated list of pangenome sizes to test (default:50,100,250,500,1000,1500,2000,2500,3000)
Aggregate outputs across experiments.
python3 main.py [--overwrite] gather_results \
[--optimizer] [--reindex] [--gap_score] [--stacker] \
[--af_loss] [--ld_loss] [--pangenie_stats] \
[--accuracy_stats] [--giraffe]Run without flags to gather all results, or pass one or more flags to gather only those result types.
Parameters:
--overwriteoverwrites the results if the flag is present (reset)--optimizergather optimizer results--reindexre-index the experiment--gap_scoregather gap score results--stackergather stacker results--af_lossgather allele frequency loss results--ld_lossgather LD loss results--pangenie_statsgather Pangenie stats--accuracy_statsgather Beagle accuracy stats--giraffegather Giraffe alignment results
[ experiment_starter ]
↓
[ optimize ]
↓
[ stacker ]
↓
[ convert_2_vcf ]
↓
[ combine_vcfs ]
↓
[ downstream analyses ]
├─ gap_score / gap_score_all
├─ gap_threshold
├─ af_loss / ld_loss
├─ beagle / accuracy_stats
├─ vg_prep / quick_align
├─ verify
└─ gather_results
[ Utilities ]
├─ VCFtoNP_parallel (preprocess new VCF data)
└─ create_multitarget_vcfs (compose multi-subject experiments)
- Experiments: stored in directories per experiment number in
/experiments/. - VCFs: per-chromosome and merged VCFs.
- Metrics: AF/LD loss, gap scores.
- Alignments: vg-prepped indexes, alignment results.
- Aggregated results: via
gather_results.
- No experiment found: Run
experiment_starterfirst. - Missing input files: Verify
STARTING_DATA_PATH,--subjects_file, and--capacity_file. - VCF issues: Ensure inputs are bgzipped (
.vcf.gz) and indexed (.tbi). - External tool errors: Ensure
bcftools,tabix,plinkare installed and on PATH.