PanMixer Toolkit

A command-line toolkit for running privacy–utility experiments on pangenome graphs.
This pipeline supports experiment initialization, optimization, stacking, VCF merging, and a wide set of downstream analyses.

This toolkit is designed to run in a cluster environment and currently supports execution via slurm using sbatch.

This repository accompanies the PanMixer paper. The main branch reflects the publication-ready state of the toolkit.

Overview

The toolkit organizes workflows around experiments identified by an --exp number.
If not specified, the latest experiment is automatically selected.

Typical flow:

experiment_starter
optimize
stacker
convert_2_vcf
combine_vcfs
Downstream analyses (gap_score, af_loss, ld_loss, beagle, etc.)
gather_results

Installation

# Create and activate an environment
conda create -y -n panmixer python=3.11
conda activate panmixer
# or use venv:
# python -m venv .venv && source .venv/bin/activate

# Install Python dependencies
pip install -r requirements.txt

# Ensure external tools (bcftools, tabix, plink) are available in PATH

Data & Paths

Defined in config.yaml, these constants need to be set before running any scripts

python_env – path to python environment (ex: ./envs/panmixer-env)
base_path – base path to root of this directory (ex: ./PanMixer)

We suggest upon cloning this repository to copy the config, cp config.yaml config.local.yaml and changing the configurations in config.local.yaml

Note: Large data assets (pangenomes, read fastqs, 1000 Genomes datasets) are not bundled in this repository. You must download them yourselves. We provide scripts to help you :)

Dataset installation and preprocessing

cd into the starting-data directory ./starting_data/. Run each command in get_data.sh one at a time, waiting for the previous command to finish.

WARNING: External download links may break over time. If a script fails, check the source URL inside it before retrying.

Commands:

./scripts/get_pangenomes.sh: Downloads the PGGB draft human pangenome
./scripts/get_pangenie_alignments.sh: Downloads the Pangenie callset
./scripts/get_1000g_phased.sh: Downloads the 1000 Genomes Project phased panel
bcftools index pangenome.vcf.gz: Indexes the downloaded pangenome
./scripts/remove_X.sh: Removes any X chromosome data
./scripts/remove_chm13.sh: Removes the chm13 backbone
sbatch scripts/split_data.sbatch: Splits the data by chromosome
sbatch scripts/get_num_alleles.sbatch: Identifies the unique alleles and variants
sbatch scripts/get_blocks.sbatch: Computes the LD blocks
sbatch scripts/convert_2_npy.sbatch: Converts the VCF files into Numpy files for easier IO
sbatch scripts/get_mappings.sbatch: Computes variant mappings
sbatch scripts/segment_blocks.sbatch: Refines segmented blocks produced by plink
sbatch scripts/get_pmi_utility.sbatch: Computes the PMI and utility loss for each obfuscation move

These commands take about 1–2 days to run end-to-end. Please be patient.

To make the sampling step reproducible during preprocessing, pass a base seed to the SLURM job:

sbatch --export=SEED=123 scripts/get_pmi_utility.sbatch

Each array task derives its own seed from this base seed, so different chromosome/subject tasks do not reuse the same random stream.

Quick Start

# 1) Start a new experiment
python3 main.py experiment_starter
# 2) Run optimizer
python3 main.py --exp 0 optimize --fixed_param utility

# Optional: make random optimizer baselines reproducible
python3 main.py --exp 0 --seed 123 optimize --fixed_param random

# 3) Stack edits using a strategy
python3 main.py --exp 0 stacker --strategy to_best

# 4) Merge per-chromosome VCFs into one
python3 main.py --exp 0 convert_2_vcf
python3 main.py --exp 0 combine_vcfs

# 5) Run downstream analyses
python3 main.py --exp 0 gap_score_all
python3 main.py --exp 0 af_loss
python3 main.py --exp 0 ld_loss
python3 main.py --exp 0 vg_prep
python3 main.py --exp 0 quick_align

Command Reference

Global Flags

--exp INT – experiment number (defaults to latest if not set).
--overwrite – allow overwriting existing results.
--seed INT – base random seed for commands that sample. Currently used by the random optimizer baseline; per-SLURM-task seeds are derived from this base seed.

Reproducibility

The HMM and allele-frequency resampling paths support explicit seeds. Without a seed, NumPy uses its default process-level randomness.

For the standalone obfuscation tool:

python3 tools/obfuscate.py starting_data HG00438 0.1 21 obfuscation_output --seed 123

For the obfuscation SLURM wrapper:

sbatch --export=DATA_DIR=/path/to/starting_data,SUBJECT=HG00438,CAPACITY=0.1,OUTPUT_DIR=/path/to/output,SEED=123 tools/obfuscate.sbatch

For starting-data PMI/utility precomputation:

sbatch --export=SEED=123 starting_data/scripts/get_pmi_utility.sbatch

Subcommands

experiment_starter

Initialize an experiment.

python3 main.py experiment_starter \
  --capacity_file path/to/capacities.txt \
  --subjects_file path/to/subjects.txt \
  [--baseline_unedited] [--baseline_empty] [--baseline_unique]

Parameters:

--capacity_file supplies a new line delimited file with all the target capacities (target privacy risk and utility loss)
--subjects_file supplies a list of target individuals to obfuscate
--baseline_unedited runs an experiment with the original pangenome graphs (no edits) as a baseline
--baseline_empty runs an experiment with the subject removed
--baseline_unique runs a baseline experiment where unique-to-subject variants are removed

optimize

Run linear optimizer.

python3 main.py optimize --fixed_param utility

Parameters:

--fixed_param can be either privacy or utility and optimize for a fixed capacity limit as set by the capacity file of that parameter
--baseline_unique optimize for the baseline with removed unique variants (use with experiments started with --baseline_unique)

stacker

Apply stacking strategy which applies the obfuscated moves taken by the optimizer step.

python3 main.py stacker --strategy to_best

Parameters:

--strategy can either be to_best which is the strategy described in the paper, to_empty removes the individual and is run with experiments setup using the --baseline_empty flag, or to_unedited which ignores all obfuscation moves and is used with --baseline_unedited

convert_2_vcf

Converts the output of stacker to a variant centric representation of the graph

python3 main.py convert_2_vcf

combine_vcfs

Merge per-chromosome VCFs.

python3 main.py combine_vcfs

gap_score / gap_score_all

Compute haploid/diploid gap scores.

python3 main.py gap_score
python3 main.py gap_score_all

Run gap_score for one attack Run gap_score_all for all attacks

af_loss / ld_loss

Compute allele frequency loss and linkage disequilibrium loss.

python3 main.py af_loss
python3 main.py ld_loss

Parameters (both commands):

--dont_replace compute metrics without replacing existing results

Beagle

Computes beagle reconstruction of genotypes

python3 main.py beagle

accuracy_stats

Get accuracy statistics of beagle refinements.

python3 main.py accuracy_stats

vg_prep / quick_align

Prepare vg giraffe indices and run read mapping for all reads in /read_fastqs/.

python3 main.py vg_prep
python3 main.py quick_align

verify

Verify genotype consistency across all samples for an experiment (runs per-chromosome in parallel via Slurm).

python3 main.py --exp 0 verify

VCFtoNP_parallel

Convert per-subject VCF files to NumPy arrays in parallel across chromosomes via Slurm. Useful for preprocessing new VCF datasets into the internal numpy format used by the pipeline.

python3 main.py --exp 0 VCFtoNP_parallel

create_multitarget_vcfs

Create multi-target VCFs by merging obfuscation results from a source experiment into the format expected by a target experiment. Useful for composing experiments that cover multiple subjects.

python3 main.py --exp 0 create_multitarget_vcfs --target_exp 1

Parameters:

--target_exp the experiment number to use as the target layout

gap_threshold

Analyze gap score privacy thresholds across a range of pangenome sizes. Useful for understanding how privacy guarantees scale with pangenome size.

python3 main.py --exp 0 gap_threshold --pangenome_sizes 50,100,250,500,1000,2000

Parameters:

--pangenome_sizes comma-separated list of pangenome sizes to test (default: 50,100,250,500,1000,1500,2000,2500,3000)

gather_results

Aggregate outputs across experiments.

python3 main.py [--overwrite] gather_results \
  [--optimizer] [--reindex] [--gap_score] [--stacker] \
  [--af_loss] [--ld_loss] [--pangenie_stats] \
  [--accuracy_stats] [--giraffe]

Run without flags to gather all results, or pass one or more flags to gather only those result types.

Parameters:

--overwrite overwrites the results if the flag is present (reset)
--optimizer gather optimizer results
--reindex re-index the experiment
--gap_score gather gap score results
--stacker gather stacker results
--af_loss gather allele frequency loss results
--ld_loss gather LD loss results
--pangenie_stats gather Pangenie stats
--accuracy_stats gather Beagle accuracy stats
--giraffe gather Giraffe alignment results

Workflow Diagram

[ experiment_starter ]
        ↓
     [ optimize ]
        ↓
      [ stacker ]
        ↓
    [ convert_2_vcf ]
        ↓
   [ combine_vcfs ]
        ↓
[ downstream analyses ]
   ├─ gap_score / gap_score_all
   ├─ gap_threshold
   ├─ af_loss / ld_loss
   ├─ beagle / accuracy_stats
   ├─ vg_prep / quick_align
   ├─ verify
   └─ gather_results

[ Utilities ]
   ├─ VCFtoNP_parallel  (preprocess new VCF data)
   └─ create_multitarget_vcfs  (compose multi-subject experiments)

Outputs

Experiments: stored in directories per experiment number in /experiments/.
VCFs: per-chromosome and merged VCFs.
Metrics: AF/LD loss, gap scores.
Alignments: vg-prepped indexes, alignment results.
Aggregated results: via gather_results.

Troubleshooting

No experiment found: Run experiment_starter first.
Missing input files: Verify STARTING_DATA_PATH, --subjects_file, and --capacity_file.
VCF issues: Ensure inputs are bgzipped (.vcf.gz) and indexed (.tbi).
External tool errors: Ensure bcftools, tabix, plink are installed and on PATH.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
downloaded_tools		downloaded_tools
plot_util		plot_util
starting_data		starting_data
supplementary_analysis_tools		supplementary_analysis_tools
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_path.sh		add_path.sh
check.py		check.py
config.yaml		config.yaml
constants.py		constants.py
environment.yaml		environment.yaml
main.py		main.py
run_main.py		run_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PanMixer Toolkit

Overview

Installation

Data & Paths

Dataset installation and preprocessing

Quick Start

Command Reference

Global Flags

Reproducibility

Subcommands

experiment_starter

optimize

stacker

convert_2_vcf

combine_vcfs

gap_score / gap_score_all

af_loss / ld_loss

Beagle

accuracy_stats

vg_prep / quick_align

verify

VCFtoNP_parallel

create_multitarget_vcfs

gap_threshold

gather_results

Workflow Diagram

Outputs

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PanMixer Toolkit

Overview

Installation

Data & Paths

Dataset installation and preprocessing

Quick Start

Command Reference

Global Flags

Reproducibility

Subcommands

experiment_starter

optimize

stacker

convert_2_vcf

combine_vcfs

gap_score / gap_score_all

af_loss / ld_loss

Beagle

accuracy_stats

vg_prep / quick_align

verify

VCFtoNP_parallel

create_multitarget_vcfs

gap_threshold

gather_results

Workflow Diagram

Outputs

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages