Version 0.0.1 — univariate SV enrichment analysis; intergration with older MiXeR features (e.g., bivariate analysis) planned for future release.
- Overview
- Installation
- LD Reference Data
- Summary Statistics
- Usage
- CLI Arguments
- Output Files
- Citation
MiXeR-SV is a statistical genetics tool for quantifying the enrichment of structural variants (SVs) in trait heritability. It builds on the MiXeR framework and uses PyTorch for efficient gradient-based optimization. Key features include:
- Univariate SV enrichment analysis via LD-score regression and a Gaussian mixture heritability model.
- Annotation-based enrichment estimation with fold-enrichment statistics and standard errors.
- QC of GWAS summary statistics (MAF filtering, MHC exclusion, ambiguous SNP removal, allele alignment).
- Chromosome-level data processing for memory efficiency.
- Bivariate (cross-trait) analysis planned for a future release.
The recommended way to set up the environment is via conda:
# Create a new conda environment with Python 3.13
conda create -n mixer2 python=3.13 -y
# Activate the environment
conda activate mixer2
# Install all required packages
pip install -r requirements.txtKey dependencies (see requirements.txt for pinned versions):
| Package | Purpose |
|---|---|
torch |
PyTorch optimization backend |
numpy / scipy |
Numerical computing |
pandas |
Data handling |
scikit-learn |
Utility functions |
mat73 |
Loading MATLAB v7.3 LD matrix files |
MiXeR-SV requires pre-computed LD matrices. We provide LD reference data derived from the 1000 Genomes Project (GRCh38) using both WGS 30X and ONT long-read sequencing:
Download link:
https://drive.google.com/drive/folders/15sg6P0rmHxRBQNglpFkDjoUSf1C6S-Q8?usp=sharing
You can download the two tar.gz files for the EAS and EUR populations, then extract them:
## EAS
tar -xzf 1kg_combined_plink_eas_AC3.tar.gz
## EUR
tar -xzf 1kg_combined_plink_eur_AC3.tar.gzThe directory structure should look like:
1kg_combined_plink_eur_AC3/
├── SV_variants.txt # List of SV variant IDs
├── annot_mat.txt # Annotation matrix (tab-separated, SNP-indexed)
├── chr1.ldmat/
├── chr2.ldmat/
│ ...
└── chr22.ldmat/
A permanent Zenodo DOI will be provided upon completion of peer review.
MiXeR-SV reads GWAS summary statistics in the standard format used by LDSC. A curated collection of 107 independent GWAS summary statistics is available from:
https://zenodo.org/records/10515792/files/sumstats_indep107.tgz?download=1
A few example files are included in the sumstats/ directory for quick testing.
| Hardware | Estimated runtime per trait |
|---|---|
| Apple M4 Max (36 GB RAM) | < 10 minutes |
| Intel CPU (32 GB+ RAM) | ~30 mins – 1 hours |
At least 32 GB of RAM is recommended for genome-wide analysis.
# Activate the conda environment
conda activate mixer2
# Create output directory
mkdir -p test_results
# Set paths to LD reference data (replace with your actual path)
LD_DIR="1kg_combined_plink_eur_AC3"
# other annotation files, should be in the same directory as the provided LD matrices
SNP_FILE="$LD_DIR/SV_variants.txt"
ANNOT="$LD_DIR/annot_mat.txt"
# Run analysis for all traits in sumstats/
for TRAIT in sumstats/*.sumstats.gz; do
BASENAME=$(basename "$TRAIT" .sumstats.gz)
OUT="test_results/univar_${BASENAME}.txt"
OUT_LIST="test_results/univar_${BASENAME}_list.txt"
# Skip if output already exists
if [[ -f "$OUT_LIST" && -s "$OUT_LIST" ]]; then
echo "Skipping (already exists): $OUT_LIST"
continue
fi
echo "Processing: $BASENAME"
python mixer2.py univar \
--annot "$ANNOT" \
--ld-mat1 "$LD_DIR" \
--trait1 "$TRAIT" \
--snp-file "$SNP_FILE" \
--output "$OUT" \
--seed 42
doneThe constrained model fixes the within-region heritability coefficient to zero, effectively testing the null hypothesis that SVs contribute no additional heritability beyond the genome-wide baseline. This is useful for:
Add --constrain-roi-estimates-to-zero to any univar call:
# Activate the conda environment
conda activate mixer2
# Create output directory
mkdir -p test_results_constrained
# Set paths to LD reference data
LD_DIR="1kg_combined_plink_eur_AC3"
SNP_FILE="$LD_DIR/SV_variants.txt"
ANNOT="$LD_DIR/annot_mat.txt"
# Run constrained analysis for all traits in sumstats/
for TRAIT in sumstats/*.sumstats.gz; do
BASENAME=$(basename "$TRAIT" .sumstats.gz)
OUT="test_results_constrained/univar_${BASENAME}.txt"
OUT_LIST="test_results_constrained/univar_${BASENAME}_list.txt"
# Skip if output already exists
if [[ -f "$OUT_LIST" && -s "$OUT_LIST" ]]; then
echo "Skipping (already exists): $OUT_LIST"
continue
fi
echo "Processing (constrained): $BASENAME"
python mixer2.py univar \
--annot "$ANNOT" \
--ld-mat1 "$LD_DIR" \
--trait1 "$TRAIT" \
--snp-file "$SNP_FILE" \
--output "$OUT" \
--constrain-roi-estimates-to-zero \
--seed 42
doneAll arguments below apply to the univar subcommand. Run python mixer2.py univar --help for the full help message.
| Argument | Description |
|---|---|
--annot |
Path to the annotation matrix file (tab-separated, SNP-indexed) |
--ld-mat1 |
Path to the LD reference directory containing chr1.ldmat/ through chr22.ldmat/ subdirectories |
--trait1 |
Path to the GWAS summary statistics file (.sumstats.gz) |
| Argument | Default | Description |
|---|---|---|
--snp-file |
None |
File listing SNP IDs that define the genomic region of interest (e.g., SV loci) |
--output, -o |
output.txt |
Base path for output files (_list.txt, _table.txt, .log suffixes are added automatically) |
--maf-threshold |
0.05 |
Minor allele frequency threshold for variant filtering |
--seed |
None |
Random seed for reproducibility |
--s-value |
-0.25 |
Heritability model parameter S (variants contribute via H^S) |
--pytorch-epochs |
500 |
Number of PyTorch optimization epochs |
--pytorch-lr |
0.001 |
PyTorch optimizer learning rate |
--only-base |
False |
Use only the "base" (intercept) annotation column |
--disable-inverse-ld-score-weights |
False |
Disable inverse-LD-score weighting |
--save-null-model |
False |
Cache the null (baseline) model to a .sig2_beta_i.mat file for reuse across runs |
--constrain-roi-estimates-to-zero |
False |
Constrain the within-region (SV) heritability coefficient to zero; useful as a null/constrained model for comparison against the unconstrained fit |
--verbose, -v |
False |
Enable verbose logging |
--debug |
False |
Enable debug-level logging |
For a given --output results/univar_trait.txt, MiXeR-SV produces three files:
| File | Description |
|---|---|
results/univar_trait_list.txt |
Key-value pairs of all result metrics, one per line |
results/univar_trait_table.txt |
Tab-separated table with column headers (suitable for downstream aggregation across traits) |
results/univar_trait.log |
Full run log |
Result metrics include fold-enrichment estimates, standard errors, statistics, and per-annotation heritability contributions.
If you use MiXeR-SV in your research, please cite:
Citation details will be added upon publication.