GenoLoader is a Python package for derived allele polarization from VCF files and genetic load estimation, designed to work with both standard- and low-coverage genomic sequencing data.
Given a Freebayes- or GATK-generated VCF annotated with functional effects (e.g. via SnpEff), GenoLoader main function polarizes each SNP as ancestral or derived and recodes individual genotypes as derived-allele dosage values (0, 1, 2) into a custom format text file. It supports multiple polarization strategies — outgroup-based, major allele in one or multiple populations — and implements a read-resampling approach for low-coverage (i.e. ancient) DNA samples. All samples present in the VCF are converted regardless of their population assignment.
Alongside the Python package provided as a Jupyter Lab/Notebook, GenoLoader main function (write_gt in the Python package) is also provided as an AI-translated c++ scritp. While the Pyhton code allows for easier understanding of the script, implementation within existing pipelines and customization, the c++ version provides the exact same features with a 10X faster execution speed. The latter is recommended for processing very large files.
Genetic load estimates and plotting options are provided as additional functions in the Python package only as they do not require long execution time.
- Features
- Requirements
- Installation
- Input files
- Usage
- Parameters
- Polarization strategies
- Polarization flags
- Output files
- Low-coverage resampling
- Example
- Citation
- Polarizes ancestral/derived alleles at biallelic SNP positions using flexible, user-defined strategies
- Supports outgroup-based unfolded polarization (
POP_OUT) as well as ingroup-only (POP1,POP2, orPOP1POP2) and reference-based options - Recodes genotypes as derived-allele dosage:
0(homozygous ancestral),1(heterozygous),2(homozygous derived),nan(missing) - Parses SnpEff
ANN=annotations to classify variants by functional category (missense, synonymous, intergenic, intronic, upstream, downstream) and putative fitness impact (HIGH, MODERATE, LOW, MODIFIER) - Assigns a diagnostic polarization flag to every locus for full traceability of the supporting data at that site
- Implements a one-read-per-genotype random resampling mode for low-coverage (ancient) DNA data
Install dependencies via conda or pip:
conda install scipy numpy pandas tqdm matplotlib seaborn
# or
pip install scipy numpy pandas tqdm matplotlib seabornClone this repository:
git clone https://github.com/emitruc/genoloader.git
cd genoloaderNo compilation or package installation is required. The script GenoLoader.v3.2.ipynb can be run interactively in a Jupyter Lab/Notebook or imported as a module.
In the bash terminal, compile the cpp file using g++:
g++ -O2 -std=c++17 -o genoloader GenoLoader.v3.2.cppChange permissions:
chmod u+x genoloaderRun the script
./genoloader sample.ann.vcf --p1 pop1.txt --p2 pop2.txt --p0 outgroup.txt \
--m1 5 --m2 5 --m0 2 --polX POP_OUT --low_cov YESDisclaimer: While the content of the c++ script has not been checked, its output (gt file) was identical with the gt file produced by the Python script on a 2,296,974-variant input vcf as checked with bash sha256sum.
A biallelic SNP VCF file generated by Freebayes or GATK, annotated with functional effects using SnpEff (the ANN= tag must be present in the INFO field). Both phased and unphased genotype notation are accepted.
The FORMAT field must include:
GT— genotype (required for all modes)AD— allelic depth (required only whenlow_cov='YES')
Note: Only loci carrying a SnpEff
ANN=annotation are written to the output. Unannotated loci are silently skipped.
Each population is defined by a plain-text file containing one sample name per line, matching the sample names in the VCF header exactly. For example:
sample_A
sample_B
sample_C
- Pop1 (
p1): target population 1 (ingroup 1) - Pop2 (
p2): target population 2 (ingroup 2), or closely related outgroup, used in combined polarization modes - Pop_out (
p0): the outgroup population/species, used for unfolded polarization
Population files are used only to determine the ancestral allele. All individuals present in the VCF are recoded and written to the output, including those not assigned to any population group. IMPORTANT: The same individual may appear in more than one population file. You may want to play with individual assignment to refine your polarization.
Import and call the write_gt function directly in Python:
from GiNOLOADER_v3_2 import write_gt
write_gt(
vcf_file = 'my_samples.ann.vcf',
p1 = 'pop1_samples.txt',
p2 = 'pop2_samples.txt',
p0 = 'outgroup_samples.txt',
m1 = 'min_ind_p1',
m2 = 'min_ind_p2',
m0 = 'min_ind_p0',
polX = 'polarization strategy',
low_cov = 'one_read_resampling'
)| Parameter | Type | Default | Description |
|---|---|---|---|
vcf_file |
str |
required | Path to the SnpEff-annotated VCF file |
p1 |
str |
None |
Path to Pop1 sample list file |
p2 |
str |
None |
Path to Pop2 sample list file |
p0 |
str |
None |
Path to outgroup sample list file |
m1 |
int |
0 |
Minimum number of non-missing individuals required in Pop1 to retain a locus |
m2 |
int |
0 |
Minimum number of non-missing individuals required in Pop2 to retain a locus |
m0 |
int |
0 |
Minimum number of non-missing individuals required in the outgroup to retain a locus |
polX |
str |
'REF' |
Polarization strategy (see Polarization strategies) |
low_cov |
str |
'NO' |
Enable low-coverage read resampling: 'YES' or 'NO' |
Tip: For genetic load estimation, set
m1andm2(or onlym1depending on your polarization strategy), equal to the total number of individuals in the respective population files to exclude any locus with missing data. No missing data is good practice when estimating genetic load.
The polX parameter controls how the ancestral allele is determined at each locus.
polX value |
Description |
|---|---|
'REF' |
No repolarization. The VCF reference allele is treated as ancestral. |
'POP_OUT' |
Outgroup-based unfolded polarization. The allele present and fixed in the outgroup (p0) is inferred as ancestral. If not fixed, alternative behaviour is recorded |
'POP1' |
The major allele in Pop1 is used as the ancestral state. |
'POP2' |
The major allele in Pop2 is used as the ancestral state. |
'POP1POP2' |
The major allele across Pop1 and Pop2 combined is used as the ancestral state. |
Note: If a population file is not provided but the strategy requires it, the function falls back gracefully to using whatever information is available, as described in Polarization flags.
Every locus in the output is assigned a diagnostic flag describing the allelic configuration used to determine the ancestral allele. These flags are written to the flag column of the .gt output file and allow full traceability of polarization decisions.
| Flag | Description |
|---|---|
reference |
No repolarization; VCF reference allele used as ancestral (polX='REF') |
allFix |
All groups (outgroup and ingroups) monomorphic for the same allele |
allMiss |
All alleles missing across all groups; locus assigned missing |
inMiss |
Both ingroup populations missing; ancestral allele inferred from outgroup modal allele |
InvarOutMiss |
Outgroup missing; both ingroups monomorphic for the same allele |
altFix |
Outgroup missing; Pop1 and Pop2 fixed for different alleles; Pop1 allele used (conservative) |
in1FoldAnc |
Outgroup and Pop2 missing; Pop1 monomorphic; Pop1 allele used as ancestral |
in1FoldSeg |
Outgroup and Pop2 missing; Pop1 polymorphic; major Pop1 allele used |
in2FoldAnc |
Outgroup and Pop1 missing; Pop2 monomorphic; Pop2 allele used as ancestral |
in2FoldSeg |
Outgroup and Pop1 missing; Pop2 polymorphic; major Pop2 allele used |
unfoldFix1outMiss |
Outgroup missing; Pop1 monomorphic, Pop2 polymorphic; Pop1 allele used |
unfoldFix2outMiss |
Outgroup missing; Pop2 monomorphic, Pop1 polymorphic; Pop2 allele used |
inFold |
Outgroup missing; both ingroups polymorphic; major ingroup allele used (folded fallback) |
unfolded2Miss1 |
Outgroup monomorphic; Pop1 missing, Pop2 present; outgroup allele used |
unfolded1Miss2 |
Outgroup monomorphic; Pop2 missing, Pop1 present; outgroup allele used |
unfoldedAltFix |
Outgroup monomorphic; both ingroups monomorphic for different alleles |
unfoldedILS |
Outgroup monomorphic; both ingroups polymorphic (possible incomplete lineage sorting) |
unfoldedFixAnc1 |
Outgroup monomorphic; Pop1 fixed for ancestral allele, Pop2 polymorphic |
unfoldedFixDer1 |
Outgroup monomorphic; Pop1 fixed for derived allele, Pop2 polymorphic |
unfoldedFixAnc2 |
Outgroup monomorphic; Pop2 fixed for ancestral allele, Pop1 polymorphic |
unfoldedFixDer2 |
Outgroup monomorphic; Pop2 fixed for derived allele, Pop1 polymorphic |
InFixAnc |
Outgroup polymorphic; ingroups monomorphic; ingroup allele used (conservative) |
allFold |
Both outgroup and ingroups polymorphic; modal allele across all samples used (folded fallback) |
triall |
Non-biallelic locus detected; all genotypes set to missing |
A tab-delimited file written for every run. Each row corresponds to one annotated SNP locus. The file contains the following columns:
| Column | Description |
|---|---|
scaffold |
Scaffold or chromosome identifier (from VCF CHROM field) |
position |
Genomic position (from VCF POS field) |
effect |
Predicted fitness effect from SnpEff (e.g. HIGH) |
vartype |
Simplified variant category: missense, synonymous, intergenic, intron, downstream, upstream, or else |
flag |
Polarization flag (see Polarization flags) |
ref |
Inferred ancestral allele (VCF REF allele if not repolarized; VCF ALT allele if repolarized) |
<sample_1> ... <sample_N> |
Derived-allele dosage per individual: 0 (homozygous ancestral), 1 (heterozygous), 2 (homozygous derived), nan (missing) |
The
refcolumn always reflects the inferred ancestral allele, regardless of what is listed as REF in the original VCF. If the ancestral allele was the VCF ALT allele, all genotypes at that locus are inverted and the ALT allele is written toref.
scaffold position effect vartype flag ref ind1 ind2 ind3 ind4
scaffold_1 10452 HIGH missense allFix A 0 0 0 0
scaffold_1 18763 MODERATE synonymous unfoldedFixAnc1 G 0 1 0 2
scaffold_1 25001 LOW missense allFold T 1 0 nan 2
scaffold_2 3301 MODIFIER intron reference C 2 2 1 0
Generated only when low_cov='YES'. This file has an identical structure to the primary .gt file, but instead of diploid dosage values it contains a single randomly sampled allele per individual per locus:
| Value | Meaning |
|---|---|
0 |
Ancestral allele sampled |
1 |
Derived allele sampled |
. |
Missing (zero read depth or missing AD field) |
This pseudo-haploid output is suitable for downstream analyses that require single-read data, such as those commonly applied to ancient or low-coverage genomes.
When low_cov='YES', GenoLoader reads the allelic depth (AD) subfield from the FORMAT column of the VCF for each individual. At each locus, a pool of alleles is constructed by listing the ancestral allele as many times as its read depth, and the derived allele as many times as its read depth. One allele is then drawn at random from this pool (equivalent to sampling a single sequencing read). Individuals with a missing or zero-depth AD field receive a missing value (.).
This approach emulates pseudo-haploid genotype calling, a standard strategy for ancient DNA and low-coverage data where reliable diploid genotype inference is not possible.
Requirement: The VCF FORMAT field must include the
ADsubfield (allelic depth), as produced by Freebayes or GATK. TheADvalues must appear as the third colon-delimited subfield in the FORMAT column (index 2, 0-based).
After loading the functions in your Jupyter Lab notebook:
# Outgroup-based polarization with missingness filter and low-coverage resampling
write_gt(
vcf_file = 'population_data.ann.vcf',
p1 = 'focal_pop.txt', # ingroup population of interest
p2 = 'sister_pop.txt', # closely related population
p0 = 'outgroup.txt', # outgroup for unfolded polarization
m1 = 10, # require all 10 focal pop individuals
m2 = 8, # require all 8 sister pop individuals
m0 = 2, # require at least 2 outgroup individuals
polX = 'POP_OUT',
low_cov = 'YES'
)This will produce:
population_data.POP_OUT.gt— diploid derived-allele dosage matrixpopulation_data.gt.covRandom1— pseudo-haploid resampled allele matrix
Upon completion, the script reports the number of loci passing missingness filters and re-polarized from the original VCF file:
125430 loci recorded to file, of which 48201 repolarized to ancestral and 3102 not passing missingness filter
If you use GenoLoader in your research, please cite:
[Trucchi], [2026]. Genoloader: a Python package for derived allele polarization and genetic load estimation for high and low-coverage data. [Journal, DOI — to be updated upon publication]
