GenoLoader

GenoLoader is a Python package for derived allele polarization from VCF files and genetic load estimation, designed to work with both standard- and low-coverage genomic sequencing data.

Given a Freebayes- or GATK-generated VCF annotated with functional effects (e.g. via SnpEff), GenoLoader main function polarizes each SNP as ancestral or derived and recodes individual genotypes as derived-allele dosage values (0, 1, 2) into a custom format text file. It supports multiple polarization strategies — outgroup-based, major allele in one or multiple populations — and implements a read-resampling approach for low-coverage (i.e. ancient) DNA samples. All samples present in the VCF are converted regardless of their population assignment.

Alongside the Python package provided as a Jupyter Lab/Notebook, GenoLoader main function (write_gt in the Python package) is also provided as an AI-translated c++ scritp. While the Pyhton code allows for easier understanding of the script, implementation within existing pipelines and customization, the c++ version provides the exact same features with a 10X faster execution speed. The latter is recommended for processing very large files.

Genetic load estimates and plotting options are provided as additional functions in the Python package only as they do not require long execution time.

Features

Polarizes ancestral/derived alleles at biallelic SNP positions using flexible, user-defined strategies
Supports outgroup-based unfolded polarization (POP_OUT) as well as ingroup-only (POP1, POP2, or POP1POP2) and reference-based options
Recodes genotypes as derived-allele dosage: 0 (homozygous ancestral), 1 (heterozygous), 2 (homozygous derived), nan (missing)
Parses SnpEff ANN= annotations to classify variants by functional category (missense, synonymous, intergenic, intronic, upstream, downstream) and putative fitness impact (HIGH, MODERATE, LOW, MODIFIER)
Assigns a diagnostic polarization flag to every locus for full traceability of the supporting data at that site
Implements a one-read-per-genotype random resampling mode for low-coverage (ancient) DNA data

Requirements

Python ≥ 3.7
scipy
numpy
pandas
tqdm
matplotlib
seaborn

Install dependencies via conda or pip:

conda install scipy numpy pandas tqdm matplotlib seaborn
# or
pip install scipy numpy pandas tqdm matplotlib seaborn

Installation

Clone this repository:

git clone https://github.com/emitruc/genoloader.git
cd genoloader

Python script

No compilation or package installation is required. The script GenoLoader.v3.2.ipynb can be run interactively in a Jupyter Lab/Notebook or imported as a module.

C++ script (only for the main function write_gt fromt the python package)

In the bash terminal, compile the cpp file using g++:

g++ -O2 -std=c++17 -o genoloader GenoLoader.v3.2.cpp

Change permissions:

chmod u+x genoloader

Run the script

./genoloader sample.ann.vcf --p1 pop1.txt --p2 pop2.txt --p0 outgroup.txt \
             --m1 5 --m2 5 --m0 2 --polX POP_OUT --low_cov YES

Disclaimer: While the content of the c++ script has not been checked, its output (gt file) was identical with the gt file produced by the Python script on a 2,296,974-variant input vcf as checked with bash sha256sum.

Input files

1. VCF file (required)

A biallelic SNP VCF file generated by Freebayes or GATK, annotated with functional effects using SnpEff (the ANN= tag must be present in the INFO field). Both phased and unphased genotype notation are accepted.

The FORMAT field must include:

GT — genotype (required for all modes)
AD — allelic depth (required only when low_cov='YES')

Note: Only loci carrying a SnpEff ANN= annotation are written to the output. Unannotated loci are silently skipped.

2. Population list files (optional, plain text)

Each population is defined by a plain-text file containing one sample name per line, matching the sample names in the VCF header exactly. For example:

sample_A
sample_B
sample_C

Pop1 (p1): target population 1 (ingroup 1)
Pop2 (p2): target population 2 (ingroup 2), or closely related outgroup, used in combined polarization modes
Pop_out (p0): the outgroup population/species, used for unfolded polarization

Population files are used only to determine the ancestral allele. All individuals present in the VCF are recoded and written to the output, including those not assigned to any population group. IMPORTANT: The same individual may appear in more than one population file. You may want to play with individual assignment to refine your polarization.

Usage

Import and call the write_gt function directly in Python:

from GiNOLOADER_v3_2 import write_gt

write_gt(
    vcf_file = 'my_samples.ann.vcf',
    p1       = 'pop1_samples.txt',
    p2       = 'pop2_samples.txt',
    p0       = 'outgroup_samples.txt',
    m1       = 'min_ind_p1',
    m2       = 'min_ind_p2',
    m0       = 'min_ind_p0',
    polX     = 'polarization strategy',
    low_cov  = 'one_read_resampling'
)

Parameters

Parameter	Type	Default	Description
`vcf_file`	`str`	required	Path to the SnpEff-annotated VCF file
`p1`	`str`	`None`	Path to Pop1 sample list file
`p2`	`str`	`None`	Path to Pop2 sample list file
`p0`	`str`	`None`	Path to outgroup sample list file
`m1`	`int`	`0`	Minimum number of non-missing individuals required in Pop1 to retain a locus
`m2`	`int`	`0`	Minimum number of non-missing individuals required in Pop2 to retain a locus
`m0`	`int`	`0`	Minimum number of non-missing individuals required in the outgroup to retain a locus
`polX`	`str`	`'REF'`	Polarization strategy (see Polarization strategies)
`low_cov`	`str`	`'NO'`	Enable low-coverage read resampling: `'YES'` or `'NO'`

Tip: For genetic load estimation, set m1 and m2 (or only m1 depending on your polarization strategy), equal to the total number of individuals in the respective population files to exclude any locus with missing data. No missing data is good practice when estimating genetic load.

Polarization strategies

The polX parameter controls how the ancestral allele is determined at each locus.

`polX` value	Description
`'REF'`	No repolarization. The VCF reference allele is treated as ancestral.
`'POP_OUT'`	Outgroup-based unfolded polarization. The allele present and fixed in the outgroup (`p0`) is inferred as ancestral. If not fixed, alternative behaviour is recorded
`'POP1'`	The major allele in Pop1 is used as the ancestral state.
`'POP2'`	The major allele in Pop2 is used as the ancestral state.
`'POP1POP2'`	The major allele across Pop1 and Pop2 combined is used as the ancestral state.

Note: If a population file is not provided but the strategy requires it, the function falls back gracefully to using whatever information is available, as described in Polarization flags.

Polarization flags

Every locus in the output is assigned a diagnostic flag describing the allelic configuration used to determine the ancestral allele. These flags are written to the flag column of the .gt output file and allow full traceability of polarization decisions.

Flag	Description
`reference`	No repolarization; VCF reference allele used as ancestral (`polX='REF'`)
`allFix`	All groups (outgroup and ingroups) monomorphic for the same allele
`allMiss`	All alleles missing across all groups; locus assigned missing
`inMiss`	Both ingroup populations missing; ancestral allele inferred from outgroup modal allele
`InvarOutMiss`	Outgroup missing; both ingroups monomorphic for the same allele
`altFix`	Outgroup missing; Pop1 and Pop2 fixed for different alleles; Pop1 allele used (conservative)
`in1FoldAnc`	Outgroup and Pop2 missing; Pop1 monomorphic; Pop1 allele used as ancestral
`in1FoldSeg`	Outgroup and Pop2 missing; Pop1 polymorphic; major Pop1 allele used
`in2FoldAnc`	Outgroup and Pop1 missing; Pop2 monomorphic; Pop2 allele used as ancestral
`in2FoldSeg`	Outgroup and Pop1 missing; Pop2 polymorphic; major Pop2 allele used
`unfoldFix1outMiss`	Outgroup missing; Pop1 monomorphic, Pop2 polymorphic; Pop1 allele used
`unfoldFix2outMiss`	Outgroup missing; Pop2 monomorphic, Pop1 polymorphic; Pop2 allele used
`inFold`	Outgroup missing; both ingroups polymorphic; major ingroup allele used (folded fallback)
`unfolded2Miss1`	Outgroup monomorphic; Pop1 missing, Pop2 present; outgroup allele used
`unfolded1Miss2`	Outgroup monomorphic; Pop2 missing, Pop1 present; outgroup allele used
`unfoldedAltFix`	Outgroup monomorphic; both ingroups monomorphic for different alleles
`unfoldedILS`	Outgroup monomorphic; both ingroups polymorphic (possible incomplete lineage sorting)
`unfoldedFixAnc1`	Outgroup monomorphic; Pop1 fixed for ancestral allele, Pop2 polymorphic
`unfoldedFixDer1`	Outgroup monomorphic; Pop1 fixed for derived allele, Pop2 polymorphic
`unfoldedFixAnc2`	Outgroup monomorphic; Pop2 fixed for ancestral allele, Pop1 polymorphic
`unfoldedFixDer2`	Outgroup monomorphic; Pop2 fixed for derived allele, Pop1 polymorphic
`InFixAnc`	Outgroup polymorphic; ingroups monomorphic; ingroup allele used (conservative)
`allFold`	Both outgroup and ingroups polymorphic; modal allele across all samples used (folded fallback)
`triall`	Non-biallelic locus detected; all genotypes set to missing

Output files

Primary output: `<vcf_file>.<polX>.gt`

A tab-delimited file written for every run. Each row corresponds to one annotated SNP locus. The file contains the following columns:

Column	Description
`scaffold`	Scaffold or chromosome identifier (from VCF CHROM field)
`position`	Genomic position (from VCF POS field)
`effect`	Predicted fitness effect from SnpEff (e.g. `HIGH`)
`vartype`	Simplified variant category: `missense`, `synonymous`, `intergenic`, `intron`, `downstream`, `upstream`, or `else`
`flag`	Polarization flag (see Polarization flags)
`ref`	Inferred ancestral allele (VCF REF allele if not repolarized; VCF ALT allele if repolarized)
`<sample_1>` ... `<sample_N>`	Derived-allele dosage per individual: `0` (homozygous ancestral), `1` (heterozygous), `2` (homozygous derived), `nan` (missing)

The ref column always reflects the inferred ancestral allele, regardless of what is listed as REF in the original VCF. If the ancestral allele was the VCF ALT allele, all genotypes at that locus are inverted and the ALT allele is written to ref.

Example output (primary `.gt` file)

scaffold    position  effect            vartype     flag          ref  ind1  ind2  ind3  ind4
scaffold_1  10452     HIGH  missense    allFix        A    0     0     0     0
scaffold_1  18763     MODERATE synonymous unfoldedFixAnc1 G  0     1     0     2
scaffold_1  25001     LOW  missense    allFold       T    1     0     nan   2
scaffold_2  3301      MODIFIER    intron      reference     C    2     2     1     0

Low-coverage output: `<vcf_file>.gt.covRandom1`

Generated only when low_cov='YES'. This file has an identical structure to the primary .gt file, but instead of diploid dosage values it contains a single randomly sampled allele per individual per locus:

Value	Meaning
`0`	Ancestral allele sampled
`1`	Derived allele sampled
`.`	Missing (zero read depth or missing AD field)

This pseudo-haploid output is suitable for downstream analyses that require single-read data, such as those commonly applied to ancient or low-coverage genomes.

Low-coverage resampling

When low_cov='YES', GenoLoader reads the allelic depth (AD) subfield from the FORMAT column of the VCF for each individual. At each locus, a pool of alleles is constructed by listing the ancestral allele as many times as its read depth, and the derived allele as many times as its read depth. One allele is then drawn at random from this pool (equivalent to sampling a single sequencing read). Individuals with a missing or zero-depth AD field receive a missing value (.).

This approach emulates pseudo-haploid genotype calling, a standard strategy for ancient DNA and low-coverage data where reliable diploid genotype inference is not possible.

Requirement: The VCF FORMAT field must include the AD subfield (allelic depth), as produced by Freebayes or GATK. The AD values must appear as the third colon-delimited subfield in the FORMAT column (index 2, 0-based).

Example

After loading the functions in your Jupyter Lab notebook:

# Outgroup-based polarization with missingness filter and low-coverage resampling
write_gt(
    vcf_file = 'population_data.ann.vcf',
    p1       = 'focal_pop.txt',       # ingroup population of interest
    p2       = 'sister_pop.txt',      # closely related population
    p0       = 'outgroup.txt',        # outgroup for unfolded polarization
    m1       = 10,                    # require all 10 focal pop individuals
    m2       = 8,                     # require all 8 sister pop individuals
    m0       = 2,                     # require at least 2 outgroup individuals
    polX     = 'POP_OUT',
    low_cov  = 'YES'
)

This will produce:

population_data.POP_OUT.gt — diploid derived-allele dosage matrix
population_data.gt.covRandom1 — pseudo-haploid resampled allele matrix

Upon completion, the script reports the number of loci passing missingness filters and re-polarized from the original VCF file:

125430 loci recorded to file, of which 48201 repolarized to ancestral and 3102 not passing missingness filter

Citation

If you use GenoLoader in your research, please cite:

[Trucchi], [2026]. Genoloader: a Python package for derived allele polarization and genetic load estimation for high and low-coverage data. [Journal, DOI — to be updated upon publication]

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
WPSG2025		WPSG2025
images		images
GenoLoader.v3.2.cpp		GenoLoader.v3.2.cpp
GenoLoader.v3.2.ipynb		GenoLoader.v3.2.ipynb
GenoLoader_counting_plotting_functions.ipynb		GenoLoader_counting_plotting_functions.ipynb
LICENSE		LICENSE
README.md		README.md
counting_plotting_functions.md		counting_plotting_functions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenoLoader

Table of Contents

Features

Requirements

Installation

Python script

C++ script (only for the main function write_gt fromt the python package)

Input files

1. VCF file (required)

2. Population list files (optional, plain text)

Usage

Parameters

Polarization strategies

Polarization flags

Output files

Primary output: `<vcf_file>.<polX>.gt`

Example output (primary `.gt` file)

Low-coverage output: `<vcf_file>.gt.covRandom1`

Low-coverage resampling

Example

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenoLoader

Table of Contents

Features

Requirements

Installation

Python script

C++ script (only for the main function write_gt fromt the python package)

Input files

1. VCF file (required)

2. Population list files (optional, plain text)

Usage

Parameters

Polarization strategies

Polarization flags

Output files

Primary output: <vcf_file>.<polX>.gt

Example output (primary .gt file)

Low-coverage output: <vcf_file>.gt.covRandom1

Low-coverage resampling

Example

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Primary output: `<vcf_file>.<polX>.gt`

Example output (primary `.gt` file)

Low-coverage output: `<vcf_file>.gt.covRandom1`

Packages