Skip to content

emitruc/genoloader

Repository files navigation

GenoLoader

GenoLoader is a Python package for derived allele polarization from VCF files and genetic load estimation, designed to work with both standard- and low-coverage genomic sequencing data.

Given a Freebayes- or GATK-generated VCF annotated with functional effects (e.g. via SnpEff), GenoLoader main function polarizes each SNP as ancestral or derived and recodes individual genotypes as derived-allele dosage values (0, 1, 2) into a custom format text file. It supports multiple polarization strategies — outgroup-based, major allele in one or multiple populations — and implements a read-resampling approach for low-coverage (i.e. ancient) DNA samples. All samples present in the VCF are converted regardless of their population assignment.

Alongside the Python package provided as a Jupyter Lab/Notebook, GenoLoader main function (write_gt in the Python package) is also provided as an AI-translated c++ scritp. While the Pyhton code allows for easier understanding of the script, implementation within existing pipelines and customization, the c++ version provides the exact same features with a 10X faster execution speed. The latter is recommended for processing very large files.

Genetic load estimates and plotting options are provided as additional functions in the Python package only as they do not require long execution time.


Table of Contents


Features

  • Polarizes ancestral/derived alleles at biallelic SNP positions using flexible, user-defined strategies
  • Supports outgroup-based unfolded polarization (POP_OUT) as well as ingroup-only (POP1, POP2, or POP1POP2) and reference-based options
  • Recodes genotypes as derived-allele dosage: 0 (homozygous ancestral), 1 (heterozygous), 2 (homozygous derived), nan (missing)
  • Parses SnpEff ANN= annotations to classify variants by functional category (missense, synonymous, intergenic, intronic, upstream, downstream) and putative fitness impact (HIGH, MODERATE, LOW, MODIFIER)
  • Assigns a diagnostic polarization flag to every locus for full traceability of the supporting data at that site
  • Implements a one-read-per-genotype random resampling mode for low-coverage (ancient) DNA data

Requirements

Install dependencies via conda or pip:

conda install scipy numpy pandas tqdm matplotlib seaborn
# or
pip install scipy numpy pandas tqdm matplotlib seaborn

Installation

Clone this repository:

git clone https://github.com/emitruc/genoloader.git
cd genoloader

Python script

No compilation or package installation is required. The script GenoLoader.v3.2.ipynb can be run interactively in a Jupyter Lab/Notebook or imported as a module.

C++ script (only for the main function write_gt fromt the python package)

In the bash terminal, compile the cpp file using g++:

g++ -O2 -std=c++17 -o genoloader GenoLoader.v3.2.cpp

Change permissions:

chmod u+x genoloader

Run the script

./genoloader sample.ann.vcf --p1 pop1.txt --p2 pop2.txt --p0 outgroup.txt \
             --m1 5 --m2 5 --m0 2 --polX POP_OUT --low_cov YES

Disclaimer: While the content of the c++ script has not been checked, its output (gt file) was identical with the gt file produced by the Python script on a 2,296,974-variant input vcf as checked with bash sha256sum.


Input files

1. VCF file (required)

A biallelic SNP VCF file generated by Freebayes or GATK, annotated with functional effects using SnpEff (the ANN= tag must be present in the INFO field). Both phased and unphased genotype notation are accepted.

The FORMAT field must include:

  • GT — genotype (required for all modes)
  • AD — allelic depth (required only when low_cov='YES')

Note: Only loci carrying a SnpEff ANN= annotation are written to the output. Unannotated loci are silently skipped.

2. Population list files (optional, plain text)

Each population is defined by a plain-text file containing one sample name per line, matching the sample names in the VCF header exactly. For example:

sample_A
sample_B
sample_C
  • Pop1 (p1): target population 1 (ingroup 1)
  • Pop2 (p2): target population 2 (ingroup 2), or closely related outgroup, used in combined polarization modes
  • Pop_out (p0): the outgroup population/species, used for unfolded polarization

Population files are used only to determine the ancestral allele. All individuals present in the VCF are recoded and written to the output, including those not assigned to any population group. IMPORTANT: The same individual may appear in more than one population file. You may want to play with individual assignment to refine your polarization.


Usage

Import and call the write_gt function directly in Python:

from GiNOLOADER_v3_2 import write_gt

write_gt(
    vcf_file = 'my_samples.ann.vcf',
    p1       = 'pop1_samples.txt',
    p2       = 'pop2_samples.txt',
    p0       = 'outgroup_samples.txt',
    m1       = 'min_ind_p1',
    m2       = 'min_ind_p2',
    m0       = 'min_ind_p0',
    polX     = 'polarization strategy',
    low_cov  = 'one_read_resampling'
)

Parameters

Parameter Type Default Description
vcf_file str required Path to the SnpEff-annotated VCF file
p1 str None Path to Pop1 sample list file
p2 str None Path to Pop2 sample list file
p0 str None Path to outgroup sample list file
m1 int 0 Minimum number of non-missing individuals required in Pop1 to retain a locus
m2 int 0 Minimum number of non-missing individuals required in Pop2 to retain a locus
m0 int 0 Minimum number of non-missing individuals required in the outgroup to retain a locus
polX str 'REF' Polarization strategy (see Polarization strategies)
low_cov str 'NO' Enable low-coverage read resampling: 'YES' or 'NO'

Tip: For genetic load estimation, set m1 and m2 (or only m1 depending on your polarization strategy), equal to the total number of individuals in the respective population files to exclude any locus with missing data. No missing data is good practice when estimating genetic load.


Polarization strategies

The polX parameter controls how the ancestral allele is determined at each locus.

polX value Description
'REF' No repolarization. The VCF reference allele is treated as ancestral.
'POP_OUT' Outgroup-based unfolded polarization. The allele present and fixed in the outgroup (p0) is inferred as ancestral. If not fixed, alternative behaviour is recorded
'POP1' The major allele in Pop1 is used as the ancestral state.
'POP2' The major allele in Pop2 is used as the ancestral state.
'POP1POP2' The major allele across Pop1 and Pop2 combined is used as the ancestral state.

Note: If a population file is not provided but the strategy requires it, the function falls back gracefully to using whatever information is available, as described in Polarization flags.


Polarization flags

Every locus in the output is assigned a diagnostic flag describing the allelic configuration used to determine the ancestral allele. These flags are written to the flag column of the .gt output file and allow full traceability of polarization decisions.

Polarization flags based on data configuration in target and outgroup samples

Flag Description
reference No repolarization; VCF reference allele used as ancestral (polX='REF')
allFix All groups (outgroup and ingroups) monomorphic for the same allele
allMiss All alleles missing across all groups; locus assigned missing
inMiss Both ingroup populations missing; ancestral allele inferred from outgroup modal allele
InvarOutMiss Outgroup missing; both ingroups monomorphic for the same allele
altFix Outgroup missing; Pop1 and Pop2 fixed for different alleles; Pop1 allele used (conservative)
in1FoldAnc Outgroup and Pop2 missing; Pop1 monomorphic; Pop1 allele used as ancestral
in1FoldSeg Outgroup and Pop2 missing; Pop1 polymorphic; major Pop1 allele used
in2FoldAnc Outgroup and Pop1 missing; Pop2 monomorphic; Pop2 allele used as ancestral
in2FoldSeg Outgroup and Pop1 missing; Pop2 polymorphic; major Pop2 allele used
unfoldFix1outMiss Outgroup missing; Pop1 monomorphic, Pop2 polymorphic; Pop1 allele used
unfoldFix2outMiss Outgroup missing; Pop2 monomorphic, Pop1 polymorphic; Pop2 allele used
inFold Outgroup missing; both ingroups polymorphic; major ingroup allele used (folded fallback)
unfolded2Miss1 Outgroup monomorphic; Pop1 missing, Pop2 present; outgroup allele used
unfolded1Miss2 Outgroup monomorphic; Pop2 missing, Pop1 present; outgroup allele used
unfoldedAltFix Outgroup monomorphic; both ingroups monomorphic for different alleles
unfoldedILS Outgroup monomorphic; both ingroups polymorphic (possible incomplete lineage sorting)
unfoldedFixAnc1 Outgroup monomorphic; Pop1 fixed for ancestral allele, Pop2 polymorphic
unfoldedFixDer1 Outgroup monomorphic; Pop1 fixed for derived allele, Pop2 polymorphic
unfoldedFixAnc2 Outgroup monomorphic; Pop2 fixed for ancestral allele, Pop1 polymorphic
unfoldedFixDer2 Outgroup monomorphic; Pop2 fixed for derived allele, Pop1 polymorphic
InFixAnc Outgroup polymorphic; ingroups monomorphic; ingroup allele used (conservative)
allFold Both outgroup and ingroups polymorphic; modal allele across all samples used (folded fallback)
triall Non-biallelic locus detected; all genotypes set to missing

Output files

Primary output: <vcf_file>.<polX>.gt

A tab-delimited file written for every run. Each row corresponds to one annotated SNP locus. The file contains the following columns:

Column Description
scaffold Scaffold or chromosome identifier (from VCF CHROM field)
position Genomic position (from VCF POS field)
effect Predicted fitness effect from SnpEff (e.g. HIGH)
vartype Simplified variant category: missense, synonymous, intergenic, intron, downstream, upstream, or else
flag Polarization flag (see Polarization flags)
ref Inferred ancestral allele (VCF REF allele if not repolarized; VCF ALT allele if repolarized)
<sample_1> ... <sample_N> Derived-allele dosage per individual: 0 (homozygous ancestral), 1 (heterozygous), 2 (homozygous derived), nan (missing)

The ref column always reflects the inferred ancestral allele, regardless of what is listed as REF in the original VCF. If the ancestral allele was the VCF ALT allele, all genotypes at that locus are inverted and the ALT allele is written to ref.

Example output (primary .gt file)

scaffold    position  effect            vartype     flag          ref  ind1  ind2  ind3  ind4
scaffold_1  10452     HIGH  missense    allFix        A    0     0     0     0
scaffold_1  18763     MODERATE synonymous unfoldedFixAnc1 G  0     1     0     2
scaffold_1  25001     LOW  missense    allFold       T    1     0     nan   2
scaffold_2  3301      MODIFIER    intron      reference     C    2     2     1     0

Low-coverage output: <vcf_file>.gt.covRandom1

Generated only when low_cov='YES'. This file has an identical structure to the primary .gt file, but instead of diploid dosage values it contains a single randomly sampled allele per individual per locus:

Value Meaning
0 Ancestral allele sampled
1 Derived allele sampled
. Missing (zero read depth or missing AD field)

This pseudo-haploid output is suitable for downstream analyses that require single-read data, such as those commonly applied to ancient or low-coverage genomes.


Low-coverage resampling

When low_cov='YES', GenoLoader reads the allelic depth (AD) subfield from the FORMAT column of the VCF for each individual. At each locus, a pool of alleles is constructed by listing the ancestral allele as many times as its read depth, and the derived allele as many times as its read depth. One allele is then drawn at random from this pool (equivalent to sampling a single sequencing read). Individuals with a missing or zero-depth AD field receive a missing value (.).

This approach emulates pseudo-haploid genotype calling, a standard strategy for ancient DNA and low-coverage data where reliable diploid genotype inference is not possible.

Requirement: The VCF FORMAT field must include the AD subfield (allelic depth), as produced by Freebayes or GATK. The AD values must appear as the third colon-delimited subfield in the FORMAT column (index 2, 0-based).


Example

After loading the functions in your Jupyter Lab notebook:

# Outgroup-based polarization with missingness filter and low-coverage resampling
write_gt(
    vcf_file = 'population_data.ann.vcf',
    p1       = 'focal_pop.txt',       # ingroup population of interest
    p2       = 'sister_pop.txt',      # closely related population
    p0       = 'outgroup.txt',        # outgroup for unfolded polarization
    m1       = 10,                    # require all 10 focal pop individuals
    m2       = 8,                     # require all 8 sister pop individuals
    m0       = 2,                     # require at least 2 outgroup individuals
    polX     = 'POP_OUT',
    low_cov  = 'YES'
)

This will produce:

  • population_data.POP_OUT.gt — diploid derived-allele dosage matrix
  • population_data.gt.covRandom1 — pseudo-haploid resampled allele matrix

Upon completion, the script reports the number of loci passing missingness filters and re-polarized from the original VCF file:

125430 loci recorded to file, of which 48201 repolarized to ancestral and 3102 not passing missingness filter

Citation

If you use GenoLoader in your research, please cite:

[Trucchi], [2026]. Genoloader: a Python package for derived allele polarization and genetic load estimation for high and low-coverage data. [Journal, DOI — to be updated upon publication]

About

genoloader: derived allele polarization and genetic load estimates with high and low coverage data

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors