Spike protein sequences and tree of different SARS-CoV-2 Pango lineages

This repository extracts the spike protein sequences and builds protein-based phylogenetic trees for different SARS-CoV-2 Pango lineages. Repository maintained by Jesse Bloom.

Quick links to results

Interactive Nextstrain phylogenetic trees of the protein sequences with branches in lengths of substitutions / deletions per site, and shared via Nextstrain Community builds. See https://nextstrain.org/community/jbloomlab/SARS2-spike-pango for a list of all the builds. Specifically:
TSV files giving lineages, their spike sequences, and maximum monthly frequencies are in https://github.com/jbloomlab/SARS2-spike-pango/tree/master/results/lineages. Note that aligned_spike is the spike aligned to Wuhan-Hu-1 with indels stripped, and complete spike is the full spike with indels. Key files:
Additional summary text files and timestamps for file downloads are in ./results/.

Overview

This repository aims to reconstruct the spike protein sequences and monthly frequencies of different SARS-CoV-2 Pango lineages. It draws data from three sources (see config.yaml for details on URLs used to get these data):

Spike protein sequences for lineages taken from LANL's COVID-19 Viral Genome Analysis Pipeline
Frequencies of different lineages over time taken from CoVSpectrum
Frequencies and lineage spike protein sequences reconstructed from Genbank by downloaded data and looking at UShER, Nextclade, and Genbank lineage annotations.

Structure of repository and running pipeline

The repository contains a snakemake pipeline. The pipeline is in Snakefile, which reads its configuration from config.yaml. All input data are in ./data/ and results are placed in ./results, only some of which are tracked in this repo (see .gitignore).

The configuration for the pipeline is in config.yaml.

The input data are:

./data/site_numbering_map.tsv: maps sequential numbering of alignment to spike numbers, assuming all is Wuhan-Hu-1 numbering.
./data/Wuhan-Hu-1_spike.fa: Wuhan-Hu-1 spike sequence used as outgroup.
./data/library_lineages.tsv: lineages selected for inclusion in experimental library; generated by manually running ./data/get_library_lineages.py on 2026-04-10.

To run the pipeline, build the conda environment in environment.yaml with:

conda env create -f environment.yaml

then activate it and run the pipeline:

conda activate SARS2-spike-pango
snakemake --software-deployment-method conda -j <n_cpus>

To run via slurm on the Hutch cluster, do

sbatch run_Hutch_cluster.bash

Steps in pipeline

Genbank data on lineage spike haplotypes and lineage frequencies

Get SARS-CoV-2 spike protein sequences from Genbank

Use ncbi-datasets to download all complete SARS-CoV-2 spike (S) protein sequences from Genbank. The downloaded files are saved to ./results/genbank_seqs/:

genomic.fna.gz: genome nucleotide sequences
data_report.jsonl.gz: raw metadata from NCBI
timestamp.txt: timestamp of the download

This step of downloading all of the spike sequences may take several hours.

Parse GenBank metadata

Parse the raw JSONL metadata from NCBI into a clean TSV (parsed_metadata.tsv.gz) with columns: accession, date, geographic_region, geographic_location, and pango lineage. The script cross-checks accessions between the metadata and genomic FASTA (errors if they are not the same sets), ordering output rows to match the FASTA. A summary report (parsed_metadata_summary.txt) is also produced with statistics on lineage counts, date ranges/percentiles, and accession cross-check results.

Get UShER metadata with Pango lineage annotations

Download the public UShER metadata file, which contains Pango lineage assignments for SARS-CoV-2 sequences. The URL is configured in config.yaml. The downloaded file is saved to ./results/usher_metadata/:

public-latest.metadata.tsv.gz: the metadata file with Pango lineage annotations
timestamp.txt: timestamp of the download

Classify sequences with nextclade

Run nextclade on all downloaded genomic sequences to assign Nextstrain clades and Pango lineages, and extract aligned spike protein sequences. The nextclade dataset is configured in config.yaml. Results are saved to ./results/nextclade/:

nextclade.tsv: classification results with columns for clade, Pango lineage, coverage, alignment score, QC metrics, and errors. All input sequences appear in this file; sequences that fail classification will have empty clade/lineage fields and a populated errors column.
nextclade.cds_translation.S.fasta.gz: aligned spike (S) protein sequences. Sequences that completely fail alignment may be absent from this file.

This step may take 12+ hours.

Parse nextclade results

Parse the raw nextclade TSV and aligned spike FASTA into a clean output TSV (parsed_nextclade_results.tsv.gz). Rows with errors (e.g., sequences too short to align) or failed S CDS translation are dropped. For each retained entry, the spike amino-acid substitutions, deletions, insertions, and missing sites are validated to be S-gene specific (the S: prefix is stripped), and the aligned spike protein sequence from the FASTA is joined in. The output columns are: accession, Nextstrain_clade, pango, qc.overallScore, substitutions, deletions, insertions, missing_sites, n_missing_sites, aligned_spike. A summary report (parsed_nextclade_results_summary.txt) is also produced with counts of total/dropped/parsed rows and top lineages/clades.

Merge metadata and spike sequences

Merge GenBank metadata, parsed nextclade results, and UShER metadata into a single per-sequence TSV (per_sequence_metadata_and_spikeseqs.tsv.gz) in ./results/genbank_merged_data/. Script: scripts/merge_genbank_metadata_and_spikeseqs.py. Each row has an accession with its collection date (from GenBank and UShER), geographic region, Pango lineage assignments from all three sources (GenBank, nextclade, UShER), nextclade QC score, spike mutations, and the aligned spike protein sequence. A summary report (per_sequence_metadata_and_spikeseqs_summary.txt) is also produced.

Compute lineage spike haplotypes

A marimo notebook (notebooks/get_genbank_lineage_haplotypes.py) that filters sequences by quality, assigns a consensus Pango lineage from the three assignment sources, and determines the consensus spike haplotype for each lineage. The consensus is computed by position-by-position majority vote from aligned spike sequences, excluding missing positions (X). Insertions are resolved by majority vote among sequences where the insertion site is not missing. A minor_mutations column lists non-consensus mutations found in at least minor_mutations_cutoff (configured in config.yaml) fraction of informative sequences, with their frequencies. Parameters (QC thresholds, date range, lineage priority order, reference spike, minor mutations cutoff) are configured in config.yaml and passed to the notebook via CLI args. Results are saved to ./results/genbank_lineages/:

spike_haplotypes.tsv: consensus spike haplotype per Pango lineage with metadata (accession counts, mutation counts, sequences), sorted by lineage
freqs_by_month.tsv: monthly frequency of each lineage with columns: lineage, month, lineage_counts, total_counts, monthly_frequency
freqs_by_month.html: interactive Altair chart of lineage frequencies over time
get_lineage_haplotypes.html: full notebook output exported as HTML (code hidden)
get_lineage_haplotypes_summary.txt: text-only summary of the notebook output

LANL data on lineage spike haplotypes

Download LANL consensus spike haplotypes

Download the LANL consensus spike haplotype archive from cov.lanl.gov. The tar.gz contains SPIKE.short.all.Wuhan.txt with the most common form (consensus) of each Pango lineage and its mutations relative to Wuhan-Hu-1. The URL is configured in config.yaml. Downloaded to ./results/lanl_downloads/.

Parse LANL haplotypes

Parse the LANL consensus file to extract the (consensus) line for each lineage, validate mutations against the Wuhan-Hu-1 reference spike, and reconstruct full spike sequences. Lineages with invalid names (e.g., "None") are skipped with a warning. Mutations are parsed as substitutions (e.g., T19I), deletions (e.g., H69-), and insertions (e.g., 214:EPE). The reference amino acid at each mutation position is validated against the Wuhan-Hu-1 spike. Script: scripts/parse_lanl_haplotypes.py. Output: ./results/lanl_lineages/spike_haplotypes.tsv (sorted by lineage) and ./results/lanl_lineages/get_lineage_haplotypes_summary.txt.

CovSpectrum data on lineage frequencies

Download CovSpectrum aggregated counts

Download aggregated lineage counts by date from the CovSpectrum LAPIS API (open data from GenBank/ENA/DDBJ). The API returns a TSV with columns date, pangoLineage, count. The URL is configured in config.yaml. Downloaded to ./results/covspectrum_downloads/.

Parse CovSpectrum frequencies

Aggregate daily counts into monthly lineage frequencies using the same day-rounding convention as the Genbank pipeline (day >= 16 rounds up to next month). Rows with missing date or lineage are dropped. Script: scripts/parse_covspectrum_freqs.py. Output: ./results/covspectrum_lineages/freqs_by_month.tsv (columns: lineage, month, lineage_counts, total_counts, monthly_frequency; sorted by month ascending then frequency descending) and ./results/covspectrum_lineages/parse_covspectrum_freqs_summary.txt.

Integrated spike haplotypes

Integrate Genbank and LANL spike haplotypes

Outer join Genbank and LANL spike_haplotypes.tsv on lineage and classify concordance:

equal: both sources present, identical complete_spike
differ: both present, different complete_spike (mutation-level diff recorded in lanl_genbank_differences)
genbank_only / lanl_only: lineage in only one source

When sources differ, LANL values are used for the merged mutation and sequence columns.

Annotations: External TSV files configured under add_annotations in config.yaml are used to add boolean annotation columns (e.g., in_library). Each entry maps an annotation name to a TSV file with a lineage column; lineages present in the TSV are labeled "yes", others "no". The script errors if the TSV contains lineages not found in the merged data.

Script: scripts/integrate_spike_haplotypes.py. Output: ./results/lineages/spike_haplotypes.tsv (columns: lineage, lanl_genbank_concordance, total_substitutions, total_mutations, substitutions, deletions, insertions, lanl_genbank_differences, ambiguous_sites_or_premature_stop, annotation columns from add_annotations, aligned_spike, complete_spike; sorted by lineage) and ./results/lineages/integrate_spike_haplotypes_summary.txt.

Integrated frequencies

Integrate Genbank and CovSpectrum frequencies

Full outer join of Genbank and CovSpectrum freqs_by_month.tsv on (lineage, month). For months where a lineage has no data in one source, lineage_counts is filled with 0 and total_counts is looked up from the source's monthly totals. Script: scripts/integrate_freqs_by_month.py. Output: ./results/lineages/freqs_by_month.tsv (columns: lineage, month, lineage_counts_covspectrum, lineage_counts_genbank, total_counts_covspectrum, total_counts_genbank, monthly_frequency_covspectrum, monthly_frequency_genbank; sorted by month ascending then max frequency descending) and ./results/lineages/integrate_freqs_by_month_summary.txt.

Combined haplotypes with frequency summaries

Integrate haplotypes and frequencies

Combine the integrated spike haplotypes with per-lineage frequency summary statistics from the integrated frequency data. Lineages with no frequency counts in either Genbank or CovSpectrum are dropped (these are LANL-only lineages absent from both sequence databases). Produces two outputs:

Uncollapsed (spike_haplotypes_w_freqs_uncollapsed.tsv): one row per lineage with max_monthly_frequency, median_month, total_counts, and source-specific variants of each (CovSpectrum and Genbank), plus all haplotype columns. The median_month is the month containing the median sequence when sequences are sorted chronologically; for the combined metric, per-month counts are the max of CovSpectrum and Genbank counts to avoid double-counting.
Collapsed (spike_haplotypes_w_freqs_collapsed.tsv): lineages with identical complete_spike are grouped. The representative lineage is the one with the highest max_monthly_frequency. Other lineages in the group are listed in equivalent_lineages. Frequency stats are re-aggregated from the raw monthly data for the group.

Both files are sorted by median_month descending, then max_monthly_frequency descending, then lineage ascending. Script: scripts/integrate_haplotypes_and_freqs.py. Output: ./results/lineages/spike_haplotypes_w_freqs_uncollapsed.tsv, ./results/lineages/spike_haplotypes_w_freqs_collapsed.tsv, and ./results/lineages/integrate_haplotypes_and_freqs_summary.txt.

Build protein-based phylogenetic trees

For each tree defined under trees: in config.yaml, filter lineages from the collapsed haplotypes TSV using a pandas query_str (or include all lineages when query_str is null), then build a Nextstrain Auspice JSON tree using the nextstrain-prot-titers-tree submodule. The submodule uses IQ-TREE (Poisson model) for phylogenetic inference and TreeTime for ancestral reconstruction. Each tree's nextstrain_prot_titers_tree_config block is passed to the submodule. Intermediate results are saved to ./results/trees/{tree_name}/ and final Auspice JSONs are placed in ./auspice/. Trees can be viewed by uploading to auspice.us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spike protein sequences and tree of different SARS-CoV-2 Pango lineages

Quick links to results

Overview

Structure of repository and running pipeline

Steps in pipeline

Genbank data on lineage spike haplotypes and lineage frequencies

Get SARS-CoV-2 spike protein sequences from Genbank

Parse GenBank metadata

Get UShER metadata with Pango lineage annotations

Classify sequences with nextclade

Parse nextclade results

Merge metadata and spike sequences

Compute lineage spike haplotypes

LANL data on lineage spike haplotypes

Download LANL consensus spike haplotypes

Parse LANL haplotypes

CovSpectrum data on lineage frequencies

Download CovSpectrum aggregated counts

Parse CovSpectrum frequencies

Integrated spike haplotypes

Integrate Genbank and LANL spike haplotypes

Integrated frequencies

Integrate Genbank and CovSpectrum frequencies

Combined haplotypes with frequency summaries

Integrate haplotypes and frequencies

Build protein-based phylogenetic trees

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
auspice		auspice
data		data
nextstrain-prot-titers-tree @ a706151		nextstrain-prot-titers-tree @ a706151
notebooks		notebooks
results		results
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
environment.yaml		environment.yaml
run_Hutch_cluster.bash		run_Hutch_cluster.bash

Folders and files

Latest commit

History

Repository files navigation

Spike protein sequences and tree of different SARS-CoV-2 Pango lineages

Quick links to results

Overview

Structure of repository and running pipeline

Steps in pipeline

Genbank data on lineage spike haplotypes and lineage frequencies

Get SARS-CoV-2 spike protein sequences from Genbank

Parse GenBank metadata

Get UShER metadata with Pango lineage annotations

Classify sequences with nextclade

Parse nextclade results

Merge metadata and spike sequences

Compute lineage spike haplotypes

LANL data on lineage spike haplotypes

Download LANL consensus spike haplotypes

Parse LANL haplotypes

CovSpectrum data on lineage frequencies

Download CovSpectrum aggregated counts

Parse CovSpectrum frequencies

Integrated spike haplotypes

Integrate Genbank and LANL spike haplotypes

Integrated frequencies

Integrate Genbank and CovSpectrum frequencies

Combined haplotypes with frequency summaries

Integrate haplotypes and frequencies

Build protein-based phylogenetic trees

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages