This repository extracts the spike protein sequences and builds protein-based phylogenetic trees for different SARS-CoV-2 Pango lineages. Repository maintained by Jesse Bloom.
-
Interactive Nextstrain phylogenetic trees of the protein sequences with branches in lengths of substitutions / deletions per site, and shared via Nextstrain Community builds. See https://nextstrain.org/community/jbloomlab/SARS2-spike-pango for a list of all the builds. Specifically:
-
TSV files giving lineages, their spike sequences, and maximum monthly frequencies are in https://github.com/jbloomlab/SARS2-spike-pango/tree/master/results/lineages. Note that aligned_spike is the spike aligned to Wuhan-Hu-1 with indels stripped, and complete spike is the full spike with indels. Key files:
-
Additional summary text files and timestamps for file downloads are in ./results/.
This repository aims to reconstruct the spike protein sequences and monthly frequencies of different SARS-CoV-2 Pango lineages. It draws data from three sources (see config.yaml for details on URLs used to get these data):
- Spike protein sequences for lineages taken from LANL's COVID-19 Viral Genome Analysis Pipeline
- Frequencies of different lineages over time taken from CoVSpectrum
- Frequencies and lineage spike protein sequences reconstructed from Genbank by downloaded data and looking at UShER, Nextclade, and Genbank lineage annotations.
The repository contains a snakemake pipeline.
The pipeline is in Snakefile, which reads its configuration from config.yaml.
All input data are in ./data/ and results are placed in ./results, only some of which are tracked in this repo (see .gitignore).
The configuration for the pipeline is in config.yaml.
The input data are:
- ./data/site_numbering_map.tsv: maps sequential numbering of alignment to spike numbers, assuming all is Wuhan-Hu-1 numbering.
- ./data/Wuhan-Hu-1_spike.fa: Wuhan-Hu-1 spike sequence used as outgroup.
- ./data/library_lineages.tsv: lineages selected for inclusion in experimental library; generated by manually running ./data/get_library_lineages.py on 2026-04-10.
To run the pipeline, build the conda environment in environment.yaml with:
conda env create -f environment.yamlthen activate it and run the pipeline:
conda activate SARS2-spike-pango
snakemake --software-deployment-method conda -j <n_cpus>To run via slurm on the Hutch cluster, do
sbatch run_Hutch_cluster.bashUse ncbi-datasets to download all complete SARS-CoV-2 spike (S) protein sequences from Genbank.
The downloaded files are saved to ./results/genbank_seqs/:
genomic.fna.gz: genome nucleotide sequencesdata_report.jsonl.gz: raw metadata from NCBItimestamp.txt: timestamp of the download
This step of downloading all of the spike sequences may take several hours.
Parse the raw JSONL metadata from NCBI into a clean TSV (parsed_metadata.tsv.gz) with columns: accession, date, geographic_region, geographic_location, and pango lineage.
The script cross-checks accessions between the metadata and genomic FASTA (errors if they are not the same sets), ordering output rows to match the FASTA.
A summary report (parsed_metadata_summary.txt) is also produced with statistics on lineage counts, date ranges/percentiles, and accession cross-check results.
Download the public UShER metadata file, which contains Pango lineage assignments for SARS-CoV-2 sequences.
The URL is configured in config.yaml.
The downloaded file is saved to ./results/usher_metadata/:
public-latest.metadata.tsv.gz: the metadata file with Pango lineage annotationstimestamp.txt: timestamp of the download
Run nextclade on all downloaded genomic sequences to assign Nextstrain clades and Pango lineages, and extract aligned spike protein sequences.
The nextclade dataset is configured in config.yaml.
Results are saved to ./results/nextclade/:
nextclade.tsv: classification results with columns for clade, Pango lineage, coverage, alignment score, QC metrics, and errors. All input sequences appear in this file; sequences that fail classification will have empty clade/lineage fields and a populated errors column.nextclade.cds_translation.S.fasta.gz: aligned spike (S) protein sequences. Sequences that completely fail alignment may be absent from this file.
This step may take 12+ hours.
Parse the raw nextclade TSV and aligned spike FASTA into a clean output TSV (parsed_nextclade_results.tsv.gz).
Rows with errors (e.g., sequences too short to align) or failed S CDS translation are dropped.
For each retained entry, the spike amino-acid substitutions, deletions, insertions, and missing sites are validated to be S-gene specific (the S: prefix is stripped), and the aligned spike protein sequence from the FASTA is joined in.
The output columns are: accession, Nextstrain_clade, pango, qc.overallScore, substitutions, deletions, insertions, missing_sites, n_missing_sites, aligned_spike.
A summary report (parsed_nextclade_results_summary.txt) is also produced with counts of total/dropped/parsed rows and top lineages/clades.
Merge GenBank metadata, parsed nextclade results, and UShER metadata into a single per-sequence TSV (per_sequence_metadata_and_spikeseqs.tsv.gz) in ./results/genbank_merged_data/. Script: scripts/merge_genbank_metadata_and_spikeseqs.py.
Each row has an accession with its collection date (from GenBank and UShER), geographic region, Pango lineage assignments from all three sources (GenBank, nextclade, UShER), nextclade QC score, spike mutations, and the aligned spike protein sequence.
A summary report (per_sequence_metadata_and_spikeseqs_summary.txt) is also produced.
A marimo notebook (notebooks/get_genbank_lineage_haplotypes.py) that filters sequences by quality, assigns a consensus Pango lineage from the three assignment sources, and determines the consensus spike haplotype for each lineage.
The consensus is computed by position-by-position majority vote from aligned spike sequences, excluding missing positions (X).
Insertions are resolved by majority vote among sequences where the insertion site is not missing.
A minor_mutations column lists non-consensus mutations found in at least minor_mutations_cutoff (configured in config.yaml) fraction of informative sequences, with their frequencies.
Parameters (QC thresholds, date range, lineage priority order, reference spike, minor mutations cutoff) are configured in config.yaml and passed to the notebook via CLI args.
Results are saved to ./results/genbank_lineages/:
spike_haplotypes.tsv: consensus spike haplotype per Pango lineage with metadata (accession counts, mutation counts, sequences), sorted by lineagefreqs_by_month.tsv: monthly frequency of each lineage with columns: lineage, month, lineage_counts, total_counts, monthly_frequencyfreqs_by_month.html: interactive Altair chart of lineage frequencies over timeget_lineage_haplotypes.html: full notebook output exported as HTML (code hidden)get_lineage_haplotypes_summary.txt: text-only summary of the notebook output
Download the LANL consensus spike haplotype archive from cov.lanl.gov.
The tar.gz contains SPIKE.short.all.Wuhan.txt with the most common form (consensus) of each Pango lineage and its mutations relative to Wuhan-Hu-1.
The URL is configured in config.yaml.
Downloaded to ./results/lanl_downloads/.
Parse the LANL consensus file to extract the (consensus) line for each lineage, validate mutations against the Wuhan-Hu-1 reference spike, and reconstruct full spike sequences.
Lineages with invalid names (e.g., "None") are skipped with a warning.
Mutations are parsed as substitutions (e.g., T19I), deletions (e.g., H69-), and insertions (e.g., 214:EPE).
The reference amino acid at each mutation position is validated against the Wuhan-Hu-1 spike.
Script: scripts/parse_lanl_haplotypes.py.
Output: ./results/lanl_lineages/spike_haplotypes.tsv (sorted by lineage) and ./results/lanl_lineages/get_lineage_haplotypes_summary.txt.
Download aggregated lineage counts by date from the CovSpectrum LAPIS API (open data from GenBank/ENA/DDBJ).
The API returns a TSV with columns date, pangoLineage, count.
The URL is configured in config.yaml.
Downloaded to ./results/covspectrum_downloads/.
Aggregate daily counts into monthly lineage frequencies using the same day-rounding convention as the Genbank pipeline (day >= 16 rounds up to next month).
Rows with missing date or lineage are dropped.
Script: scripts/parse_covspectrum_freqs.py.
Output: ./results/covspectrum_lineages/freqs_by_month.tsv (columns: lineage, month, lineage_counts, total_counts, monthly_frequency; sorted by month ascending then frequency descending) and ./results/covspectrum_lineages/parse_covspectrum_freqs_summary.txt.
Outer join Genbank and LANL spike_haplotypes.tsv on lineage and classify concordance:
equal: both sources present, identicalcomplete_spikediffer: both present, differentcomplete_spike(mutation-level diff recorded inlanl_genbank_differences)genbank_only/lanl_only: lineage in only one source
When sources differ, LANL values are used for the merged mutation and sequence columns.
Annotations: External TSV files configured under add_annotations in config.yaml are used to add boolean annotation columns (e.g., in_library). Each entry maps an annotation name to a TSV file with a lineage column; lineages present in the TSV are labeled "yes", others "no". The script errors if the TSV contains lineages not found in the merged data.
Script: scripts/integrate_spike_haplotypes.py.
Output: ./results/lineages/spike_haplotypes.tsv (columns: lineage, lanl_genbank_concordance, total_substitutions, total_mutations, substitutions, deletions, insertions, lanl_genbank_differences, ambiguous_sites_or_premature_stop, annotation columns from add_annotations, aligned_spike, complete_spike; sorted by lineage) and ./results/lineages/integrate_spike_haplotypes_summary.txt.
Full outer join of Genbank and CovSpectrum freqs_by_month.tsv on (lineage, month).
For months where a lineage has no data in one source, lineage_counts is filled with 0 and total_counts is looked up from the source's monthly totals.
Script: scripts/integrate_freqs_by_month.py.
Output: ./results/lineages/freqs_by_month.tsv (columns: lineage, month, lineage_counts_covspectrum, lineage_counts_genbank, total_counts_covspectrum, total_counts_genbank, monthly_frequency_covspectrum, monthly_frequency_genbank; sorted by month ascending then max frequency descending) and ./results/lineages/integrate_freqs_by_month_summary.txt.
Combine the integrated spike haplotypes with per-lineage frequency summary statistics from the integrated frequency data. Lineages with no frequency counts in either Genbank or CovSpectrum are dropped (these are LANL-only lineages absent from both sequence databases). Produces two outputs:
- Uncollapsed (
spike_haplotypes_w_freqs_uncollapsed.tsv): one row per lineage withmax_monthly_frequency,median_month,total_counts, and source-specific variants of each (CovSpectrum and Genbank), plus all haplotype columns. Themedian_monthis the month containing the median sequence when sequences are sorted chronologically; for the combined metric, per-month counts are the max of CovSpectrum and Genbank counts to avoid double-counting. - Collapsed (
spike_haplotypes_w_freqs_collapsed.tsv): lineages with identicalcomplete_spikeare grouped. The representative lineage is the one with the highestmax_monthly_frequency. Other lineages in the group are listed inequivalent_lineages. Frequency stats are re-aggregated from the raw monthly data for the group.
Both files are sorted by median_month descending, then max_monthly_frequency descending, then lineage ascending.
Script: scripts/integrate_haplotypes_and_freqs.py.
Output: ./results/lineages/spike_haplotypes_w_freqs_uncollapsed.tsv, ./results/lineages/spike_haplotypes_w_freqs_collapsed.tsv, and ./results/lineages/integrate_haplotypes_and_freqs_summary.txt.
For each tree defined under trees: in config.yaml, filter lineages from the collapsed haplotypes TSV using a pandas query_str (or include all lineages when query_str is null), then build a Nextstrain Auspice JSON tree using the nextstrain-prot-titers-tree submodule.
The submodule uses IQ-TREE (Poisson model) for phylogenetic inference and TreeTime for ancestral reconstruction.
Each tree's nextstrain_prot_titers_tree_config block is passed to the submodule.
Intermediate results are saved to ./results/trees/{tree_name}/ and final Auspice JSONs are placed in ./auspice/.
Trees can be viewed by uploading to auspice.us.