Skip to content

Add T2T-CHM13v2.0 reference genome support (CHM13-T2T)#250

Open
ljwharbers wants to merge 4 commits into
SigProfilerSuite:masterfrom
ljwharbers:worktree-chm13-support
Open

Add T2T-CHM13v2.0 reference genome support (CHM13-T2T)#250
ljwharbers wants to merge 4 commits into
SigProfilerSuite:masterfrom
ljwharbers:worktree-chm13-support

Conversation

@ljwharbers

@ljwharbers ljwharbers commented Jun 11, 2026

Copy link
Copy Markdown

Adding support for CHM13. Assisted with Claude, but of course verified personally. Is it possible to host the .tar.gz alongside your other hosted genomes?

Summary

  • Registers CHM13-T2T (T2T-CHM13v2.0, nuclear chromosomes 1–22, X, Y) as a first-class supported genome
  • Adds per-chromosome MD5 checksums to CHECKSUMS in reference_genome_manager.py so the standard SigProfilerMatrixGenerator install CHM13-T2T command works once the tarball is on the FTP server
  • Adds CHM13-T2T to all chrom_orders dictionaries in SigProfilerMatrixGeneratorFunc.py (9 locations) and MutationMatrixGenerator.py to prevent KeyError on any code path
  • Updates CLI help strings, README supported-genomes list, and CHANGELOG

ljwharbers and others added 2 commits June 11, 2026 18:32
Register CHM13-T2T (nuclear chromosomes 1-22, X, Y; no mitochondrion)
as a first-class supported genome. Adds CHECKSUMS entry, extends all
chrom_orders dicts, updates CLI help, README, and CHANGELOG.

FTP upload of CHM13-T2T.tar.gz is required before install works end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matrix generation reads transcript_path from the package directory
(references/chromosomes/transcripts/<genome>/), not the volume, so the
per-chromosome transcript files must ship in the repo like every other
supported genome. These were missing for CHM13-T2T, causing
SigProfilerMatrixGeneratorFunc to fail with FileNotFoundError.

Adds the 24 transcript files (chr 1-22, X, Y) so end-to-end matrix
generation works for CHM13-T2T.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ljwharbers ljwharbers force-pushed the worktree-chm13-support branch from 15d42fd to 87c3d22 Compare June 11, 2026 18:48
The previous TSB was built from chm13v2.0_maskedY_rCRS.fa, which
hard-masks chrY PAR1 (positions 1-2,394,410) with Ns. This caused
"reference base does not match" errors for any VCF called against
the iGenomes UCSC CHM13 reference (unmasked PAR1).

Rebuilt all 24 TSB files from the iGenomes CHM13 genome.fa
(s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/CHM13/). Only chrY
checksum changed; chr1-22 and X are identical between the two
references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ljwharbers ljwharbers marked this pull request as ready for review June 12, 2026 10:51
@ljwharbers

Copy link
Copy Markdown
Author

Happy to share the .tar of the build for you to host if this looks good to you. Just let me know what the preferred sharing method is for you.

@ljwharbers ljwharbers mentioned this pull request Jun 19, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant