Fix #394: in-frame stop-codon deletion with empty 3' UTR is a Deletion#395
Merged
Conversation
An in-frame deletion that removes the stop codon of a transcript with no
3' UTR sequence (e.g. MAPK3-006 / ENST00000395199, whose
three_prime_utr_sequence is "") raised:
ValueError: If no amino acids added by StopLoss then it should be Silent
translate_in_frame_mutation sets using_three_prime_utr=True whenever the
mutation runs past the reference stop codon, even when there is no UTR
sequence to translate into. predict_in_frame_coding_effect then took the
StopLoss branch with an empty aa_alt, which the StopLoss constructor
rejects.
Only emit StopLoss when readthrough actually adds residues (n_aa_alt > 0);
otherwise fall through to the existing n_aa_alt == 0 branch, which reports
a C-terminal Deletion. This is the honest classification: with no UTR
sequence we cannot predict an extended protein.
The earlier #246 fix only covered transcripts with a non-empty 3' UTR.
Tests: the reported MAPK3 variant exercises both sides from one input —
ENST00000395199 (no UTR) -> Deletion, ENST00000403394 (804nt UTR) ->
StopLoss readthrough — plus a splice-free unit test of the branch.
Claude-Session: https://claude.ai/code/session_0149VWj5Rm1rYFf9azu4ry62
02538af to
6858f3e
Compare
The default annotator (protein_diff -> classify_from_protein_diff)
independently mislabeled the same empty-3'UTR stop deletion as a
PrematureStop ("p.G355*") on a clean in-exon variant. That's wrong: the
mutant CDS contains no stop codon at all — the protein simply runs off
the end of the available sequence. A PrematureStop requires a stop codon
to exist.
Guard the PrematureStop branch so it only fires when the mutant
transcript actually terminated at a stop codon (i.e. translation did not
consume every available codon). When translation ran off the end, fall
through to the C-terminal Deletion classification, matching the in-frame
predictor. Genuine premature stops (a real stop codon in the mutant CDS)
are unaffected.
Now both annotators agree this scenario is a Deletion, honoring the #394
decision that an empty-UTR stop deletion is a C-terminal deletion rather
than a stop-loss with no predictable readthrough.
Tests: clean in-exon deletion via the default annotator -> Deletion;
genuine premature-stop insertion (BRCA1) -> still PrematureStop.
Bump version to 6.0.1 and date the CHANGELOG entry.
Claude-Session: https://claude.ai/code/session_0149VWj5Rm1rYFf9azu4ry62
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #394.
An in-frame deletion that removes the stop codon of a transcript with no 3' UTR sequence crashes:
(The reporter's original
start=30128150is off-by-one against the GRCh37 reference;30128151is the equivalent ref-matching coordinate. Same MAPK3 region, same crash.)Root cause
translate_in_frame_mutationsetsusing_three_prime_utr=Truewhenever the mutation runs past the reference stop codon — even when the transcript has no 3' UTR sequence to translate into (transcript.three_prime_utr_sequence == "", e.g. MAPK3-006 /ENST00000395199).predict_in_frame_coding_effectthen takes theStopLossbranch with an emptyaa_alt, which theStopLossconstructor rejects.Fix
Only emit
StopLosswhen readthrough actually adds residues (n_aa_alt > 0); otherwise fall through to the existingn_aa_alt == 0branch, which reports a C-terminalDeletion. With no UTR sequence we cannot predict an extended protein, so a Deletion is the honest classification — matching the suggestion in #394.The earlier #246 fix only handled transcripts with a non-empty 3' UTR.
Tests
The reported MAPK3 variant conveniently exercises both sides of the fix from a single input:
ENST00000395199(no 3' UTR) →Deletionp.GGT354del— the fixed crashENST00000403394(804 nt 3' UTR) →StopLossreadthrough — proves the Deletion fallback wasn't over-broadenedv.effects()across all MAPK3 transcripts no longer raisespredict_in_frame_coding_effectfor the empty-UTR branchAdded to
tests/test_stop_codon_classification_bugs.py. Full suite passes (the one unrelatedtest_filter_effects_by_transcript_expressionfailure is pre-existing onmain— it's sensitive to the locally-installed Ensembl release).https://claude.ai/code/session_0149VWj5Rm1rYFf9azu4ry62