Skip to content

fix: predictCoding on empty ranges returns AAStringSet for REFAA/VARAA (#86)#92

Open
jmg421 wants to merge 2 commits into
Bioconductor:develfrom
jmg421:fix/issue-86-predictCoding-empty-AAStringSet
Open

fix: predictCoding on empty ranges returns AAStringSet for REFAA/VARAA (#86)#92
jmg421 wants to merge 2 commits into
Bioconductor:develfrom
jmg421:fix/issue-86-predictCoding-empty-AAStringSet

Conversation

@jmg421

@jmg421 jmg421 commented Jun 12, 2026

Copy link
Copy Markdown

Problem

When predictCoding() is called with a query that has no overlap with any CDS (e.g. a non-coding variant), .localCoordinates() returns a zero-length GRanges. The early-exit guard at that point returned txlocal directly — before REFAA/VARAA columns were ever added to mcols(). This caused downstream operations like reverse() or subseq() on those columns to throw errors.

Reproducer from #86:

library(VariantAnnotation)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")[123]  # non-coding variant
seqlevels(vcf) <- "chr22"
coding <- predictCoding(vcf, TxDb.Hsapiens.UCSC.hg19.knownGene, seqSource=Hsapiens)

coding$REFAA  # NULL — should be AAStringSet()
coding$VARAA  # NULL — should be AAStringSet()

Fix

Two changes in R/methods-predictCoding.R:

  1. Remove the early return on length(txlocal) == 0 — let execution fall through to the full mcols()-building block, which naturally produces zero-length AAStringSet columns via AAStringSet(rep("", length(txlocal))).

  2. Fix scalar GENEID — change GENEID=NA_character_ to GENEID=rep(NA_character_, length(txlocal)) so DataFrame() construction is valid at zero length.

Test

Extended test_predictCoding_empty in inst/unitTests/test_predictCoding-methods.R to assert:

  • mcols(result)$REFAA is an AAStringSet
  • mcols(result)$VARAA is an AAStringSet
  • Both have length == 0L

Fixes #86.

jmg421 added 2 commits June 12, 2026 10:19
Bioconductor#86)

When query has no overlap with the CDS, .localCoordinates() returns a
zero-length GRanges. Previously an early return on length(txlocal)==0
caused REFAA and VARAA to be absent from mcols(), returning NULL instead
of empty AAStringSet objects. This breaks downstream operations like
reverse() and subseq() on the result columns.

Fix:
- Remove early return so the full mcols-building code runs even when
  txlocal is empty, naturally producing zero-length AAStringSet columns
- Fix GENEID=NA_character_ -> rep(NA_character_, length(txlocal)) so
  DataFrame() construction works correctly at zero length

Test: extend test_predictCoding_empty to assert REFAA and VARAA are
AAStringSet with length 0.
…lassification

Multi-nucleotide variants (MNVs/DBS) can produce VARAA strings like 'P*'
or '*W' where %in% '*' fails to match. Switch to grepl('\*', ..., fixed=TRUE)
so any VARAA containing a stop codon is correctly classified as 'nonsense'
rather than 'nonsynonymous'.

Fixes Bioconductor#86. Adds unit test test_predictCoding_nonsense_DBS covering
a DBS that introduces a stop at a codon boundary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

predictCoding on empty ranges drops REFAA/VARAA, should be empty AAStringSet instead?

1 participant