Preserve FASTA-header versions in SequenceData (closes #351)#355
Conversation
iskandr
left a comment
There was a problem hiding this comment.
Two requested changes before landing — bundling rather than spinning a follow-up:
1. Drop the pickle filename bump
The FASTA_PICKLE_SCHEMA_VERSION = "v2" rename is overcautious. v1 pickles are forward-compatible with this code.
Walking through what happens when v2.10.0 loads an old (bare-keyed) v1 pickle:
_add_to_fasta_dictionaryiterates the loaded dict.- For each
(identifier, sequence): identifier is bare (ENSP00000123456), so_split_ens_versionreturns(id, None). - The
if version is not None:guard skips both_stripped_indexand_versionswrites. - Result: bare-keyed
_fasta_dictionary, empty_stripped_index, empty_versions.
That state still resolves every realistic lookup correctly:
| Caller passes | Dict has (v1 cache) | Result |
|---|---|---|
bare ENSP00000123456 |
bare key | direct hit ✓ |
versioned ENSP00000123456.3 (GENCODE GTF) |
bare key | step-2 strip-and-retry hits ✓ |
| bare against versioned-FASTA (the new GENCODE→bare path) | n/a — only happens after a fresh parse, which is fine |
The only capability lost on a stale v1 cache is fasta_version(id) returning a real int — but that information was never captured by the v1 parser, so returning None is honest, not wrong.
Suggested patch: drop FASTA_PICKLE_SCHEMA_VERSION and revert the filename change in SequenceData.__init__:
self.fasta_dictionary_filenames = [
filename + ".pickle" for filename in self.fasta_filenames
]Plus a regression test that constructs an old-style bare-keyed pickle on disk and asserts SequenceData loads it without erroring, and that lookup_sequence_with_version_fallback still resolves both bare and versioned callers against it.
2. Surface fasta_version on the user-facing view objects
SequenceData.fasta_version(id) is exposed but nothing in the public API plumbs it up. Easy add now, harder to retrofit later if downstream tools start caching the wrong protein version.
Transcript.fasta_version — straightforward, Transcript already has self.genome:
@property
def fasta_version(self):
"""Version embedded in this transcript's cDNA FASTA header, or None."""
if self.genome.transcript_sequences is None:
return None
return self.genome.transcript_sequences.fasta_version(self.transcript_id)Protein.fasta_version — needs a genome reference on the Protein view, which it doesn't carry today. Smallest change: thread genome= through Protein.__init__ and the Transcript.protein constructor:
# protein.py
class Protein(Serializable):
def __init__(self, protein_id, protein_version=None, genome=None):
self.protein_id = protein_id
self.protein_version = protein_version
self.genome = genome
@property
def fasta_version(self):
if self.genome is None or self.genome.protein_sequences is None:
return None
return self.genome.protein_sequences.fasta_version(self.protein_id)
# transcript.py — Transcript.protein
return Protein(
protein_id=self.protein_id,
protein_version=protein_version,
genome=self.genome,
)Tests to add:
transcript.fasta_versionreturns the int when the cDNA FASTA carried a version; None when it didn't.transcript.protein.fasta_versionreturns the int when the protein FASTA carried a version.- A consistency-check pattern:
transcript.protein.protein_version(from GTF) vstranscript.protein.fasta_version(from FASTA) — useful when a user mixes a fresh GTF with a stale FASTA.
This matches the rationale I gave in #351 for keeping the version in the first place: the FASTA-header version is the authoritative source-of-truth for sequence identity, and downstream consumers want to be able to see it.
Both changes are small. Reverting the schema bump is ~4 lines + a test. Adding fasta_version on the views is ~15 lines + tests. Together they make the PR fully close out the original issue rather than leaving the user-facing accessor as a follow-up.
`_parse_header_id` no longer strips ENS .N version suffixes — the
versioned form is preserved as the canonical FASTA identifier, with
the version treated as authoritative source-of-truth for the
sequence's identity. Also added pipe-delimited (GENCODE) header
handling: the leading field before `|` is now the identifier rather
than the whole pipe-packed blob with everything-after-the-first-dot
silently dropped.
`SequenceData` now maintains:
- `_fasta_dictionary` keyed on whichever form the header carried
(versioned when available, bare otherwise)
- `_stripped_index: bare_ens_id -> versioned_id`, so bare-ID
lookups still resolve against a versioned FASTA (GENCODE case)
- `_versions: id -> int`, exposed via the new `fasta_version(id)`
accessor for downstream sanity-checking of FASTA / GTF alignment
`lookup_sequence_with_version_fallback` resolves both directions —
versioned caller against bare FASTA (existing Ensembl path) and
bare caller against versioned FASTA (new GENCODE path) — without
the runtime `.rsplit` on every miss the old helper did. Defensive
fallback for callers that hold a `SequenceData` without the new
attributes (e.g. unpickled old object).
Bumped the on-disk pickle filename to `<fasta>.v2.pickle` so stale
caches from the previous bare-key layout get ignored instead of
loaded.
Closes #351. Companion to gtfparse#67 (attribute_aliases +
cast_version_columns), which together close out the GENCODE
compatibility regression originally reported in #335.
* Reverted FASTA_PICKLE_SCHEMA_VERSION + the .v2.pickle filename suffix. v1 (bare-keyed) pickles load cleanly under the new code path - the rebuilt _stripped_index stays empty, but lookups still resolve via the version-stripped retry path so Ensembl-style callers see no regression. New test test_old_bare_keyed_pickle_still_loads_under_new_code pins this. * Added Transcript.fasta_version - returns the int version that the cDNA FASTA header carried for this transcript, or None when the genome has no transcript FASTA / the header was bare. * Added Protein.fasta_version - same shape but for the protein FASTA. Protein now carries an optional genome= reference so the view object can consult genome.protein_sequences.fasta_version(). Default is None for back-compat with callers constructing Protein outside of Transcript. * Opted out of gtfparse 2.7.0's cast_version_columns=True in database.py's read_gtf call. When the *_version columns are missing on some rows (e.g. start_codon rows don't carry transcript_version), pandas' nullable Int64 routes through float on the sqlite write path and the value lands as '7.0' text, which broke pyensembl's int(result[0]) parse. Keeping strings and doing the int conversion on the property side as before. * New tests for both fasta_version accessors, including a "GTF claims version 5, FASTA claims version 3" disagreement case so downstream tools can detect mismatched GTF/FASTA pairs.
01018f9 to
0fa7d94
Compare
Summary
_parse_header_idstops stripping ENS.Nversion suffixes — the versioned form becomes the canonical FASTA identifier. Also handles GENCODE pipe-delimited headers properly (leadingENSP00000493376.2instead of the whole pipe-packed blob with everything after the first dot lopped off).SequenceDatagains two parse-time data structures:_stripped_index: bare_ens_id -> versioned_idso bare-ID lookups still resolve against a versioned FASTA (the GENCODE case)_versions: id -> intexposed via a newfasta_version(id)accessor for downstream sanity-checking of FASTA / GTF alignmentlookup_sequence_with_version_fallbackresolves both directions (versioned caller -> bare FASTA, bare caller -> versioned FASTA) and drops the per-miss.rsplitworkaround. Falls back gracefully when handed an oldSequenceData(e.g. from an unpickled object) that lacks_stripped_index.<fasta>.pickleto<fasta>.v2.pickle(constantFASTA_PICKLE_SCHEMA_VERSION) so caches from the v1 bare-key layout get rebuilt instead of silently loaded.Companion to openvax/gtfparse#67 (attribute_aliases + cast_version_columns). Together these two PRs close out the GENCODE compatibility regression originally surfaced in #335 — the user's example (
Variant(...).effects()returning NoncodingTranscript for every transcript) becomes pure data-flow once both are released.Test plan
./lint.shclean./test.sh— 199 passed (22 new), no regressions; existingtest_versioned_protein_fasta.py(ID version handling and GENCODE compatibility #335 part-2 tests) still green.tests/test_fasta_versions.pycovers:_parse_header_idkeeps versioned ENS ID; bare ENS unchanged; splits on space (Ensembl) and pipe (GENCODE); doesn't touch TAIRAT1G01010.1_split_ens_versionextracts the int version, returns None for bare or non-ENS, defensive for unparseable suffixSequenceDatakeys versioned IDs verbatim;_stripped_indexmaps bare ENS → versioned;fasta_version()accepts either form_stripped_indexFASTA_PICKLE_SCHEMA_VERSION; pickle round-trip rebuilds_stripped_indexfrom disklookup_sequence_with_version_fallbackresolves direct hits, versioned-caller→bare-FASTA, bare-caller→versioned-FASTA, and returns None for unknown/empty inputs.1isoform is NOT stripped by the fallback (would be semantically wrong)SequenceDatawithout_stripped_indexstill resolves via literal dict get