Preserve FASTA-header versions in SequenceData (closes #351) by iskandr · Pull Request #355 · openvax/pyensembl

iskandr · 2026-05-13T15:11:25Z

Summary

_parse_header_id stops stripping ENS .N version suffixes — the versioned form becomes the canonical FASTA identifier. Also handles GENCODE pipe-delimited headers properly (leading ENSP00000493376.2 instead of the whole pipe-packed blob with everything after the first dot lopped off).
SequenceData gains two parse-time data structures:
- _stripped_index: bare_ens_id -> versioned_id so bare-ID lookups still resolve against a versioned FASTA (the GENCODE case)
- _versions: id -> int exposed via a new fasta_version(id) accessor for downstream sanity-checking of FASTA / GTF alignment
lookup_sequence_with_version_fallback resolves both directions (versioned caller -> bare FASTA, bare caller -> versioned FASTA) and drops the per-miss .rsplit workaround. Falls back gracefully when handed an old SequenceData (e.g. from an unpickled object) that lacks _stripped_index.
Pickle filename bumped from <fasta>.pickle to <fasta>.v2.pickle (constant FASTA_PICKLE_SCHEMA_VERSION) so caches from the v1 bare-key layout get rebuilt instead of silently loaded.
Version bumped to 2.10.0.

Companion to openvax/gtfparse#67 (attribute_aliases + cast_version_columns). Together these two PRs close out the GENCODE compatibility regression originally surfaced in #335 — the user's example (Variant(...).effects() returning NoncodingTranscript for every transcript) becomes pure data-flow once both are released.

Test plan

./lint.sh clean
./test.sh — 199 passed (22 new), no regressions; existing test_versioned_protein_fasta.py (ID version handling and GENCODE compatibility #335 part-2 tests) still green.
tests/test_fasta_versions.py covers:
- _parse_header_id keeps versioned ENS ID; bare ENS unchanged; splits on space (Ensembl) and pipe (GENCODE); doesn't touch TAIR AT1G01010.1
- _split_ens_version extracts the int version, returns None for bare or non-ENS, defensive for unparseable suffix
- SequenceData keys versioned IDs verbatim; _stripped_index maps bare ENS → versioned; fasta_version() accepts either form
- bare-key fallback when header carries no version
- TAIR isoform suffix is keyed as-is and bare lookup returns None
- conflict resolution: when two versioned forms of the same bare ENS coexist, the higher version wins in _stripped_index
- pickle filename contains FASTA_PICKLE_SCHEMA_VERSION; pickle round-trip rebuilds _stripped_index from disk
- lookup_sequence_with_version_fallback resolves direct hits, versioned-caller→bare-FASTA, bare-caller→versioned-FASTA, and returns None for unknown/empty inputs
- TAIR .1 isoform is NOT stripped by the fallback (would be semantically wrong)
- legacy SequenceData without _stripped_index still resolves via literal dict get
- pickled FASTA dictionary on disk uses versioned keys

coveralls · 2026-05-13T15:20:57Z

coverage: 85.426% (+0.5%) from 84.971% — feature/preserve-fasta-versions into main

iskandr

Two requested changes before landing — bundling rather than spinning a follow-up:

1. Drop the pickle filename bump

The FASTA_PICKLE_SCHEMA_VERSION = "v2" rename is overcautious. v1 pickles are forward-compatible with this code.

Walking through what happens when v2.10.0 loads an old (bare-keyed) v1 pickle:

_add_to_fasta_dictionary iterates the loaded dict.
For each (identifier, sequence): identifier is bare (ENSP00000123456), so _split_ens_version returns (id, None).
The if version is not None: guard skips both _stripped_index and _versions writes.
Result: bare-keyed _fasta_dictionary, empty _stripped_index, empty _versions.

That state still resolves every realistic lookup correctly:

Caller passes	Dict has (v1 cache)	Result
bare `ENSP00000123456`	bare key	direct hit ✓
versioned `ENSP00000123456.3` (GENCODE GTF)	bare key	step-2 strip-and-retry hits ✓
bare against versioned-FASTA (the new GENCODE→bare path)	n/a — only happens after a fresh parse, which is fine

The only capability lost on a stale v1 cache is fasta_version(id) returning a real int — but that information was never captured by the v1 parser, so returning None is honest, not wrong.

Suggested patch: drop FASTA_PICKLE_SCHEMA_VERSION and revert the filename change in SequenceData.__init__:

self.fasta_dictionary_filenames = [
    filename + ".pickle" for filename in self.fasta_filenames
]

Plus a regression test that constructs an old-style bare-keyed pickle on disk and asserts SequenceData loads it without erroring, and that lookup_sequence_with_version_fallback still resolves both bare and versioned callers against it.

2. Surface `fasta_version` on the user-facing view objects

SequenceData.fasta_version(id) is exposed but nothing in the public API plumbs it up. Easy add now, harder to retrofit later if downstream tools start caching the wrong protein version.

Transcript.fasta_version — straightforward, Transcript already has self.genome:

@property
def fasta_version(self):
    """Version embedded in this transcript's cDNA FASTA header, or None."""
    if self.genome.transcript_sequences is None:
        return None
    return self.genome.transcript_sequences.fasta_version(self.transcript_id)

Protein.fasta_version — needs a genome reference on the Protein view, which it doesn't carry today. Smallest change: thread genome= through Protein.__init__ and the Transcript.protein constructor:

# protein.py
class Protein(Serializable):
    def __init__(self, protein_id, protein_version=None, genome=None):
        self.protein_id = protein_id
        self.protein_version = protein_version
        self.genome = genome

    @property
    def fasta_version(self):
        if self.genome is None or self.genome.protein_sequences is None:
            return None
        return self.genome.protein_sequences.fasta_version(self.protein_id)

# transcript.py — Transcript.protein
return Protein(
    protein_id=self.protein_id,
    protein_version=protein_version,
    genome=self.genome,
)

Tests to add:

transcript.fasta_version returns the int when the cDNA FASTA carried a version; None when it didn't.
transcript.protein.fasta_version returns the int when the protein FASTA carried a version.
A consistency-check pattern: transcript.protein.protein_version (from GTF) vs transcript.protein.fasta_version (from FASTA) — useful when a user mixes a fresh GTF with a stale FASTA.

This matches the rationale I gave in #351 for keeping the version in the first place: the FASTA-header version is the authoritative source-of-truth for sequence identity, and downstream consumers want to be able to see it.

Both changes are small. Reverting the schema bump is ~4 lines + a test. Adding fasta_version on the views is ~15 lines + tests. Together they make the PR fully close out the original issue rather than leaving the user-facing accessor as a follow-up.

`_parse_header_id` no longer strips ENS .N version suffixes — the versioned form is preserved as the canonical FASTA identifier, with the version treated as authoritative source-of-truth for the sequence's identity. Also added pipe-delimited (GENCODE) header handling: the leading field before `|` is now the identifier rather than the whole pipe-packed blob with everything-after-the-first-dot silently dropped. `SequenceData` now maintains: - `_fasta_dictionary` keyed on whichever form the header carried (versioned when available, bare otherwise) - `_stripped_index: bare_ens_id -> versioned_id`, so bare-ID lookups still resolve against a versioned FASTA (GENCODE case) - `_versions: id -> int`, exposed via the new `fasta_version(id)` accessor for downstream sanity-checking of FASTA / GTF alignment `lookup_sequence_with_version_fallback` resolves both directions — versioned caller against bare FASTA (existing Ensembl path) and bare caller against versioned FASTA (new GENCODE path) — without the runtime `.rsplit` on every miss the old helper did. Defensive fallback for callers that hold a `SequenceData` without the new attributes (e.g. unpickled old object). Bumped the on-disk pickle filename to `<fasta>.v2.pickle` so stale caches from the previous bare-key layout get ignored instead of loaded. Closes #351. Companion to gtfparse#67 (attribute_aliases + cast_version_columns), which together close out the GENCODE compatibility regression originally reported in #335.

* Reverted FASTA_PICKLE_SCHEMA_VERSION + the .v2.pickle filename suffix. v1 (bare-keyed) pickles load cleanly under the new code path - the rebuilt _stripped_index stays empty, but lookups still resolve via the version-stripped retry path so Ensembl-style callers see no regression. New test test_old_bare_keyed_pickle_still_loads_under_new_code pins this. * Added Transcript.fasta_version - returns the int version that the cDNA FASTA header carried for this transcript, or None when the genome has no transcript FASTA / the header was bare. * Added Protein.fasta_version - same shape but for the protein FASTA. Protein now carries an optional genome= reference so the view object can consult genome.protein_sequences.fasta_version(). Default is None for back-compat with callers constructing Protein outside of Transcript. * Opted out of gtfparse 2.7.0's cast_version_columns=True in database.py's read_gtf call. When the *_version columns are missing on some rows (e.g. start_codon rows don't carry transcript_version), pandas' nullable Int64 routes through float on the sqlite write path and the value lands as '7.0' text, which broke pyensembl's int(result[0]) parse. Keeping strings and doing the int conversion on the property side as before. * New tests for both fasta_version accessors, including a "GTF claims version 5, FASTA claims version 3" disagreement case so downstream tools can detect mismatched GTF/FASTA pairs.

iskandr commented May 13, 2026

View reviewed changes

This was referenced May 13, 2026

Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES into read_gtf call #356

Merged

ID version handling and GENCODE compatibility #335

Closed

iskandr added 2 commits May 13, 2026 13:40

iskandr force-pushed the feature/preserve-fasta-versions branch from 01018f9 to 0fa7d94 Compare May 13, 2026 17:46

iskandr merged commit d25332b into main May 13, 2026
10 checks passed

iskandr deleted the feature/preserve-fasta-versions branch May 13, 2026 17:53

iskandr mentioned this pull request May 13, 2026

Cannot Access the coding sequence in specific transcript #267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve FASTA-header versions in SequenceData (closes #351)#355

Preserve FASTA-header versions in SequenceData (closes #351)#355
iskandr merged 2 commits into
mainfrom
feature/preserve-fasta-versions

iskandr commented May 13, 2026

Uh oh!

coveralls commented May 13, 2026 •

edited

Loading

Uh oh!

iskandr left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

iskandr commented May 13, 2026

Summary

Test plan

Uh oh!

coveralls commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iskandr left a comment

Choose a reason for hiding this comment

1. Drop the pickle filename bump

2. Surface fasta_version on the user-facing view objects

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coveralls commented May 13, 2026 •

edited

Loading

2. Surface `fasta_version` on the user-facing view objects