Skip to content

Preserve FASTA-header versions in SequenceData (closes #351)#355

Merged
iskandr merged 2 commits into
mainfrom
feature/preserve-fasta-versions
May 13, 2026
Merged

Preserve FASTA-header versions in SequenceData (closes #351)#355
iskandr merged 2 commits into
mainfrom
feature/preserve-fasta-versions

Conversation

@iskandr

@iskandr iskandr commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • _parse_header_id stops stripping ENS .N version suffixes — the versioned form becomes the canonical FASTA identifier. Also handles GENCODE pipe-delimited headers properly (leading ENSP00000493376.2 instead of the whole pipe-packed blob with everything after the first dot lopped off).
  • SequenceData gains two parse-time data structures:
    • _stripped_index: bare_ens_id -> versioned_id so bare-ID lookups still resolve against a versioned FASTA (the GENCODE case)
    • _versions: id -> int exposed via a new fasta_version(id) accessor for downstream sanity-checking of FASTA / GTF alignment
  • lookup_sequence_with_version_fallback resolves both directions (versioned caller -> bare FASTA, bare caller -> versioned FASTA) and drops the per-miss .rsplit workaround. Falls back gracefully when handed an old SequenceData (e.g. from an unpickled object) that lacks _stripped_index.
  • Pickle filename bumped from <fasta>.pickle to <fasta>.v2.pickle (constant FASTA_PICKLE_SCHEMA_VERSION) so caches from the v1 bare-key layout get rebuilt instead of silently loaded.
  • Version bumped to 2.10.0.

Companion to openvax/gtfparse#67 (attribute_aliases + cast_version_columns). Together these two PRs close out the GENCODE compatibility regression originally surfaced in #335 — the user's example (Variant(...).effects() returning NoncodingTranscript for every transcript) becomes pure data-flow once both are released.

Test plan

  • ./lint.sh clean
  • ./test.sh — 199 passed (22 new), no regressions; existing test_versioned_protein_fasta.py (ID version handling and GENCODE compatibility #335 part-2 tests) still green.
  • tests/test_fasta_versions.py covers:
    • _parse_header_id keeps versioned ENS ID; bare ENS unchanged; splits on space (Ensembl) and pipe (GENCODE); doesn't touch TAIR AT1G01010.1
    • _split_ens_version extracts the int version, returns None for bare or non-ENS, defensive for unparseable suffix
    • SequenceData keys versioned IDs verbatim; _stripped_index maps bare ENS → versioned; fasta_version() accepts either form
    • bare-key fallback when header carries no version
    • TAIR isoform suffix is keyed as-is and bare lookup returns None
    • conflict resolution: when two versioned forms of the same bare ENS coexist, the higher version wins in _stripped_index
    • pickle filename contains FASTA_PICKLE_SCHEMA_VERSION; pickle round-trip rebuilds _stripped_index from disk
    • lookup_sequence_with_version_fallback resolves direct hits, versioned-caller→bare-FASTA, bare-caller→versioned-FASTA, and returns None for unknown/empty inputs
    • TAIR .1 isoform is NOT stripped by the fallback (would be semantically wrong)
    • legacy SequenceData without _stripped_index still resolves via literal dict get
    • pickled FASTA dictionary on disk uses versioned keys

@coveralls

coveralls commented May 13, 2026

Copy link
Copy Markdown

Coverage Status

coverage: 85.426% (+0.5%) from 84.971% — feature/preserve-fasta-versions into main

@iskandr iskandr left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two requested changes before landing — bundling rather than spinning a follow-up:

1. Drop the pickle filename bump

The FASTA_PICKLE_SCHEMA_VERSION = "v2" rename is overcautious. v1 pickles are forward-compatible with this code.

Walking through what happens when v2.10.0 loads an old (bare-keyed) v1 pickle:

  • _add_to_fasta_dictionary iterates the loaded dict.
  • For each (identifier, sequence): identifier is bare (ENSP00000123456), so _split_ens_version returns (id, None).
  • The if version is not None: guard skips both _stripped_index and _versions writes.
  • Result: bare-keyed _fasta_dictionary, empty _stripped_index, empty _versions.

That state still resolves every realistic lookup correctly:

Caller passes Dict has (v1 cache) Result
bare ENSP00000123456 bare key direct hit ✓
versioned ENSP00000123456.3 (GENCODE GTF) bare key step-2 strip-and-retry hits ✓
bare against versioned-FASTA (the new GENCODE→bare path) n/a — only happens after a fresh parse, which is fine

The only capability lost on a stale v1 cache is fasta_version(id) returning a real int — but that information was never captured by the v1 parser, so returning None is honest, not wrong.

Suggested patch: drop FASTA_PICKLE_SCHEMA_VERSION and revert the filename change in SequenceData.__init__:

self.fasta_dictionary_filenames = [
    filename + ".pickle" for filename in self.fasta_filenames
]

Plus a regression test that constructs an old-style bare-keyed pickle on disk and asserts SequenceData loads it without erroring, and that lookup_sequence_with_version_fallback still resolves both bare and versioned callers against it.

2. Surface fasta_version on the user-facing view objects

SequenceData.fasta_version(id) is exposed but nothing in the public API plumbs it up. Easy add now, harder to retrofit later if downstream tools start caching the wrong protein version.

Transcript.fasta_version — straightforward, Transcript already has self.genome:

@property
def fasta_version(self):
    """Version embedded in this transcript's cDNA FASTA header, or None."""
    if self.genome.transcript_sequences is None:
        return None
    return self.genome.transcript_sequences.fasta_version(self.transcript_id)

Protein.fasta_version — needs a genome reference on the Protein view, which it doesn't carry today. Smallest change: thread genome= through Protein.__init__ and the Transcript.protein constructor:

# protein.py
class Protein(Serializable):
    def __init__(self, protein_id, protein_version=None, genome=None):
        self.protein_id = protein_id
        self.protein_version = protein_version
        self.genome = genome

    @property
    def fasta_version(self):
        if self.genome is None or self.genome.protein_sequences is None:
            return None
        return self.genome.protein_sequences.fasta_version(self.protein_id)

# transcript.py — Transcript.protein
return Protein(
    protein_id=self.protein_id,
    protein_version=protein_version,
    genome=self.genome,
)

Tests to add:

  • transcript.fasta_version returns the int when the cDNA FASTA carried a version; None when it didn't.
  • transcript.protein.fasta_version returns the int when the protein FASTA carried a version.
  • A consistency-check pattern: transcript.protein.protein_version (from GTF) vs transcript.protein.fasta_version (from FASTA) — useful when a user mixes a fresh GTF with a stale FASTA.

This matches the rationale I gave in #351 for keeping the version in the first place: the FASTA-header version is the authoritative source-of-truth for sequence identity, and downstream consumers want to be able to see it.


Both changes are small. Reverting the schema bump is ~4 lines + a test. Adding fasta_version on the views is ~15 lines + tests. Together they make the PR fully close out the original issue rather than leaving the user-facing accessor as a follow-up.

iskandr added 2 commits May 13, 2026 13:40
`_parse_header_id` no longer strips ENS .N version suffixes — the
versioned form is preserved as the canonical FASTA identifier, with
the version treated as authoritative source-of-truth for the
sequence's identity. Also added pipe-delimited (GENCODE) header
handling: the leading field before `|` is now the identifier rather
than the whole pipe-packed blob with everything-after-the-first-dot
silently dropped.

`SequenceData` now maintains:
  - `_fasta_dictionary` keyed on whichever form the header carried
    (versioned when available, bare otherwise)
  - `_stripped_index: bare_ens_id -> versioned_id`, so bare-ID
    lookups still resolve against a versioned FASTA (GENCODE case)
  - `_versions: id -> int`, exposed via the new `fasta_version(id)`
    accessor for downstream sanity-checking of FASTA / GTF alignment

`lookup_sequence_with_version_fallback` resolves both directions —
versioned caller against bare FASTA (existing Ensembl path) and
bare caller against versioned FASTA (new GENCODE path) — without
the runtime `.rsplit` on every miss the old helper did. Defensive
fallback for callers that hold a `SequenceData` without the new
attributes (e.g. unpickled old object).

Bumped the on-disk pickle filename to `<fasta>.v2.pickle` so stale
caches from the previous bare-key layout get ignored instead of
loaded.

Closes #351. Companion to gtfparse#67 (attribute_aliases +
cast_version_columns), which together close out the GENCODE
compatibility regression originally reported in #335.
* Reverted FASTA_PICKLE_SCHEMA_VERSION + the .v2.pickle filename suffix.
  v1 (bare-keyed) pickles load cleanly under the new code path - the
  rebuilt _stripped_index stays empty, but lookups still resolve via
  the version-stripped retry path so Ensembl-style callers see no
  regression. New test test_old_bare_keyed_pickle_still_loads_under_new_code
  pins this.

* Added Transcript.fasta_version - returns the int version that the
  cDNA FASTA header carried for this transcript, or None when the
  genome has no transcript FASTA / the header was bare.

* Added Protein.fasta_version - same shape but for the protein FASTA.
  Protein now carries an optional genome= reference so the view object
  can consult genome.protein_sequences.fasta_version(). Default is
  None for back-compat with callers constructing Protein outside of
  Transcript.

* Opted out of gtfparse 2.7.0's cast_version_columns=True in
  database.py's read_gtf call. When the *_version columns are missing
  on some rows (e.g. start_codon rows don't carry transcript_version),
  pandas' nullable Int64 routes through float on the sqlite write path
  and the value lands as '7.0' text, which broke pyensembl's
  int(result[0]) parse. Keeping strings and doing the int conversion
  on the property side as before.

* New tests for both fasta_version accessors, including a "GTF claims
  version 5, FASTA claims version 3" disagreement case so downstream
  tools can detect mismatched GTF/FASTA pairs.
@iskandr iskandr force-pushed the feature/preserve-fasta-versions branch from 01018f9 to 0fa7d94 Compare May 13, 2026 17:46
@iskandr iskandr merged commit d25332b into main May 13, 2026
10 checks passed
@iskandr iskandr deleted the feature/preserve-fasta-versions branch May 13, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants