Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 57 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,58 @@
# Changelog

## 5.16.2

**Combine separate predictor runs (#170):**

`topiary.combine_predictions([a, b, ...])` combines separate
predictor outputs into the same long-form shape produced by running
those predictors together. It accepts `TopiaryResult` or fresh
`TopiaryPredictor` DataFrame outputs, supports both split-by-predictor
and split-by-allele/peptide-length runs, rejects duplicate
`(prediction_method_name, kind, identity)` predictions, and by default
requires every emitted `(prediction_method_name, kind)` group to cover
the same identity grid. Use `coverage="partial"` only for deliberate
sparse unions.

Fresh `TopiaryPredictor` DataFrames now carry lightweight
`DataFrame.attrs` model-version metadata (`topiary_models`) so this
helper can preserve model provenance without changing the public return
type. The emitted rows remain the source of truth for which predictor
produced which quantities: `prediction_method_name`, `predictor_version`,
`kind`, and the value/rank columns are not duplicated into separate
`kind_support` metadata.

`TopiaryPredictor(name=...)` now optionally records per-run provenance
in a `prediction_run_name` column. This is intended for split predictor
grids such as one NetMHCpan run per allele/peptide length: the logical
method remains `prediction_method_name="netmhcpan"`, while
`prediction_run_name` records the shard. `combine_predictions`
and `to_wide()` treat the run name as provenance, not as a separate
prediction identity, so disjoint shards combine cleanly and overlapping
shards still fail as duplicate predictions.

`combine_predictions` also treats `sample_name` as part of the
implicit row identity when present, matching `to_wide()` grouping for
multi-sample predictor outputs.

The combine docs now spell out the recommended allele-grid strategy:
split NetMHCpan-style per-allele predictors can be combined under
`coverage="complete"`, while intentionally sparse grids such as
MHCflurry haplotype-mode presentation should use `coverage="partial"`
and the ranking DSL's `best_*_allele` accessors for allele attribution.

`TopiaryResult` now treats long/wide representation as an internal,
cached view concern. Results expose `long_df` and `wide_df` on demand,
`to_long()` / `to_wide()` return results with that active `df` view,
and `topiary.stack_results()` normalizes mixed-form TopiaryResults
internally rather than requiring callers to pre-convert them.

Result merging now has user-facing names for the two distinct operations:
use `stack_results` / `result.stack_with(...)` when inputs are independent
result sets (files, samples, cohorts), and use `combine_predictions` /
`result.combine_predictions(...)` when inputs are complementary predictor
outputs for the same logical identity grid.

## 5.16.1

**pirlygenes 5.1.0 integration:**
Expand Down Expand Up @@ -49,23 +102,23 @@ selector won't find them — callers wanting per-algorithm DSL access
should melt them out themselves or re-predict via `TopiaryPredictor`.

Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through
`topiary.concat([read_pvacseq(p1), read_pvacseq(p2)])`; no dedicated
`topiary.stack_results([read_pvacseq(p1), read_pvacseq(p2)])`; no dedicated
multi-file entry point is exposed.

Loader-derived columns aligned with `TopiaryPredictor` output so
downstream consumers (vaxrank, etc.) don't have to special-case the
loader source:

- `mhc_class` (`"I"` / `"II"`) — derived from the allele string;
lets concat-ed MHC-I + MHC-II results be filtered or split by class.
lets stacked MHC-I + MHC-II results be filtered or split by class.
- `contains_mutant_residues` (boolean) — true iff the row's mutation
position falls inside the candidate peptide; false for flanking-only
peptides that pVACseq scored but where the mutation lies outside.
- `mutation_start_in_peptide` / `mutation_end_in_peptide` (Int64,
0-based half-open) — derived from pVACseq's 1-based Pos / Mutation
Position.
- `source` — per-row provenance label, matching `read_tsv`
convention; keeps multi-file concats distinguishable without rooting
convention; keeps multi-file stacks distinguishable without rooting
through `Metadata.sources`.

`Metadata.extra["kind_support"]` mirrors `TopiaryPredictor.kind_support`
Expand Down Expand Up @@ -858,7 +911,7 @@ changes, just internal cleanup.
`iterrows`, etc.) so most existing DataFrame-style code continues to
work. Provides `to_wide()`, `to_long()`, `to_tsv()`, `to_csv()`,
`filter_by()`, `sort_by()`.
- `topiary.concat([r1, r2, ...])` merges `TopiaryResult`s, unioning
- `topiary.stack_results([r1, r2, ...])` merges `TopiaryResult`s, unioning
models (warns on version conflicts), concatenating sources, and
preserving filter/sort history only if all inputs agree.
- `read_tsv` / `read_csv` accept a `tag=` kwarg to label the source of
Expand Down
34 changes: 34 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,40 @@
| `predict_from_variants(variants)` | VariantCollection | Variant pipeline (builds `ProteinFragment`s internally and delegates). |
| `predict_from_mutation_effects(effects)` | EffectCollection | Same as `predict_from_variants` but starting from pre-computed effects. |

## TopiaryResult

`TopiaryResult` is the semantic result object for Topiary prediction tables.
It can ingest either long or wide prediction tables, keeps the active `df`
view for pandas-style access, and materializes cached `long_df` and `wide_df`
views on demand. The public `form` value describes the active `df` view; it is
derived from the stored views rather than acting as separate result state. Use
`result.to_long()` / `result.to_wide()` when you want a new `TopiaryResult`
whose active `df` is that form.

Topiary has two result-merging operations:

| Operation | Meaning | Use when |
| --- | --- | --- |
| `topiary.stack_results([a, b])` / `a.stack_with(b)` | More result sets | Inputs are separate files, samples, cohorts, or independent result sets. |
| `topiary.combine_predictions([a, b])` / `a.combine_predictions(b)` | More predictions for the same logical identity grid | Inputs are separate predictors or predictor shards that should behave like one run. |

`stack_results` is a row-union operation. The inputs do not need to describe
the same peptides, alleles, samples, sources, predictors, or score kinds; they
are just more Topiary result rows with merged provenance.

`combine_predictions` is a prediction-union operation. The inputs are pieces
of one logical prediction table: separate predictors over the same candidates,
or one predictor split into disjoint allele/peptide-length shards. It rejects
duplicate `(prediction_method_name, kind, identity)` rows and, by default,
requires every emitted `(prediction_method_name, kind)` group to cover the
same identity grid. Use `coverage="partial"` only for intentionally sparse
prediction unions.

Both operations accept mixed long/wide `TopiaryResult` inputs and normalize
internally. Fresh `TopiaryPredictor` DataFrame outputs may also be passed to
`combine_predictions`; they are first wrapped as `TopiaryResult` objects and
then validated with the same rules.

## CachedPredictor

Drop-in replacement for a live predictor that serves scores from a
Expand Down
12 changes: 6 additions & 6 deletions docs/pvacseq.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,10 @@ Loader-derived columns aligned with `TopiaryPredictor` output so downstream code

| Column | Type | What it carries |
|--------|------|-----------------|
| `mhc_class` | `"I"` / `"II"` / `pd.NA` | Per-row class derived from the allele. Lets concat-ed multi-class results be filtered or split by class without re-parsing alleles. |
| `mhc_class` | `"I"` / `"II"` / `pd.NA` | Per-row class derived from the allele. Lets stacked multi-class results be filtered or split by class without re-parsing alleles. |
| `contains_mutant_residues` | `boolean` (nullable) | True iff the row's mutation position falls inside the candidate peptide. False for flanking-only peptides where pVACseq scored a 9-mer adjacent to the mutation but the mutation lies outside. |
| `mutation_start_in_peptide` / `mutation_end_in_peptide` | `Int64` | 0-based half-open mutation interval within the peptide. Derived from pVACseq's 1-based Pos (aggregated) or Mutation Position (all_epitopes). Single-residue semantics — multi-residue mutations collapse to a representative position. |
| `source` | `str` | Per-row provenance label, matching `read_tsv` convention so multi-file concats stay distinguishable without rooting through `Metadata.sources`. |
| `source` | `str` | Per-row provenance label, matching `read_tsv` convention so multi-file stacks stay distinguishable without rooting through `Metadata.sources`. |

### Annotation passthroughs

Expand All @@ -76,12 +76,12 @@ print(ranked.head())

### MHC-I + MHC-II combined

`read_pvacseq()` doesn't expose a multi-file entry point — composition is just `topiary.concat`:
`read_pvacseq()` doesn't expose a multi-file entry point — composition is just `topiary.stack_results`:

```python
from topiary import read_pvacseq, concat
from topiary import read_pvacseq, stack_results

combined = concat([
combined = stack_results([
read_pvacseq("HCC1395.MHC_I.all_epitopes.aggregated.tsv"),
read_pvacseq("HCC1395.MHC_II.all_epitopes.aggregated.tsv"),
])
Expand Down Expand Up @@ -232,7 +232,7 @@ r.extra["kind_support"]
apply_filter(r.df, my_filter, kind_support=r.extra["kind_support"])
```

`pvacseq_format` is `"aggregated"` or `"all_epitopes"` (or a comma-joined string after melting / concat-ing).
`pvacseq_format` is `"aggregated"` or `"all_epitopes"` (or a comma-joined string after melting / stacking).

## Caveats and known limitations

Expand Down
103 changes: 103 additions & 0 deletions docs/ranking.md
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,109 @@ should use the recommended forms above.
default is `auto`: raw affinity values and percentile ranks sort ascending,
while all other sort expressions sort descending.

## Combining Separate Predictor Runs

Run predictors together when that is convenient:

```python
from mhctools import NetMHCpan, MHCflurry
from topiary import TopiaryPredictor

combined = TopiaryPredictor(
models=[NetMHCpan, MHCflurry],
alleles=["HLA-A*02:01", "HLA-B*07:02"],
).predict_from_named_peptides(peptides)
```

When predictors need to run separately, use `combine_predictions` to
turn their complementary prediction rows back into the same long-form
shape:

```python
from mhctools import NetMHCpan, MHCflurry
from topiary import TopiaryPredictor, combine_predictions

netmhcpan_rows = TopiaryPredictor(
models=NetMHCpan,
alleles=["HLA-A*02:01", "HLA-B*07:02"],
).predict_from_named_peptides(peptides)

mhcflurry_rows = TopiaryPredictor(
models=MHCflurry,
alleles=["HLA-A*02:01", "HLA-B*07:02"],
).predict_from_named_peptides(peptides)

combined = combine_predictions([netmhcpan_rows, mhcflurry_rows])
```

`TopiaryResult` owns the long/wide representation. Loaders may naturally
produce wide results (for example LENS) or long results (for example pVACseq
and fresh predictor outputs), but callers can use `result.long_df`,
`result.wide_df`, `result.to_long()`, or `result.to_wide()` on demand. Topiary
merge functions normalize those forms internally instead of making callers
choose a representation before combining results.

You can also shard the same predictor over allele or peptide-length batches and
combine the shards. Use `TopiaryPredictor(name=...)` when you want to keep
track of which batch produced each row:

```python
shards = []
for allele in ["HLA-A*02:01", "HLA-B*07:02"]:
for length in [8, 9, 10, 11]:
length_peptides = {
name: peptide
for name, peptide in peptides.items()
if len(peptide) == length
}
shards.append(
TopiaryPredictor(
models=NetMHCpan,
alleles=[allele],
name=f"netmhcpan_{allele}_len{length}",
).predict_from_named_peptides(length_peptides)
)

combined = combine_predictions(shards)
```

`prediction_method_name` is still the logical predictor name (`netmhcpan` in
the example above). The optional `prediction_run_name` column is only
provenance for a particular run or shard. That distinction lets distinct
NetMHCpan allele/length shards combine into one logical NetMHCpan result,
while overlapping shards with the same `(prediction_method_name, kind,
peptide, allele, sample/source context)` still fail as duplicates.
`to_wide()` drops `prediction_run_name` from the grouping keys, so a named
split run has the same wide shape as a single unsplit run.

The helper is intentionally strict. It rejects duplicate
`(prediction_method_name, kind, identity)` rows, and by default requires every
emitted `(prediction_method_name, kind)` group to cover the same peptide/allele
identity grid. This catches incomplete split runs before `to_wide()` can
produce half-populated rows. If you intentionally want a sparse union, pass
`coverage="partial"`; duplicate predictions are still rejected.

The combined result preserves the original rows: use each row's
`prediction_method_name`, `predictor_version`, `kind`, and value/rank columns
to inspect which predictor produced which quantity. Use
`prediction_run_name` only to audit the batch that produced a row, not as a DSL
selector.

Allele aggregation remains part of the ranking DSL: for example,
`Affinity["netmhcpan"].best_value_allele` and
`Presentation["netmhcpan"].best_score_allele` report the allele associated
with the best BA or EL value across the combined allele grid. For predictors
that emit one row per allele, such as NetMHCpan or MHCflurry in single-allele
mode, this is the best per-allele row after all shards are combined. For
MHCflurry presentation in haplotype mode, MHCflurry itself sees the allele set
together and may emit one deconvolved best-allele row; combining independent
single-allele MHCflurry shards is therefore not the same calculation as a
direct haplotype-mode MHCflurry run. If you intentionally combine haplotype
presentation rows with per-allele rows, use `coverage="partial"` because those
kinds have different identity grids by construction. Processing-only
quantities that do not depend on allele should be read directly rather than
through `best_*`.

## Putting it together

```python
Expand Down
Loading
Loading