Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
355cd3f
chore(docs): add EAT annotation-transfer design spec + backend implem…
tsenoner Jun 11, 2026
70881d7
chore(protlabel): scaffold EAT engine package + scipy dep
tsenoner Jun 11, 2026
ee482ba
feat(protlabel): goPredSim reliability index transform
tsenoner Jun 11, 2026
4e99e8d
feat(protlabel): chunked brute-force kNN backend
tsenoner Jun 11, 2026
d494242
fix(protlabel): bound kNN per-chunk memory adaptively; guard k>=1
tsenoner Jun 11, 2026
c07aef5
feat(protlabel): kNN label transfer with reliability index
tsenoner Jun 11, 2026
4b39cb8
test(protlabel): document RI tie-break and cover nearest-source selec…
tsenoner Jun 11, 2026
796e5b1
feat(protlabel): persistable Lookup sidecar + public API
tsenoner Jun 11, 2026
ae7fcc2
feat: query/reference classifier for annotation transfer
tsenoner Jun 11, 2026
bc8837e
test: cover neither-match exclusion and multi-prefix OR in classifier
tsenoner Jun 11, 2026
94b4f0f
feat: build per-cell prediction overlay columns
tsenoner Jun 11, 2026
05194bf
test: cover empty-predictions and unknown-id overlay edge cases
tsenoner Jun 11, 2026
5093f66
feat: replace annotations part of a parquetbundle in place
tsenoner Jun 11, 2026
c9cae3f
feat: add 'protspace transfer' annotation-transfer subcommand
tsenoner Jun 11, 2026
c708f90
fix(transfer): handle protein_id id column in real bundles; clearer e…
tsenoner Jun 11, 2026
0ee1354
docs: document protspace transfer + prediction overlay columns
tsenoner Jun 11, 2026
21d508c
docs: correct transfer --metric options (euclidean, cosine only)
tsenoner Jun 11, 2026
a05e977
feat(transfer): warn on zero transfers; validate --metric/--k early
tsenoner Jun 11, 2026
98b42f6
chore(docs): remove EAT build plan + superseded draft; keep design spec
tsenoner Jun 12, 2026
9da7f4d
fix(transfer): address review findings — atomicity, precision, securi…
tsenoner Jun 16, 2026
f7186f5
refactor(transfer): drop __pred_source overlay column; keep numeric c…
tsenoner Jun 16, 2026
72fa7b7
Merge branch 'main' into feat/eat-transfer-backend
tsenoner Jun 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Single entry point: `protspace = protspace.cli.app:app`
| `protspace bundle` | Combine projections + annotations → .parquetbundle |
| `protspace serve` | Launch Dash web frontend |
| `protspace style` | Add annotation colors/styles |
| `protspace transfer` | Fill missing annotations from nearest reference embeddings (EAT) |

### protspace prepare Usage

Expand Down
19 changes: 19 additions & 0 deletions docs/annotations.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,3 +183,22 @@ Per-protein predictions from the [Biocentral API](https://biocentral.rostlab.org
| Pfam clans | `~/.cache/protspace/pfam_clans/` | 30 days | Pfam family → clan mapping |

The `default` group only requires the UniProt REST API (+ ExPASy for EC names). For `--keep-tmp` annotation caching, see [CLI Reference](cli.md#annotation-caching---keep-tmp).

## Prediction Overlay Columns (EAT Transfer)

Running `protspace transfer` appends two new columns to the bundle's annotations table for each requested column `COL`. The curated `COL` column is never modified.

| Column | Type | Meaning |
| --- | --- | --- |
| `COL__pred_value` | string | The transferred label from the nearest annotated reference protein |
| `COL__pred_confidence` | float | Reliability index in [0, 1] — 1 = identical embeddings (formula depends on `--metric`/`--k`, see below) |

A protein is considered "predicted" for `COL` when `COL` is empty but `COL__pred_value` is present. Use `COL__pred_confidence` to threshold low-reliability transfers.

The reliability index depends on the `--metric` and `--k` used during transfer:

- **Default (`--metric euclidean`, `--k 1`):** `0.5 / (0.5 + distance)`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should be cosine.

- **`--metric cosine` (`--k 1`):** `clamp(1 - cosine_distance, 0, 1)`, where `cosine_distance` is in [0, 2].
- **`--k > 1`:** the goPredSim mean reliability — `(1/m) · Σ s(d)` of the per-neighbour similarity over the `k` nearest neighbours carrying the chosen label, with `m = min(k, number of references)`. Because of this normalization, values are **not** comparable across different `--k` settings.

See [`protspace transfer`](cli.md#protspace-transfer) for usage and option details.
38 changes: 38 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
| `protspace bundle` | Combine projections + annotations into .parquetbundle |
| `protspace serve` | Launch interactive Dash web frontend |
| `protspace style` | Add/inspect annotation styles in existing files |
| `protspace transfer` | Fill missing annotations from nearest reference embeddings (EAT) |

Run `protspace <command> -h` for detailed help.

Expand Down Expand Up @@ -183,6 +184,43 @@ protspace style input.parquetbundle output.parquetbundle --annotation-styles sty
protspace style data.parquetbundle --dump-settings
```

## `protspace transfer`

Embedding Annotation Transfer (EAT): fills missing annotation values for query proteins by transferring the annotation of the nearest annotated reference protein in pLM embedding space. For each query protein that lacks a value in the requested annotation column, the command finds the closest reference (by distance in the original high-dimensional embedding space — Euclidean by default, or cosine via `--metric`, and not in the 2-D/3-D projection) and assigns that reference's label along with a reliability index adapted from goPredSim, yielding a score in [0, 1] where 1 means identical embeddings. The curated source column (`COL`) is left untouched; results are written as two new columns: `COL__pred_value` (string) and `COL__pred_confidence` (float). The method is a direct application of the approach introduced by Littmann et al., Sci Rep 2021 ([DOI 10.1038/s41598-020-80786-0](https://doi.org/10.1038/s41598-020-80786-0)) and extended by Heinzinger et al., NAR Genom Bioinform 2022 ([DOI 10.1093/nargab/lqac043](https://doi.org/10.1093/nargab/lqac043)).

**Reliability index (`COL__pred_confidence`).** The exact form depends on `--metric` and `--k`:

- **Default (`--metric euclidean`, `--k 1`):** `confidence = 0.5 / (0.5 + distance)` (1 at distance 0, 0.5 at distance 0.5, → 0 as distance → ∞).
- **`--metric cosine` (`--k 1`):** `confidence = clamp(1 - cosine_distance, 0, 1)`, where `cosine_distance` is in [0, 2].
- **`--k > 1`:** the value is the goPredSim mean reliability — `(1/m) · Σ s(d)`, the sum of the per-neighbour similarity `s(d)` (the euclidean or cosine form above) over the `k` nearest neighbours that carry the chosen label, divided by `m = min(k, number of references)`. Because of this normalization, confidence values are **not** comparable across different `--k` settings.

```bash
protspace transfer \
-b results.parquetbundle \
-e embeddings.h5:prot_t5 \
-t protein_category \
-o results.parquetbundle \
--query-id-prefix TRINITY_ \
--reference-where 'protein_category~neurotoxin'
```

**Key options:**

| Flag | Description | Default |
| ---- | ----------- | ------- |
| `-b, --bundle` | Input `.parquetbundle` file | — |
| `-e, --embeddings` | HDF5 embeddings file (use `:name` suffix for external files) | — |
| `-t, --transfer` | Annotation column to transfer (repeatable) | — |
| `-o, --output` | Output `.parquetbundle` (may overwrite input) | — |
| `--query-id-prefix` | Restrict query proteins to IDs starting with this prefix | — |
| `--query-where` | Filter query proteins by annotation value (`col~substr`) | — |
| `--reference-id-prefix` | Restrict reference proteins to IDs starting with this prefix | — |
| `--reference-where` | Filter reference proteins by annotation value (`col~substr`) | — |
| `--k` | Number of nearest neighbours | `1` |
| `--metric` | Distance metric (`euclidean`, `cosine`); see the reliability-index forms above | `euclidean` |

Distances are computed in the original embedding space (HDF5), not in the 2-D/3-D projection. The `--metric` choice also changes how `COL__pred_confidence` is computed: euclidean uses `0.5 / (0.5 + distance)`, while cosine uses `clamp(1 - cosine_distance, 0, 1)` (see the reliability-index note above).

## Combining Multiple Inputs (`-i`)

When multiple `-i` inputs are provided, behavior depends on whether they share the same embedding name:
Expand Down
Loading