Skip to content

Improvements for Unique Peptide Finder support#101

Draft
pverscha wants to merge 22 commits into
developfrom
feature/unique-peptide-changes
Draft

Improvements for Unique Peptide Finder support#101
pverscha wants to merge 22 commits into
developfrom
feature/unique-peptide-changes

Conversation

@pverscha

Copy link
Copy Markdown
Member

I've recently been working on a new prototype of a Unique Peptide Finder tool. This tool is still under active development, but will be a good addition to the Unipept Web application when it's ready.

In order for this new tool to work efficiently, I've had to implement a few new endpoints for the API and had to make some changes to pre-existing endpoints. In addition to new functionality, I was also able to drastically speed up the pept2taxa and pept2lca endpoints. This PR bundles all these changes.

Important

This PR also required changes to the unipept-index package, which have been published in this PR. Make sure to merge that PR first, before deploying this one. Note that we also need to remove the branch = reference from the Cargo.toml file in the index package reference.

New endpoints added

private_api/taxa/unique_peptides

  • Adds a new GET/POST endpoint at /private_api/taxa/unique_peptides (and .json) that computes taxon-unique peptides for a given strain or species.
  • Adds get_proteins_for_taxon() to the database crate: an exact term-query on taxon_id with search_after pagination to correctly handle taxa with more than 10,000 proteins.
  • Adds fancy-regex dependency to support lookahead assertions in user-supplied cleavage patterns (the standard regex crate does not support lookaheads).

Request parameters

Parameter Type Required Default Description
taxon_id u32 yes NCBI taxon ID; must be at species or strain rank
cleavage_regex string no [KR](?!P) Regex matching cleavage sites; split occurs after each match (standard tryptic convention)
min_length usize no 5 Minimum peptide length in amino acids

Example response

{
  "unique_peptides": ["AAFEDLQSLQDK", "NLFVAKNLR"],
  "total_peptides": 4821,
  "total_unique_peptides": 312
}

total_peptides is the count of deduplicated peptides after digestion and length filtering. total_unique_peptides equals unique_peptides.length.

Error cases:

  • 400 — invalid cleavage_regex
  • 400 — taxon_id not found in the taxon store
  • 400 — taxon_id is not at species or strain rank (message includes the actual rank)

private_api/taxa/shared_peptides

  • Adds a new GET/POST endpoint at /private_api/taxa/shared_peptides (and .json) that computes peptides shared across all provided taxa.
  • Reuses get_proteins_for_taxon() from the database crate (added for the unique_peptides endpoint) and the shared digestion helpers in api/src/helpers/digestion.rs.

Request parameters

Parameter Type Required Default Description
taxon_ids u32[] yes List of NCBI taxon IDs; each must be at species or strain rank
cleavage_regex string no [KR](?!P) Regex matching cleavage sites; split occurs after each match (standard tryptic convention)
min_length usize no 5 Minimum peptide length in amino acids

Example response

{
  "shared_peptides": ["AAFEDLQSLQDK", "NLFVAKNLR"]
}

shared_peptides contains deduplicated peptides (after digestion and length filtering) that are present in at least one protein of every provided taxon. The list is sorted lexicographically. If taxon_ids is empty, an empty list is returned.

Error cases:

  • 400 — invalid cleavage_regex
  • 400 — any taxon_id not found in the taxon store
  • 400 — any taxon_id is not at species or strain rank (message includes the actual rank)

Changes made to existing endpoints

private_api/taxa

Adds a new parameter report_protein_count that will report the amount of proteins that are associated with the provided taxa. If a taxon is provided at a level that's higher than species or strain, the protein count is equal to the sum of the protein count of all descendant taxa.

Example request:

{
    "taxids": [4751],
    "report_protein_count": true
}

Example response:

[
    {
        "id": 4751,
        "name": "Fungi",
        "rank": "kingdom",
        "lineage": [
            2759,
            null,
            4751,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null,
            null
        ],
        "protein_count": 38050
    }
]

api/v2/pept2taxa

Improved performance by switching to the more lightweight analyse_taxa index function instead of analyse (see PR: unipept/unipept-index#36)

api/v2/pept2lca

Improved performance by switching to the more lightweight analyse_taxa index function instead of analyse (see PR: unipept/unipept-index#36)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant