Improvements for Unique Peptide Finder support#101
Draft
pverscha wants to merge 22 commits into
Draft
Conversation
…' into feature/unique-peptide-changes
…' into feature/unique-peptide-changes
This was referenced Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I've recently been working on a new prototype of a Unique Peptide Finder tool. This tool is still under active development, but will be a good addition to the Unipept Web application when it's ready.
In order for this new tool to work efficiently, I've had to implement a few new endpoints for the API and had to make some changes to pre-existing endpoints. In addition to new functionality, I was also able to drastically speed up the
pept2taxaandpept2lcaendpoints. This PR bundles all these changes.Important
This PR also required changes to the unipept-index package, which have been published in this PR. Make sure to merge that PR first, before deploying this one. Note that we also need to remove the
branch =reference from theCargo.tomlfile in the index package reference.New endpoints added
private_api/taxa/unique_peptidesGET/POSTendpoint at/private_api/taxa/unique_peptides(and.json) that computes taxon-unique peptides for a given strain or species.get_proteins_for_taxon()to thedatabasecrate: an exactterm-query ontaxon_idwithsearch_afterpagination to correctly handle taxa with more than 10,000 proteins.fancy-regexdependency to support lookahead assertions in user-supplied cleavage patterns (the standardregexcrate does not support lookaheads).Request parameters
taxon_idu32cleavage_regexstring[KR](?!P)min_lengthusize5Example response
{ "unique_peptides": ["AAFEDLQSLQDK", "NLFVAKNLR"], "total_peptides": 4821, "total_unique_peptides": 312 }total_peptidesis the count of deduplicated peptides after digestion and length filtering.total_unique_peptidesequalsunique_peptides.length.Error cases:
private_api/taxa/shared_peptidesGET/POSTendpoint at/private_api/taxa/shared_peptides(and.json) that computes peptides shared across all provided taxa.get_proteins_for_taxon()from thedatabasecrate (added for theunique_peptidesendpoint) and the shared digestion helpers inapi/src/helpers/digestion.rs.Request parameters
taxon_idsu32[]cleavage_regexstring[KR](?!P)min_lengthusize5Example response
{ "shared_peptides": ["AAFEDLQSLQDK", "NLFVAKNLR"] }shared_peptidescontains deduplicated peptides (after digestion and length filtering) that are present in at least one protein of every provided taxon. The list is sorted lexicographically. If taxon_ids is empty, an empty list is returned.Error cases:
Changes made to existing endpoints
private_api/taxaAdds a new parameter
report_protein_countthat will report the amount of proteins that are associated with the provided taxa. If a taxon is provided at a level that's higher than species or strain, the protein count is equal to the sum of the protein count of all descendant taxa.Example request:
{ "taxids": [4751], "report_protein_count": true }Example response:
[ { "id": 4751, "name": "Fungi", "rank": "kingdom", "lineage": [ 2759, null, 4751, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], "protein_count": 38050 } ]api/v2/pept2taxaImproved performance by switching to the more lightweight
analyse_taxaindex function instead ofanalyse(see PR: unipept/unipept-index#36)api/v2/pept2lcaImproved performance by switching to the more lightweight
analyse_taxaindex function instead ofanalyse(see PR: unipept/unipept-index#36)