This repository contains the API used to manage the PanGBank database, which stores collections of pangenomes built with PPanGGOLiN.
The API is built with FastAPI and uses SQLModel as its ORM.
It provides a RESTful interface for querying and exploring pangenome collections. Alongside the API, a command-line tool pangbank_db is included to manage the database.
PanGBank-api is organized into two main components:
- Core package: Database models, CRUD operations, and CLI tools (
pangbank_db) - API server: FastAPI-based REST API (optional)
For database management and CLI tools without the API server:
pip install pangbank-apiThis installs:
- Database models (
pangbank_api.models) - Database utilities (
pangbank_api.database,pangbank_api.config) - CRUD operations (
pangbank_api.crud) - CLI tool
pangbank_dbfor database management
For running the REST API server:
pip install pangbank-api[fastapi]This additionally installs:
- FastAPI framework
- API routers (
pangbank_api.routers) - API server (
pangbank_api.main)
-
Clone the repository:
git clone https://github.com/labgem/PanGBank-api.git cd PanGBank-api -
Create a virtual environment and install with FastAPI:
python -m venv venv source venv/bin/activate pip install .[fastapi] -
Run the API in development mode:
export PANGBANK_DB_PATH="<path/to/database.sqlite>" export PANGBANK_DATA_DIR="<path/to/pangenome_directory>" fastapi dev pangbank_api/main.py
PANGBANK_DB_PATHis the path to your SQLite database file.PANGBANK_DATA_DIRis the root directory containing your pangenome data and mash files.
All CLI commands require the PANGBANK_DB_PATH environment variable to be set.
export PANGBANK_DB_PATH="<path/to/database.sqlite>"To add a new collection of pangenomes in the database, use:
pangbank_db add-collection-release <collection_release.json>Note
This command requires two environment variables:
export PANGBANK_DB_PATH="<path/to/database.sqlite>"
export PANGBANK_DATA_DIR="<root/path/serving/pangenomes>"JSON Schema Example
The genome_metadata field (optional) allows you to load genome metadata and quality metrics into the Genome table during collection import. The TSV file should have:
- First column:
genomes(genome names matching those in your genome_sources) - Other columns: Any of the following supported quality metrics:
strain- Strain identifierorganism_name- Organism namencbi_genome_category- NCBI genome categorygenome_category- Custom genome categorycheckm2_completeness- CheckM2 completeness (%)checkm2_contamination- CheckM2 contamination (%)checkm2_model- CheckM2 model usedcheckm_completeness- CheckM completeness (%)checkm_contamination- CheckM contamination (%)checkm_strain_heterogeneity- CheckM strain heterogeneity (%)gc_count- GC base countgc_percentage- GC percentagegenome_size- Total genome size (bp)l50_contigs- L50 contigs statisticn50_contigs- N50 contigs statistic
The system automatically handles type conversion (str, int, float) based on the Genome model field types. Only optional fields are updated - required fields like name are protected from modification.
Important: Quality metrics are immutable once set. If you try to update a genome with different values for existing quality metrics:
- During
add-collection-release: New values are accepted with a warning (initial import allows overwrites) - During
add-quality-metrics: Command fails with an error unless--forceflag is used
This ensures data integrity and prevents accidental corruption of quality metric data.
- Paths for
pangenomes_directoryandmash_sketchmust be relative toPANGBANK_DATA_DIR. - Paths for
taxonomy.file,genome_sources[*].file,genome_metadata.file, andgenome_statuses[*].filemust be absolute file paths. genome_metadataandgenome_statusesare optional.- Each genome status file should contain one genome name per line.
pangbank_db list-collectionspangbank_db delete-collection <collection_name> --release-version <version>Add genome status information (representative, reference, type strain, etc.) to an existing collection release without re-importing the entire collection:
pangbank_db add-genome-statuses \
--collection-name <collection_name> \
--release-version <release_version> \
--status-type <status_type> \
--origin <origin> \
--file <file>Example:
pangbank_db add-genome-statuses \
--collection-name "GTDB_all_sampled" \
--release-version "1.0.0" \
--status-type "representative" \
--origin "GTDB" \
--file /path/to/gtdb_representatives.txtThis command is useful for:
- Adding genome statuses to releases that were imported without them
- Updating status information when new representative/reference genomes are announced
- Adding multiple status types incrementally (e.g., first representatives, then type strains)
The file should contain one genome name per line. Duplicate statuses are automatically skipped.
Add or update genome quality metrics (CheckM completeness, contamination, genome size, etc.) for genomes already in the database:
pangbank_db add-quality-metrics <genome_metadata.tsv>
# Force overwrite existing values (with warnings)
pangbank_db add-quality-metrics <genome_metadata.tsv> --forceExample:
# Add new quality metrics (fails if trying to change existing values)
pangbank_db add-quality-metrics /path/to/gtdb_genome_metadata.tsv
# Intentionally overwrite existing metrics (logs warnings)
pangbank_db add-quality-metrics /path/to/updated_metrics.tsv --forceThe TSV file should have:
- A
genomescolumn with genome names - Quality metric columns matching the Genome model fields (e.g.,
checkm2_completeness,checkm2_contamination,genome_size,gc_percentage)
Important Notes:
- Only columns that match optional Genome fields will be imported; unknown columns are automatically filtered out
- Quality metrics are immutable by default - attempting to change existing values raises an error
- Use
--forceflag to intentionally overwrite existing values (warnings will be logged for each change) - If a genome already has a value for a field:
- Identical values are skipped (idempotent operation)
- Different values raise an error unless
--forceis used
- Only fields with
None(no existing value) are updated without restriction
This command is useful for:
- Adding quality metrics after initial data import
- Importing metrics for newly added genomes
- Safely re-running imports without data corruption risk
Example TSV format:
genomes checkm2_completeness checkm2_contamination genome_size gc_percentage
GenomeA 98.5 0.2 5000000 45.5
GenomeB 95.0 1.5 4500000 42.0We use Alembic to manage schema changes in the PanGBank database.
Generate a migration after updating your SQLModel models (e.g., adding or changing columns):
alembic revision --autogenerate -m "Describe your change here"This applies all pending migrations:
alembic upgrade headIf something went wrong, you can revert the last migration:
alembic downgrade -1Or go back to the base (empty schema):
alembic downgrade baseNote
- The SQLite database path is defined in
config.pyvia thepangbank_db_pathsetting (PANGBANK_DB_PATHenv var). - Alembic is configured to read this dynamically, so no need to change
alembic.ini.
- Fork the repository.
- Create a feature branch (
git checkout -b feature-name). - Commit your changes (
git commit -m 'Add new feature'). - Push to the branch (
git push origin feature-name). - Open a pull request.
For any inquiries or issues, open an issue on the GitHub repository.
{ "collection": { "name": "GTDB_all_sampled", "description": "GTDB all is a collection of pangenomes made of GTDB species that have at least 15 genomes." }, "release": { "version": "1.0.0", "ppanggolin_version": "2.2.4", "pangbank_wf_version": "0.0.2", "pangenomes_directory": "GTDB_refseq/release_v1.0.0/data/pangenomes/", // relative to PANGBANK_DATA_DIR "release_note": "", "date": "2025-07-10", "mash_sketch": "GTDB_refseq/release_v1.0.0/data/mash_sketch/families_persistent_all.msh", // relative to PANGBANK_DATA_DIR "mash_version": "2.3" }, "taxonomy": { "name": "GTDB", "version": "10-RS226", "ranks": "Domain; Phylum; Class; Order; Family; Genus; Species", "file": "/absolute/path/to/taxonomy.tsv" }, "genome_sources": [ { "name": "RefSeq", "file": "/absolute/path/to/genomes.tsv", "version": "", "description": "", "source": "", "url": "" } ], "genome_metadata": { "file": "/absolute/path/to/genome_metadata.tsv" }, "genome_statuses": [ { "status_type": "representative", "origin": "GTDB", "file": "/absolute/path/to/gtdb_representatives.txt" }, { "status_type": "reference", "origin": "NCBI_RefSeq", "file": "/absolute/path/to/ncbi_references.txt" } ] }