major refactoring and update for all the providers. by ypriverol · Pull Request #105 · bigbio/pgatk

ypriverol · 2026-05-22T17:05:11Z

Summary by CodeRabbit

New Features
- New CLI commands for GENCODE and gnomAD; options to generate transcripts/CDS and select COSMIC products. Per-study cBioPortal and scripted COSMIC download modes. Benchmarking script added. Parallel-workers supported across translation commands.
Documentation
- Major CLI docs, validations, use-case and workflow updates reflecting GFF3/transcript generation and new command flags.
Improvements
- Migration to GFF3 for ClinVar flows, richer HGVS/translation handling, GRCh37 support, safer archive extraction, expanded ignore/config defaults.
Tests
- Many new and updated unit/integration tests and test fixtures for downloaders, translation pipelines, and parallel runs.

The click commands accept --protein_prefix; --var_prefix no longer exists. Example commands in docs/use-cases.md and docs/pgatk-cli.md were stale and failed when copy-pasted. Closes #100

Implementation plans and design scratch notes are kept locally as working artifacts only; they don't belong in the published repo. Files remain on disk and are now gitignored.

Sanger retired the legacy /cosmic/file_download/... endpoint. The new flow is a two-step API: GET /api/mono/products/v1/downloads/scripted returns a signed S3 URL (1-hour TTL), which is then fetched without auth. The downloader is rewritten around this contract. - Config: replace the v94 per-file URL/filename pairs with a single `products` list of `path=` values copy-pasteable from the COSMIC scripted-download UI. Defaults cover v103 / GRCh38: genome-screens mutations, targeted-screens mutations, gene FASTA, transcripts, cell-line mutations, cell-line CNA. - Downloader: build_api_url() constructs the new endpoint URL; the two-step download parses JSON, follows the signed URL, and extracts the resulting .tar archive (replaces the old .tar.gz handling). - Downloader: clear error message when the API returns 200 but the body is not the expected JSON-with-url - the failure mode reported in the bug. Distinguish API errors from S3 errors. - Downloader: tqdm progress bar for the S3 download (large files were silent before). - CLI: add -P/--products flag (repeatable) so users can fetch an arbitrary product without editing YAML, e.g. pgatk cosmic-downloader -P grch37/cancer_mutation_census/v100/... - Tests: unit tests for URL construction and base64 auth-token format (no network). End-to-end verified with real Sanger credentials: three of the six default paths (mutations, gene FASTA, cell-line mutations) download and extract cleanly. The three added afterwards (targeted-screens, transcripts, CNA) share the same path pattern as the verified ones and the filenames were taken from the COSMIC UI verbatim. Closes #98

The default list contained two entries that no longer match modern Ensembl GTFs: `polymorphic_pseudogene` was deprecated and folded into other biotypes, and `mRNA` is not an Ensembl biotype at all (it's a generic GFF feature type used by some non-Ensembl annotations). Three protein-producing biotypes that exist in current releases were missing. Removed: - polymorphic_pseudogene (deprecated) - mRNA (not an Ensembl biotype) Added: - protein_coding_CDS_not_defined (~26k transcripts; coding but CDS not perfectly annotated, still produces protein) - protein_coding_LoF (predicted loss-of-function coding transcripts; still produces a truncated protein) - translated_processed_pseudogene (pseudogenes that ARE translated; the "translated_" prefix distinguishes them from regular pseudo- genes) Net default list: 14 -> 15 biotypes. YAML config and the in-code fallback in ensembl.py kept in sync. Closes #101

Issue #99 reports vcf-to-proteindb taking 12+ hours on chr22-scale VCFs. Root cause: get_features() makes two SQLite queries against the gffutils GTF database per (variant, transcript) pair, with no caching. For typical CSQ-annotated VCFs that's tens of millions of queries per chromosome. Six semantically preserving optimisations, stacked: 1. Memoize get_features() results per transcript id for the duration of one vcf_to_proteindb call. Dominant gain. Micro-benchmark on a real ENSEMBL chr22 transcript shows ~475x speedup on this op (0.094 ms/call SQLite -> 0.0002 ms/call cache hit). For 50M get_features calls per chr22 that's ~78 minutes -> ~10 seconds of pure get_features time. 2. Apply biotype / consequence filters BEFORE the (now-cached) get_features call so rejected transcripts skip the work entirely. 3. Iterate the VCF DataFrame with itertuples(index=False) instead of iterrows(); ~5-10x faster on the outer loop. 4. Parse record.INFO into a dict once per variant instead of two list-comprehensions that each scan the whole INFO string. 5. Cache transcripts_dict[id] (SeqIO.index) lookups per transcript id; avoids repeated fseek+parse on the FASTA for variants that share a transcript. Uses a _MISSING sentinel so known-missing ids skip the disk lookup too. 6. Lazy log formatting in the hot loop -- wrap debug() calls with isEnabledFor(DEBUG) and use %s placeholders so the message string isn't built when DEBUG isn't enabled. Verified end-to-end against the three in-repo pytest tests (test_vcf_to_proteindb, test_vcf_to_proteindb_notannotated, test_vcf_gnomad_to_proteindb). End-to-end run on a 10K-variant 1000 Genomes chr22 VCF + ENSEMBL release 113 chr22 GTF + full cDNA FASTA completes in ~30 seconds (mostly gffutils DB build + bedtools intersect, both one-time costs unrelated to the fix); 43 sequences produced. Behaviour change worth noting: the "feature IDs from VCF not found in FASTA" counter now counts each missing transcript once (cached), not per-variant. The old counter over-counted on repeated lookups of the same missing id. Behaviour, not output, change. Add scripts/benchmark_vcf_to_proteindb.py: one-shot wall-clock timer for re-validating perf on representative data. Not used by CI. Closes #99

cosmic_downloader.py: tarfile.extractall on COSMIC archives is now wrapped in _safe_extract_tar(), which validates each member against the destination directory before extraction. Rejects entries that would escape the target via absolute path, .. traversal, or unsafe symlink/hardlink targets (CVE-2007-4559 family). COSMIC archives are trusted in practice but defence-in-depth costs nothing and silences the static-analysis warning. ensembl.py: add blank line before _FeatureCache docstring (D203). D213 ("multi-line docstring summary on second line") not addressed - it conflicts with D212 ("on the first line"), which the codebase already follows throughout. Standard modern Python style.

…yle) Codacy's default pydocstyle config has both D203 ("one blank line before class docstring") and D211 ("no blank lines before class docstring") enabled. These rules contradict each other; the codebase follows D203. Likewise, D213 ("multi-line summary on second line") conflicts with D212 ("on the first line"), which is the codebase's established convention. Add `add-ignore = ["D211", "D213"]` to silence the false positives.

The pydocstyle settings in pyproject.toml in 2b467a5 didn't help - Codacy runs an older pydocstyle that only reads its native config files (setup.cfg, .pydocstyle, .pydocstyle.ini, tox.ini), not pyproject.toml's `[tool.pydocstyle]` section. Add `.pydocstyle` with the same `add-ignore = D211,D213` and a `.codacy.yaml` that sets the same patterns at the Codacy engine level. Also tighten _safe_extract_tar to pass the validated member list to extractall() explicitly (members=safe_members). Functionally identical, but makes the contract obvious to readers (and the static analyser): only validated members are extracted. The `# nosec B202` comment marks the deliberate suppression for any Bandit-style scanner that still flags extractall calls. No source-code behaviour change.

Real issues fixed: - test_cosmic_downloader.py: remove unused `import os` (F401). - cosmic_downloader.py: rewrite _safe_extract_tar docstring with proper summary/body split (single-period summary line + blank line before body), fixes D205 and D415. - cosmic_downloader.py: add explicit type hint `safe_members: list[tarfile.TarInfo]` on the validated-members list, and add a clear comment at the extractall() call site explaining what was validated. Helps reviewers (and static analysers that can follow type hints) confirm the safety contract. Codacy/Bandit config: - setup.cfg [bandit]: skip B101 entirely (B101 fires on every pytest `assert`, which is a false positive because pytest depends on assert rewriting). Also document the B202 (tarfile.extractall) suppression at the call site. - setup.cfg [pydocstyle]: same `add-ignore = D211,D213` as the existing .pydocstyle + pyproject.toml. setup.cfg is the most universally-read pydocstyle config location across versions. Remaining D211 / D213 noise on the PR cannot be fixed in code - they are mutually-exclusive pydocstyle rules (D211 vs D203, D213 vs D212) and the codebase has to follow one of each pair. The Codacy engine enables both halves of each pair by default and appears to ignore local config files. Disable D211 and D213 in the Codacy repository settings UI (Repository -> Settings -> Code patterns -> pydocstyle) to silence them globally.

- Drop "_FeatureCache.put" eviction comment that incorrectly claimed random eviction (the loop actually drops the first 1/5 of insertion order). One-line loop is self-explanatory without commentary. - Drop "see Task 5" dangling reference in seq_cache declaration; the task lived in a now-gitignored implementation plan and means nothing to a future reader. - Collapse two-line "Parse INFO once" comment to one line; keep the WHY (avoids repeated scans) and drop the WHAT (variable name and loop already say it). - Collapse the two filter-block comments into a single rationale line explaining why the filters run before get_features().

Codebase-wide cleanup pass: drop comments that restate well-named code, stale task/PR/issue references that have rotted, orphan commented-out code, narrative step-trackers, and mid-file author stamps. Preserved all docstrings, workaround notes, domain-knowledge comments, and substantive WHY rationale. No behaviour change.

Five issues from the Copilot review, fixed: 1. seq_cache type annotation: changed from `dict[str, tuple]` to `dict[str, Optional[tuple]]` to reflect that we store None as the "known-missing transcript" marker alongside (ref_seq, desc) tuples. Added a comment explaining the _MISSING sentinel pattern. 2. cosmic_downloader.py: redact the presigned S3 URL when logging. The download URL embeds a short-lived AWS signature/credentials in its query string; even with a 1-hour TTL it can leak into log aggregators. Log only the path component. 3. commands/cosmic_downloader.py: rewrite the `--url_file` help text. The flag used to write a direct download URL; after the v103 API migration it writes the SCRIPTED-DOWNLOAD API URL (step 1, not the signed S3 URL). The help now describes the actual TSV format and the indirection. 4. tests/test_cosmic_downloader.py: rewrite module docstring to describe what the tests actually check (one URL pattern + the token format), not the stale "four product API URLs" claim. 5. cosmic_downloader.py: fix the `output_directory` config-lookup bug. `get_configuration_default_params` only searched under `cosmic_data.cosmic_server.<var>`, so the `output_directory` field at `cosmic_data.<var>` (top level) in cosmic_config.yaml was silently ignored - the service always fell back to the hardcoded `./database_cosmic/`. Now check the top level too, so the YAML value takes effect. Pre-existing bug, surfaced by the Copilot review.

- removed .gz from file names in the docs since ensembl download uncopmress files after downloading: `.gtf.gz` to `.gtf` and `.vcf.gz` to `.vcf` in multiple commands for consistency. - Updated the `include_biotypes` for pseudogenes adding: `unitary_pseudogene` and `rRNA_pseudogene`. - Renamed references from `lincRNA` to `lncRNA` throughout the documentation for accuracy. - update canonical biotypes to only keep those that have CDS similar to the reference.

…r organization and clarity. - Added a new section for 'Putative proteins (three-frame)' with example commands. - Updated input commands to include the new putative proteins in relevant workflows. - Enhanced descriptions for condition-specific databases and cancer-type specific databases to improve understanding.

…ates - Improved handling of deletion-insertion (delins) mutations with better validation and making sure to skip intronic and UTR offsets. - Added support for tandem duplication (dup) mutations, allowing for accurate sequence reconstruction. - Updated mutation type checks to include additional HGVS terms for missense, nonsense, and in-frame insertions/deletions. - Refined the required columns in the CosmicMutantExport input to align with updated naming conventions. Additionally, added a new 'Validations' section in the documentation for clarity on module correctness, currently only covers COSMIC mutation translation based on results in tests/test_variant_types_hgvs.py.

- Updated input file names and parameters in the use cases to reflect the new COSMIC v103 format. - Introduced the `--clinical_sample_file` option to allow mapping of `COSMIC_PHENOTYPE_ID` to `PRIMARY_SITE` for filtering. - Enhanced the `CancerGenomesService` to utilize the classification file for phenotype mapping, improving mutation filtering based on tissue type. - Adjusted documentation to clarify the new data model and integration tests for COSMIC v103. - Updated test cases to align with the new schema and ensure accurate output generation.

Adds --profile-out PATH and --print-top N flags. When given, the run is wrapped in cProfile, the .prof file is written to PATH, and the top N functions are printed sorted by both cumulative time and own time. Used to identify the remaining hot path after the issue-#99 fix before designing a parallelization strategy.

Profile of 50K gnomAD chr22 variants showed ~36s one-time setup and ~14s per-variant work, with Biopython translate the dominant CPU cost. Implements per-chromosome multiprocessing.Pool to fan out the per-variant loop across cores. - New `--workers N` CLI flag (default 1, sequential, backward-compatible). - Internally: pre-annotate VCF in the main process (avoids bedtools-cwd race), split the VCF into per-chromosome temp files, dispatch to a spawn-context Pool, concatenate per-chunk outputs. - Existing per-variant loop moved verbatim into `_vcf_to_proteindb_chunk`; the public `vcf_to_proteindb` now dispatches sequential or parallel. - Workers re-construct EnsemblDataService fresh per process (no shared gffutils SQLite handles or SeqIO indices); each pays its own setup but the per-variant CPU work parallelises near-linearly. - Equivalence test verifies workers=2 produces the same set of output sequences as workers=1 on the existing test.vcf fixture.

…apping Three focused improvements stacking on the per-chromosome multiprocessing PR: 1. Replace SeqIO.index with SeqIO.index_db. SeqIO.index rebuilds an in-memory FASTA offset index every invocation (~14 s for the 434 MB Ensembl cDNA FASTA); SeqIO.index_db persists the index in a SQLite .idx file next to the FASTA, reused across runs and shared across workers. Pre-built once in the main process before fan-out so the first chunk pays the build cost and the rest just open the same .idx. Stale detection: .idx older than FASTA triggers a rebuild. 2. _split_vcf_by_chrom now streams the input VCF and writes per-chrom chunks directly to disk in a single pass instead of buffering all data lines in memory. Constant memory regardless of input size — makes the whole-genome path actually feasible. 3. transcript_id_mapping (versionless -> versioned dict of 207k FASTA keys) is now built lazily on the first KeyError fallback. Most real workloads never need it; saves ~200ms per worker startup. No public API change. All existing tests pass; equivalence test still confirms sequential == parallel output.

The previous commit (254e35c) inadvertently captured a re-ordering of proteindb_from_custom_VCF.fa from a single test run. The file is the OUTPUT of test_vcf_to_proteindb_notannotated (which only asserts exit_code == 0, not content); the ordering shifts non-deterministically between runs because annoate_vcf uses Python set iteration with hash randomization. Restore the prior content to keep the working tree stable across consecutive test invocations. Also gitignore *.fa.idx / *.fasta.idx: BioPython SeqIO.index_db now materialises a SQLite index next to each FASTA. These are build artifacts, rebuilt automatically when the source FASTA changes.

perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)

- performance improvement in vcf-to-proteindb by taking the variants in batches plus other caching strategies and pre-filtering - added tests for ensembl variant translation - added tests to gnomad and updated docs to use Gencode files

…nostics New features: - Add gnomad-vcf-downloader command (parallel per-chromosome download from GCS) - Add gencode-downloader command (GENCODE GTF + genome FASTA + gffread transcripts) - Add pgatk.gnomad.data_downloader module backing both new commands ClinVar pipeline (GTF → GFF3 migration): - Replace --gtf/gtf_file with --gff/gff_file throughout; NCBI RefSeq GTF leaves transcript_id empty so gffread cannot build CDS chains — GFF3 ID=/Parent= linkage is required - Replace synthetic mini testdata with real NCBI GFF3 + transcript sequences for OR4F5 (NM_001005484.2) and EPCAM (NM_002354.3); delete mini_refseq.gtf - Add isg15 testdata for mutation-type coverage in test_clinvar_translation.py - Label variants with no CLNSIG field as "not_provided" in output headers - Suppress BioPython partial-codon warning inside get_orfs_vcf (expected for indel frameshifts, not a data error) COSMIC pipeline: - Pre-filter p.? mutations before SNP construction (UTR/intronic variants cannot produce a mutant protein; avoids spurious warnings) - Split unparseable counter into ref_mismatch_count and unsupported_count; emit two separate summary warnings at end of run - get_mut_pro_seq returns None for substitution REF mismatch (sentinel) vs "" for unsupported HGVS notation so callers can distinguish the two - Strip leading N (masked genomic base) from FASTA entries before position lookup; fixes off-by-one for ~37 COSMIC genes with leading-N artifacts - Add _substitution_mismatch_detail for detailed DEBUG logging (expected REF vs found base at the mismatch position) Test data: - Replace synthetic gnomAD testdata with real gnomAD v4.1.1 + GENCODE v39 data for OR11H1 (ENST00000643195.1, chr22); CDS=1-948 header enables 1-frame translation in all gnomAD tests - Add test_ensembl_translation.py and test_gnomad_translation.py covering get_altseq + get_orfs_vcf end-to-end for real variant positions Housekeeping: - Untrack 25 test-generated output files (proteindb_from_*.fa, per-tissue COSMIC FAs, *.db indexes) that were committed before gitignore rules existed - Add pgatk/testdata/*.db to .gitignore; fix stale --gtf refs in use-cases.md

changes in data sources. cBioPortal (cgenomes_proteindb / cbioportal_to_proteindb): - Replace Ensembl CDS FASTA input with NCBI RefSeq transcript FASTA (CDS= / gene_biotype= headers from ncbi-downloader --generate-transcripts) - Add GFF3-coordinate-based mutation application as fallback when HGVSc is absent; GFF DB is built once in the main process via gffutils - Add parallel worker pool (_cbio_worker_init / _cbio_translate_batch) with per-worker FASTA + GFF state to avoid re-pickling on each batch - Expose new CLI flags: --include_biotypes, --exclude_biotypes, --skip_including_all_cds, --include/exclude_variant_classifications, --gff, --workers - Map NCBI NC_ accessions and MAF chr aliases to canonical chromosome names - Replace old synthetic test fixtures with real GRCh37/GRCh38 NCBI GFF + transcript FASTA files; add comprehensive test_cbioportal_translation.py (14 tests covering missense, nonsense, lncRNA, biotype filtering, parallel equivalence, and sample-split) - Refactor cbioportal_to_proteindb into 3 phases: serial MAF parsing, parallel batch translation (multiprocessing.Pool with spawn context), serial dedup-and-write - Add CBIO_BATCH_SIZE config key (default 2000 rows per batch) real GRCh37 and GRCh38 TCGA BRCA fixtures including GFF and transcript FASTA - Add TestSplitByFilterColumn (4 tests), TestParallelEquivalence (1 test), TestRealDataHgvsVerification (2 tests) in new test_cbioportal_translation.py ClinVar (clinvar_service / data_downloader): - Add _FeatureCache (LRU-style, 50k entries) to memoize _get_features() lookups per worker - Split large VCF inputs into fixed-size batches (_split_vcf_into_batches) before dispatching to the multiprocessing pool - Add SQLite SeqIO index (_ensure_fasta_index) for O(1) transcript lookup replacing linear FASTA scans - Add _fasta_key_fn to strip rna- prefix and version from NCBI FASTA headers General / toolbox: - toolbox/general.py: add open_vcf helper and related utilities - Ensembl/gnomAD downloaders: minor fixes and GENCODE support - Docs (pgatk-cli.md, use-cases.md): updated to reflect new CLI interface and NCBI/GENCODE workflow - Remove stale test fixtures (output_decoy.fa, paac_jhu_2014.tar, test_blast_psms_out.tsv, old cBioPortal MSKCC files)

Generate testdata/proteindb_from_ENSEMBL_VCF.fa via vcf-to-proteindb inside the test when missing. Previously the test read this file as input, relying on it being committed; commit 87af784 untracked it as a generated output, but unittest runs test_check_ensembl_database before test_vcf_to_proteindb (alphabetical), so on a fresh checkout the input was absent and the CLI exited with code 1.

- Drop genuinely unused imports: sqlite3 in cgenomes_proteindb, gzip in ensembl/ensembl. - Annotate subprocess imports and gffread calls with `# nosec B404/B603`; the binary is hardcoded, resolved via shutil.which, and invoked with list-form argv (never shell=True), so the bandit notices are false positives. - Consolidate exclude_paths from .codacy.yml into .codacy.yaml so the test directory is actually skipped by Codacy (pytest's `assert` usage was being flagged because only the engines file was being read).

- cgenomes_proteindb: replace two bare `except Exception: pass` blocks with narrowed exception classes (AttributeError/IndexError/ValueError in the mismatch-detail helper; sqlite3.Error + FeatureNotFoundError in the versioned-ID transcript fallback) and emit DEBUG logs so suppressed failures are not silent. - toolbox/general.download_file: annotate urlretrieve with `# nosec B310`; scheme is already restricted to http/https/ftp via ToolBoxException earlier in the function.

Bug-fix iteration: closes #98, #99, #100, #101

qodo-code-review · 2026-05-22T17:05:16Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-22T17:05:25Z

📝 Walkthrough

Walkthrough

Single PR updating CLI, downloaders, multi-process translation pipelines (ClinVar/Ensembl/CancerGenomes), generate-transcripts helpers, COSMIC/cBioPortal REST flows, many test fixtures/tests, docs, and pydocstyle/setup tooling to v0.0.27.

Changes

Main functional cohort

Layer / File(s)	Summary
Full change set `*`	Comprehensive PR: new/updated CLI subcommands and options; new downloaders (gencode, gnomad), refactored COSMIC and cBioPortal downloaders; parallel/chunked translation support in Ensembl/ClinVar/CancerGenomes services; generate-transcripts helpers; toolbox utility changes (open_vcf, download progress); decoy generation refactor; many testdata additions/updates and extensive new test modules; docs and pydocstyle/setup.cfg updates; package version bump.

Estimated code review effort:
🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs:

bigbio/pgatk#92: related mutation-to-protein translation changes in cgenomes_proteindb.
bigbio/pgatk#94: related docs/use-cases and index updates.
bigbio/pgatk#89: prior CLI wiring changes connected to added subcommands.

Suggested labels:
Review effort 4/5

"A rabbit hops through code and tests,
nibbling doc crumbs and fixes at rest.
New downloaders hum, translators sprint,
tests bloom like carrots — a dependable feast.
🐇✨"

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (5)

setup.cfg (1)

12-15: ⚡ Quick win

Consider relocating the B202 comment for clarity.

The B202 comment (lines 12-14) appears immediately before the skips line but B202 is not actually skipped in the configuration. While the comment correctly states that B202 is handled with # nosec at the call site, placing this explanation here may mislead readers into thinking B202 should be in the skips list. Consider moving this explanatory comment outside the bandit section or to a separate documentation block.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@setup.cfg` around lines 12 - 15, The explanatory comment about B202 is placed
immediately above the "skips = B101" setting and may imply B202 is being
skipped; move the B202 explanation out of the Bandit "skips" block—either
relocate it to a separate documentation block or above the Bandit section header
so it's clearly informational, and leave the "skips" entry unchanged; look for
the "skips" setting and the current B202 comment text to relocate.

pgatk/tests/test_cosmic_downloader.py (1)

33-46: ⚡ Quick win

Consider testing public behavior instead of the private _cosmic_token attribute.

Line 46 directly asserts on the private attribute _cosmic_token. Testing private implementation details makes tests brittle and tightly couples them to internal structure. If the authentication mechanism changes (e.g., the token is computed on-demand), this test breaks even if the public API works correctly.

Consider refactoring to test observable behavior through public methods or indirectly verify the token is correctly formatted by testing actual authentication success scenarios (if feasible with mocking).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pgatk/tests/test_cosmic_downloader.py` around lines 33 - 46, The test is
asserting a private attribute _cosmic_token; change it to exercise public
behavior on CosmicDownloadService instead—either call an existing public method
that uses the token (e.g., the method that initiates the FTP/HTTP request or a
public auth header accessor) or add a small public helper (e.g.,
get_cosmic_auth_header or format_cosmic_token) and assert its return value;
alternatively, mock the network call used by CosmicDownloadService and assert
the outgoing Authorization/header or credentials include the
base64(user:password) value rather than reading _cosmic_token directly. Ensure
the test references CosmicDownloadService and the public method or mocked
network call so it no longer depends on the private attribute.

pgatk/clinvar/clinvar_service.py (1)

345-351: 💤 Low value

Replace EN DASH with standard hyphen in log message.

Line 347 contains an EN DASH character (–) which should be a standard HYPHEN-MINUS (-) for consistency and terminal compatibility.

✏️ Suggested fix

             logger.warning(
-                "Building gffutils database from %s — this takes 10–30 minutes "
+                "Building gffutils database from %s - this takes 10-30 minutes "
                 "on first run and is cached as %s for subsequent runs.",
                 gtf_file,
                 db_file,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pgatk/clinvar/clinvar_service.py` around lines 345 - 351, The logger.warning
message in clinvar_service.py that references gtf_file and db_file contains an
EN DASH character (in the string passed to logger.warning); update that string
to use a standard hyphen-minus ("-") instead of the EN DASH so terminals and
tooling handle it consistently (locate the warning call that mentions "this
takes 10–30 minutes" and replace the "–" with "-").

pgatk/ensembl/ensembl.py (2)

920-924: 💤 Low value

Move imports to module level.

These imports (multiprocessing, shutil, tempfile) are standard library modules and should be at the top of the file for consistency and to follow Python conventions.

✏️ Suggested refactor

Add these imports at the top of the file (around line 5-8) with the other standard library imports:

import multiprocessing
import shutil
import tempfile

Then remove lines 922-924.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pgatk/ensembl/ensembl.py` around lines 920 - 924, The three standard-library
imports currently inside the "Parallel — split into fixed-size variant-batch
VCFs, fan out to a Pool." block (multiprocessing, shutil, tempfile) should be
moved to the module-level import section with the other stdlib imports and the
in-place import statements removed; locate the in-file strings
'multiprocessing', 'shutil', and 'tempfile' (the inline import lines) and add
them to the top-of-file imports (near the other standard library imports) and
delete the local imports at that block so the module follows normal Python
import conventions.

562-569: 💤 Low value

Prefer next(...) over single element slice.

Using next() with a generator expression is more efficient than building a list and taking [0], and handles the empty case more explicitly.

✏️ Suggested refactor

         annotation_cols: list[str] = []
         try:
-            raw_cols = \
-                [x for x in metadata if x.startswith('##INFO=<ID={}'.format(self._annotation_field_name))][
-                    0].upper().split('FORMAT')[1].strip(' ').split(':')[-1].split('=')[-1].strip(' ').split('|')
+            info_line = next(
+                (x for x in metadata if x.startswith('##INFO=<ID={}'.format(self._annotation_field_name))),
+                None
+            )
+            if info_line is None:
+                raise IndexError
+            raw_cols = info_line.upper().split('FORMAT')[1].strip(' ').split(':')[-1].split('=')[-1].strip(' ').split('|')
             annotation_cols = [c.strip(' ">\n\r') for c in raw_cols]
         except IndexError:
             pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pgatk/ensembl/ensembl.py` around lines 562 - 569, The current code builds a
list and indexes [0] to get the INFO line for self._annotation_field_name which
is inefficient and brittle; replace the list-and-index with next((x for x in
metadata if x.startswith(f'##INFO=<ID={self._annotation_field_name}')), None) to
get the first matching entry (or None) and then only run the
.upper().split('FORMAT')... parsing when the result is not None, otherwise leave
annotation_cols as an empty list; maintain the same trimming logic (strip('
">\n\r')) and keep the existing exception/silent-fail behavior by removing the
IndexError try/except and guarding on the next() result instead.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/use-cases.md`:
- Around line 111-113: Fix visible typos in the workflow text: replace
"annoated" with "annotated", "Cancer-speific" with "Cancer-specific", "speific"
with "specific", and "numbre" with "number" wherever they appear (notably the
occurrence shown and also at the other listed locations). Update the Markdown
content in docs/use-cases.md to correct those words so user-facing text reads
clearly and consistently.
- Around line 1194-1203: Step 3 lists putative.fa for concatenation but Step 2
never produces putative.fa, breaking the workflow; either remove putative.fa
from the list in the Step 3 concatenation (the cat command) or add an explicit
Step 2 command that generates putative.fa (e.g., the same pattern used to create
lncRNA.fa or pseudogene.fa) so that putative.fa exists before Step 3 runs—update
the docs to reference the filename putative.fa and the Step 2 generation command
or remove the filename from the Step 3 file list accordingly.

In `@docs/validations.md`:
- Line 153: Several fenced code blocks in the document open with plain triple
backticks ("```") and lack language identifiers; update each opening fence to
include an explicit language (for example change "```" to "```bash",
"```python", or "```text" as appropriate) for the blocks referenced so
markdownlint MD040 is satisfied and rendering improves; locate the offending
fences by searching for the bare "```" sequences within the file and pick the
correct language based on the block contents (shell commands => bash, code
snippets => python/javascript/etc., plain output => text), then save the file.

In `@pgatk/cgenomes/cbioportal_downloader.py`:
- Around line 410-411: download_one_study() currently writes to the shared
url_file from worker threads (url_file.write(...)), which is not thread-safe;
change download_one_study() to stop writing to url_file and instead return the
out_path (or None on failure) to the caller, then in the caller that submits
futures (the code that uses concurrent.futures and collects results) wait for
futures to complete and write each returned out_path to url_file serially (or
filter Nones) so all file writes happen in the main thread; alternatively, if
you must write from threads, protect url_file.write with a threading.Lock, but
prefer returning out_path from download_one_study() and performing the writes
after futures complete.

In `@pgatk/cgenomes/cgenomes_proteindb.py`:
- Around line 157-166: The DEL branch incorrectly treats HGVS deletions without
explicit bases (e.g., "c.104_124del") because del_dna becomes empty and nothing
is removed; update the logic in the DEL handling (the block using vartype ==
"DEL", variables cdna_pos, pos, del_dna, enst_pos, seq_mut) to detect empty
del_dna and parse the deletion range (from cdna_pos or pos, e.g., extract start
and end positions via regex), compute the number of bases to delete, and remove
the corresponding slice from seq (using enst_pos as the start index and length =
end - start + 1) before assigning seq_mut so that deletions specified by ranges
are correctly applied.

In `@pgatk/cgenomes/cosmic_downloader.py`:
- Around line 159-160: The error message currently includes the full presigned
download_url (msg = "COSMIC S3 download failed: HTTP {} for
{}".format(data_response.status_code, download_url)), which can leak temporary
query credentials; change the message to log a redacted URL or only the
scheme+netloc+path by stripping query parameters (e.g., parse download_url and
use parsed.scheme://parsed.netloc + parsed.path) or otherwise replace the query
string with "<redacted>" before formatting into msg so only non-sensitive parts
are logged; keep the data_response.status_code in the message unchanged.

In `@pgatk/clinvar/data_downloader.py`:
- Around line 26-32: The current decompression writes directly to plain which
can leave a corrupt file if interrupted; change the decompression in the block
that opens genome_fna (the gzip.open(...) and shutil.copyfileobj(...) section)
to write to a temporary file in the same directory (e.g., plain + ".tmp" or use
tempfile.NamedTemporaryFile(dir=... , delete=False)), flush and fsync the temp
file, close it, then atomically replace the final path with
os.replace(temp_path, plain); ensure any exception removes the temp file and
preserve the same return values (return plain, plain) and logging using the
existing symbols genome_fna, plain, gzip.open, shutil.copyfileobj, and
logger.info.

In `@pgatk/commands/cbioportal_to_proteindb.py`:
- Around line 38-40: The --workers click option currently allows zero or
negative integers but is only used when "workers > 1", causing invalid inputs to
be ignored; in the CLI handler (the function receiving the "workers" parameter,
referenced by the option decorator and the runtime check "if workers > 1:"), add
an explicit validation that workers is an integer >= 1 and raise a
click.BadParameter (or click.UsageError) with a clear message when workers <= 0
so the tool fails fast instead of silently ignoring the value; keep the rest of
the logic (the existing "if workers > 1:" branch) unchanged.

In `@pgatk/commands/clinvar_to_proteindb.py`:
- Around line 16-21: Validate the CLI `--workers` value is a positive integer
before passing it to the service: in the command handler that receives the
`workers` option (the click command using the `@click.option`("-w", "--workers",
...) in clinvar_to_proteindb.py), check that workers is not None and workers > 0
and raise a click.BadParameter (or call click.echo + sys.exit) when it is 0 or
negative; perform this check early in the command function before any
multiprocessing or service invocation to fail fast and provide a clear error
message.

In `@pgatk/commands/ensembl_downloader.py`:
- Around line 128-134: The loop that assigns genome_fna currently falls back to
next(iter(genome_map.values())) which can silently pair a GTF (prefix derived
from gtf_path) to the wrong genome; change the logic in the block handling
gtf_path/prefix so that you try genome_map.get(prefix) and if it is None then:
if len(genome_map) == 1 use the sole genome file as a deliberate fallback,
otherwise raise an explicit error (ValueError or custom) indicating the missing
prefix and listing available genome_map keys so the user can fix inputs; update
any downstream users of genome_fna (e.g., output_fasta generation) to rely on
this validated assignment.

In `@pgatk/commands/gnomad_vcf_downloader.py`:
- Around line 30-34: The CLI should reject non-positive worker counts at parse
time so ThreadPoolExecutor isn't given max_workers <= 0; update the "--workers"
option to validate inputs (e.g., use click.IntRange(min=1) or a custom callback
that raises a BadParameter for values < 1) and ensure the option's help/default
behavior remains; this will prevent GnomadVcfDownloader.download_all() from
being called with self._workers <= 0.

In `@pgatk/commands/ncbi_downloader.py`:
- Around line 91-93: The except blocks in generate_transcripts and
generate_cds_fasta are too broad and drop original tracebacks; narrow them to
catch FileNotFoundError (from _require_gffread) and
subprocess.CalledProcessError (from subprocess.run(..., check=True)), print the
error with click.echo, and re-raise a SystemExit preserving the original
exception context by using "raise SystemExit(1) from exc" so the original
traceback is retained; update both exception handlers around
generate_transcripts, generate_cds_fasta and any similar blocks.

In `@pgatk/commands/vcf_to_proteindb.py`:
- Around line 41-42: Reject non-positive worker counts at the CLI boundary by
validating the '--workers' option: change the Click option from type=int to
type=click.IntRange(1, None) or add an explicit check after parsing that raises
click.BadParameter if workers is None? no, should reject <=0 (e.g., if workers
is not None and workers <= 0 raise click.BadParameter("workers must be a
positive integer")). Apply the same validation for the other workers option
occurrence referenced (lines 110-114) so no zero or negative values are
forwarded downstream.
- Line 42: The help string for the --workers argparse option is incorrect: it
says "default: 1" but the argument default is None; update the help text for the
--workers option in vcf_to_proteindb.py (the parser.add_argument call that
defines "--workers") to accurately state the real default (e.g., "default: None"
or "default: auto/None (sequential)") or change the argument default to 1 if you
intend the default behavior to be 1; ensure the help text and the default value
on the --workers option remain consistent.

In `@pgatk/db/map_peptide2genome.py`:
- Line 90: The current check uses if line[0] != "#" which raises IndexError on
empty lines; update the loop in map_peptide2genome.py to guard empties first by
skipping lines that are blank (e.g., if not line or line.strip() == "" then
continue) before evaluating the comment check on line[0], then proceed with the
existing logic that handles non-comment lines.

In `@pgatk/ensembl/data_downloader.py`:
- Around line 508-515: The code currently reuses an existing plain FASTA sibling
(plain) which may be a partial decompression; update the logic around genome_fna
/ genome_for_gffread so decompression is always written to a temporary file in
the same directory and then atomically moved into place (use
tempfile.NamedTemporaryFile or tempfile.mkstemp with the same dir, write the
decompressed contents from gzip.open into that temp, close it, then
os.replace(temp, plain)); set genome_for_gffread = plain and _tmp = plain after
the atomic replace. Ensure the temp file is created in the same directory as
plain so os.replace is atomic on the filesystem.

In `@pgatk/ensembl/ensembl.py`:
- Around line 138-152: The function _vcf_to_proteindb_worker is annotated as
returning None but actually returns stats (a dict) from
svc._vcf_to_proteindb_chunk, causing type-checking and caller aggregation
errors; change the return annotation from -> None to an appropriate mapping type
such as -> Dict[str, Any] (or -> Any/Mapping if more appropriate), and update
any imports (e.g., typing.Dict, Any) if needed so the signature accurately
reflects that _vcf_to_proteindb_worker returns the stats dict obtained via
_worker_state['svc']._vcf_to_proteindb_chunk.

In `@pgatk/tests/test_vcf_to_proteindb_parallel.py`:
- Around line 11-24: The helper _sequence_set currently returns a set which
loses duplicate counts and can mask parallelization bugs; change it to return a
multiset (e.g., use collections.Counter on seqs) and update the test assertions
to compare Counters (or other multiset-aware comparison) instead of sets;
replace all uses of _sequence_set (including the other occurrence) so tests
assert equality of multisets rather than sets.

---

Nitpick comments:
In `@pgatk/clinvar/clinvar_service.py`:
- Around line 345-351: The logger.warning message in clinvar_service.py that
references gtf_file and db_file contains an EN DASH character (in the string
passed to logger.warning); update that string to use a standard hyphen-minus
("-") instead of the EN DASH so terminals and tooling handle it consistently
(locate the warning call that mentions "this takes 10–30 minutes" and replace
the "–" with "-").

In `@pgatk/ensembl/ensembl.py`:
- Around line 920-924: The three standard-library imports currently inside the
"Parallel — split into fixed-size variant-batch VCFs, fan out to a Pool." block
(multiprocessing, shutil, tempfile) should be moved to the module-level import
section with the other stdlib imports and the in-place import statements
removed; locate the in-file strings 'multiprocessing', 'shutil', and 'tempfile'
(the inline import lines) and add them to the top-of-file imports (near the
other standard library imports) and delete the local imports at that block so
the module follows normal Python import conventions.
- Around line 562-569: The current code builds a list and indexes [0] to get the
INFO line for self._annotation_field_name which is inefficient and brittle;
replace the list-and-index with next((x for x in metadata if
x.startswith(f'##INFO=<ID={self._annotation_field_name}')), None) to get the
first matching entry (or None) and then only run the .upper().split('FORMAT')...
parsing when the result is not None, otherwise leave annotation_cols as an empty
list; maintain the same trimming logic (strip(' ">\n\r')) and keep the existing
exception/silent-fail behavior by removing the IndexError try/except and
guarding on the next() result instead.

In `@pgatk/tests/test_cosmic_downloader.py`:
- Around line 33-46: The test is asserting a private attribute _cosmic_token;
change it to exercise public behavior on CosmicDownloadService instead—either
call an existing public method that uses the token (e.g., the method that
initiates the FTP/HTTP request or a public auth header accessor) or add a small
public helper (e.g., get_cosmic_auth_header or format_cosmic_token) and assert
its return value; alternatively, mock the network call used by
CosmicDownloadService and assert the outgoing Authorization/header or
credentials include the base64(user:password) value rather than reading
_cosmic_token directly. Ensure the test references CosmicDownloadService and the
public method or mocked network call so it no longer depends on the private
attribute.

In `@setup.cfg`:
- Around line 12-15: The explanatory comment about B202 is placed immediately
above the "skips = B101" setting and may imply B202 is being skipped; move the
B202 explanation out of the Bandit "skips" block—either relocate it to a
separate documentation block or above the Bandit section header so it's clearly
informational, and leave the "skips" entry unchanged; look for the "skips"
setting and the current B202 comment text to relocate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 805323b5-19b5-4a6a-a46f-4ecaa4b79c00

📥 Commits

Reviewing files that changed from the base of the PR and between b9f8d17 and 5f92d21.

⛔ Files ignored due to path filters (10)

pgatk/testdata/paac_jhu_2014.tar is excluded by !**/*.tar
pgatk/testdata/psms_with_single_missmatch.tsv is excluded by !**/*.tsv
pgatk/testdata/test.db is excluded by !**/*.db
pgatk/testdata/test_blast_psms_out.tsv is excluded by !**/*.tsv
pgatk/testdata/test_blast_validate_psms_out.tsv is excluded by !**/*.tsv
pgatk/testdata/test_cosmic_classification.tsv is excluded by !**/*.tsv
pgatk/testdata/test_cosmic_mutations.tsv is excluded by !**/*.tsv
pgatk/testdata/test_gencode.db is excluded by !**/*.db
pgatk/testdata/test_gnomad_bgz.vcf.gz is excluded by !**/*.gz
pgatk/testdata/test_validate_psms_out.tsv is excluded by !**/*.tsv

📒 Files selected for processing (98)

.codacy.yaml
.codacy.yml
.gitignore
.pydocstyle
docs/index.md
docs/pgatk-cli.md
docs/plans/2026-03-01-pgatk-graph-engine-design.md
docs/plans/2026-03-03-protein-accession-design.md
docs/use-cases.md
docs/validations.md
pgatk/cgenomes/cbioportal_downloader.py
pgatk/cgenomes/cgenomes_proteindb.py
pgatk/cgenomes/cosmic_downloader.py
pgatk/cli.py
pgatk/clinvar/clinvar_service.py
pgatk/clinvar/data_downloader.py
pgatk/commands/cbioportal_to_proteindb.py
pgatk/commands/clinvar_to_proteindb.py
pgatk/commands/cosmic_downloader.py
pgatk/commands/cosmic_to_proteindb.py
pgatk/commands/ensembl_downloader.py
pgatk/commands/gencode_downloader.py
pgatk/commands/gnomad_vcf_downloader.py
pgatk/commands/ncbi_downloader.py
pgatk/commands/vcf_to_proteindb.py
pgatk/config/cbioportal_config.yaml
pgatk/config/clinvar_config.yaml
pgatk/config/cosmic_config.yaml
pgatk/config/ensembl_config.yaml
pgatk/db/digest_mutant_protein.py
pgatk/db/map_peptide2genome.py
pgatk/ensembl/data_downloader.py
pgatk/ensembl/ensembl.py
pgatk/gnomad/__init__.py
pgatk/gnomad/data_downloader.py
pgatk/proteogenomics/spectrumai.py
pgatk/proteomics/db/protein_database_decoy.py
pgatk/testdata/clinvar/clinvar_isg15_assembly_report.txt
pgatk/testdata/clinvar/clinvar_isg15_cds.fna
pgatk/testdata/clinvar/clinvar_isg15_cds.gff
pgatk/testdata/clinvar/clinvar_isg15_mutation_types.vcf
pgatk/testdata/clinvar/mini_clinvar.vcf
pgatk/testdata/clinvar/mini_refseq.gff
pgatk/testdata/clinvar/mini_refseq.gtf
pgatk/testdata/clinvar/mini_refseq_protein.faa
pgatk/testdata/output_decoy.fa
pgatk/testdata/proteindb_from_CDSs_DNAseq.fa
pgatk/testdata/proteindb_from_ENSEMBL_VCF-clean.fa
pgatk/testdata/proteindb_from_ENSEMBL_VCF.fa
pgatk/testdata/proteindb_from_altORFs_DNAseq.fa
pgatk/testdata/proteindb_from_custom_VCF.fa
pgatk/testdata/proteindb_from_gnomad_VCF.fa
pgatk/testdata/proteindb_from_lncRNAs_DNAseq.fa
pgatk/testdata/proteindb_from_ncRNAs_DNAseq.fa
pgatk/testdata/proteindb_from_pseudogenes_DNAseq.fa
pgatk/testdata/proteindb_from_sncRNAs_DNAseq.fa
pgatk/testdata/proteindb_minus_strand_test.fa
pgatk/testdata/test_cbioportal_data_clinical_sample.txt
pgatk/testdata/test_cbioportal_data_mutations_mskcc.txt
pgatk/testdata/test_cbioportal_data_mutations_mskcc_proteindb.fa
pgatk/testdata/test_cbioportal_data_mutations_mskcc_proteindb_BladderCancer.fa
pgatk/testdata/test_cbioportal_genes.fa
pgatk/testdata/test_cbioportal_grch37_clinical_sample.txt
pgatk/testdata/test_cbioportal_grch37_mutations.txt
pgatk/testdata/test_cbioportal_grch38_mutations.txt
pgatk/testdata/test_cbioportal_ncbi_grch37.gff
pgatk/testdata/test_cbioportal_ncbi_grch37_transcripts.fa
pgatk/testdata/test_cbioportal_ncbi_grch38.gff
pgatk/testdata/test_cbioportal_ncbi_grch38_transcripts.fa
pgatk/testdata/test_cosmic_mutations_proteindb.fa
pgatk/testdata/test_cosmic_mutations_proteindb_bone.fa
pgatk/testdata/test_cosmic_mutations_proteindb_liver.fa
pgatk/testdata/test_cosmic_mutations_proteindb_skin.fa
pgatk/testdata/test_cosmic_mutations_proteindb_thyroid.fa
pgatk/testdata/test_cosmic_mutations_proteindb_upperaerodigestivetract.fa
pgatk/testdata/test_ensembl_v2p.fa
pgatk/testdata/test_ensembl_v2p.gtf
pgatk/testdata/test_ensembl_v2p.vcf
pgatk/testdata/test_gnomad_gencode.fa
pgatk/testdata/test_gnomad_gencode.gtf
pgatk/tests/pgatk_tests.py
pgatk/tests/test_cbioportal_translation.py
pgatk/tests/test_clinvar/test_clinvar_integration.py
pgatk/tests/test_clinvar/test_clinvar_service.py
pgatk/tests/test_clinvar/test_clinvar_translation.py
pgatk/tests/test_clinvar/test_data_downloader.py
pgatk/tests/test_cosmic_core.py
pgatk/tests/test_cosmic_downloader.py
pgatk/tests/test_ensembl_core.py
pgatk/tests/test_ensembl_translation.py
pgatk/tests/test_gnomad_translation.py
pgatk/tests/test_variant_types_hgvs.py
pgatk/tests/test_vcf_to_proteindb_parallel.py
pgatk/toolbox/general.py
pgatk/toolbox/vcf_utils.py
pyproject.toml
scripts/benchmark_vcf_to_proteindb.py
setup.cfg

💤 Files with no reviewable changes (29)

pgatk/testdata/test_cosmic_mutations_proteindb_bone.fa
pgatk/testdata/test_cosmic_mutations_proteindb_upperaerodigestivetract.fa
docs/plans/2026-03-03-protein-accession-design.md
pgatk/testdata/proteindb_from_pseudogenes_DNAseq.fa
pgatk/testdata/proteindb_from_sncRNAs_DNAseq.fa
pgatk/testdata/proteindb_from_lncRNAs_DNAseq.fa
docs/plans/2026-03-01-pgatk-graph-engine-design.md
pgatk/testdata/proteindb_minus_strand_test.fa
pgatk/testdata/test_cbioportal_data_mutations_mskcc_proteindb_BladderCancer.fa
pgatk/testdata/test_cbioportal_data_clinical_sample.txt
pgatk/testdata/test_cosmic_mutations_proteindb_thyroid.fa
pgatk/testdata/proteindb_from_ENSEMBL_VCF.fa
pgatk/testdata/output_decoy.fa
pgatk/testdata/proteindb_from_CDSs_DNAseq.fa
pgatk/testdata/test_cosmic_mutations_proteindb.fa
pgatk/testdata/proteindb_from_ncRNAs_DNAseq.fa
pgatk/testdata/proteindb_from_gnomad_VCF.fa
pgatk/testdata/test_cosmic_mutations_proteindb_skin.fa
pgatk/testdata/proteindb_from_custom_VCF.fa
.codacy.yml
pgatk/testdata/test_cbioportal_genes.fa
pgatk/testdata/clinvar/mini_refseq.gtf
pgatk/testdata/proteindb_from_ENSEMBL_VCF-clean.fa
pgatk/testdata/test_cbioportal_data_mutations_mskcc_proteindb.fa
pgatk/testdata/proteindb_from_altORFs_DNAseq.fa
pgatk/testdata/test_cosmic_mutations_proteindb_liver.fa
pgatk/proteogenomics/spectrumai.py
pgatk/testdata/test_cbioportal_data_mutations_mskcc.txt
pgatk/proteomics/db/protein_database_decoy.py

coderabbitai · 2026-05-22T17:18:19Z

+
+**Tissue-type data architecture** — `PRIMARY_SITE` (tissue of origin) is no longer a column in the mutation export. It lives in a separate **Classification file** (`Cosmic_Classification_Tsv_v103_GRCh38.tar`) and is joined via `COSMIC_PHENOTYPE_ID`:
+
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

Several fenced blocks open with plain triple backticks at the flagged lines. Add explicit languages (e.g., text, bash, python) to satisfy markdownlint and improve rendering.

Also applies to: 179-179, 193-193, 244-244, 279-279, 289-289, 336-336, 341-341, 345-345, 373-373

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 153-153: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/validations.md` at line 153, Several fenced code blocks in the document open with plain triple backticks ("```") and lack language identifiers; update each opening fence to include an explicit language (for example change "```" to "```bash", "```python", or "```text" as appropriate) for the blocks referenced so markdownlint MD040 is satisfied and rendering improves; locate the offending fences by searching for the bare "```" sequences within the file and pick the correct language based on the block contents (shell commands => bash, code snippets => python/javascript/etc., plain output => text), then save the file.

coderabbitai · 2026-05-22T17:18:20Z

+def _sequence_set(fasta_path) -> set:
+    seqs = []
+    current = []
+    with open(fasta_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            if line.startswith('>'):
+                if current:
+                    seqs.append(''.join(current))
+                    current = []
+            else:
+                current.append(line.strip())
+        if current:
+            seqs.append(''.join(current))
+    return set(seqs)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve sequence multiplicity in the equivalence assertion.

Using set hides duplicate-count differences, so a parallelization bug that changes copy counts can still pass. Compare multisets instead.

Suggested fix

+from collections import Counter @@ -def _sequence_set(fasta_path) -> set: +def _sequence_multiset(fasta_path) -> Counter: @@ - return set(seqs) + return Counter(seqs) @@ - assert _sequence_set(out_seq) == _sequence_set(out_par) + assert _sequence_multiset(out_seq) == _sequence_multiset(out_par)

Also applies to: 64-64

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pgatk/tests/test_vcf_to_proteindb_parallel.py` around lines 11 - 24, The helper _sequence_set currently returns a set which loses duplicate counts and can mask parallelization bugs; change it to return a multiset (e.g., use collections.Counter on seqs) and update the test assertions to compare Counters (or other multiset-aware comparison) instead of sets; replace all uses of _sequence_set (including the other occurrence) so tests assert equality of multisets rather than sets.

Copilot

Pull request overview

This PR refactors multiple provider pipelines (Ensembl/VCF, ClinVar/NCBI, COSMIC, cBioPortal, gnomAD/GENCODE) and updates CLI/config/testdata to support new download workflows (including transcript/CDS generation via gffread) and optional parallelism.

Changes:

Added new CLI download commands (gencode-downloader, gnomad-vcf-downloader) and enhanced existing downloaders (NCBI/Ensembl/COSMIC) with transcript/CDS generation support.
Introduced/extended parallel execution options (e.g., --workers) and added tests for parallel equivalence.
Updated ClinVar pipeline to use RefSeq GFF3 (instead of GTF), refreshed fixtures, and adjusted multiple tests/configs accordingly.

Reviewed changes

Copilot reviewed 100 out of 108 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
setup.cfg	Adds pydocstyle/bandit configuration.
scripts/benchmark_vcf_to_proteindb.py	Adds local benchmark/profiling helper for `vcf-to-proteindb`.
pyproject.toml	Bumps version and adds tool config for pydocstyle.
pgatk/toolbox/vcf_utils.py	Suppresses Biopython “Partial codon” warnings during ORF translation.
pgatk/toolbox/general.py	Adds `open_vcf()` and enhances `download_file()` with progress reporting and scheme validation.
pgatk/tests/test_vcf_to_proteindb_parallel.py	Adds equivalence test for sequential vs parallel vcf-to-proteindb output.
pgatk/tests/test_cosmic_downloader.py	Adds unit tests for COSMIC API URL/token logic.
pgatk/tests/test_cosmic_core.py	Updates mismatch sentinel behavior and adds leading-`N` alignment test.
pgatk/tests/test_clinvar/test_data_downloader.py	Updates expected NCBI RefSeq URLs (GFF3 + genomic FASTA).
pgatk/tests/test_clinvar/test_clinvar_service.py	Switches tests to GFF3 fixtures and record-based overlap mapping.
pgatk/tests/test_clinvar/test_clinvar_integration.py	Updates integration tests to use `--gff` and new fixtures.
pgatk/tests/pgatk_tests.py	Updates cbioportal + cosmic tests and hardens ensembl-check fixture generation.
pgatk/testdata/test_validate_psms_out.tsv	Removes committed generated test output.
pgatk/testdata/test_gnomad_gencode.gtf	Adds gnomAD/GENCODE GTF fixture.
pgatk/testdata/test_gnomad_gencode.fa	Adds gnomAD/GENCODE transcript FASTA fixture.
pgatk/testdata/test_ensembl_v2p.vcf	Adds Ensembl VEP-annotated VCF fixture.
pgatk/testdata/test_ensembl_v2p.gtf	Adds Ensembl GTF fixture.
pgatk/testdata/test_ensembl_v2p.fa	Adds transcript FASTA fixture for Ensembl VCF translation.
pgatk/testdata/test_cosmic_mutations.tsv	Updates COSMIC fixture format/columns.
pgatk/testdata/test_cosmic_mutations_proteindb.fa	Removes committed generated output FASTA.
pgatk/testdata/test_cosmic_mutations_proteindb_upperaerodigestivetract.fa	Removes committed generated output FASTA.
pgatk/testdata/test_cosmic_mutations_proteindb_thyroid.fa	Removes committed generated output FASTA.
pgatk/testdata/test_cosmic_mutations_proteindb_skin.fa	Removes committed generated output FASTA.
pgatk/testdata/test_cosmic_mutations_proteindb_liver.fa	Removes committed generated output FASTA.
pgatk/testdata/test_cosmic_mutations_proteindb_bone.fa	Removes committed generated output FASTA.
pgatk/testdata/test_cosmic_classification.tsv	Adds COSMIC classification fixture.
pgatk/testdata/test_cbioportal_grch38_mutations.txt	Adds GRCh38 cBioPortal mutation fixture.
pgatk/testdata/test_cbioportal_grch37_mutations.txt	Adds GRCh37 cBioPortal mutation fixture.
pgatk/testdata/test_cbioportal_grch37_clinical_sample.txt	Adds GRCh37 cBioPortal clinical fixture.
pgatk/testdata/test_blast_validate_psms_out.tsv	Removes committed generated output TSV.
pgatk/testdata/test_blast_psms_out.tsv	Removes committed generated output TSV.
pgatk/testdata/psms_with_single_missmatch.tsv	Removes committed generated output TSV.
pgatk/testdata/proteindb_minus_strand_test.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_sncRNAs_DNAseq.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_pseudogenes_DNAseq.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_ncRNAs_DNAseq.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_lncRNAs_DNAseq.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_gnomad_VCF.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_ENSEMBL_VCF.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_custom_VCF.fa	Removes committed generated output FASTA.
pgatk/testdata/proteindb_from_CDSs_DNAseq.fa	Removes committed generated output FASTA.
pgatk/testdata/output_decoy.fa	Removes committed generated decoy output FASTA.
pgatk/testdata/clinvar/mini_refseq.gtf	Removes old ClinVar GTF fixture.
pgatk/testdata/clinvar/mini_refseq.gff	Adds new ClinVar GFF3 fixture.
pgatk/testdata/clinvar/mini_refseq_protein.faa	Updates ClinVar transcript FASTA fixture content.
pgatk/testdata/clinvar/mini_clinvar.vcf	Updates ClinVar mini VCF coordinates/records.
pgatk/testdata/clinvar/clinvar_isg15_mutation_types.vcf	Adds ClinVar mutation-type coverage fixture.
pgatk/testdata/clinvar/clinvar_isg15_cds.gff	Adds ISG15 CDS-only GFF3 fixture.
pgatk/testdata/clinvar/clinvar_isg15_cds.fna	Adds ISG15 CDS nucleotide FASTA fixture.
pgatk/testdata/clinvar/clinvar_isg15_assembly_report.txt	Adds minimal assembly report fixture.
pgatk/proteomics/db/protein_database_decoy.py	Cleans up decoy generation implementation/comments.
pgatk/proteogenomics/spectrumai.py	Minor cleanup in ions_tolerance handling comments.
pgatk/gnomad/init.py	Adds/keeps gnomad package init.
pgatk/ensembl/data_downloader.py	Adds `generate_transcripts()` helper via gffread and imports.
pgatk/db/map_peptide2genome.py	Minor comment cleanup.
pgatk/db/digest_mutant_protein.py	Hardens header parsing and removes invalid close calls.
pgatk/config/ensembl_config.yaml	Updates default included biotypes list.
pgatk/config/cosmic_config.yaml	Switches to scripted-download products list and updates filter_column.
pgatk/config/clinvar_config.yaml	Adds `workers` default.
pgatk/config/cbioportal_config.yaml	Adds biotype/classification filtering defaults and translation settings.
pgatk/commands/vcf_to_proteindb.py	Adds `--workers` option and forwards to EnsemblDataService.
pgatk/commands/ncbi_downloader.py	Adds GRCh37 support + gffread transcript/CDS generation options.
pgatk/commands/gnomad_vcf_downloader.py	New command to download gnomAD VCF/TBI files in parallel.
pgatk/commands/gencode_downloader.py	New command to download GENCODE and optionally generate transcripts FASTA.
pgatk/commands/ensembl_downloader.py	Adds optional transcript generation after download.
pgatk/commands/cosmic_to_proteindb.py	Adds clinical annotation file option + default filter column change.
pgatk/commands/cosmic_downloader.py	Adds `--products` override and enhanced `--url_file` semantics.
pgatk/commands/clinvar_to_proteindb.py	Switches CLI to GFF3 and adds optional parallel workers.
pgatk/commands/cbioportal_to_proteindb.py	Updates CLI inputs/options and adds workers + coordinate translation support.
pgatk/clinvar/data_downloader.py	Adds GRCh37 support and gffread transcript/CDS extraction utilities.
pgatk/cli.py	Registers new CLI commands.
pgatk/cgenomes/cbioportal_downloader.py	Refactors cBioPortal download to REST API (MAF + clinical TSV output).
docs/plans/2026-03-03-protein-accession-design.md	Removes internal plan doc from repo.
docs/plans/2026-03-01-pgatk-graph-engine-design.md	Removes internal plan doc from repo.
docs/index.md	Adds documentation link for validations.
.pydocstyle	Adds pydocstyle ignore config file.
.gitignore	Expands ignores for generated outputs, db files, and internal docs.
.codacy.yml	Removes old Codacy config name.
.codacy.yaml	Adds updated Codacy config with pydocstyle settings and excludes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Cross-checked every documented command/option against pgatk/commands/*.py. Fixes: - generate-decoy: rewrite the help block - it listed mostly invented short flags (-i, -o, -l, -n, -x, -w, -t, -b, -m) and two options (--cleavage_sites, --anti_cleavage_sites) that do not exist on the click command; correct flags are -in/-out/-e/-d/-s and the new --min_peptide_length/--max_peptide_length/--keep_target_hits options. - ensembl-downloader: remove -l/--list_taxonomies (no such option) and the example using it; add the real -en/--ensembl_name, --grch37, and --url_file options. - cosmic-downloader: add -P/--products and --url_file options. - cbioportal-downloader: add -th/--multithreading and --url_file options. - ncbi-downloader: add --grch37, --generate-transcripts, --generate-cds. - vcf-to-proteindb: -c/--config_file is optional (was listed Required); --exclude_consequences default includes regulatory_region_variant; the "parameters to change for non-ENSEMBL sources" note referred to --transcript_index/--consequence_index which do not exist - corrected to --transcript_str/--consequence_str/--biotype_str; -w default is cpu_count(), not 1. - dnaseq-to-proteindb: -c/--config_file is optional; --num_orfs default is 3 (matches click help string and ensembl_config.yaml). - cosmic-to-proteindb cell-line example: --filter_column 'Sample name' is the v2 column name; v103 uses SAMPLE_NAME. - Fix broken intra-page anchor #downloading-ncbi--clinvar-data -> #downloading-ncbi-clinvar-data (mkdocs slug normalizes slash to nothing). mkdocs build now produces zero warnings from user-facing pages (remaining warnings are in docs/plans/, which is historical content).

docs(cli): fix stale options, defaults, and examples in pgatk-cli.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/pgatk-cli.md`:
- Line 576: The in-page anchor "`#downloading-ncbi-clinvar-data`" doesn't match
the actual heading "### Downloading NCBI / ClinVar Data"; locate the sentence
that mentions `--generate-transcripts` (the line linking to the downloader docs)
and either update the fragment to the correct slug for that heading (e.g.,
"`#downloading-ncbi--clinvar-data`") or change the heading text to a slug that
matches the existing fragment (e.g., remove or replace the "/" so the heading
slug becomes "downloading-ncbi-clinvar-data"); ensure the link and heading use
the same slug so the anchor resolves.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8121c2f5-e8cb-4c00-9216-26f84c2af46d

📥 Commits

Reviewing files that changed from the base of the PR and between 5f92d21 and 824ea58.

📒 Files selected for processing (1)

docs/pgatk-cli.md

coderabbitai · 2026-05-22T21:12:44Z

 ```

-The input files can be downloaded using the [ncbi-downloader](#downloading-ncbi--clinvar-data) command.
+The input files are produced by the [ncbi-downloader](#downloading-ncbi-clinvar-data) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix broken in-page anchor target.

Line 576 links to #downloading-ncbi-clinvar-data, but that fragment does not match the heading slug for ### Downloading NCBI / ClinVar Data, so the link may not resolve in rendered Markdown.

Proposed fix

-The input files are produced by the [ncbi-downloader](`#downloading-ncbi-clinvar-data`) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation. +The input files are produced by the [ncbi-downloader](`#downloading-ncbi--clinvar-data`) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The input files are produced by the [ncbi-downloader](#downloading-ncbi-clinvar-data) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation.

The input files are produced by the [ncbi-downloader](`#downloading-ncbi--clinvar-data`) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 576-576: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/pgatk-cli.md` at line 576, The in-page anchor "`#downloading-ncbi-clinvar-data`" doesn't match the actual heading "### Downloading NCBI / ClinVar Data"; locate the sentence that mentions `--generate-transcripts` (the line linking to the downloader docs) and either update the fragment to the correct slug for that heading (e.g., "`#downloading-ncbi--clinvar-data`") or change the heading text to a slug that matches the existing fragment (e.g., remove or replace the "/" so the heading slug becomes "downloading-ncbi-clinvar-data"); ensure the link and heading use the same slug so the anchor resolves.

- docs/use-cases.md: replace non-portable `grep -oP` with a portable `sed -E` extraction so the per-chromosome ENSEMBL example works on systems without GNU grep (e.g. default macOS). - gencode_downloader: also catch `subprocess.CalledProcessError` when running gffread (matches ensembl_downloader and ncbi_downloader). - ensembl_downloader: when no genome matches the GTF prefix, emit a warning naming the missing prefix and the fallback assembly before using it — previously a GRCh38 GTF could silently pair with a GRCh37 genome (or vice versa) with no indication to the user. - mini_clinvar.vcf: document that rs00002 and rs00004 are intentionally at the same 1:69550 G>A position so the two filtering paths (CLNSIG-based and MC-based) are independently exercised. Skipped: CodeRabbit also flagged the `workers > 1` guard in cbioportal_to_proteindb. The click option defaults to 1 and the service also defaults to 1 when WORKERS is unset, so `--workers 1` and "no flag" produce identical behaviour; the guard is not a user-visible bug.

chore: address CodeRabbit follow-up findings

Each item verified against current dev before fixing. Critical: - cgenomes_proteindb._cbio_translate_batch DEL path: `pos.split("del")[1]` returns "" for HGVS range-only deletions (e.g. `c.104_124del`); the `del_dna == ""` comparison always matched the empty slice so nothing was removed and the deletion was silently ignored. Now derive the deletion length from the range positions in cdna_pos when del_dna is empty, with a bounds check. Major: - cbioportal_downloader.download_one_study: when `--multithreading` is enabled the function was writing to the shared `url_file` handle from up to 10 worker threads, which can interleave/corrupt lines. The thread pool now passes `url_file=None` to workers and the main thread serializes the writes after `as_completed`. Single-threaded call sites still pass the handle so behaviour is unchanged there. - cosmic_downloader: on S3 download failure, only log the path of the presigned URL — not the full URL — so short-lived AWS query credentials don't leak into log aggregators. (The success-path debug log was already redacted; this matches it.) - clinvar/, ensembl/, gnomad/ data_downloader._resolve_genome_fna / generate_transcripts: stage `.gz` -> plain FASTA decompression through a sibling `.tmp` file and atomically `os.replace()` it, removing the temp on any exception. Previously an interrupted decompression left a partial sibling that later runs silently reused, producing corrupt gffread output. - db/map_peptide2genome.parse_gtf: `line[0] != "#"` raises IndexError on empty lines. Skip blank lines before the comment check; also guard against short rows in the CDS path. - docs/use-cases.md Step 3 cats `putative.fa` but Step 2 never produced it. Added the missing 3-frame putative-protein generation step. Minor: - vcf-to-proteindb `--workers` help text said "default: 1, sequential. Per-chromosome split.", but the click default is None (config-driven, defaults to 1) and the actual strategy is fixed-size variant batches (~50k) that may span chromosomes. Updated the help text to match. - ensembl._vcf_to_proteindb_worker: signature said `-> None` but the function returns a stats dict that the caller aggregates. Corrected the annotation. - docs/use-cases.md typos: annoated -> annotated, Cancer-speific -> Cancer-specific (x2), numbre -> number. Skipped (with reason): - Worker validation (`--workers` >= 1) across cbioportal/clinvar/gnomad CLI commands — separate concern; could be a follow-up using click.IntRange(min=1). - ncbi_downloader broad except — already catches the right exceptions per earlier review (#107). - Various nitpicks (B202 comment placement, en-dash, private-attr test, import location, `next()` over `[0]`) — low-value style items. Test plan: - python -m pytest pgatk/tests/ -q -> 316 passed - python -m unittest discover -s pgatk/tests -p "*_tests.py" -> 20/20

chore: address CodeRabbit + Copilot findings from PR #105

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/use-cases.md`:
- Around line 1194-1201: The current command three-frame-translates all
protein_coding transcripts (exploding search space); instead narrow putative.fa
to non-canonical transcripts by filtering out canonical transcripts before
translation: either run dnaseq-to-proteindb on a FASTA that excludes sequences
listed in canonical.fa or add an exclusion flag (e.g., --exclude_fasta
canonical.fa or --exclude_transcripts) if supported; specifically update the
pgatk dnaseq-to-proteindb invocation (flags: --input_fasta, --output_proteindb,
--protein_prefix, --include_biotypes, --num_orfs, --skip_including_all_cds) so
that only non-canonical transcripts are translated and concatenated with
canonical.fa as intended.

In `@pgatk/cgenomes/cbioportal_downloader.py`:
- Around line 414-421: download_one_study() currently proceeds to return
out_path even if _fetch_clinical_data(...) failed/returned None, leaving
data_clinical_sample.txt missing; update download_one_study() to check the
return value of _fetch_clinical_data() (or its success flag) and treat a
None/failure as fatal: log an error and raise an exception or return a
non-success indicator so the study download fails instead of returning out_path.
Ensure the url_file.write(out_path + "\n") only runs when both mutation file and
clinical fetch succeeded, and reference the functions/variables
_fetch_clinical_data, download_one_study, url_file, and out_path when locating
the change.

In `@pgatk/gnomad/data_downloader.py`:
- Around line 40-51: The code creates a shared temp path via tmp = plain +
".tmp" which can be clobbered by concurrent runs; change the decompression to
create a unique temp file (e.g., use tempfile.mkstemp or
tempfile.NamedTemporaryFile(delete=False) to obtain a unique pathname), write to
that unique temp (instead of tmp), call os.replace(unique_tmp, plain) for atomic
swap, and on exception only remove the unique temp you created (so adjust the
cleanup branch to reference the unique temp variable rather than plain +
".tmp").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58a35505-9f9c-40e7-b742-3e9a8505fb1a

📥 Commits

Reviewing files that changed from the base of the PR and between dac9ff6 and 55ae29a.

📒 Files selected for processing (10)

docs/use-cases.md
pgatk/cgenomes/cbioportal_downloader.py
pgatk/cgenomes/cgenomes_proteindb.py
pgatk/cgenomes/cosmic_downloader.py
pgatk/clinvar/data_downloader.py
pgatk/commands/vcf_to_proteindb.py
pgatk/db/map_peptide2genome.py
pgatk/ensembl/data_downloader.py
pgatk/ensembl/ensembl.py
pgatk/gnomad/data_downloader.py

coderabbitai · 2026-05-23T08:17:52Z

+# Putative proteins (three-frame translation of all protein-coding transcripts)
+pgatk dnaseq-to-proteindb \
+    --input_fasta ensembl_data/transcripts.fa \
+    --output_proteindb putative.fa \
+    --protein_prefix putative_ \
+    --include_biotypes protein_coding \
+    --num_orfs 3 \
+    --skip_including_all_cds


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Narrow putative.fa back to the non-canonical subset.

This recipe now three-frame-translates all protein_coding transcripts, then concatenates that output with canonical.fa. That materially changes the workflow from the earlier “putative” definition in this doc and will explode the search space with highly redundant coding-transcript translations.

Suggested doc fix

-# Putative proteins (three-frame translation of all protein-coding transcripts) +# Putative proteins (three-frame) pgatk dnaseq-to-proteindb \ --input_fasta ensembl_data/transcripts.fa \ --output_proteindb putative.fa \ --protein_prefix putative_ \ - --include_biotypes protein_coding \ + --include_biotypes protein_coding_CDS_not_defined,TEC,translated_processed_pseudogene \ --num_orfs 3 \ --skip_including_all_cds

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Putative proteins (three-frame translation of all protein-coding transcripts)

pgatk dnaseq-to-proteindb \

--input_fasta ensembl_data/transcripts.fa \

--output_proteindb putative.fa \

--protein_prefix putative_ \

--include_biotypes protein_coding \

--num_orfs 3 \

--skip_including_all_cds

# Putative proteins (three-frame)

pgatk dnaseq-to-proteindb \

--input_fasta ensembl_data/transcripts.fa \

--output_proteindb putative.fa \

--protein_prefix putative_ \

--include_biotypes protein_coding_CDS_not_defined,TEC,translated_processed_pseudogene \

--num_orfs 3 \

--skip_including_all_cds

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/use-cases.md` around lines 1194 - 1201, The current command three-frame-translates all protein_coding transcripts (exploding search space); instead narrow putative.fa to non-canonical transcripts by filtering out canonical transcripts before translation: either run dnaseq-to-proteindb on a FASTA that excludes sequences listed in canonical.fa or add an exclusion flag (e.g., --exclude_fasta canonical.fa or --exclude_transcripts) if supported; specifically update the pgatk dnaseq-to-proteindb invocation (flags: --input_fasta, --output_proteindb, --protein_prefix, --include_biotypes, --num_orfs, --skip_including_all_cds) so that only non-canonical transcripts are translated and concatenated with canonical.fa as intended.

coderabbitai · 2026-05-23T08:17:52Z

+        _fetch_clinical_data(self._cbioportal_base_url, download_study, study_dir, log)
+
+        # Note: url_file writes intentionally happen in download_study() (single-threaded
+        # main thread) after futures complete; the parameter is retained for the
+        # single-study code path and to keep the legacy serial call sites unchanged.
+        if url_file is not None:
+            url_file.write(out_path + "\n")
+        return out_path


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the study download when the clinical TSV cannot be produced.

download_one_study() still returns data_mutations.txt even when _fetch_clinical_data() returns None, so the new per-study layout can be reported as successful while data_clinical_sample.txt is missing. That breaks downstream --clinical_sample_file workflows.

Suggested fix

- _fetch_clinical_data(self._cbioportal_base_url, download_study, study_dir, log) + clinical_path = _fetch_clinical_data( + self._cbioportal_base_url, download_study, study_dir, log + ) + if clinical_path is None: + log.error("Failed to fetch clinical data for study '%s'", download_study) + return None # Note: url_file writes intentionally happen in download_study() (single-threaded # main thread) after futures complete; the parameter is retained for the # single-study code path and to keep the legacy serial call sites unchanged. if url_file is not None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_fetch_clinical_data(self._cbioportal_base_url, download_study, study_dir, log)

# Note: url_file writes intentionally happen in download_study() (single-threaded

# main thread) after futures complete; the parameter is retained for the

# single-study code path and to keep the legacy serial call sites unchanged.

if url_file is not None:

url_file.write(out_path + "\n")

return out_path

clinical_path = _fetch_clinical_data(

self._cbioportal_base_url, download_study, study_dir, log

)

if clinical_path is None:

log.error("Failed to fetch clinical data for study '%s'", download_study)

return None

# Note: url_file writes intentionally happen in download_study() (single-threaded

# main thread) after futures complete; the parameter is retained for the

# single-study code path and to keep the legacy serial call sites unchanged.

if url_file is not None:

url_file.write(out_path + "\n")

return out_path

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pgatk/cgenomes/cbioportal_downloader.py` around lines 414 - 421, download_one_study() currently proceeds to return out_path even if _fetch_clinical_data(...) failed/returned None, leaving data_clinical_sample.txt missing; update download_one_study() to check the return value of _fetch_clinical_data() (or its success flag) and treat a None/failure as fatal: log an error and raise an exception or return a non-success indicator so the study download fails instead of returning out_path. Ensure the url_file.write(out_path + "\n") only runs when both mutation file and clinical fetch succeeded, and reference the functions/variables _fetch_clinical_data, download_one_study, url_file, and out_path when locating the change.

coderabbitai · 2026-05-23T08:17:52Z

+    tmp = plain + ".tmp"
+    try:
+        with gzip.open(genome_fna, "rb") as fi, open(tmp, "wb") as fo:
+            shutil.copyfileobj(fi, fo)
+        os.replace(tmp, plain)
+    except BaseException:
+        if os.path.exists(tmp):
+            try:
+                os.remove(tmp)
+            except OSError:
+                pass
+        raise


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique temporary filename to avoid concurrent decompression races.

Line 40 uses a shared plain + ".tmp" path. Parallel runs against the same input can clobber/remove each other’s temp file.

Proposed fix

@@ import shutil import subprocess # nosec B404 - used only with hardcoded gffread argv lists, never shell=True +import tempfile @@ - tmp = plain + ".tmp" + tmp_fd, tmp = tempfile.mkstemp( + prefix=os.path.basename(plain) + ".", + suffix=".tmp", + dir=os.path.dirname(plain) or ".", + ) + os.close(tmp_fd) try: with gzip.open(genome_fna, "rb") as fi, open(tmp, "wb") as fo: shutil.copyfileobj(fi, fo) os.replace(tmp, plain) - except BaseException: + except BaseException: if os.path.exists(tmp): try: os.remove(tmp) except OSError: pass raise

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

tmp = plain + ".tmp"

try:

with gzip.open(genome_fna, "rb") as fi, open(tmp, "wb") as fo:

shutil.copyfileobj(fi, fo)

os.replace(tmp, plain)

except BaseException:

if os.path.exists(tmp):

try:

os.remove(tmp)

except OSError:

pass

raise

tmp_fd, tmp = tempfile.mkstemp(

prefix=os.path.basename(plain) + ".",

suffix=".tmp",

dir=os.path.dirname(plain) or ".",

)

os.close(tmp_fd)

try:

with gzip.open(genome_fna, "rb") as fi, open(tmp, "wb") as fo:

shutil.copyfileobj(fi, fo)

os.replace(tmp, plain)

except BaseException:

if os.path.exists(tmp):

try:

os.remove(tmp)

except OSError:

pass

raise

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pgatk/gnomad/data_downloader.py` around lines 40 - 51, The code creates a shared temp path via tmp = plain + ".tmp" which can be clobbered by concurrent runs; change the decompression to create a unique temp file (e.g., use tempfile.mkstemp or tempfile.NamedTemporaryFile(delete=False) to obtain a unique pathname), write to that unique temp (instead of tmp), call os.replace(unique_tmp, plain) for atomic swap, and on exception only remove the unique temp you created (so adjust the cleanup branch to reference the unique temp variable rather than plain + ".tmp").

ypriverol and others added 30 commits May 11, 2026 09:48

docs: rename --var_prefix to --protein_prefix in CLI examples

b670242

The click commands accept --protein_prefix; --var_prefix no longer exists. Example commands in docs/use-cases.md and docs/pgatk-cli.md were stale and failed when copy-pasted. Closes #100

Stop tracking docs/plans/ (internal working docs)

bb4edb6

Implementation plans and design scratch notes are kept locally as working artifacts only; they don't belong in the published repo. Files remain on disk and are now gitignored.

Bump version to 0.0.27

e6889e4

minor command fixes in use-cases file

311fafb

minor fix to cosmic gzip file reading

0b9c179

Merge pull request #104 from bigbio/perf/vcf-to-proteindb-parallel

6c02214

perf(vcf-to-proteindb): per-chromosome multiprocessing (~3.4× on 4-chrom test)

Improved and fixed issues in variant translation

9f59cd3

- performance improvement in vcf-to-proteindb by taking the variants in batches plus other caching strategies and pre-filtering - added tests for ensembl variant translation - added tests to gnomad and updated docs to use Gencode files

fixed test data file for clinvar

71fbe99

missed in the prev commit

6c81a0d

ypriverol added 3 commits May 22, 2026 15:22

Merge pull request #102 from bigbio/feat/bug-fixed-iteration

5f92d21

Bug-fix iteration: closes #98, #99, #100, #101

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

ypriverol requested a review from Copilot May 22, 2026 17:19

Copilot started reviewing on behalf of ypriverol May 22, 2026 17:20 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread pgatk/cgenomes/cbioportal_downloader.py

Comment thread pgatk/commands/vcf_to_proteindb.py Outdated

ypriverol added 2 commits May 22, 2026 18:29

Merge pull request #106 from bigbio/docs/cli-options-fix

824ea58

docs(cli): fix stale options, defaults, and examples in pgatk-cli.md

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

ypriverol added 3 commits May 22, 2026 22:36

Merge pull request #107 from bigbio/chore/coderabbit-followups

dac9ff6

chore: address CodeRabbit follow-up findings

ypriverol mentioned this pull request May 23, 2026

chore: address CodeRabbit + Copilot findings from PR #105 #108

Merged

3 tasks

Merge pull request #108 from bigbio/chore/pr105-review-followups

55ae29a

chore: address CodeRabbit + Copilot findings from PR #105

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

ypriverol merged commit cdedf90 into master May 23, 2026
4 checks passed


		Tissue-type data architecture — `PRIMARY_SITE` (tissue of origin) is no longer a column in the mutation export. It lives in a separate Classification file (`Cosmic_Classification_Tsv_v103_GRCh38.tar`) and is joined via `COSMIC_PHENOTYPE_ID`:

		```

	The input files are produced by the [ncbi-downloader](#downloading-ncbi-clinvar-data) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation.
	The input files are produced by the [ncbi-downloader](`#downloading-ncbi--clinvar-data`) command. Use `--generate-transcripts` with `ncbi-downloader` to produce `transcripts.fa` (with `CDS=` headers) from the GFF3 annotation.

Conversation

ypriverol commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 22, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ypriverol commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading