Skip to content

chore: address CodeRabbit + Copilot findings from PR #105#108

Merged
ypriverol merged 1 commit into
devfrom
chore/pr105-review-followups
May 23, 2026
Merged

chore: address CodeRabbit + Copilot findings from PR #105#108
ypriverol merged 1 commit into
devfrom
chore/pr105-review-followups

Conversation

@ypriverol

Copy link
Copy Markdown
Member

Summary

Acts on the actionable findings from CodeRabbit (19 inline comments) and Copilot (2 inline comments) on #105. Each item was verified against the current state of dev before fixing; some prior reviews had already been resolved by #107 (ensembl_downloader silent assembly fallback) and #106 (broken NCBI/ClinVar anchor).

Critical

  • pgatk/cgenomes/cgenomes_proteindb.py (_cbio_translate_batch DEL path) — `pos.split("del")[1]` returns `""` for HGVS range-only deletions like `c.104_124del`, and the `del_dna == seq[...]` comparison then matched the empty slice — so the deletion was silently ignored and the reference sequence was emitted unchanged. Now: when `del_dna` is empty, derive the deletion length from the range in `cdna_pos` and apply the deletion by length, with a bounds check.

Major

  • pgatk/cgenomes/cbioportal_downloader.py — `download_one_study` writes to the shared `url_file` handle. Under `--multithreading` (up to 10 threads via `ThreadPoolExecutor`) those writes can interleave. The pool now passes `url_file=None` to workers and the main thread writes returned paths serially as futures complete. Serial code paths are unchanged. (Also flagged by Copilot.)
  • pgatk/cgenomes/cosmic_downloader.py — on S3 download failure the error log included the full presigned URL including short-lived AWS query credentials. Strip the query string before formatting the error message (matches the success-path debug log).
  • pgatk/clinvar/data_downloader.py, pgatk/ensembl/data_downloader.py, pgatk/gnomad/data_downloader.py — `.gz → plain` FASTA decompression wrote directly to the final path. An interrupted run left a partial sibling that later invocations silently reused, producing corrupt `gffread` output. Now staged through a `.tmp` file and atomically `os.replace()`d; the temp file is removed on any exception (including `KeyboardInterrupt`).
  • pgatk/db/map_peptide2genome.parse_gtf — `if line[0] != "#"` raises `IndexError` on empty lines. Skip blank lines before the comment check; also guard the CDS path against short rows.
  • docs/use-cases.md — Step 3 `cat` includes `putative.fa` but Step 2 never produced it, so the recipe didn't run end-to-end. Added the missing three-frame putative-protein generation step.

Minor

  • pgatk/commands/vcf_to_proteindb.py — `--workers` help said `default: 1, sequential. Per-chromosome split.`, but click default is `None` (config-driven, defaults to 1) and the actual strategy is fixed-size variant batches (~50k records) that may span chromosomes. Help text updated. (Also flagged by Copilot.)
  • pgatk/ensembl/ensembl.py — `_vcf_to_proteindb_worker` annotated `-> None` but returns a stats dict that the caller aggregates. Annotation corrected to `-> dict`.
  • docs/use-cases.md typos: `annoated → annotated`, `Cancer-speific → Cancer-specific` (x2), `numbre → number`.

Skipped (with reason)

  • Worker validation (`--workers >= 1`) across cbioportal/clinvar/gnomad CLIs — fixable in one pass with `click.IntRange(min=1)`; left for a separate "CLI input validation" PR since it touches multiple unrelated files and changes the parser surface.
  • `ncbi_downloader` overly broad except — already narrowed to FileNotFoundError + CalledProcessError in chore: address CodeRabbit follow-up findings #107.
  • Style nitpicks — B202 comment placement, en-dash → hyphen in a log line, testing a private `_cosmic_token` attribute, moving stdlib imports out of a worker block, replacing `[x for ...][0]` with `next(...)`. Low-value style; not addressed.

Test plan

  • `python -m pytest pgatk/tests/ -q` — 316 passed
  • `python -m unittest discover -s pgatk/tests -p "*_tests.py"` — 20 / 20
  • Consider adding a targeted test for HGVS range-only deletions on the `_cbio_translate_batch` path (the existing COSMIC `get_mut_pro_seq` test covers the parallel function but not the cBioPortal path that was broken). Out of scope for this fix-only PR.

Each item verified against current dev before fixing.

Critical:
- cgenomes_proteindb._cbio_translate_batch DEL path: `pos.split("del")[1]`
  returns "" for HGVS range-only deletions (e.g. `c.104_124del`); the
  `del_dna == ""` comparison always matched the empty slice so nothing
  was removed and the deletion was silently ignored. Now derive the
  deletion length from the range positions in cdna_pos when del_dna is
  empty, with a bounds check.

Major:
- cbioportal_downloader.download_one_study: when `--multithreading` is
  enabled the function was writing to the shared `url_file` handle from
  up to 10 worker threads, which can interleave/corrupt lines. The
  thread pool now passes `url_file=None` to workers and the main thread
  serializes the writes after `as_completed`. Single-threaded call sites
  still pass the handle so behaviour is unchanged there.
- cosmic_downloader: on S3 download failure, only log the path of the
  presigned URL — not the full URL — so short-lived AWS query
  credentials don't leak into log aggregators. (The success-path debug
  log was already redacted; this matches it.)
- clinvar/, ensembl/, gnomad/ data_downloader._resolve_genome_fna /
  generate_transcripts: stage `.gz` -> plain FASTA decompression
  through a sibling `.tmp` file and atomically `os.replace()` it,
  removing the temp on any exception. Previously an interrupted
  decompression left a partial sibling that later runs silently reused,
  producing corrupt gffread output.
- db/map_peptide2genome.parse_gtf: `line[0] != "#"` raises IndexError on
  empty lines. Skip blank lines before the comment check; also guard
  against short rows in the CDS path.
- docs/use-cases.md Step 3 cats `putative.fa` but Step 2 never produced
  it. Added the missing 3-frame putative-protein generation step.

Minor:
- vcf-to-proteindb `--workers` help text said "default: 1, sequential.
  Per-chromosome split.", but the click default is None (config-driven,
  defaults to 1) and the actual strategy is fixed-size variant batches
  (~50k) that may span chromosomes. Updated the help text to match.
- ensembl._vcf_to_proteindb_worker: signature said `-> None` but the
  function returns a stats dict that the caller aggregates. Corrected
  the annotation.
- docs/use-cases.md typos: annoated -> annotated, Cancer-speific ->
  Cancer-specific (x2), numbre -> number.

Skipped (with reason):
- Worker validation (`--workers` >= 1) across cbioportal/clinvar/gnomad
  CLI commands — separate concern; could be a follow-up using
  click.IntRange(min=1).
- ncbi_downloader broad except — already catches the right exceptions
  per earlier review (#107).
- Various nitpicks (B202 comment placement, en-dash, private-attr test,
  import location, `next()` over `[0]`) — low-value style items.

Test plan:
- python -m pytest pgatk/tests/ -q -> 316 passed
- python -m unittest discover -s pgatk/tests -p "*_tests.py" -> 20/20
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai

coderabbitai Bot commented May 23, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c6f0619-0477-472e-a702-bc9f10d9f9e3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/pr105-review-followups

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol ypriverol merged commit 55ae29a into dev May 23, 2026
6 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant