chore: address CodeRabbit + Copilot findings from PR #105#108
Conversation
Each item verified against current dev before fixing.
Critical:
- cgenomes_proteindb._cbio_translate_batch DEL path: `pos.split("del")[1]`
returns "" for HGVS range-only deletions (e.g. `c.104_124del`); the
`del_dna == ""` comparison always matched the empty slice so nothing
was removed and the deletion was silently ignored. Now derive the
deletion length from the range positions in cdna_pos when del_dna is
empty, with a bounds check.
Major:
- cbioportal_downloader.download_one_study: when `--multithreading` is
enabled the function was writing to the shared `url_file` handle from
up to 10 worker threads, which can interleave/corrupt lines. The
thread pool now passes `url_file=None` to workers and the main thread
serializes the writes after `as_completed`. Single-threaded call sites
still pass the handle so behaviour is unchanged there.
- cosmic_downloader: on S3 download failure, only log the path of the
presigned URL — not the full URL — so short-lived AWS query
credentials don't leak into log aggregators. (The success-path debug
log was already redacted; this matches it.)
- clinvar/, ensembl/, gnomad/ data_downloader._resolve_genome_fna /
generate_transcripts: stage `.gz` -> plain FASTA decompression
through a sibling `.tmp` file and atomically `os.replace()` it,
removing the temp on any exception. Previously an interrupted
decompression left a partial sibling that later runs silently reused,
producing corrupt gffread output.
- db/map_peptide2genome.parse_gtf: `line[0] != "#"` raises IndexError on
empty lines. Skip blank lines before the comment check; also guard
against short rows in the CDS path.
- docs/use-cases.md Step 3 cats `putative.fa` but Step 2 never produced
it. Added the missing 3-frame putative-protein generation step.
Minor:
- vcf-to-proteindb `--workers` help text said "default: 1, sequential.
Per-chromosome split.", but the click default is None (config-driven,
defaults to 1) and the actual strategy is fixed-size variant batches
(~50k) that may span chromosomes. Updated the help text to match.
- ensembl._vcf_to_proteindb_worker: signature said `-> None` but the
function returns a stats dict that the caller aggregates. Corrected
the annotation.
- docs/use-cases.md typos: annoated -> annotated, Cancer-speific ->
Cancer-specific (x2), numbre -> number.
Skipped (with reason):
- Worker validation (`--workers` >= 1) across cbioportal/clinvar/gnomad
CLI commands — separate concern; could be a follow-up using
click.IntRange(min=1).
- ncbi_downloader broad except — already catches the right exceptions
per earlier review (#107).
- Various nitpicks (B202 comment placement, en-dash, private-attr test,
import location, `next()` over `[0]`) — low-value style items.
Test plan:
- python -m pytest pgatk/tests/ -q -> 316 passed
- python -m unittest discover -s pgatk/tests -p "*_tests.py" -> 20/20
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Acts on the actionable findings from CodeRabbit (19 inline comments) and Copilot (2 inline comments) on #105. Each item was verified against the current state of
devbefore fixing; some prior reviews had already been resolved by #107 (ensembl_downloadersilent assembly fallback) and #106 (broken NCBI/ClinVar anchor).Critical
pgatk/cgenomes/cgenomes_proteindb.py(_cbio_translate_batchDEL path) — `pos.split("del")[1]` returns `""` for HGVS range-only deletions like `c.104_124del`, and the `del_dna == seq[...]` comparison then matched the empty slice — so the deletion was silently ignored and the reference sequence was emitted unchanged. Now: when `del_dna` is empty, derive the deletion length from the range in `cdna_pos` and apply the deletion by length, with a bounds check.Major
pgatk/cgenomes/cbioportal_downloader.py— `download_one_study` writes to the shared `url_file` handle. Under `--multithreading` (up to 10 threads via `ThreadPoolExecutor`) those writes can interleave. The pool now passes `url_file=None` to workers and the main thread writes returned paths serially as futures complete. Serial code paths are unchanged. (Also flagged by Copilot.)pgatk/cgenomes/cosmic_downloader.py— on S3 download failure the error log included the full presigned URL including short-lived AWS query credentials. Strip the query string before formatting the error message (matches the success-path debug log).pgatk/clinvar/data_downloader.py,pgatk/ensembl/data_downloader.py,pgatk/gnomad/data_downloader.py— `.gz → plain` FASTA decompression wrote directly to the final path. An interrupted run left a partial sibling that later invocations silently reused, producing corrupt `gffread` output. Now staged through a `.tmp` file and atomically `os.replace()`d; the temp file is removed on any exception (including `KeyboardInterrupt`).pgatk/db/map_peptide2genome.parse_gtf— `if line[0] != "#"` raises `IndexError` on empty lines. Skip blank lines before the comment check; also guard the CDS path against short rows.docs/use-cases.md— Step 3 `cat` includes `putative.fa` but Step 2 never produced it, so the recipe didn't run end-to-end. Added the missing three-frame putative-protein generation step.Minor
pgatk/commands/vcf_to_proteindb.py— `--workers` help said `default: 1, sequential. Per-chromosome split.`, but click default is `None` (config-driven, defaults to 1) and the actual strategy is fixed-size variant batches (~50k records) that may span chromosomes. Help text updated. (Also flagged by Copilot.)pgatk/ensembl/ensembl.py— `_vcf_to_proteindb_worker` annotated `-> None` but returns a stats dict that the caller aggregates. Annotation corrected to `-> dict`.docs/use-cases.mdtypos: `annoated → annotated`, `Cancer-speific → Cancer-specific` (x2), `numbre → number`.Skipped (with reason)
Test plan