segregated the python scripts from workflows#46
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (5)
📝 WalkthroughWalkthroughCentralizes import-path handling via _path_guard, modernizes snapshot I/O with pathlib and atomic writes, adds a workflow CLI ( ChangesPipeline Infrastructure Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/add_fossils.py (1)
319-331:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix clone/fetch control flow; clone path is currently unreachable.
Line 320 creates
temp_dirbefore Line 323 checks existence, so the clone branch never runs. On fresh runs this attemptsgit fetchin a non-repo directory and fails downstream processing.Proposed fix
- temp_dir = Path(f"./temp_fossil_repos_{repo_name}") - temp_dir.mkdir(exist_ok=True) - local_repo = temp_dir - - if not local_repo.exists(): + local_repo = Path(f"./temp_fossil_repos_{repo_name}") + if not local_repo.exists(): logger.info(" Cloning %s...", repo_url) run_command(["git", "clone", repo_url, str(local_repo)]) else: logger.info(" Repo already cloned — fetching latest...") try: run_command(["git", "fetch", "--all"], cwd=str(local_repo)) except RuntimeError as e: logger.warning(" Fetch failed (continuing with local): %s", e) @@ - if temp_dir.exists(): - remove_path(str(temp_dir)) + if local_repo.exists(): + remove_path(str(local_repo))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/add_fossils.py` around lines 319 - 331, The code creates temp_dir before checking existence, so the clone branch is never taken; fix by separating the base temp directory from the per-repo path and only mkdir the base, then check the per-repo path (or remove the premature mkdir). Concretely: create a base directory (e.g., base_temp = Path("./temp_fossil_repos") and base_temp.mkdir(exist_ok=True)), set local_repo = base_temp / repo_name, then if not local_repo.exists(): run_command(["git", "clone", repo_url, str(local_repo)]) else run_command(["git", "fetch", "--all"], cwd=str(local_repo)) using the existing run_command, repo_url and repo_name symbols.
🧹 Nitpick comments (2)
scripts/workflow.py (2)
55-55: ⚡ Quick winAdd explicit encoding to
write_text()for consistency.The read operations use
encoding="utf-8"butwrite_text()on line 55 relies on platform default encoding, which may differ.♻️ Proposed fix
- Path(out_file).write_text(body) + Path(out_file).write_text(body, encoding="utf-8")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/workflow.py` at line 55, Change the call to Path(out_file).write_text(body) to explicitly specify UTF-8 encoding to match reads; in scripts/workflow.py update the write_text invocation that writes the variable body (where out_file and body are used) to include encoding="utf-8" so the file write is deterministic across platforms.
70-70: ⚡ Quick winAdd explicit encoding to
read_text()for consistency.Same encoding consistency issue as
write_text().♻️ Proposed fix
- data = json.loads(f.read_text()) + data = json.loads(f.read_text(encoding="utf-8"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/workflow.py` at line 70, The call to f.read_text() in scripts/workflow.py (where data = json.loads(f.read_text())) lacks an explicit encoding; update that call to pass a consistent encoding (e.g., encoding="utf-8") so it matches the write_text() usage and avoids platform-dependent behavior—locate the f.read_text() invocation and change it to explicitly specify the encoding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 166-167: After force-pushing with "git push origin
HEAD:chore/monthly-data-update --force", update the local ref before checking
out to avoid a stale branch: fetch or reset the local branch to the remote
(e.g., run "git fetch origin chore/monthly-data-update" then "git checkout
chore/monthly-data-update" or replace checkout with "git checkout -B
chore/monthly-data-update origin/chore/monthly-data-update" or "git branch -f
chore/monthly-data-update origin/chore/monthly-data-update" to force the local
branch to match the pushed commit); apply this change around the current git
push / git checkout commands in the workflow.
In @.github/workflows/unit-tests.yml:
- Around line 15-21: Replace the tag-pinned GitHub Actions with immutable SHA
pins and disable checkout credential persistence: update the actions/checkout@v4
reference to its corresponding commit SHA and add persist-credentials: false to
that checkout step, and replace actions/setup-python@v5 with its commit SHA
(keeping with: python-version: "3.12" and cache: pip) so both actions are
SHA-pinned and the checkout step no longer exposes credentials.
In `@scripts/__init__.py`:
- Around line 12-14: The current change removed/altered the sys.path mutation in
scripts/__init__.py (_SCRIPTS_DIR and sys.path.insert) but those scripts rely on
that behavior (see scripts/_path_guard.py and imports in scripts/_blame.py,
add_fossils.py, cleanup_data.py); before converting to relative imports,
reproduce and capture the exact pylint E0401 output by running pylint on those
files to confirm whether the error is a linter-only resolution or a runtime
import failure, then either (a) if it’s only pylint, revert the broad import
refactor and fix linter resolution via pylintrc or per-import disable for the
specific imports, or (b) if runtime fails when executing scripts directly,
restore a single canonical sys.path guard (keep _path_guard.py as the single
source of truth and remove duplicate mutations) or add a safe direct-exec
fallback (try/except import patterns) in the scripts that import
_utils/_data_io/_blame so direct python scripts/*.py execution continues to
work.
In `@scripts/add_fossils.py`:
- Around line 404-422: The two overly long logger.info calls inside
_update_survivor_one exceed the 120-char lint limit; break their format string
and arguments across multiple concatenated/continued strings or use multiple
logger.info calls so each line stays under 120 chars (e.g., split the message
and the tuple of arguments across lines or log OLD and NEW in separate short
calls), referencing the logger.info(...) calls that print " ✓ Survivor
unchanged..." and the " OLD: ..."/" NEW: ..." lines to keep the same
content and argument order but wrapped to satisfy pylint C0301.
In `@scripts/analyse_repository.py`:
- Around line 323-329: prev_file_data is being seeded from the newest
historical_snapshots unconditionally which can make reprocessing use a future
baseline; change the seeding logic to pick the most recent snapshot whose
snapshot_date is strictly earlier than the target period (i.e., find the latest
historical_snapshots entry with snapshot_date < target_period and set
prev_file_data = (commit_hash, file_compositions) from that entry). Also change
the code path that currently appends reprocessed results (the append that writes
new snapshot entries for a snapshot_date) to instead check for an existing entry
with the same snapshot_date and replace that entry's data (commit_hash and
file_compositions) rather than appending, to avoid duplicate snapshot_date
periods.
In `@scripts/run_pipeline.py`:
- Around line 33-36: The sibling bare imports (_path_guard, _utils,
cleanup_data) cause static import resolution errors; change them to
package-qualified imports so lint/runtime agree: import the module names as from
scripts._path_guard import ... (or simply import scripts._path_guard to preserve
side-effects), from scripts._utils import load_config, and from
scripts.cleanup_data import cleanup_data as run_cleanup, and update any
invocation docs/tests to run the module in package mode (python -m
scripts.run_pipeline) so the imports resolve consistently; keep any required
noqa/pylint comments only if still necessary after switching.
In `@scripts/workflow.py`:
- Line 14: Change the top-level import to a package import (replace "import
_path_guard" with "from scripts import _path_guard") in scripts/workflow.py and
apply the same change to any other modules that currently import _path_guard as
a top-level module so pylint E0401 is resolved; update the call to
Path(out_file).write_text(body) to pass encoding="utf-8" for consistent file
writes; and in validate_graph_files replace any assert statements with explicit
validations that raise appropriate exceptions (ValueError or RuntimeError) so
checks cannot be skipped under python -O (look for the validate_graph_files
function to locate these asserts).
---
Outside diff comments:
In `@scripts/add_fossils.py`:
- Around line 319-331: The code creates temp_dir before checking existence, so
the clone branch is never taken; fix by separating the base temp directory from
the per-repo path and only mkdir the base, then check the per-repo path (or
remove the premature mkdir). Concretely: create a base directory (e.g.,
base_temp = Path("./temp_fossil_repos") and base_temp.mkdir(exist_ok=True)), set
local_repo = base_temp / repo_name, then if not local_repo.exists():
run_command(["git", "clone", repo_url, str(local_repo)]) else
run_command(["git", "fetch", "--all"], cwd=str(local_repo)) using the existing
run_command, repo_url and repo_name symbols.
---
Nitpick comments:
In `@scripts/workflow.py`:
- Line 55: Change the call to Path(out_file).write_text(body) to explicitly
specify UTF-8 encoding to match reads; in scripts/workflow.py update the
write_text invocation that writes the variable body (where out_file and body are
used) to include encoding="utf-8" so the file write is deterministic across
platforms.
- Line 70: The call to f.read_text() in scripts/workflow.py (where data =
json.loads(f.read_text())) lacks an explicit encoding; update that call to pass
a consistent encoding (e.g., encoding="utf-8") so it matches the write_text()
usage and avoids platform-dependent behavior—locate the f.read_text() invocation
and change it to explicitly specify the encoding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 13eefcd2-7014-47e8-8723-54176fb4024b
📒 Files selected for processing (12)
.github/workflows/theseus-engine.yml.github/workflows/unit-tests.ymlscripts/__init__.pyscripts/_blame.pyscripts/_data_io.pyscripts/_path_guard.pyscripts/_utils.pyscripts/add_fossils.pyscripts/analyse_repository.pyscripts/cleanup_data.pyscripts/run_pipeline.pyscripts/workflow.py
| import _path_guard # noqa: F401 # pylint: disable=unused-import | ||
|
|
||
| from _utils import load_config | ||
| from cleanup_data import cleanup_data as run_cleanup |
There was a problem hiding this comment.
Sibling bare imports are breaking CI import resolution.
Line 33–36 currently depend on _path_guard side effects, but pylint resolves imports statically and is failing with E0401 for these modules. Please standardize to package-qualified imports (scripts.*) and align invocation to module mode (python -m scripts.run_pipeline) so runtime and lint contexts use the same contract.
🧰 Tools
🪛 GitHub Actions: Unit Tests / 0_Run Unit Tests.txt
[error] 33-33: Pylint import-error (E0401): Unable to import '_path_guard'.
[error] 35-35: Pylint import-error (E0401): Unable to import '_utils'.
[error] 36-36: Pylint import-error (E0401): Unable to import 'cleanup_data'.
🪛 GitHub Actions: Unit Tests / Run Unit Tests
[error] 33-33: pylint (import-error E0401): Unable to import '_path_guard'.
[error] 35-35: pylint (import-error E0401): Unable to import '_utils'.
[error] 36-36: pylint (import-error E0401): Unable to import 'cleanup_data'.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/run_pipeline.py` around lines 33 - 36, The sibling bare imports
(_path_guard, _utils, cleanup_data) cause static import resolution errors;
change them to package-qualified imports so lint/runtime agree: import the
module names as from scripts._path_guard import ... (or simply import
scripts._path_guard to preserve side-effects), from scripts._utils import
load_config, and from scripts.cleanup_data import cleanup_data as run_cleanup,
and update any invocation docs/tests to run the module in package mode (python
-m scripts.run_pipeline) so the imports resolve consistently; keep any required
noqa/pylint comments only if still necessary after switching.
|
@coderabbitai review |
✅ Action performedReview finished.
|
Summary by CodeRabbit
New Features
Bug Fixes
Chores