segregated the python scripts from workflows by Asifdotexe · Pull Request #46 · Asifdotexe/Theseus

Asifdotexe · 2026-06-05T06:52:52Z

Summary by CodeRabbit

New Features
- Added a workflow CLI for repo discovery, PR-body generation, and graph validation.
Bug Fixes
- Added a step to repair branch ancestry and ensure monthly-data-update can be synchronized safely.
Chores
- Refactored CI workflows for clearer configuration and safer cleanup.
- Simplified test workflow setup.
- Improved file I/O with atomic writes and modern path handling.
- Standardized script import behavior for consistent execution.

coderabbitai · 2026-06-05T06:53:05Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3d3d717-9a7d-450b-836c-c1e0a7447f81

📥 Commits

Reviewing files that changed from the base of the PR and between 710f816 and d25a6ea.

📒 Files selected for processing (5)

.github/workflows/theseus-engine.yml
.github/workflows/unit-tests.yml
scripts/add_fossils.py
scripts/analyse_repository.py
scripts/workflow.py

🚧 Files skipped from review as they are similar to previous changes (5)

.github/workflows/unit-tests.yml
scripts/workflow.py
.github/workflows/theseus-engine.yml
scripts/add_fossils.py
scripts/analyse_repository.py

📝 Walkthrough

Walkthrough

Centralizes import-path handling via _path_guard, modernizes snapshot I/O with pathlib and atomic writes, adds a workflow CLI (scripts/workflow.py), refactors fossil and repository processing into helper/callbacks, and updates GitHub Actions to call the new CLI and simplify test setup.

Changes

Pipeline Infrastructure Refactor

Layer / File(s)	Summary
Import path guarding foundation `scripts/_path_guard.py`, `scripts/__init__.py`, `scripts/_blame.py`, `scripts/add_fossils.py`, `scripts/analyse_repository.py`, `scripts/cleanup_data.py`, `scripts/run_pipeline.py`	New `_path_guard` module prepends `scripts/` to `sys.path` at import time; scripts now import `_path_guard` instead of inlining sys.path manipulation.
File I/O modernization and atomic writes `scripts/_data_io.py`	`load_snapshot_data` and `save_snapshot_data` accept `str
Utility deletion/error handling `scripts/_utils.py`	`remove_path` subprocess deletion calls set `check=False`; fallback retry uses `_handle_remove_readonly` and `shutil.rmtree(..., onexc=...)`.
Fossil processing callback refactor `scripts/add_fossils.py`	Introduces `_process_each_repo` shared helper; rewires `backfill_fossils` and `update_survivor_fossils` to callback-based `_backfill_one` / `_update_survivor_one`, merging and persisting fossil results.
Repository analysis helper functions `scripts/analyse_repository.py`	Adds `_ensure_repo_ready`, `_find_baseline`, and `_process_snapshots_by_year` for incremental per-year snapshot processing; `process_repository` delegates to these helpers.
Workflow CLI utility `scripts/workflow.py`	Adds `discover-repos`, `build-pr-body`, and `validate-graph-files` subcommands and a `main()` dispatcher used by GitHub workflows.
GitHub Actions workflow integration `.github/workflows/theseus-engine.yml`	Repo matrix sourced from `scripts/workflow.py discover-repos`; added "Fix shared branch ancestry" step that preserves `data/` and force-pushes a recreated shared branch when needed; inline Python PR-body/validation replaced with `scripts/workflow.py` calls; `analyze` job sets `permissions.contents: write`.
Unit test workflow simplification `.github/workflows/unit-tests.yml`	Composite action replaced with explicit checkout, `actions/setup-python@v5` for Python 3.12, Poetry installed via `pipx`, and `poetry install --with dev`.
Pylint init-hook `pyproject.toml`	Adds `[tool.pylint]` init-hook to prepend `scripts` to `sys.path`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Asifdotexe/Theseus#45: Overlapping workflow changes around shared-branch ancestry handling and rebase/force-push steps.
Asifdotexe/Theseus#7: Related refactor of scripts/analyse_repository.py and incremental snapshot logic.
Asifdotexe/Theseus#20: Related add_fossils.py survivor-update logic and incremental fossil handling.

Suggested labels

enhancement

Poem

🐰 I hopped through paths to clear the way,
Guarded imports so scripts can play,
Atomic writes and helpers spun,
Workflows call the CLI now—job done!
Fossils tidy, pipelines light, hooray!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'segregated the python scripts from workflows' accurately describes the primary change across multiple files, refactoring Python scripts to use a shared _path_guard module and new workflow.py CLI utility instead of inline logic.
Docstring Coverage	✅ Passed	Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 42-fix-action-failure-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/add_fossils.py (1)

319-331: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix clone/fetch control flow; clone path is currently unreachable.

Line 320 creates temp_dir before Line 323 checks existence, so the clone branch never runs. On fresh runs this attempts git fetch in a non-repo directory and fails downstream processing.

Proposed fix

-        temp_dir = Path(f"./temp_fossil_repos_{repo_name}")
-        temp_dir.mkdir(exist_ok=True)
-        local_repo = temp_dir
-
-        if not local_repo.exists():
+        local_repo = Path(f"./temp_fossil_repos_{repo_name}")
+        if not local_repo.exists():
             logger.info("  Cloning %s...", repo_url)
             run_command(["git", "clone", repo_url, str(local_repo)])
         else:
             logger.info("  Repo already cloned — fetching latest...")
             try:
                 run_command(["git", "fetch", "--all"], cwd=str(local_repo))
             except RuntimeError as e:
                 logger.warning("  Fetch failed (continuing with local): %s", e)
@@
-        if temp_dir.exists():
-            remove_path(str(temp_dir))
+        if local_repo.exists():
+            remove_path(str(local_repo))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 319 - 331, The code creates temp_dir
before checking existence, so the clone branch is never taken; fix by separating
the base temp directory from the per-repo path and only mkdir the base, then
check the per-repo path (or remove the premature mkdir). Concretely: create a
base directory (e.g., base_temp = Path("./temp_fossil_repos") and
base_temp.mkdir(exist_ok=True)), set local_repo = base_temp / repo_name, then if
not local_repo.exists(): run_command(["git", "clone", repo_url,
str(local_repo)]) else run_command(["git", "fetch", "--all"],
cwd=str(local_repo)) using the existing run_command, repo_url and repo_name
symbols.

🧹 Nitpick comments (2)

scripts/workflow.py (2)

55-55: ⚡ Quick win

Add explicit encoding to write_text() for consistency.

The read operations use encoding="utf-8" but write_text() on line 55 relies on platform default encoding, which may differ.

♻️ Proposed fix

-    Path(out_file).write_text(body)
+    Path(out_file).write_text(body, encoding="utf-8")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/workflow.py` at line 55, Change the call to
Path(out_file).write_text(body) to explicitly specify UTF-8 encoding to match
reads; in scripts/workflow.py update the write_text invocation that writes the
variable body (where out_file and body are used) to include encoding="utf-8" so
the file write is deterministic across platforms.

70-70: ⚡ Quick win

Add explicit encoding to read_text() for consistency.

Same encoding consistency issue as write_text().

♻️ Proposed fix

-            data = json.loads(f.read_text())
+            data = json.loads(f.read_text(encoding="utf-8"))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/workflow.py` at line 70, The call to f.read_text() in
scripts/workflow.py (where data = json.loads(f.read_text())) lacks an explicit
encoding; update that call to pass a consistent encoding (e.g.,
encoding="utf-8") so it matches the write_text() usage and avoids
platform-dependent behavior—locate the f.read_text() invocation and change it to
explicitly specify the encoding.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 166-167: After force-pushing with "git push origin
HEAD:chore/monthly-data-update --force", update the local ref before checking
out to avoid a stale branch: fetch or reset the local branch to the remote
(e.g., run "git fetch origin chore/monthly-data-update" then "git checkout
chore/monthly-data-update" or replace checkout with "git checkout -B
chore/monthly-data-update origin/chore/monthly-data-update" or "git branch -f
chore/monthly-data-update origin/chore/monthly-data-update" to force the local
branch to match the pushed commit); apply this change around the current git
push / git checkout commands in the workflow.

In @.github/workflows/unit-tests.yml:
- Around line 15-21: Replace the tag-pinned GitHub Actions with immutable SHA
pins and disable checkout credential persistence: update the actions/checkout@v4
reference to its corresponding commit SHA and add persist-credentials: false to
that checkout step, and replace actions/setup-python@v5 with its commit SHA
(keeping with: python-version: "3.12" and cache: pip) so both actions are
SHA-pinned and the checkout step no longer exposes credentials.

In `@scripts/__init__.py`:
- Around line 12-14: The current change removed/altered the sys.path mutation in
scripts/__init__.py (_SCRIPTS_DIR and sys.path.insert) but those scripts rely on
that behavior (see scripts/_path_guard.py and imports in scripts/_blame.py,
add_fossils.py, cleanup_data.py); before converting to relative imports,
reproduce and capture the exact pylint E0401 output by running pylint on those
files to confirm whether the error is a linter-only resolution or a runtime
import failure, then either (a) if it’s only pylint, revert the broad import
refactor and fix linter resolution via pylintrc or per-import disable for the
specific imports, or (b) if runtime fails when executing scripts directly,
restore a single canonical sys.path guard (keep _path_guard.py as the single
source of truth and remove duplicate mutations) or add a safe direct-exec
fallback (try/except import patterns) in the scripts that import
_utils/_data_io/_blame so direct python scripts/*.py execution continues to
work.

In `@scripts/add_fossils.py`:
- Around line 404-422: The two overly long logger.info calls inside
_update_survivor_one exceed the 120-char lint limit; break their format string
and arguments across multiple concatenated/continued strings or use multiple
logger.info calls so each line stays under 120 chars (e.g., split the message
and the tuple of arguments across lines or log OLD and NEW in separate short
calls), referencing the logger.info(...) calls that print "  ✓ Survivor
unchanged..." and the "    OLD: ..."/"    NEW: ..." lines to keep the same
content and argument order but wrapped to satisfy pylint C0301.

In `@scripts/analyse_repository.py`:
- Around line 323-329: prev_file_data is being seeded from the newest
historical_snapshots unconditionally which can make reprocessing use a future
baseline; change the seeding logic to pick the most recent snapshot whose
snapshot_date is strictly earlier than the target period (i.e., find the latest
historical_snapshots entry with snapshot_date < target_period and set
prev_file_data = (commit_hash, file_compositions) from that entry). Also change
the code path that currently appends reprocessed results (the append that writes
new snapshot entries for a snapshot_date) to instead check for an existing entry
with the same snapshot_date and replace that entry's data (commit_hash and
file_compositions) rather than appending, to avoid duplicate snapshot_date
periods.

In `@scripts/run_pipeline.py`:
- Around line 33-36: The sibling bare imports (_path_guard, _utils,
cleanup_data) cause static import resolution errors; change them to
package-qualified imports so lint/runtime agree: import the module names as from
scripts._path_guard import ... (or simply import scripts._path_guard to preserve
side-effects), from scripts._utils import load_config, and from
scripts.cleanup_data import cleanup_data as run_cleanup, and update any
invocation docs/tests to run the module in package mode (python -m
scripts.run_pipeline) so the imports resolve consistently; keep any required
noqa/pylint comments only if still necessary after switching.

In `@scripts/workflow.py`:
- Line 14: Change the top-level import to a package import (replace "import
_path_guard" with "from scripts import _path_guard") in scripts/workflow.py and
apply the same change to any other modules that currently import _path_guard as
a top-level module so pylint E0401 is resolved; update the call to
Path(out_file).write_text(body) to pass encoding="utf-8" for consistent file
writes; and in validate_graph_files replace any assert statements with explicit
validations that raise appropriate exceptions (ValueError or RuntimeError) so
checks cannot be skipped under python -O (look for the validate_graph_files
function to locate these asserts).

---

Outside diff comments:
In `@scripts/add_fossils.py`:
- Around line 319-331: The code creates temp_dir before checking existence, so
the clone branch is never taken; fix by separating the base temp directory from
the per-repo path and only mkdir the base, then check the per-repo path (or
remove the premature mkdir). Concretely: create a base directory (e.g.,
base_temp = Path("./temp_fossil_repos") and base_temp.mkdir(exist_ok=True)), set
local_repo = base_temp / repo_name, then if not local_repo.exists():
run_command(["git", "clone", repo_url, str(local_repo)]) else
run_command(["git", "fetch", "--all"], cwd=str(local_repo)) using the existing
run_command, repo_url and repo_name symbols.

---

Nitpick comments:
In `@scripts/workflow.py`:
- Line 55: Change the call to Path(out_file).write_text(body) to explicitly
specify UTF-8 encoding to match reads; in scripts/workflow.py update the
write_text invocation that writes the variable body (where out_file and body are
used) to include encoding="utf-8" so the file write is deterministic across
platforms.
- Line 70: The call to f.read_text() in scripts/workflow.py (where data =
json.loads(f.read_text())) lacks an explicit encoding; update that call to pass
a consistent encoding (e.g., encoding="utf-8") so it matches the write_text()
usage and avoids platform-dependent behavior—locate the f.read_text() invocation
and change it to explicitly specify the encoding.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13eefcd2-7014-47e8-8723-54176fb4024b

📥 Commits

Reviewing files that changed from the base of the PR and between 6f04970 and 467e14b.

📒 Files selected for processing (12)

.github/workflows/theseus-engine.yml
.github/workflows/unit-tests.yml
scripts/__init__.py
scripts/_blame.py
scripts/_data_io.py
scripts/_path_guard.py
scripts/_utils.py
scripts/add_fossils.py
scripts/analyse_repository.py
scripts/cleanup_data.py
scripts/run_pipeline.py
scripts/workflow.py

coderabbitai · 2026-06-05T07:05:25Z

+import _path_guard  # noqa: F401  # pylint: disable=unused-import

 from _utils import load_config
 from cleanup_data import cleanup_data as run_cleanup


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Sibling bare imports are breaking CI import resolution.

Line 33–36 currently depend on _path_guard side effects, but pylint resolves imports statically and is failing with E0401 for these modules. Please standardize to package-qualified imports (scripts.*) and align invocation to module mode (python -m scripts.run_pipeline) so runtime and lint contexts use the same contract.

🧰 Tools

🪛 GitHub Actions: Unit Tests / 0_Run Unit Tests.txt

[error] 33-33: Pylint import-error (E0401): Unable to import '_path_guard'.

[error] 35-35: Pylint import-error (E0401): Unable to import '_utils'.

[error] 36-36: Pylint import-error (E0401): Unable to import 'cleanup_data'.

🪛 GitHub Actions: Unit Tests / Run Unit Tests

[error] 33-33: pylint (import-error E0401): Unable to import '_path_guard'.

[error] 35-35: pylint (import-error E0401): Unable to import '_utils'.

[error] 36-36: pylint (import-error E0401): Unable to import 'cleanup_data'.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/run_pipeline.py` around lines 33 - 36, The sibling bare imports (_path_guard, _utils, cleanup_data) cause static import resolution errors; change them to package-qualified imports so lint/runtime agree: import the module names as from scripts._path_guard import ... (or simply import scripts._path_guard to preserve side-effects), from scripts._utils import load_config, and from scripts.cleanup_data import cleanup_data as run_cleanup, and update any invocation docs/tests to run the module in package mode (python -m scripts.run_pipeline) so the imports resolve consistently; keep any required noqa/pylint comments only if still necessary after switching.

…tion

…ding, line length

Asifdotexe · 2026-06-05T13:57:33Z

@coderabbitai review

coderabbitai · 2026-06-05T13:58:35Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Asifdotexe added 2 commits June 4, 2026 22:34

#42 segregated the python scripts from workflows

aecceef

#42 optimised the python code

467e14b

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Asifdotexe added 2 commits June 5, 2026 15:03

resolve pylint E0401, add explicit encoding, replace assert in valida…

710f816

…tion

fix: stale branch ref, SHA-pin actions, premature mkdir, baseline see…

d25a6ea

…ding, line length

Asifdotexe merged commit 98803a2 into main Jun 5, 2026
2 checks passed

This was referenced Jun 6, 2026

update the doc, script and workflow #47

Merged

Chore/monthly data update #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segregated the python scripts from workflows#46

segregated the python scripts from workflows#46
Asifdotexe merged 4 commits into
mainfrom
42-fix-action-failure-issue

Asifdotexe commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Uh oh!

Asifdotexe commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Asifdotexe commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Asifdotexe commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Asifdotexe commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading