Skip to content

segregated the python scripts from workflows#46

Merged
Asifdotexe merged 4 commits into
mainfrom
42-fix-action-failure-issue
Jun 5, 2026
Merged

segregated the python scripts from workflows#46
Asifdotexe merged 4 commits into
mainfrom
42-fix-action-failure-issue

Conversation

@Asifdotexe

@Asifdotexe Asifdotexe commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added a workflow CLI for repo discovery, PR-body generation, and graph validation.
  • Bug Fixes

    • Added a step to repair branch ancestry and ensure monthly-data-update can be synchronized safely.
  • Chores

    • Refactored CI workflows for clearer configuration and safer cleanup.
    • Simplified test workflow setup.
    • Improved file I/O with atomic writes and modern path handling.
    • Standardized script import behavior for consistent execution.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3d3d717-9a7d-450b-836c-c1e0a7447f81

📥 Commits

Reviewing files that changed from the base of the PR and between 710f816 and d25a6ea.

📒 Files selected for processing (5)
  • .github/workflows/theseus-engine.yml
  • .github/workflows/unit-tests.yml
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/workflow.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • .github/workflows/unit-tests.yml
  • scripts/workflow.py
  • .github/workflows/theseus-engine.yml
  • scripts/add_fossils.py
  • scripts/analyse_repository.py

📝 Walkthrough

Walkthrough

Centralizes import-path handling via _path_guard, modernizes snapshot I/O with pathlib and atomic writes, adds a workflow CLI (scripts/workflow.py), refactors fossil and repository processing into helper/callbacks, and updates GitHub Actions to call the new CLI and simplify test setup.

Changes

Pipeline Infrastructure Refactor

Layer / File(s) Summary
Import path guarding foundation
scripts/_path_guard.py, scripts/__init__.py, scripts/_blame.py, scripts/add_fossils.py, scripts/analyse_repository.py, scripts/cleanup_data.py, scripts/run_pipeline.py
New _path_guard module prepends scripts/ to sys.path at import time; scripts now import _path_guard instead of inlining sys.path manipulation.
File I/O modernization and atomic writes
scripts/_data_io.py
load_snapshot_data and save_snapshot_data accept `str
Utility deletion/error handling
scripts/_utils.py
remove_path subprocess deletion calls set check=False; fallback retry uses _handle_remove_readonly and shutil.rmtree(..., onexc=...).
Fossil processing callback refactor
scripts/add_fossils.py
Introduces _process_each_repo shared helper; rewires backfill_fossils and update_survivor_fossils to callback-based _backfill_one / _update_survivor_one, merging and persisting fossil results.
Repository analysis helper functions
scripts/analyse_repository.py
Adds _ensure_repo_ready, _find_baseline, and _process_snapshots_by_year for incremental per-year snapshot processing; process_repository delegates to these helpers.
Workflow CLI utility
scripts/workflow.py
Adds discover-repos, build-pr-body, and validate-graph-files subcommands and a main() dispatcher used by GitHub workflows.
GitHub Actions workflow integration
.github/workflows/theseus-engine.yml
Repo matrix sourced from scripts/workflow.py discover-repos; added "Fix shared branch ancestry" step that preserves data/ and force-pushes a recreated shared branch when needed; inline Python PR-body/validation replaced with scripts/workflow.py calls; analyze job sets permissions.contents: write.
Unit test workflow simplification
.github/workflows/unit-tests.yml
Composite action replaced with explicit checkout, actions/setup-python@v5 for Python 3.12, Poetry installed via pipx, and poetry install --with dev.
Pylint init-hook
pyproject.toml
Adds [tool.pylint] init-hook to prepend scripts to sys.path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Asifdotexe/Theseus#45: Overlapping workflow changes around shared-branch ancestry handling and rebase/force-push steps.
  • Asifdotexe/Theseus#7: Related refactor of scripts/analyse_repository.py and incremental snapshot logic.
  • Asifdotexe/Theseus#20: Related add_fossils.py survivor-update logic and incremental fossil handling.

Suggested labels

enhancement

Poem

🐰 I hopped through paths to clear the way,
Guarded imports so scripts can play,
Atomic writes and helpers spun,
Workflows call the CLI now—job done!
Fossils tidy, pipelines light, hooray!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'segregated the python scripts from workflows' accurately describes the primary change across multiple files, refactoring Python scripts to use a shared _path_guard module and new workflow.py CLI utility instead of inline logic.
Docstring Coverage ✅ Passed Docstring coverage is 90.48% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 42-fix-action-failure-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/add_fossils.py (1)

319-331: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix clone/fetch control flow; clone path is currently unreachable.

Line 320 creates temp_dir before Line 323 checks existence, so the clone branch never runs. On fresh runs this attempts git fetch in a non-repo directory and fails downstream processing.

Proposed fix
-        temp_dir = Path(f"./temp_fossil_repos_{repo_name}")
-        temp_dir.mkdir(exist_ok=True)
-        local_repo = temp_dir
-
-        if not local_repo.exists():
+        local_repo = Path(f"./temp_fossil_repos_{repo_name}")
+        if not local_repo.exists():
             logger.info("  Cloning %s...", repo_url)
             run_command(["git", "clone", repo_url, str(local_repo)])
         else:
             logger.info("  Repo already cloned — fetching latest...")
             try:
                 run_command(["git", "fetch", "--all"], cwd=str(local_repo))
             except RuntimeError as e:
                 logger.warning("  Fetch failed (continuing with local): %s", e)
@@
-        if temp_dir.exists():
-            remove_path(str(temp_dir))
+        if local_repo.exists():
+            remove_path(str(local_repo))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 319 - 331, The code creates temp_dir
before checking existence, so the clone branch is never taken; fix by separating
the base temp directory from the per-repo path and only mkdir the base, then
check the per-repo path (or remove the premature mkdir). Concretely: create a
base directory (e.g., base_temp = Path("./temp_fossil_repos") and
base_temp.mkdir(exist_ok=True)), set local_repo = base_temp / repo_name, then if
not local_repo.exists(): run_command(["git", "clone", repo_url,
str(local_repo)]) else run_command(["git", "fetch", "--all"],
cwd=str(local_repo)) using the existing run_command, repo_url and repo_name
symbols.
🧹 Nitpick comments (2)
scripts/workflow.py (2)

55-55: ⚡ Quick win

Add explicit encoding to write_text() for consistency.

The read operations use encoding="utf-8" but write_text() on line 55 relies on platform default encoding, which may differ.

♻️ Proposed fix
-    Path(out_file).write_text(body)
+    Path(out_file).write_text(body, encoding="utf-8")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/workflow.py` at line 55, Change the call to
Path(out_file).write_text(body) to explicitly specify UTF-8 encoding to match
reads; in scripts/workflow.py update the write_text invocation that writes the
variable body (where out_file and body are used) to include encoding="utf-8" so
the file write is deterministic across platforms.

70-70: ⚡ Quick win

Add explicit encoding to read_text() for consistency.

Same encoding consistency issue as write_text().

♻️ Proposed fix
-            data = json.loads(f.read_text())
+            data = json.loads(f.read_text(encoding="utf-8"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/workflow.py` at line 70, The call to f.read_text() in
scripts/workflow.py (where data = json.loads(f.read_text())) lacks an explicit
encoding; update that call to pass a consistent encoding (e.g.,
encoding="utf-8") so it matches the write_text() usage and avoids
platform-dependent behavior—locate the f.read_text() invocation and change it to
explicitly specify the encoding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 166-167: After force-pushing with "git push origin
HEAD:chore/monthly-data-update --force", update the local ref before checking
out to avoid a stale branch: fetch or reset the local branch to the remote
(e.g., run "git fetch origin chore/monthly-data-update" then "git checkout
chore/monthly-data-update" or replace checkout with "git checkout -B
chore/monthly-data-update origin/chore/monthly-data-update" or "git branch -f
chore/monthly-data-update origin/chore/monthly-data-update" to force the local
branch to match the pushed commit); apply this change around the current git
push / git checkout commands in the workflow.

In @.github/workflows/unit-tests.yml:
- Around line 15-21: Replace the tag-pinned GitHub Actions with immutable SHA
pins and disable checkout credential persistence: update the actions/checkout@v4
reference to its corresponding commit SHA and add persist-credentials: false to
that checkout step, and replace actions/setup-python@v5 with its commit SHA
(keeping with: python-version: "3.12" and cache: pip) so both actions are
SHA-pinned and the checkout step no longer exposes credentials.

In `@scripts/__init__.py`:
- Around line 12-14: The current change removed/altered the sys.path mutation in
scripts/__init__.py (_SCRIPTS_DIR and sys.path.insert) but those scripts rely on
that behavior (see scripts/_path_guard.py and imports in scripts/_blame.py,
add_fossils.py, cleanup_data.py); before converting to relative imports,
reproduce and capture the exact pylint E0401 output by running pylint on those
files to confirm whether the error is a linter-only resolution or a runtime
import failure, then either (a) if it’s only pylint, revert the broad import
refactor and fix linter resolution via pylintrc or per-import disable for the
specific imports, or (b) if runtime fails when executing scripts directly,
restore a single canonical sys.path guard (keep _path_guard.py as the single
source of truth and remove duplicate mutations) or add a safe direct-exec
fallback (try/except import patterns) in the scripts that import
_utils/_data_io/_blame so direct python scripts/*.py execution continues to
work.

In `@scripts/add_fossils.py`:
- Around line 404-422: The two overly long logger.info calls inside
_update_survivor_one exceed the 120-char lint limit; break their format string
and arguments across multiple concatenated/continued strings or use multiple
logger.info calls so each line stays under 120 chars (e.g., split the message
and the tuple of arguments across lines or log OLD and NEW in separate short
calls), referencing the logger.info(...) calls that print "  ✓ Survivor
unchanged..." and the "    OLD: ..."/"    NEW: ..." lines to keep the same
content and argument order but wrapped to satisfy pylint C0301.

In `@scripts/analyse_repository.py`:
- Around line 323-329: prev_file_data is being seeded from the newest
historical_snapshots unconditionally which can make reprocessing use a future
baseline; change the seeding logic to pick the most recent snapshot whose
snapshot_date is strictly earlier than the target period (i.e., find the latest
historical_snapshots entry with snapshot_date < target_period and set
prev_file_data = (commit_hash, file_compositions) from that entry). Also change
the code path that currently appends reprocessed results (the append that writes
new snapshot entries for a snapshot_date) to instead check for an existing entry
with the same snapshot_date and replace that entry's data (commit_hash and
file_compositions) rather than appending, to avoid duplicate snapshot_date
periods.

In `@scripts/run_pipeline.py`:
- Around line 33-36: The sibling bare imports (_path_guard, _utils,
cleanup_data) cause static import resolution errors; change them to
package-qualified imports so lint/runtime agree: import the module names as from
scripts._path_guard import ... (or simply import scripts._path_guard to preserve
side-effects), from scripts._utils import load_config, and from
scripts.cleanup_data import cleanup_data as run_cleanup, and update any
invocation docs/tests to run the module in package mode (python -m
scripts.run_pipeline) so the imports resolve consistently; keep any required
noqa/pylint comments only if still necessary after switching.

In `@scripts/workflow.py`:
- Line 14: Change the top-level import to a package import (replace "import
_path_guard" with "from scripts import _path_guard") in scripts/workflow.py and
apply the same change to any other modules that currently import _path_guard as
a top-level module so pylint E0401 is resolved; update the call to
Path(out_file).write_text(body) to pass encoding="utf-8" for consistent file
writes; and in validate_graph_files replace any assert statements with explicit
validations that raise appropriate exceptions (ValueError or RuntimeError) so
checks cannot be skipped under python -O (look for the validate_graph_files
function to locate these asserts).

---

Outside diff comments:
In `@scripts/add_fossils.py`:
- Around line 319-331: The code creates temp_dir before checking existence, so
the clone branch is never taken; fix by separating the base temp directory from
the per-repo path and only mkdir the base, then check the per-repo path (or
remove the premature mkdir). Concretely: create a base directory (e.g.,
base_temp = Path("./temp_fossil_repos") and base_temp.mkdir(exist_ok=True)), set
local_repo = base_temp / repo_name, then if not local_repo.exists():
run_command(["git", "clone", repo_url, str(local_repo)]) else
run_command(["git", "fetch", "--all"], cwd=str(local_repo)) using the existing
run_command, repo_url and repo_name symbols.

---

Nitpick comments:
In `@scripts/workflow.py`:
- Line 55: Change the call to Path(out_file).write_text(body) to explicitly
specify UTF-8 encoding to match reads; in scripts/workflow.py update the
write_text invocation that writes the variable body (where out_file and body are
used) to include encoding="utf-8" so the file write is deterministic across
platforms.
- Line 70: The call to f.read_text() in scripts/workflow.py (where data =
json.loads(f.read_text())) lacks an explicit encoding; update that call to pass
a consistent encoding (e.g., encoding="utf-8") so it matches the write_text()
usage and avoids platform-dependent behavior—locate the f.read_text() invocation
and change it to explicitly specify the encoding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13eefcd2-7014-47e8-8723-54176fb4024b

📥 Commits

Reviewing files that changed from the base of the PR and between 6f04970 and 467e14b.

📒 Files selected for processing (12)
  • .github/workflows/theseus-engine.yml
  • .github/workflows/unit-tests.yml
  • scripts/__init__.py
  • scripts/_blame.py
  • scripts/_data_io.py
  • scripts/_path_guard.py
  • scripts/_utils.py
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/cleanup_data.py
  • scripts/run_pipeline.py
  • scripts/workflow.py

Comment thread .github/workflows/theseus-engine.yml Outdated
Comment thread .github/workflows/unit-tests.yml Outdated
Comment thread scripts/__init__.py
Comment thread scripts/add_fossils.py Outdated
Comment thread scripts/analyse_repository.py Outdated
Comment thread scripts/run_pipeline.py
Comment on lines +33 to 36
import _path_guard # noqa: F401 # pylint: disable=unused-import

from _utils import load_config
from cleanup_data import cleanup_data as run_cleanup

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Sibling bare imports are breaking CI import resolution.

Line 33–36 currently depend on _path_guard side effects, but pylint resolves imports statically and is failing with E0401 for these modules. Please standardize to package-qualified imports (scripts.*) and align invocation to module mode (python -m scripts.run_pipeline) so runtime and lint contexts use the same contract.

🧰 Tools
🪛 GitHub Actions: Unit Tests / 0_Run Unit Tests.txt

[error] 33-33: Pylint import-error (E0401): Unable to import '_path_guard'.


[error] 35-35: Pylint import-error (E0401): Unable to import '_utils'.


[error] 36-36: Pylint import-error (E0401): Unable to import 'cleanup_data'.

🪛 GitHub Actions: Unit Tests / Run Unit Tests

[error] 33-33: pylint (import-error E0401): Unable to import '_path_guard'.


[error] 35-35: pylint (import-error E0401): Unable to import '_utils'.


[error] 36-36: pylint (import-error E0401): Unable to import 'cleanup_data'.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/run_pipeline.py` around lines 33 - 36, The sibling bare imports
(_path_guard, _utils, cleanup_data) cause static import resolution errors;
change them to package-qualified imports so lint/runtime agree: import the
module names as from scripts._path_guard import ... (or simply import
scripts._path_guard to preserve side-effects), from scripts._utils import
load_config, and from scripts.cleanup_data import cleanup_data as run_cleanup,
and update any invocation docs/tests to run the module in package mode (python
-m scripts.run_pipeline) so the imports resolve consistently; keep any required
noqa/pylint comments only if still necessary after switching.

Comment thread scripts/workflow.py
@Asifdotexe

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@Asifdotexe Asifdotexe merged commit 98803a2 into main Jun 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant