39 optimizing the workflow by Asifdotexe · Pull Request #41 · Asifdotexe/Theseus

Asifdotexe · 2026-06-02T08:14:21Z

Summary by CodeRabbit

New Features
- Per-repository pipeline producing separate raw and processed data and creating a single aggregated pull request
- Reusable CI step to set up Python and Poetry
Bug Fixes
- Improved incremental snapshot analysis with verification to reduce mismatches
Documentation
- Updated configuration guide and repository onboarding instructions
Chores
- Performance and processing improvements; tests updated to use processed outputs; .gitignore now ignores presentation/

coderabbitai · 2026-06-02T08:14:37Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 389359fa-b472-48ce-bc4c-bf7bec9904e8

📥 Commits

Reviewing files that changed from the base of the PR and between c356128 and 676bd40.

📒 Files selected for processing (6)

.github/workflows/theseus-engine.yml
data/processed/claude-code_graph.json
data/processed/langchain_graph.json
data/processed/numpy_graph.json
data/processed/react_graph.json
data/processed/zed_graph.json

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/theseus-engine.yml

📝 Walkthrough

Walkthrough

Refactors the Theseus pipeline to use incremental blame across snapshots, centralizes parallel blame in a BlameRunner, stages outputs into data/raw and data/processed (with graph generation), updates scripts to use the new staging, and converts CI to a repo-discovery matrix using a reusable Python/Poetry composite action.

Changes

Incremental Blame & Data Architecture Refactoring

Layer / File(s)	Summary
CI Foundation: Composite Action and Workflow Setup `.github/actions/setup-python-poetry/action.yml`, `.github/workflows/unit-tests.yml`	New composite action consolidates Python/Poetry setup with configurable `python-version` and `poetry-install-args`; unit-tests workflow now calls the action with `poetry-install-args: --with dev`.
Blame Parsing Optimization and BlameRunner Class `scripts/_blame.py`	Adds `_is_hash()` for fast hex checks; rewrites `parse_blame_year_counts` and `find_oldest_fossil_in_blame` to parse author-time first; adds `blame_single_file_year_counts()` and `BlameRunner` for parallelized blame aggregation and fossil selection.
Incremental Analysis Utilities `scripts/_utils.py`	Adds `get_changed_files(repo_path, from_commit, to_commit)` to list changed paths between commits and `count_repo_lines(repo_path)` to compute total repo line count for verification.
Data Directory Structure and Two-Phase Cleanup `scripts/run_pipeline.py`, `scripts/cleanup_data.py`	Pipeline ensures `raw/` and `processed/` exist; `cleanup_raw()` minifies raw snapshots (removes internal fields/invalid future-year composition entries); `generate_graph_data()` emits `{repo}_graph.json` into `data/processed/` containing filtered snapshots and fossils; `cleanup_data()` orchestrates both phases.
Incremental Snapshot Analysis with Blame Baseline `scripts/analyse_repository.py`	`analyze_single_snapshot()` now accepts optional `prev_file_data` and returns `(age_distribution, file_compositions)`; adds `_blame_full_snapshot`, `_blame_incremental_snapshot`, `_aggregate_file_compositions`, and `_verify_line_count_guard()` to support incremental blaming (blame only changed files, carry forward prior per-file results, verify totals and fallback to full blame); outputs written to `data/raw/{repo}_data.json` and snapshot records include `commit_hash` and `file_compositions`.
Fossil Computation via BlameRunner `scripts/add_fossils.py`	Genesis and Survivor fossil selection now use `BlameRunner(...).blame_oldest_fossil()`; backfill and survivor loops read `data/raw/` and use per-repo temp clone directories removed after processing.
CI Matrix-Driven Discovery and Artifact Orchestration `.github/workflows/theseus-engine.yml`	Replaces monolithic analyze job with `discover-repos` (parses config), `analyze` matrix (runs per-repo pipeline and uploads `data/raw/_data.json` and `data/processed/_graph.json`), and `create-pr` that downloads artifacts and opens a single aggregated PR (guarded with `if: success()`).
Frontend Data Contract and Test Integration `app.js`, `tests/test_data_integrity.py`	`TheseusVisualizer.loadRepo()` now loads `data/processed/${repoInfo.name}_graph.json`; tests updated to read fixtures from `data/processed`.
Configuration, Documentation, and Repository Setup `docs/CONFIGURATION.md`, `theseus.config.json`, `.gitignore`	Docs updated (example config, new "Adding a New Repository" guide, kebab-case `name` requirement); `theseus.config.json` removes `file` fields for several repos and adds `tensorflow`; `.gitignore` updated to ignore `presentation/`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Asifdotexe/Theseus#7: Introduced the original snapshot analysis pipeline that this PR refactors to incremental blame.
Asifdotexe/Theseus#21: Changes to cleanup/graph generation overlap with this PR’s cleanup_data refactor.
Asifdotexe/Theseus#37: Related frontend loader changes affecting app.js loadRepo behavior and AbortController usage.

"From a curious rabbit's den, I write with nimble paws,
Raw turns to processed in the moonlight's laws,
Blame hops only where changes tread tonight,
Threads chase fossils hidden from plain sight,
CI bundles gardens into one tidy pr." 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title "39 optimizing the workflow" is vague and does not clearly convey the specific changes made in the PR, which span workflow restructuring, composite actions, script refactoring, and data processing improvements.	Revise the title to be more specific and descriptive of the main changes, such as 'Refactor analysis workflow with composite action and incremental blame strategy' or similar, to clearly communicate the primary objectives.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 39-optimizing-the-workflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

scripts/add_fossils.py (2)

313-326: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Bug: mkdir() before existence check causes clone to never execute.

Line 314 creates temp_dir unconditionally, so line 317's existence check will always be True. The clone branch (lines 318-319) will never execute, and git fetch will fail on an empty directory.

🐛 Proposed fix

         temp_dir = Path(f"./temp_fossil_repos_{repo_name}")
-        temp_dir.mkdir(exist_ok=True)
         local_repo = temp_dir

         if not local_repo.exists():
             logger.info("  Cloning %s...", repo_url)
             run_command(["git", "clone", repo_url, str(local_repo)])
         else:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 313 - 326, The code creates temp_dir
with temp_dir.mkdir(exist_ok=True) before checking existence, so
local_repo.exists() is always true and cloning never runs; change the logic in
the block that sets up temp_dir/local_repo (variables temp_dir, local_repo) to
first check if temp_dir/local_repo exists and only call mkdir() when it does not
exist, then run git clone via run_command(["git", "clone", repo_url,
str(local_repo)]) when the repo directory is absent, else run git fetch as
currently implemented and keep the existing exception handling for run_command.

415-428: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Same bug: mkdir() before existence check in survivor update path.

Same issue as in backfill_fossils — the mkdir() at line 416 makes the existence check at line 419 always true.

🐛 Proposed fix

         temp_dir = Path(f"./temp_fossil_repos_{repo_name}")
-        temp_dir.mkdir(exist_ok=True)
         local_repo = temp_dir

         if not local_repo.exists():
             logger.info("  Cloning %s...", repo_url)
             run_command(["git", "clone", repo_url, str(local_repo)])
         else:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 415 - 428, The code creates temp_dir
(temp_fossil_repos_{repo_name}) with temp_dir.mkdir(...) before checking
existence, so the subsequent if not local_repo.exists() is always false; remove
or move that mkdir call and only create the parent/temp directory when you
actually need to clone: keep temp_dir and local_repo as-is, delete the early
temp_dir.mkdir(exist_ok=True), and in the branch that clones (inside the if not
local_repo.exists()), ensure the parent directory exists (e.g.,
mkdir(parents=True, exist_ok=True)) before calling run_command(["git", "clone",
repo_url, str(local_repo)]); leave the fetch branch unchanged and continue using
run_command for git fetch.

🧹 Nitpick comments (1)

.github/actions/setup-python-poetry/action.yml (1)

17-28: ⚡ Quick win

Use Poetry caching here instead of pip caching.

This action installs project dependencies with poetry install, so cache: pip mostly helps the pipx install poetry step while the actual project environment still restores cold on each run. That undercuts the workflow optimization this PR is aiming for.

Suggested change

     - name: Set up Python
       uses: actions/setup-python@v5
       with:
         python-version: ${{ inputs.python-version }}
-        cache: pip
+        cache: poetry
+        cache-dependency-path: poetry.lock

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/setup-python-poetry/action.yml around lines 17 - 28, The
workflow currently sets cache: pip in the "Set up Python" step which only
benefits the pipx install of Poetry; change the Python setup to use Poetry
caching so project dependencies restored from cache when running the "Install
dependencies" step. Update the "Set up Python" step (actions/setup-python@v5) to
use cache: poetry instead of cache: pip, keep the "Install Poetry" (pipx install
poetry) and "Install dependencies" (poetry install ${{
inputs.poetry-install-args }}) steps unchanged so Poetry is still installed but
dependency installs reuse the Poetry cache.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 29-31: The workflow lets a single repo failure cancel sibling
matrix jobs and still run the create-pr job (because create-pr uses if:
always()), producing a partial PR; turn off fast-fail and gate the PR job on the
matrix job(s) succeeding: set strategy.fail-fast: false on the matrix that
defines repo (strategy.matrix.repo) and change the create-pr job's condition to
require the analyze job(s) success (replace if: always() with a check like
needs.analyze.result == 'success' or equivalent needs.analyze.conclusion ==
'success') so create-pr only runs when the full matrix completes successfully.
- Around line 16-24: The multiline inline python command under the
discover-repos step is malformed YAML; replace the current python -c "..."
invocation with a proper run block that uses a here-doc style multiline script
(e.g., run: | followed by an indented Python script) so the lines starting with
import/with are treated as script content, and ensure the environment variable
REPOS is echoed to GITHUB_OUTPUT as before (refer to the REPOS variable and the
python script in this block). Also change the create-pr job's condition from if:
always() to a success gate (e.g., if: success() or check that required
artifacts/data-* exist via needs.<job>.outputs) so PR creation is skipped when
upstream steps fail, using the create-pr job name to locate the conditional.

In `@docs/CONFIGURATION.md`:
- Around line 33-34: The docs currently state a single output file
`data/{name}_data.json`; update the `output-path` / repository output docs to
reflect the new layout by replacing that single path with both
`data/raw/{name}_data.json` (raw data) and `data/processed/{name}_graph.json`
(processed graph), adjust any example commands and descriptive text to point
readers to the correct raw vs processed files, and make the same change in the
other occurrence of the same example further down the file (the second block
that mentions `data/{name}_data.json`).

In `@tests/test_data_integrity.py`:
- Around line 16-17: The test currently sets data_dir = Path("./data/processed")
which relies on the current working directory; change data_dir to be resolved
relative to the test file (use Path(__file__).resolve().parent and then join
"data/processed") so the fixtures are located from the test file location, and
leave the json_files = list(data_dir.glob("*.json")) line unchanged; update the
variable data_dir in tests/test_data_integrity.py accordingly.

---

Outside diff comments:
In `@scripts/add_fossils.py`:
- Around line 313-326: The code creates temp_dir with
temp_dir.mkdir(exist_ok=True) before checking existence, so local_repo.exists()
is always true and cloning never runs; change the logic in the block that sets
up temp_dir/local_repo (variables temp_dir, local_repo) to first check if
temp_dir/local_repo exists and only call mkdir() when it does not exist, then
run git clone via run_command(["git", "clone", repo_url, str(local_repo)]) when
the repo directory is absent, else run git fetch as currently implemented and
keep the existing exception handling for run_command.
- Around line 415-428: The code creates temp_dir (temp_fossil_repos_{repo_name})
with temp_dir.mkdir(...) before checking existence, so the subsequent if not
local_repo.exists() is always false; remove or move that mkdir call and only
create the parent/temp directory when you actually need to clone: keep temp_dir
and local_repo as-is, delete the early temp_dir.mkdir(exist_ok=True), and in the
branch that clones (inside the if not local_repo.exists()), ensure the parent
directory exists (e.g., mkdir(parents=True, exist_ok=True)) before calling
run_command(["git", "clone", repo_url, str(local_repo)]); leave the fetch branch
unchanged and continue using run_command for git fetch.

---

Nitpick comments:
In @.github/actions/setup-python-poetry/action.yml:
- Around line 17-28: The workflow currently sets cache: pip in the "Set up
Python" step which only benefits the pipx install of Poetry; change the Python
setup to use Poetry caching so project dependencies restored from cache when
running the "Install dependencies" step. Update the "Set up Python" step
(actions/setup-python@v5) to use cache: poetry instead of cache: pip, keep the
"Install Poetry" (pipx install poetry) and "Install dependencies" (poetry
install ${{ inputs.poetry-install-args }}) steps unchanged so Poetry is still
installed but dependency installs reuse the Poetry cache.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 513705d0-bbe5-4f6f-9c92-db0477640d87

📥 Commits

Reviewing files that changed from the base of the PR and between 121c0bb and 7f6a419.

📒 Files selected for processing (19)

.github/actions/setup-python-poetry/action.yml
.github/workflows/theseus-engine.yml
.github/workflows/unit-tests.yml
.gitignore
app.js
data/processed/claude-code_graph.json
data/processed/langchain_graph.json
data/processed/numpy_graph.json
data/processed/react_graph.json
data/processed/zed_graph.json
docs/CONFIGURATION.md
scripts/_blame.py
scripts/_utils.py
scripts/add_fossils.py
scripts/analyse_repository.py
scripts/cleanup_data.py
scripts/run_pipeline.py
tests/test_data_integrity.py
theseus.config.json

Asifdotexe · 2026-06-02T09:45:47Z

@coderabbitai review

coderabbitai · 2026-06-02T09:45:53Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 17-24: The heredoc used to populate REPOS (the python << 'EOF'
block) is dedented and therefore terminates the surrounding YAML block scalar;
indent the entire heredoc body, the trailing EOF marker, and the closing
parenthesis to the same column as the run: | block content so YAML treats the
heredoc as part of the block scalar (YAML will strip common indentation and the
bash heredoc will still receive EOF at column 0); update the lines around the
REPOS assignment and the python << 'EOF' invocation accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5866a1c7-b363-459f-a4cb-e42ae69a2d7c

📥 Commits

Reviewing files that changed from the base of the PR and between 7f6a419 and c356128.

📒 Files selected for processing (3)

.github/workflows/theseus-engine.yml
docs/CONFIGURATION.md
tests/test_data_integrity.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_data_integrity.py

…optimizing-the-workflow

Asifdotexe added 6 commits May 31, 2026 21:07

#39 performed performacne optimization using git diff

6531592

#39 abstracted blamerunner

54ac614

#39 added documentation and added tensorflow as config

7991852

#39 restructured the data directory

893f054

#39 updated the github actions to ensure parellel processing

5b910f3

#39 performance optimizations

7f6a419

Asifdotexe self-assigned this Jun 2, 2026

Asifdotexe added the enhancement New feature or request label Jun 2, 2026

Asifdotexe linked an issue Jun 2, 2026 that may be closed by this pull request

Optimizing the workflow #39

Closed

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread .github/workflows/theseus-engine.yml

Comment thread .github/workflows/theseus-engine.yml

Comment thread docs/CONFIGURATION.md Outdated

Comment thread tests/test_data_integrity.py Outdated

#39 made some changes

c356128

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread .github/workflows/theseus-engine.yml Outdated

Asifdotexe added 2 commits June 2, 2026 15:30

#39 fixed the scalar content to be indented

dea4b6c

Merge branch 'main' of https://github.com/Asifdotexe/Theseus into 39-…

676bd40

…optimizing-the-workflow

Asifdotexe merged commit be35901 into main Jun 2, 2026
2 checks passed

coderabbitai Bot mentioned this pull request Jun 3, 2026

#42 updated the github workflow and threshold #43

Merged

Asifdotexe deleted the 39-optimizing-the-workflow branch June 4, 2026 16:41

coderabbitai Bot mentioned this pull request Jun 6, 2026

update the doc, script and workflow #47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

39 optimizing the workflow#41

39 optimizing the workflow#41
Asifdotexe merged 9 commits into
mainfrom
39-optimizing-the-workflow

Asifdotexe commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Asifdotexe commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Asifdotexe commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Asifdotexe commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Asifdotexe commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading