Skip to content

39 optimizing the workflow#41

Merged
Asifdotexe merged 9 commits into
mainfrom
39-optimizing-the-workflow
Jun 2, 2026
Merged

39 optimizing the workflow#41
Asifdotexe merged 9 commits into
mainfrom
39-optimizing-the-workflow

Conversation

@Asifdotexe

@Asifdotexe Asifdotexe commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Per-repository pipeline producing separate raw and processed data and creating a single aggregated pull request
    • Reusable CI step to set up Python and Poetry
  • Bug Fixes

    • Improved incremental snapshot analysis with verification to reduce mismatches
  • Documentation

    • Updated configuration guide and repository onboarding instructions
  • Chores

    • Performance and processing improvements; tests updated to use processed outputs; .gitignore now ignores presentation/

@Asifdotexe Asifdotexe self-assigned this Jun 2, 2026
@Asifdotexe Asifdotexe added the enhancement New feature or request label Jun 2, 2026
@Asifdotexe Asifdotexe linked an issue Jun 2, 2026 that may be closed by this pull request
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 389359fa-b472-48ce-bc4c-bf7bec9904e8

📥 Commits

Reviewing files that changed from the base of the PR and between c356128 and 676bd40.

📒 Files selected for processing (6)
  • .github/workflows/theseus-engine.yml
  • data/processed/claude-code_graph.json
  • data/processed/langchain_graph.json
  • data/processed/numpy_graph.json
  • data/processed/react_graph.json
  • data/processed/zed_graph.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/theseus-engine.yml

📝 Walkthrough

Walkthrough

Refactors the Theseus pipeline to use incremental blame across snapshots, centralizes parallel blame in a BlameRunner, stages outputs into data/raw and data/processed (with graph generation), updates scripts to use the new staging, and converts CI to a repo-discovery matrix using a reusable Python/Poetry composite action.

Changes

Incremental Blame & Data Architecture Refactoring

Layer / File(s) Summary
CI Foundation: Composite Action and Workflow Setup
.github/actions/setup-python-poetry/action.yml, .github/workflows/unit-tests.yml
New composite action consolidates Python/Poetry setup with configurable python-version and poetry-install-args; unit-tests workflow now calls the action with poetry-install-args: --with dev.
Blame Parsing Optimization and BlameRunner Class
scripts/_blame.py
Adds _is_hash() for fast hex checks; rewrites parse_blame_year_counts and find_oldest_fossil_in_blame to parse author-time first; adds blame_single_file_year_counts() and BlameRunner for parallelized blame aggregation and fossil selection.
Incremental Analysis Utilities
scripts/_utils.py
Adds get_changed_files(repo_path, from_commit, to_commit) to list changed paths between commits and count_repo_lines(repo_path) to compute total repo line count for verification.
Data Directory Structure and Two-Phase Cleanup
scripts/run_pipeline.py, scripts/cleanup_data.py
Pipeline ensures raw/ and processed/ exist; cleanup_raw() minifies raw snapshots (removes internal fields/invalid future-year composition entries); generate_graph_data() emits {repo}_graph.json into data/processed/ containing filtered snapshots and fossils; cleanup_data() orchestrates both phases.
Incremental Snapshot Analysis with Blame Baseline
scripts/analyse_repository.py
analyze_single_snapshot() now accepts optional prev_file_data and returns (age_distribution, file_compositions); adds _blame_full_snapshot, _blame_incremental_snapshot, _aggregate_file_compositions, and _verify_line_count_guard() to support incremental blaming (blame only changed files, carry forward prior per-file results, verify totals and fallback to full blame); outputs written to data/raw/{repo}_data.json and snapshot records include commit_hash and file_compositions.
Fossil Computation via BlameRunner
scripts/add_fossils.py
Genesis and Survivor fossil selection now use BlameRunner(...).blame_oldest_fossil(); backfill and survivor loops read data/raw/ and use per-repo temp clone directories removed after processing.
CI Matrix-Driven Discovery and Artifact Orchestration
.github/workflows/theseus-engine.yml
Replaces monolithic analyze job with discover-repos (parses config), analyze matrix (runs per-repo pipeline and uploads data/raw/*_data.json and data/processed/*_graph.json), and create-pr that downloads artifacts and opens a single aggregated PR (guarded with if: success()).
Frontend Data Contract and Test Integration
app.js, tests/test_data_integrity.py
TheseusVisualizer.loadRepo() now loads data/processed/${repoInfo.name}_graph.json; tests updated to read fixtures from data/processed.
Configuration, Documentation, and Repository Setup
docs/CONFIGURATION.md, theseus.config.json, .gitignore
Docs updated (example config, new "Adding a New Repository" guide, kebab-case name requirement); theseus.config.json removes file fields for several repos and adds tensorflow; .gitignore updated to ignore presentation/.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Asifdotexe/Theseus#7: Introduced the original snapshot analysis pipeline that this PR refactors to incremental blame.
  • Asifdotexe/Theseus#21: Changes to cleanup/graph generation overlap with this PR’s cleanup_data refactor.
  • Asifdotexe/Theseus#37: Related frontend loader changes affecting app.js loadRepo behavior and AbortController usage.

"From a curious rabbit's den, I write with nimble paws,
Raw turns to processed in the moonlight's laws,
Blame hops only where changes tread tonight,
Threads chase fossils hidden from plain sight,
CI bundles gardens into one tidy pr." 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "39 optimizing the workflow" is vague and does not clearly convey the specific changes made in the PR, which span workflow restructuring, composite actions, script refactoring, and data processing improvements. Revise the title to be more specific and descriptive of the main changes, such as 'Refactor analysis workflow with composite action and incremental blame strategy' or similar, to clearly communicate the primary objectives.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 39-optimizing-the-workflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
scripts/add_fossils.py (2)

313-326: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Bug: mkdir() before existence check causes clone to never execute.

Line 314 creates temp_dir unconditionally, so line 317's existence check will always be True. The clone branch (lines 318-319) will never execute, and git fetch will fail on an empty directory.

🐛 Proposed fix
         temp_dir = Path(f"./temp_fossil_repos_{repo_name}")
-        temp_dir.mkdir(exist_ok=True)
         local_repo = temp_dir

         if not local_repo.exists():
             logger.info("  Cloning %s...", repo_url)
             run_command(["git", "clone", repo_url, str(local_repo)])
         else:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 313 - 326, The code creates temp_dir
with temp_dir.mkdir(exist_ok=True) before checking existence, so
local_repo.exists() is always true and cloning never runs; change the logic in
the block that sets up temp_dir/local_repo (variables temp_dir, local_repo) to
first check if temp_dir/local_repo exists and only call mkdir() when it does not
exist, then run git clone via run_command(["git", "clone", repo_url,
str(local_repo)]) when the repo directory is absent, else run git fetch as
currently implemented and keep the existing exception handling for run_command.

415-428: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Same bug: mkdir() before existence check in survivor update path.

Same issue as in backfill_fossils — the mkdir() at line 416 makes the existence check at line 419 always true.

🐛 Proposed fix
         temp_dir = Path(f"./temp_fossil_repos_{repo_name}")
-        temp_dir.mkdir(exist_ok=True)
         local_repo = temp_dir

         if not local_repo.exists():
             logger.info("  Cloning %s...", repo_url)
             run_command(["git", "clone", repo_url, str(local_repo)])
         else:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 415 - 428, The code creates temp_dir
(temp_fossil_repos_{repo_name}) with temp_dir.mkdir(...) before checking
existence, so the subsequent if not local_repo.exists() is always false; remove
or move that mkdir call and only create the parent/temp directory when you
actually need to clone: keep temp_dir and local_repo as-is, delete the early
temp_dir.mkdir(exist_ok=True), and in the branch that clones (inside the if not
local_repo.exists()), ensure the parent directory exists (e.g.,
mkdir(parents=True, exist_ok=True)) before calling run_command(["git", "clone",
repo_url, str(local_repo)]); leave the fetch branch unchanged and continue using
run_command for git fetch.
🧹 Nitpick comments (1)
.github/actions/setup-python-poetry/action.yml (1)

17-28: ⚡ Quick win

Use Poetry caching here instead of pip caching.

This action installs project dependencies with poetry install, so cache: pip mostly helps the pipx install poetry step while the actual project environment still restores cold on each run. That undercuts the workflow optimization this PR is aiming for.

Suggested change
     - name: Set up Python
       uses: actions/setup-python@v5
       with:
         python-version: ${{ inputs.python-version }}
-        cache: pip
+        cache: poetry
+        cache-dependency-path: poetry.lock
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/setup-python-poetry/action.yml around lines 17 - 28, The
workflow currently sets cache: pip in the "Set up Python" step which only
benefits the pipx install of Poetry; change the Python setup to use Poetry
caching so project dependencies restored from cache when running the "Install
dependencies" step. Update the "Set up Python" step (actions/setup-python@v5) to
use cache: poetry instead of cache: pip, keep the "Install Poetry" (pipx install
poetry) and "Install dependencies" (poetry install ${{
inputs.poetry-install-args }}) steps unchanged so Poetry is still installed but
dependency installs reuse the Poetry cache.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 29-31: The workflow lets a single repo failure cancel sibling
matrix jobs and still run the create-pr job (because create-pr uses if:
always()), producing a partial PR; turn off fast-fail and gate the PR job on the
matrix job(s) succeeding: set strategy.fail-fast: false on the matrix that
defines repo (strategy.matrix.repo) and change the create-pr job's condition to
require the analyze job(s) success (replace if: always() with a check like
needs.analyze.result == 'success' or equivalent needs.analyze.conclusion ==
'success') so create-pr only runs when the full matrix completes successfully.
- Around line 16-24: The multiline inline python command under the
discover-repos step is malformed YAML; replace the current python -c "..."
invocation with a proper run block that uses a here-doc style multiline script
(e.g., run: | followed by an indented Python script) so the lines starting with
import/with are treated as script content, and ensure the environment variable
REPOS is echoed to GITHUB_OUTPUT as before (refer to the REPOS variable and the
python script in this block). Also change the create-pr job's condition from if:
always() to a success gate (e.g., if: success() or check that required
artifacts/data-* exist via needs.<job>.outputs) so PR creation is skipped when
upstream steps fail, using the create-pr job name to locate the conditional.

In `@docs/CONFIGURATION.md`:
- Around line 33-34: The docs currently state a single output file
`data/{name}_data.json`; update the `output-path` / repository output docs to
reflect the new layout by replacing that single path with both
`data/raw/{name}_data.json` (raw data) and `data/processed/{name}_graph.json`
(processed graph), adjust any example commands and descriptive text to point
readers to the correct raw vs processed files, and make the same change in the
other occurrence of the same example further down the file (the second block
that mentions `data/{name}_data.json`).

In `@tests/test_data_integrity.py`:
- Around line 16-17: The test currently sets data_dir = Path("./data/processed")
which relies on the current working directory; change data_dir to be resolved
relative to the test file (use Path(__file__).resolve().parent and then join
"data/processed") so the fixtures are located from the test file location, and
leave the json_files = list(data_dir.glob("*.json")) line unchanged; update the
variable data_dir in tests/test_data_integrity.py accordingly.

---

Outside diff comments:
In `@scripts/add_fossils.py`:
- Around line 313-326: The code creates temp_dir with
temp_dir.mkdir(exist_ok=True) before checking existence, so local_repo.exists()
is always true and cloning never runs; change the logic in the block that sets
up temp_dir/local_repo (variables temp_dir, local_repo) to first check if
temp_dir/local_repo exists and only call mkdir() when it does not exist, then
run git clone via run_command(["git", "clone", repo_url, str(local_repo)]) when
the repo directory is absent, else run git fetch as currently implemented and
keep the existing exception handling for run_command.
- Around line 415-428: The code creates temp_dir (temp_fossil_repos_{repo_name})
with temp_dir.mkdir(...) before checking existence, so the subsequent if not
local_repo.exists() is always false; remove or move that mkdir call and only
create the parent/temp directory when you actually need to clone: keep temp_dir
and local_repo as-is, delete the early temp_dir.mkdir(exist_ok=True), and in the
branch that clones (inside the if not local_repo.exists()), ensure the parent
directory exists (e.g., mkdir(parents=True, exist_ok=True)) before calling
run_command(["git", "clone", repo_url, str(local_repo)]); leave the fetch branch
unchanged and continue using run_command for git fetch.

---

Nitpick comments:
In @.github/actions/setup-python-poetry/action.yml:
- Around line 17-28: The workflow currently sets cache: pip in the "Set up
Python" step which only benefits the pipx install of Poetry; change the Python
setup to use Poetry caching so project dependencies restored from cache when
running the "Install dependencies" step. Update the "Set up Python" step
(actions/setup-python@v5) to use cache: poetry instead of cache: pip, keep the
"Install Poetry" (pipx install poetry) and "Install dependencies" (poetry
install ${{ inputs.poetry-install-args }}) steps unchanged so Poetry is still
installed but dependency installs reuse the Poetry cache.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 513705d0-bbe5-4f6f-9c92-db0477640d87

📥 Commits

Reviewing files that changed from the base of the PR and between 121c0bb and 7f6a419.

📒 Files selected for processing (19)
  • .github/actions/setup-python-poetry/action.yml
  • .github/workflows/theseus-engine.yml
  • .github/workflows/unit-tests.yml
  • .gitignore
  • app.js
  • data/processed/claude-code_graph.json
  • data/processed/langchain_graph.json
  • data/processed/numpy_graph.json
  • data/processed/react_graph.json
  • data/processed/zed_graph.json
  • docs/CONFIGURATION.md
  • scripts/_blame.py
  • scripts/_utils.py
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/cleanup_data.py
  • scripts/run_pipeline.py
  • tests/test_data_integrity.py
  • theseus.config.json

Comment thread .github/workflows/theseus-engine.yml
Comment thread .github/workflows/theseus-engine.yml
Comment thread docs/CONFIGURATION.md Outdated
Comment thread tests/test_data_integrity.py Outdated
@Asifdotexe

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Around line 17-24: The heredoc used to populate REPOS (the python << 'EOF'
block) is dedented and therefore terminates the surrounding YAML block scalar;
indent the entire heredoc body, the trailing EOF marker, and the closing
parenthesis to the same column as the run: | block content so YAML treats the
heredoc as part of the block scalar (YAML will strip common indentation and the
bash heredoc will still receive EOF at column 0); update the lines around the
REPOS assignment and the python << 'EOF' invocation accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5866a1c7-b363-459f-a4cb-e42ae69a2d7c

📥 Commits

Reviewing files that changed from the base of the PR and between 7f6a419 and c356128.

📒 Files selected for processing (3)
  • .github/workflows/theseus-engine.yml
  • docs/CONFIGURATION.md
  • tests/test_data_integrity.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_data_integrity.py

Comment thread .github/workflows/theseus-engine.yml Outdated
@Asifdotexe Asifdotexe merged commit be35901 into main Jun 2, 2026
2 checks passed
@Asifdotexe Asifdotexe deleted the 39-optimizing-the-workflow branch June 4, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimizing the workflow

1 participant