Skip to content

33 python code auditing#34

Merged
Asifdotexe merged 9 commits into
mainfrom
33-python-code-auditing
May 31, 2026
Merged

33 python code auditing#34
Asifdotexe merged 9 commits into
mainfrom
33-python-code-auditing

Conversation

@Asifdotexe

@Asifdotexe Asifdotexe commented May 31, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added shared pipeline utilities, a unified pipeline runner, and modularized data IO and blame helpers to streamline snapshot/fossil processing.
    • Updated multiple repository datasets with refreshed monthly composition metrics and fossil metadata.
  • Refactor

    • Consolidated scripts to use shared helpers and improved concurrency, error handling, and atomic data writes.
    • CI workflows updated to run the new pipeline and adjust test/job setup.
  • Documentation & Configuration

    • Revised milestone dates and added new project milestones.
  • Tests

    • Added tests covering snapshot filtering and blame-year parsing.

@Asifdotexe Asifdotexe self-assigned this May 31, 2026
@Asifdotexe Asifdotexe linked an issue May 31, 2026 that may be closed by this pull request
@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@Asifdotexe, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 19 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d468cae7-1fcf-4fd0-a73b-81bda950811e

📥 Commits

Reviewing files that changed from the base of the PR and between 0f264ac and 0489db6.

📒 Files selected for processing (5)
  • .github/workflows/theseus-engine.yml
  • pyproject.toml
  • scripts/_data_io.py
  • scripts/_utils.py
  • theseus.config.json
📝 Walkthrough

Walkthrough

Extract shared helpers and IO into scripts/_utils.py and scripts/_data_io.py, centralize git-blame logic in scripts/_blame.py, add scripts/run_pipeline.py to orchestrate analysis → fossils → cleanup, and refactor add_fossils.py, analyse_repository.py, and cleanup_data.py to use the shared modules; update tests, dataset JSON, config milestones, and CI workflows.

Changes

Shared Pipeline Infrastructure and Fossil/Analysis Refactoring

Layer / File(s) Summary
Shared utilities and IO
scripts/_utils.py, scripts/_data_io.py
Add run_command, load_config, get_default_branch, get_tracked_files, remove_path, plus load_snapshot_data/save_snapshot_data for normalized atomic JSON IO.
Blame parsing and fossil selection
scripts/_blame.py
New blame helpers: blame_single_file, parse_blame_year_counts, find_oldest_fossil_in_blame, and concurrent blame_files_* utilities to aggregate year counts or find oldest fossil.
Fossil generation/refactor
scripts/add_fossils.py
Migrate to shared helpers and blame module; genesis now blames only files added in each commit with stale_limit early-exit; survivor uses detected default branch and (file, commit) identity; backfill/update use typed signatures, shared snapshot IO, and remove_path cleanup.
Repository analysis and snapshot filtering
scripts/analyse_repository.py
Switch git calls to run_command, add _resolve_worker_count and _filter_snapshots(reprocess), use load_snapshot_data/save_snapshot_data, detect default branch via utils, and add concurrent main() with --repo/--reprocess.
Unified pipeline runner
scripts/run_pipeline.py
New orchestration running three ordered stages: snapshot analysis, fossils (backfill or survivor-only), and cleanup with CLI flags --repo, --reprocess, --update-survivor.
Data cleanup refactor
scripts/cleanup_data.py
Use shared snapshot IO and config loader; remove legacy total_lines and prune future-year composition keys per snapshot_date; persist minified JSON.
Tests for snapshot filtering
tests/test_analyse_repository.py
Update imports to new helpers and add TestFilterSnapshots suite validating processed-period exclusion and reprocess override behavior.
Dataset JSON and config milestone updates
data/claude-code_data.json, data/langchain_data.json, data/numpy_data.json, data/react_data.json, data/zed_data.json, theseus.config.json
Replace dataset JSON payloads with refreshed snapshots composition numbers and updated fossils.genesis metadata; update theseus.config.json milestones (React date, new NumPy and Zed milestones) and ensure trailing JSON correctness.
CI/workflow updates
.github/workflows/integration-tests.yml, .github/workflows/theseus-engine.yml
Rename integration workflow to unit tests and adjust Poetry install; change engine workflow to install Poetry via pipx and run scripts/run_pipeline.py --update-survivor, gating PR creation on prior success and tweak PR body.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped through scripts and tidy trails,
Shared utils stitched the fraying rails.
Fossils found where old lines hide,
Snapshots cleaned and pipelines tied.
A carrot for each helper shared—hooray! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '33 python code auditing' is vague and generic, using non-descriptive language that doesn't convey meaningful information about the actual changes in the pull request. Replace with a specific title that describes the main change, such as 'Refactor scripts to use shared utilities and unified pipeline orchestration' or 'Consolidate pipeline scripts with shared utility modules'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 93.88% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 33-python-code-auditing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
scripts/add_fossils.py (1)

481-491: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Docstring incorrectly describes return value.

The docstring states "Returns the number of repos where the survivor was updated" but the function actually returns had_failures: bool. The updated_count variable is only used for logging.

📝 Proposed fix
     """
     Refresh only the Survivor (Living) fossil for each repo.
     Skips writing to disk if the fossil's file:line:commit hasn't changed.

     This is designed to be fast and run on every monthly cron tick so that
     the living fossil stays current even when no new snapshots are being added.

-    Returns the number of repos where the survivor was updated.
+    Returns True if any errors occurred, False otherwise.
     """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/add_fossils.py` around lines 481 - 491, The docstring for
update_survivor_fossils incorrectly says "Returns the number of repos where the
survivor was updated" while the function actually returns had_failures: bool
(and updated_count is only used for logging); update the docstring's Returns
section to describe the actual return value (e.g., "Returns True if any failures
occurred while updating survivors, False otherwise") and mention that
updated_count is used only for logging, or alternatively change the function to
return updated_count if you intend an integer—reference update_survivor_fossils,
had_failures, and updated_count when making the fix.
scripts/analyse_repository.py (1)

329-336: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reprocessing creates duplicate snapshot entries.

When --reprocess YYYY-MM targets an already-processed period, _filter_snapshots correctly includes it for reprocessing, but line 332 concatenates historical_snapshots (which still contains the old entry) with total_new_data (containing the new entry). The sort on line 333 does not deduplicate, resulting in two entries for the same period in the output JSON.

🐛 Proposed fix to remove old entry when reprocessing
         total_new_data.extend(year_data)
         year_elapsed = time.perf_counter() - year_start

-        final_snapshots = historical_snapshots + total_new_data
+        # When reprocessing, filter out the old entry to avoid duplicates
+        if reprocess:
+            base_snapshots = [
+                s for s in historical_snapshots if s["snapshot_date"] != reprocess
+            ]
+        else:
+            base_snapshots = historical_snapshots
+        final_snapshots = base_snapshots + total_new_data
         final_snapshots.sort(key=lambda x: x["snapshot_date"])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/analyse_repository.py` around lines 329 - 336, The code concatenates
historical_snapshots and total_new_data then sorts, which leaves duplicate
entries when reprocessing a period; update the logic before calling
_atomic_write_json to remove any historical_snapshots entries that match
reprocessed snapshot dates (use the same key used in _filter_snapshots, e.g.,
"snapshot_date") so final_snapshots = (historical_snapshots filtered to exclude
dates present in total_new_data) + total_new_data, then sort and pass to
_atomic_write_json; refer to functions/variables _filter_snapshots,
historical_snapshots, total_new_data, final_snapshots, and _atomic_write_json to
locate and implement the change.
theseus.config.json (2)

62-65: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Adjust/justify the “Python 2 & 3 unified” milestone date (theseus.config.json:62-65)
Existing sources place the introduction of Python 3 support in NumPy at version 1.5.0 (released in 2011), so the “Python 2 & 3 unified” milestone dated 2013-04 needs an authoritative rationale or the date should be updated to match when Py2+Py3 unified support actually began.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@theseus.config.json` around lines 62 - 65, The "Python 2 & 3 unified"
milestone in theseus.config.json currently has date "2013-04" which disagrees
with sources about when NumPy added Py3 support; update the milestone by either
setting the "date" to the authoritative release date for NumPy's Python 3
support (e.g., the NumPy 1.5.0 release date) or keep the existing date but add a
brief justification/citation in the "description" field referencing the
authoritative source (mention the NumPy 1.5.0 release note) so the "title",
"date", and "description" stay consistent.

80-83: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the Zed milestone date (and consider splitting the event)

  • In theseus.config.json (repo zed milestone at date 2023-12), update the date to 2024-01: Zed was open-sourced on 2024-01-24, and the GPUI2 transition hit the preview channel in early Jan 2024.
  • The current milestone title (“GPUI2 Transition and Open Sourced”) combines two distinct events; splitting it into separate milestones for GPUI2 transition vs open-source release improves clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@theseus.config.json` around lines 80 - 83, Update the milestone object that
currently has "date": "2023-12" and "title": "GPUI2 Transition and Open Sourced"
in theseus.config.json: change the "date" value to "2024-01" to reflect the
2024-01-24 open-source release, and split this combined milestone into two
separate milestone objects (one for the GPUI2 transition and one for the Zed
open-source release) with distinct "title", "date" and "description" fields so
each event is represented separately (locate the existing milestone by the
"title" and "date" keys to modify and duplicate/split into two entries).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@data/langchain_data.json`:
- Line 1: The snapshots in data/langchain_data.json include future-year "2026"
keys that violate the cleanup logic; fix by running or re-running the existing
cleanup that uses max_year = int(snapshot_date[:4]) and deletes composition keys
with int(year) > max_year (the del composition[key] step in
scripts/cleanup_data.py), or adjust your data generation to never emit
future-year entries (remove/justify all "2026" composition keys across snapshots
such as the 2023-06 and 2025-12 entries) so the file is consistent with the
Filter future years rule.

In `@data/react_data.json`:
- Line 1: The snapshot 2013-12 in react_data.json contains a future-year entry
("2014":6) that should have been removed by the cleanup logic; update the
cleanup logic in cleanup_data.py (the filtering block around lines 38-45) or
re-run the cleanup to ensure snapshot compositions drop any year > snapshot year
(e.g. remove the "2014" key from snapshot_date "2013-12"); specifically, ensure
the function that processes snapshots enforces year <= snapshot_year for each
composition entry and write the cleaned output back to data/react_data.json,
then re-run the pipeline verification.

In `@data/zed_data.json`:
- Line 1: The zed_data.json snapshots before 2026 incorrectly include a "2026"
composition key, so trace the code that builds snapshot compositions (functions
like mapTimestampToYear / buildComposition / generateSnapshots or whatever
computes composition years from fossil.timestamp or git blame timestamps) and
confirm whether it uses commit/timestamp time (possibly with clock skew/UTC
conversion) instead of the snapshot_date year; fix by either using the
snapshot_date's year as the authoritative bucket or clamping computed years so
they are <= snapshot_date year (or otherwise deriving year from the
snapshot_date string), and add a unit/CI check to reject future-year keys for
past snapshot_date entries.

In `@scripts/_utils.py`:
- Around line 147-171: The code references stat.S_IWUSR, stat.S_IWGRP, and
stat.S_IWOTH inside the handle_remove_readonly callback used by shutil.rmtree
but the stat module is not imported, causing a NameError; fix by adding an
import for the stat module (import stat) near the top-level imports so
handle_remove_readonly (and its use in
shutil.rmtree(onexc=handle_remove_readonly)) can access stat constants.
- Around line 148-166: The retry loop currently calls shutil.rmtree(path,
ignore_errors=False) before the onerror handler is defined so the handler is
never used; move the handle_remove_readonly definition above the first
shutil.rmtree call and invoke shutil.rmtree(path,
onerror=handle_remove_readonly) inside the loop (dropping the duplicated rmtree
call) so each retry uses the handler to fix permissions and retry deletion;
reference the existing shutil.rmtree and handle_remove_readonly names and the
for attempt in range(3) loop when making the change.

In `@theseus.config.json`:
- Line 44: Update the React 19 milestone date in theseus.config.json by changing
the "date" field for the React 19 milestone entry from "2024-06" (or "2024-04")
to "2024-12"; locate the React 19 milestone object (the entry containing the
"date" property shown in the diff) and set its "date" value to "2024-12" to
reflect the stable release on 2024-12-05.

---

Outside diff comments:
In `@scripts/add_fossils.py`:
- Around line 481-491: The docstring for update_survivor_fossils incorrectly
says "Returns the number of repos where the survivor was updated" while the
function actually returns had_failures: bool (and updated_count is only used for
logging); update the docstring's Returns section to describe the actual return
value (e.g., "Returns True if any failures occurred while updating survivors,
False otherwise") and mention that updated_count is used only for logging, or
alternatively change the function to return updated_count if you intend an
integer—reference update_survivor_fossils, had_failures, and updated_count when
making the fix.

In `@scripts/analyse_repository.py`:
- Around line 329-336: The code concatenates historical_snapshots and
total_new_data then sorts, which leaves duplicate entries when reprocessing a
period; update the logic before calling _atomic_write_json to remove any
historical_snapshots entries that match reprocessed snapshot dates (use the same
key used in _filter_snapshots, e.g., "snapshot_date") so final_snapshots =
(historical_snapshots filtered to exclude dates present in total_new_data) +
total_new_data, then sort and pass to _atomic_write_json; refer to
functions/variables _filter_snapshots, historical_snapshots, total_new_data,
final_snapshots, and _atomic_write_json to locate and implement the change.

In `@theseus.config.json`:
- Around line 62-65: The "Python 2 & 3 unified" milestone in theseus.config.json
currently has date "2013-04" which disagrees with sources about when NumPy added
Py3 support; update the milestone by either setting the "date" to the
authoritative release date for NumPy's Python 3 support (e.g., the NumPy 1.5.0
release date) or keep the existing date but add a brief justification/citation
in the "description" field referencing the authoritative source (mention the
NumPy 1.5.0 release note) so the "title", "date", and "description" stay
consistent.
- Around line 80-83: Update the milestone object that currently has "date":
"2023-12" and "title": "GPUI2 Transition and Open Sourced" in
theseus.config.json: change the "date" value to "2024-01" to reflect the
2024-01-24 open-source release, and split this combined milestone into two
separate milestone objects (one for the GPUI2 transition and one for the Zed
open-source release) with distinct "title", "date" and "description" fields so
each event is represented separately (locate the existing milestone by the
"title" and "date" keys to modify and duplicate/split into two entries).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a56ec0d-3779-462c-89ca-ec4e2230e163

📥 Commits

Reviewing files that changed from the base of the PR and between c46e99f and 1384e16.

📒 Files selected for processing (12)
  • data/claude-code_data.json
  • data/langchain_data.json
  • data/numpy_data.json
  • data/react_data.json
  • data/zed_data.json
  • scripts/__init__.py
  • scripts/_utils.py
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/cleanup_data.py
  • tests/test_analyse_repository.py
  • theseus.config.json

Comment thread data/langchain_data.json Outdated
Comment thread data/react_data.json Outdated
Comment thread data/zed_data.json Outdated
Comment thread scripts/_utils.py Outdated
Comment thread scripts/_utils.py
Comment thread theseus.config.json Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/theseus-engine.yml:
- Line 41: Replace the mutable tag reference "uses:
peter-evans/create-pull-request@v6" with a fixed commit SHA (for example the
suggested
"peter-evans/create-pull-request@b1ddad2c994a25fbc81a28b3ec0e368bb2021c50") to
eliminate supply-chain risk; update the workflow entry that specifies uses:
peter-evans/create-pull-request@v6 and keep the existing token: ${{
secrets.GITHUB_TOKEN }} and job permissions (contents: write / pull-requests:
write) unchanged while switching the action reference to the chosen immutable
commit SHA.

In `@scripts/_data_io.py`:
- Around line 65-73: The JSON loader currently returns raw dicts or other JSON
values, which can lack the expected "snapshots" (list) and "fossils" (dict) keys
and cause KeyError in backfill_fossils and update_survivor_fossils; after
json.load(...) in the existing try block, normalize the result so you always
return a dict with "snapshots" (ensure it's a list, otherwise set to []) and
"fossils" (ensure it's a dict, otherwise set to {}), and if the JSON is neither
a list nor dict, return {"snapshots": [], "fossils": {}}; update the branch that
currently returns data to perform this normalization before returning.

In `@scripts/add_fossils.py`:
- Around line 305-309: The code directly indexes data["snapshots"] after calling
load_snapshot_data which can raise KeyError for malformed JSON; update the
load_snapshot_data implementation to normalize its return value so it always
includes both keys (at least "snapshots" as an empty list and the other expected
key(s) as sensible defaults) and ensure callers (e.g., where load_snapshot_data
is used) can safely index "snapshots" without a try/except; keep the call site
(load_snapshot_data(...)) unchanged and instead make load_snapshot_data return a
dict that always contains "snapshots" to prevent uncaught KeyError.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 98575af6-aa45-4344-9782-c966c9a3e874

📥 Commits

Reviewing files that changed from the base of the PR and between 1384e16 and 0f264ac.

📒 Files selected for processing (14)
  • .github/workflows/integration-tests.yml
  • .github/workflows/theseus-engine.yml
  • data/langchain_data.json
  • data/numpy_data.json
  • data/react_data.json
  • data/zed_data.json
  • scripts/_blame.py
  • scripts/_data_io.py
  • scripts/_utils.py
  • scripts/add_fossils.py
  • scripts/analyse_repository.py
  • scripts/cleanup_data.py
  • scripts/run_pipeline.py
  • tests/test_analyse_repository.py
✅ Files skipped from review due to trivial changes (1)
  • data/langchain_data.json
🚧 Files skipped from review as they are similar to previous changes (4)
  • data/react_data.json
  • data/zed_data.json
  • data/numpy_data.json
  • scripts/analyse_repository.py

Comment thread .github/workflows/theseus-engine.yml Outdated
Comment thread scripts/_data_io.py
Comment thread scripts/add_fossils.py
@Asifdotexe Asifdotexe merged commit 1fdeb51 into main May 31, 2026
2 checks passed
@Asifdotexe Asifdotexe deleted the 33-python-code-auditing branch May 31, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python code auditing

1 participant