Skip to content

1 fetch calculate and store#7

Merged
Asifdotexe merged 22 commits into
mainfrom
1-fetch-calculate-and-store
Apr 4, 2026
Merged

1 fetch calculate and store#7
Asifdotexe merged 22 commits into
mainfrom
1-fetch-calculate-and-store

Conversation

@Asifdotexe

@Asifdotexe Asifdotexe commented Apr 4, 2026

Copy link
Copy Markdown
Owner

Summary

  • New Features

    • Added repository analysis capability to track code metrics and composition over time
  • Chores

    • Configured development tooling with code quality checks, formatting, and linting
    • Initialized project configuration and dependencies

Look at the optimization result at #1 (comment)

Summary by CodeRabbit

  • New Features

    • Added repository analysis that snapshots monthly commits and reports code composition and line-count trends over time.
  • Chores

    • Added project configuration for Python packaging, build system, and pinned development tooling (formatters, linter, test runner) plus pre-commit hooks to enforce code style and quality.
  • Data

    • Included initial monthly snapshot dataset capturing total lines and per-year composition for the project.

@Asifdotexe Asifdotexe self-assigned this Apr 4, 2026
@Asifdotexe Asifdotexe linked an issue Apr 4, 2026 that may be closed by this pull request
4 tasks
@coderabbitai

coderabbitai Bot commented Apr 4, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@Asifdotexe has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 37 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 37 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 79025aff-7cd2-4308-90c1-5cf0bde323cc

📥 Commits

Reviewing files that changed from the base of the PR and between 02de6e4 and 1a7a08c.

📒 Files selected for processing (3)
  • .github/workflows/integration-tests.yml
  • scripts/analyse_repository.py
  • tests/test_analyse_repository.py
📝 Walkthrough

Walkthrough

Adds project tooling and metadata (pre-commit configuration and pyproject), a new repository analysis script that snapshots monthly commits and computes line-count composition by author year, and initial monthly JSON data capturing those snapshot results.

Changes

Cohort / File(s) Summary
Project Configuration
\.pre-commit-config\.yaml, pyproject\.toml
Adds pre-commit hooks (trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files, plus local hooks invoking black, isort, pylint via poetry run with language: system and types: [python]; pylint uses --rcfile=pyproject.toml). Adds pyproject.toml with project metadata, dev dependency group, Pylint config (max-line-length 120, disables fixme), and build-system backend.
Repository Analysis Script
scripts/analyse_repository.py
New script that clones or refreshes target repos, derives one snapshot commit per month, checks out each snapshot, lists tracked files, runs git blame --line-porcelain concurrently, parses blame to attribute lines by author-year, aggregates per-snapshot composition, and persists chronological JSON state. Includes timer decorator and subprocess wrapper.
Data
data/claude-code_data.json
New JSON dataset with monthly snapshot records (2025-02 through 2026-04) listing snapshot_date, total_lines, and composition mapping years to line counts.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Main as "analyse_repository.py"
    participant Git as "git (clone/checkout/blame)"
    participant FS as "File System"
    participant Exec as "ThreadPoolExecutor"
    participant Parser as "Blame Parser"
    participant Store as "JSON Storage"

    User->>Main: process_repository(repo_slug)
    Main->>Git: clone_repository(repo_slug)
    Git-->>Main: repo cloned
    Main->>Git: get_monthly_snapshots()
    Git-->>Main: monthly commit list

    loop for each snapshot
        Main->>Git: checkout(commit)
        Git-->>Main: checked out
        Main->>Git: git ls-files
        Git-->>Main: file list

        loop for each file (concurrent)
            Main->>Exec: _blame_single_file(file)
            Exec->>Git: git blame --line-porcelain file
            Git-->>Exec: blame output
            Exec->>Parser: _parse_blame_output(output)
            Parser-->>Exec: year -> line counts
            Exec-->>Main: file composition
        end

        Main->>Main: aggregate compositions
        Main->>Store: load_existing_state(json)
        Store-->>Main: existing data
        Main->>Store: append & write JSON
        Store-->>Main: persisted
    end

    Main->>FS: cleanup temp dir
    FS-->>Main: removed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through commits, sniffed each line,

Collected months and counted ink so fine.
Hooks in place, the code sleeps neat,
Snapshots saved, the tale complete.
A little rabbit cheers — data sweet!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '1 fetch calculate and store' is vague and uses non-descriptive phrasing that does not clearly convey what the changeset accomplishes, making it difficult for reviewers scanning history to understand the primary change. Revise the title to be more descriptive of the main feature, such as 'Add repository analysis pipeline to track code metrics' or 'Implement monthly snapshot analysis for code composition tracking'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 1-fetch-calculate-and-store

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
scripts/analyse_repository.py (2)

83-85: Use f-string conversion flag instead of str().

Per static analysis (RUF010), replace {str(e)} with {e!s} for cleaner f-string formatting.

-        print(f"Failed to list files in repository: {str(e)}")
+        print(f"Failed to list files in repository: {e!s}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 83 - 85, Replace the f-string
conversion in the exception handler that prints the RuntimeError: locate the
except block "except RuntimeError as e:" (the print and subsequent return {}),
and change the interpolation from using str(e) to the f-string conversion flag
by using {e!s} so the message uses the f-string conversion form instead of
calling str().

52-55: Consider adding defensive parsing for malformed log lines.

If a log line doesn't contain the expected | separator (unlikely but possible with corrupted git output), the unpacking on line 55 will raise a ValueError. Consider wrapping in a try-except or validating the split result:

🛡️ Defensive parsing suggestion
     for line in log_output.splitlines():
         if not line:
             continue
-        commit_hash, commit_date = line.split("|")
+        parts = line.split("|")
+        if len(parts) != 2:
+            continue
+        commit_hash, commit_date = parts
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 52 - 55, The loop that parses git
log lines (iterating over log_output.splitlines()) assumes every non-empty line
contains a '|' and unpacks into commit_hash and commit_date, which will raise on
malformed lines; update the parsing in that loop to defensively handle bad lines
by either validating the result of line.split("|") (check length == 2) or
wrapping the unpack in a try/except catching ValueError, skip or continue on
malformed lines, and emit a warning or debug log that includes the offending
line for troubleshooting (refer to the variables/loop using log_output, line,
commit_hash, commit_date).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.pre-commit-config.yaml:
- Around line 10-29: The pre-commit hook versions in .pre-commit-config.yaml
(black 25.9.0, isort 6.1.0, pylint v3.3.8) mismatch the versions declared in
pyproject.toml; update the file so behavior matches Poetry-managed tools by
either (A) setting each hook's rev to the same versions specified in
pyproject.toml for black, isort and pylint, or (B) switch the pylint hook (and
any hooks that run via Poetry like black/isort) to use repo: local and keep
entry: poetry run <tool> so the system will use the pyproject.toml versions
(adjust/remove the rev fields when using repo: local); target symbols: the black
hook, isort hook, and the pylint hook with entry: poetry run pylint.

In `@pyproject.toml`:
- Around line 29-31: Update the build-system requires constraint to ensure
poetry-core is new enough to support [dependency-groups]; change the value of
the requires entry under the [build-system] table so that
"poetry-core>=2.2.0,<3.0.0" is used instead of "poetry-core>=2.0.0,<3.0.0" (edit
the requires line that references poetry-core in the [build-system] block).
- Around line 14-21: The dependency specifiers in the dependency-groups block
are using non-standard parenthesized syntax (e.g., "pytest (==9.0.2)"); update
each package specifier to valid PEP 508 form by removing the parentheses and
extra space and using standard operators (e.g., change "pytest (==9.0.2)" to
"pytest==9.0.2", "pre-commit (>=4.5.1,<5.0.0)" to "pre-commit>=4.5.1,<5.0.0",
etc.) so the dependency-groups entries conform to PEP 735/508 parsing.

In `@scripts/analyse_repository.py`:
- Around line 1-4: Update the module docstring in scripts/analyse_repository.py:
fix the typo "reposible" → "responsible" and reword the second sentence to
accurately describe behavior (e.g., state that the script processes monthly
snapshots incrementally rather than implying exactly "12 blames per year") so
the docstring for the module (top-of-file docstring) correctly describes the
script's function.
- Around line 93-105: Fix the minor typos in the comments around the git blame
invocation: change "doesn;t" to "doesn't" and "flas" to "flags" in the comment
near the _run_command(...) call that produces blame_output, and change "has" to
"as" in the comment describing the porcelain parsing for
commit_to_year/current_commit so the comments read correctly and clearly.
- Around line 205-215: The __main__ block currently hardcodes TARGET_REPO_PATH
to a user-specific Windows path; replace this with a CLI argument or
environment-variable backed value: use argparse to add a --target-repo (fallback
to os.environ.get('TARGET_REPO_PATH') or a neutral placeholder like
"./path/to/repo") and a --output JSON flag for OUTPUT_JSON_FNAME, then call
generate_theseus_data(target_repo, output_json) as before and keep the
os.path.exists check; ensure you remove the username-specific default so no
personal paths are committed and keep the existing OUTPUT_JSON_FNAME behavior as
a configurable default.
- Around line 194-195: Remove the unnecessary repository switch by deleting the
call to _run_command(["git", "checkout", "-"], cwd=target_repo_path) in
scripts/analyse_repository.py; the analysis only runs git blame and git ls-tree
on specified commits and never changes the working tree, so remove this line
(referencing the _run_command call and target_repo_path) to avoid flipping the
repo to a previous branch.

---

Nitpick comments:
In `@scripts/analyse_repository.py`:
- Around line 83-85: Replace the f-string conversion in the exception handler
that prints the RuntimeError: locate the except block "except RuntimeError as
e:" (the print and subsequent return {}), and change the interpolation from
using str(e) to the f-string conversion flag by using {e!s} so the message uses
the f-string conversion form instead of calling str().
- Around line 52-55: The loop that parses git log lines (iterating over
log_output.splitlines()) assumes every non-empty line contains a '|' and unpacks
into commit_hash and commit_date, which will raise on malformed lines; update
the parsing in that loop to defensively handle bad lines by either validating
the result of line.split("|") (check length == 2) or wrapping the unpack in a
try/except catching ValueError, skip or continue on malformed lines, and emit a
warning or debug log that includes the offending line for troubleshooting (refer
to the variables/loop using log_output, line, commit_hash, commit_date).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec766b0a-a981-4f97-9b21-c67780ef5bc5

📥 Commits

Reviewing files that changed from the base of the PR and between 3d2a641 and 8affcc4.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • .pre-commit-config.yaml
  • data/theseus_data.json
  • pyproject.toml
  • scripts/analyse_repository.py

Comment thread .pre-commit-config.yaml Outdated
Comment thread pyproject.toml
Comment thread pyproject.toml
Comment thread scripts/analyse_repository.py
Comment thread scripts/analyse_repository.py Outdated
Comment thread scripts/analyse_repository.py Outdated
Comment thread scripts/analyse_repository.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
scripts/analyse_repository.py (3)

223-225: Minor typo in comment.

Line 224: "it's" → "its" (possessive), and "120 month" → "120-month" (compound adjective).

📝 Suggested fix
-    # Let's say we have a 10-year old repository. Running git blame on every file for every single month
-    # of it's 120 month long history would take hours and it would blow past the GitHub Action's free tier limit.
+    # Let's say we have a 10-year old repository. Running git blame on every file for every single month
+    # of its 120-month-long history would take hours and it would blow past the GitHub Action's free tier limit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 223 - 225, Fix the typo in the
comment about repository history: change "it's" to the possessive "its" and make
"120 month" a compound adjective "120-month" so the sentence reads correctly
(the comment describing running git blame on every file for every single month
of its 120-month long history).

286-301: Consider CLI arguments for flexibility; minor grammar fix.

The hardcoded TARGETS list works for the current use case. For broader adoption, consider adding argparse to accept repo slugs as arguments.

Line 290: "these one" → "these ones" or "this one".

📝 Suggested fix for grammar
-    # The Case Studies: Start with these one to benchmark.
+    # The Case Studies: Start with this one to benchmark.
     TARGETS = [
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 286 - 301, Replace the hardcoded
TARGETS usage in the __main__ block with argparse so callers can pass one or
more repo slugs (e.g., --targets or positional) while preserving the existing
TARGETS as the default; update calls to process_repository to iterate over
args.targets (or fallback to TARGETS) and optionally expose DATA_OUTPUT_DIR via
an argument; also fix the inline comment string "# The Case Studies: Start with
these one to benchmark." to correct grammar (e.g., "Start with these ones to
benchmark." or "Start with this one to benchmark."). Ensure changes touch the
__main__ block, the TARGETS variable, and the process_repository invocation.

35-58: Minor typo in docstring and note on subprocess usage.

Line 37: "it's" → "its" (possessive form).

Regarding static analysis hint S603: The subprocess calls are currently safe since TARGETS is hardcoded. If this script is later extended to accept external input (e.g., via CLI args or API), ensure proper validation of repo_slug to prevent command injection.

📝 Suggested fix for typo
 def _run_command(cmd: list[str], cwd: str | None = None) -> str:
     """
-    Execute a shell command and return it's standard output
+    Execute a shell command and return its standard output
 
     :param cmd: List of arguments forming the command.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 35 - 58, Fix the docstring typo
in _run_command by changing "it's" to "its" and update the comment to note
subprocess security: keep using subprocess.run without shell=True for current
hardcoded TARGETS, and add a TODO that if external input (e.g., repo_slug) is
ever accepted, validate/sanitize it (or use shlex.quote/explicit argument lists)
to prevent command injection (reference function _run_command and any callers
that pass repo_slug).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/analyse_repository.py`:
- Around line 233-236: The code assumes "main" as the default branch when
refreshing an existing repo; change it to detect the remote default branch
(e.g., run git symbolic-ref --short refs/remotes/origin/HEAD or git rev-parse
--abbrev-ref origin/HEAD and parse the result) and fall back to "main" then
"master" if detection fails, then use that branch name instead of the hardcoded
"main" in the _run_command calls; update the checkout/pull sequence around
repo_name and temp_repo_path to compute branch_name first and call
_run_command(["git","checkout", branch_name], cwd=temp_repo_path) and
_run_command(["git","pull"], cwd=temp_repo_path).
- Around line 128-131: The year is being derived with
datetime.fromtimestamp(timestamp) which uses the local timezone; change it to
use UTC so year attribution is consistent across environments—replace the call
in scripts/analyse_repository.py where you set year (the block that parses
parts[1] and assigns commit_to_year[current_commit]) to use UTC (e.g.,
datetime.utcfromtimestamp(...) or datetime.fromtimestamp(..., timezone.utc)) and
add any necessary import for timezone handling.

---

Nitpick comments:
In `@scripts/analyse_repository.py`:
- Around line 223-225: Fix the typo in the comment about repository history:
change "it's" to the possessive "its" and make "120 month" a compound adjective
"120-month" so the sentence reads correctly (the comment describing running git
blame on every file for every single month of its 120-month long history).
- Around line 286-301: Replace the hardcoded TARGETS usage in the __main__ block
with argparse so callers can pass one or more repo slugs (e.g., --targets or
positional) while preserving the existing TARGETS as the default; update calls
to process_repository to iterate over args.targets (or fallback to TARGETS) and
optionally expose DATA_OUTPUT_DIR via an argument; also fix the inline comment
string "# The Case Studies: Start with these one to benchmark." to correct
grammar (e.g., "Start with these ones to benchmark." or "Start with this one to
benchmark."). Ensure changes touch the __main__ block, the TARGETS variable, and
the process_repository invocation.
- Around line 35-58: Fix the docstring typo in _run_command by changing "it's"
to "its" and update the comment to note subprocess security: keep using
subprocess.run without shell=True for current hardcoded TARGETS, and add a TODO
that if external input (e.g., repo_slug) is ever accepted, validate/sanitize it
(or use shlex.quote/explicit argument lists) to prevent command injection
(reference function _run_command and any callers that pass repo_slug).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d3a55e2-bc4a-4768-a879-f37f59f684f4

📥 Commits

Reviewing files that changed from the base of the PR and between 8affcc4 and d7d7aa4.

📒 Files selected for processing (2)
  • data/claude-code_data.json
  • scripts/analyse_repository.py
✅ Files skipped from review due to trivial changes (1)
  • data/claude-code_data.json

Comment thread scripts/analyse_repository.py
Comment thread scripts/analyse_repository.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
scripts/analyse_repository.py (3)

181-184: Consider handling exceptions from future.result().

If _blame_single_file raises an unexpected exception (not RuntimeError), calling future.result() will re-raise it and crash the entire analysis. While _blame_single_file catches RuntimeError, other exceptions (e.g., KeyboardInterrupt, MemoryError) would propagate.

♻️ Suggested defensive handling
         for future in concurrent.futures.as_completed(future_to_file):
-            file_dist = future.result()
-            for year, count in file_dist.items():
-                age_distribution[year] += count
+            try:
+                file_dist = future.result()
+                for year, count in file_dist.items():
+                    age_distribution[year] += count
+            except Exception as e:
+                file = future_to_file[future]
+                print(f"Warning: Failed to process {file}: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 181 - 184, The loop calling
future.result() can re-raise unexpected exceptions from _blame_single_file and
crash the run; wrap the future.result() call in a try/except that catches
Exception (or more specific exceptions you want to handle), log the error with
context (include the file identifier from future_to_file and the exception), and
skip that future so age_distribution aggregation only runs for successful
results; ensure you still re-raise critical exceptions if desired (e.g.,
KeyboardInterrupt) or handle them separately.

86-92: Consider using maxsplit for defensive parsing.

While the current git log format %H|%cI shouldn't produce extra | characters, using split("|", 1) would be more defensive against unexpected output formats.

♻️ Suggested improvement
     for line in log_output.splitlines():
         if not line:
             continue
-        commit_hash, commit_date = line.split("|")
+        commit_hash, commit_date = line.split("|", 1)
 
         # We slice the first 7 characters of the ISO to get the 'YYYY-MM' period
         period = commit_date[:7]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 86 - 92, The parsing loop over
log_output should defensively handle unexpected '|' characters: replace the
naive split in the loop that assigns commit_hash, commit_date with a
single-split using maxsplit (e.g., split("|", 1)) so commit_hash is the first
field and commit_date is the remainder, then continue to compute period =
commit_date[:7]; update the code that processes log_output.splitlines()
accordingly to use this single-split approach.

61-70: Advisory: Validate repository slug if accepting user input in the future.

The static analysis hint (S603) flags subprocess calls with potentially untrusted input. Currently safe since TARGETS is hardcoded, but if you implement CLI arguments (per the TODO at line 206), consider validating repo_slug format (e.g., matching ^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$) to prevent command injection via malformed repository names.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/analyse_repository.py` around lines 61 - 70, The clone_repository
function currently builds a repo URL from repo_slug and calls _run_command; to
address the S603 risk when repo_slug may come from user input, validate and
sanitize repo_slug before use: in clone_repository (and any caller) enforce a
strict regex like ^[A-Za-z0-9_.-]+/[A-Za-z0-9_.-]+$ (or equivalent) and raise
ValueError on mismatch, or explicitly parse and reject dangerous characters,
then construct repo_url only from the validated slug and call _run_command;
referencing clone_repository and the _run_command call ensures the validation
happens right before the git clone is executed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/analyse_repository.py`:
- Around line 297-300: The comment above the TARGETS list has a grammar mistake:
replace "Start with these one" with "Start with this one" in the comment that
precedes the TARGETS variable to read "The Case Studies: Start with this one to
benchmark."; locate the comment text near the TARGETS =
["anthropics/claude-code"] declaration and update it accordingly.
- Around line 35-42: The docstring for the function _run_command contains a
grammatical typo: change "it's" to the possessive "its" in the first line of the
docstring so it reads "return its standard output"; update the string in the
_run_command docstring accordingly to fix the typo.
- Around line 237-243: The branch-checkout loop that calls _run_command(["git",
"checkout", branch"], cwd=temp_repo_path) should handle the case where neither
"main" nor "master" exists: detect the repository's default branch (e.g.,
resolve origin/HEAD via git rev-parse --abbrev-ref origin/HEAD or git
symbolic-ref refs/remotes/origin/HEAD) and attempt to checkout that, and if
detection fails emit a clear warning/error and avoid blindly running
_run_command(["git", "pull"], cwd=temp_repo_path); update the logic around the
for loop and subsequent _run_command calls in scripts/analyse_repository.py to
try the detected default branch first, fall back to the existing "main"/"master"
attempts, and log or raise when no suitable branch is found.

---

Nitpick comments:
In `@scripts/analyse_repository.py`:
- Around line 181-184: The loop calling future.result() can re-raise unexpected
exceptions from _blame_single_file and crash the run; wrap the future.result()
call in a try/except that catches Exception (or more specific exceptions you
want to handle), log the error with context (include the file identifier from
future_to_file and the exception), and skip that future so age_distribution
aggregation only runs for successful results; ensure you still re-raise critical
exceptions if desired (e.g., KeyboardInterrupt) or handle them separately.
- Around line 86-92: The parsing loop over log_output should defensively handle
unexpected '|' characters: replace the naive split in the loop that assigns
commit_hash, commit_date with a single-split using maxsplit (e.g., split("|",
1)) so commit_hash is the first field and commit_date is the remainder, then
continue to compute period = commit_date[:7]; update the code that processes
log_output.splitlines() accordingly to use this single-split approach.
- Around line 61-70: The clone_repository function currently builds a repo URL
from repo_slug and calls _run_command; to address the S603 risk when repo_slug
may come from user input, validate and sanitize repo_slug before use: in
clone_repository (and any caller) enforce a strict regex like
^[A-Za-z0-9_.-]+/[A-Za-z0-9_.-]+$ (or equivalent) and raise ValueError on
mismatch, or explicitly parse and reject dangerous characters, then construct
repo_url only from the validated slug and call _run_command; referencing
clone_repository and the _run_command call ensures the validation happens right
before the git clone is executed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71d2a7dd-19ed-4109-9dda-6ab888dd370f

📥 Commits

Reviewing files that changed from the base of the PR and between d7d7aa4 and 02de6e4.

📒 Files selected for processing (3)
  • .pre-commit-config.yaml
  • pyproject.toml
  • scripts/analyse_repository.py
✅ Files skipped from review due to trivial changes (2)
  • .pre-commit-config.yaml
  • pyproject.toml

Comment thread scripts/analyse_repository.py
Comment thread scripts/analyse_repository.py
Comment thread scripts/analyse_repository.py Outdated
@Asifdotexe Asifdotexe merged commit acaa870 into main Apr 4, 2026
1 check passed
@Asifdotexe Asifdotexe deleted the 1-fetch-calculate-and-store branch April 4, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fetch, calculate and store

1 participant