commit-risk-scorer

A shift-left engineering intelligence agent that predicts PR risk, recommends reviewer / test / gate actions, and closes the loop through CI, telemetry, and DORA-style engineering metrics. Built on NVIDIA's open AI stack with a hybrid predictive pipeline (FT classifier + LLM judge).

📚 New here? See docs/README.md — a documentation index organized by audience (hiring manager / engineer / adopter) and by time available (5 / 15 / 30 / 60 min). It's the recommended entry point for anyone going deeper than this README.

What it does

Input: PR diff + metadata (author, files, target branch) + build/test history + ownership signals.

Output:

Risk score (0–100) and risk level (Low / Medium / High / Critical)
Top risk factors — evidence-backed (file-ownership gaps, weak test coverage, historically failing areas, deployment blast radius)
Recommended actions — reviewer assignment, test suite to run, gate decision (not just a numeric signal)
DORA-style impact telemetry — cycle time, change failure rate, MTTR, adoption, FP/FN feedback. (OSS deployment uses a DORA-aligned eval harness with replayed / simulated data — real impact numbers require deployment in a real org. Disclosed in docs/limitations.md §9 and docs/metrics.md §Estimation honesty.)

The risk score is not the product — the action is. Score feeds into a policy decision surface:

Risk → Action mapping

Score	Level	Action
0–20	Low	Fast-track / normal review
21–50	Medium	Add code-owner reviewer + targeted tests
51–80	High	Require SME review + extended CI
81–100	Critical	Block merge / manual gate

Example output

{
  "riskScore": 72,
  "riskLevel": "High",
  "topRiskFactors": [
    "Touches auth middleware (high-incident area)",
    "No test coverage for the modified branch",
    "Similar historical PRs caused CI failures"
  ],
  "recommendedActions": [
    "Add security / code-owner reviewer",
    "Run extended integration test suite",
    "Block auto-merge until reviewer approval"
  ],
  "confidence": 0.81,
  "evidence": [
    "Changed file: src/auth/token_validator.py — owned by @security-team",
    "Historical match: PR #1842 failed `test_auth_session_refresh`",
    "Test impact: 0 of 12 covering tests modified"
  ]
}

Designed to run as a CI check on every PR — providing pre-merge predictive signal that drives policy decisions, not just numeric scores. See docs/evaluation.md for how each layer is measured and docs/enterprise-safety.md for the production-safety controls.

Demo — three scenarios, end-to-end

The repo ships a runnable demo that walks three real-shaped PRs through the full pipeline (5 sub-agents → policy gatekeeper → markdown PR comment). The captured output lives in demo/output.md; regenerate it with python -m demo.run_demo > demo/output.md.

Scenario	What it is	Risk score	Risk level	Action
A	README typo fix by a regular contributor	0.00	🟢 Low	`fast_track`
B	4-file refactor inside `src/auth/` (sensitive path), CODEOWNERS provided	0.46	🟡 Medium	`owner_review`
C	Bot-authored 8-file mechanical refactor; PR description claims paths absent from the diff (prompt-vs-diff drift)	1.00	🔴 Critical	`block_merge`

The output that's posted on the PR is real markdown — see demo/output.md for the verbatim agent comments for each scenario, including evidence and sub-agent reports.

Why this project exists

Motivation. Built as both a working tool and a public showcase of how I approach enterprise AI tooling. The artifact requires hands-on engagement with NVIDIA's open AI stack (NeMo, Triton, NIM, Garak, NeMo Guardrails) and the operating discipline of internal platform teams (eval-gated CI, runbooks, postmortems, partner-team onboarding). Honest framing in docs/notes/why-this-project.md.

Market positioning. Existing solutions occupy one corner of the design space:

Tool	Approach	Limitation
PR-Agent (Codium)	Generative LLM review	No predictive scoring or calibration
CodeBERT / Devign	Trained classifier	No reasoning or agent integration
NVIDIA Garak	LLM red-teaming	Not specialized for code-review agents

This project is the integration that unifies all three. Adjacent big-tech tools cover individual quadrants — Google Tricorder for static analysis, Meta Sapienz / Getafix for test generation and automated repair, Microsoft TestImpact for test selection, Microsoft CloudBuild for traditional-ML build-failure prediction, Amazon CodeGuru and GitHub Copilot Code Review for generative review — but none combines predictive scoring + LLM agent orchestration + commit-history RAG in a single open-source artifact. Academically the predictive side is well established (DeepJIT, CC2Vec, JITLine, PROMISE benchmark), but no open-source project integrates it with the rest.

The landscape mapped to four quadrants:

                          Predictive
                              ▲
                              │
           ┌──────────────────┼──────────────────┐
           │  MS CloudBuild   │  THIS PROJECT    │
           │  failure-pred    │  commit-risk-    │
           │  MS TestImpact   │  scorer          │
           │  (classical ML)  │  (LLM + RAG +    │
           │                  │   agent)         │
           ├──────────────────┼──────────────────┤
           │  Google          │  PR-Agent        │
           │  Tricorder       │  GH Copilot      │
           │  (rule-based     │  Code Review     │
           │   static)        │  AWS CodeGuru    │
           │                  │  (generative)    │
           └──────────────────┼──────────────────┘
                              ▼
                   Reactive / Generative

          ◄─── Rules / Static          LLM / RAG ───►

The top-right quadrant — predictive + LLM/RAG/agent-driven — is the unoccupied space this project fills. See docs/design-doc.md §Why This Gap Exists for the full prior-art breakdown and honest caveats.

Where this fits — Agentic SDLC System

This repo is Node #1 of a 5-agent Agentic SDLC System — a multi-agent platform that uses LLMs + agentic AI to automate end-to-end software-engineering workflows and measure their impact in DORA terms. Node #1 (Pre-Merge Risk Workflow) is shipped here to production-shape; Nodes #2–#5 are scoped as roadmap with concrete interfaces, deliberately not implemented yet so this single node can remain deep rather than the system as a whole remaining shallow.

         ┌─ ★ Node 1: Pre-Merge Risk Workflow  ← THIS REPO (shipped)
         │
SDLC ────┼─   Node 2: Build Failure Triage     ← roadmap (NVIDIA MTTR lever)
Workflow │    Node 3: Smart Test Selection     ← roadmap
Agent    │    Node 4: Release Readiness        ← roadmap
         │    Node 5: Cross-Team Dependency    ← roadmap (HW-SW codesign)
         │
         └─ shared: Orchestrator · Heterogeneous RAG (A/B/C) · Model Gateway
                    FeedbackLog · Audit Store · Action Surface · DORA loop

Full vision, per-node interfaces, NVIDIA-IPP mapping, and prioritization in docs/agentic-sdlc-architecture.md.

Architecture (in brief)

                git diff + PR metadata (+ codeowners, history)
                          |
                          v
              +----------------------------+
              | Multi-Agent Harness        |
              |   (Claude Agent SDK)       |
              |     - diff-analyzer        |  <- structural shape of the diff
              |     - ownership-mapper     |  <- reviewers + bus-factor risk
              |     - agent-pr-auditor     |  <- detects AI-authored PRs + agent-specific risk
              |     - test-impact-scout    |  <- which tests cover the change
              |     - historical-context   |  <- RAG over similar past PRs
              +-------------+--------------+
                            |
                            v
              +----------------------------+
              | Multi-Vendor Model Gateway |
              |     - Claude (judge)       |
              |     - NVIDIA NIM           |
              |     - Triton-served NeMo   |
              |     - Azure OpenAI         |
              +-------------+--------------+
                            |
                            v
              +----------------------------+
              | Policy Gatekeeper          |  <- score -> action (4 bands)
              |   + Explanation Writer     |  <- markdown PR comment
              +-------------+--------------+
                            |
                            v
                Risk Score + Reasoning + Action
                + Audit Log (Mongo/MySQL/ES)
                + DORA Impact Dashboard

Tech stack

Agent harness: Claude Agent SDK, MCP tool federation (NVIDIA-native alternative: AIQ Toolkit + NeMo Retriever — see docs/design-doc.md)
Fine-tuning: NVIDIA NeMo + LoRA (Mistral-7B-v0.3 base)
Classical-ML baseline: NVIDIA RAPIDS cuML (GBDT on engineered features; sklearn CPU fallback for dev)
Inference optimization: NVIDIA TensorRT-LLM (engine compilation for the Mistral adapter)
Serving: NVIDIA Triton Inference Server, NVIDIA NIM
Evaluation: pytest, NVIDIA Garak (red-teaming)
Safety: NVIDIA NeMo Guardrails
Backend: Python, FastAPI
Dashboard: Streamlit
Storage: MongoDB / MySQL / Elasticsearch (multi-backend audit log + RAG index — see src/storage/audit_store.py)
CI: GitHub Actions (eval-gated deploys)

Try it locally

git clone https://github.com/mingdongt/commit-risk-scorer
cd commit-risk-scorer
pip install -e .                  # installs deps declared in pyproject.toml

# Library demo — full sub-agents -> policy -> markdown PR comment pipeline
python -m src.agent.harness

# HTTP service — same pipeline behind a FastAPI endpoint
uvicorn src.serving.api:app --port 8000
#   GET  /health
#   POST /score  { "diff": "...", "metadata": { "codeowners": {...} } }

# DORA impact dashboard — Streamlit (v0.1 simulated data; v0.2 reads audit-store)
streamlit run src/metrics/dora_dashboard.py

# Build the GitHub PR / CI training dataset (requires $GITHUB_TOKEN for useful volume)
python -m src.data.scrape_github_prs --repos kubernetes/kubernetes django/django \
    --max-prs-per-repo 100 --output data/raw/github_prs.jsonl

# Tests — 86 across agent / policy / explainer / API / red-team / audit-store /
#         scraper / dashboard / fine-tune
pytest tests/

Repository structure

.
├── README.md                       <- you are here
├── LICENSE                          <- Apache 2.0
├── docs/
│   ├── design-doc.md                <- motivation, prior art, architecture
│   ├── onboarding.md                <- adoption guide for teams using this
│   ├── runbook.md                   <- what to do when the agent misfires
│   ├── postmortem-template.md
│   └── metrics.md                   <- DORA metric definitions
├── src/
│   ├── agent/                       <- Claude Agent SDK harness
│   ├── models/                      <- model gateway, NeMo fine-tune
│   ├── eval/                        <- pytest eval suite, Garak probes
│   ├── serving/                     <- FastAPI + Triton client
│   └── metrics/                     <- DORA dashboard
├── tests/                           <- pytest suite (regression-gated CI)
├── data/                            <- labeled commits (gitignored)
├── notebooks/                       <- exploration, baselines, fine-tune logs
└── .github/workflows/               <- GitHub Actions: eval.yml runs pytest on push + PR

Initial Results — pipeline validation across baselines

Two pipelines have been validated end-to-end on subsamples of CodeXGLUE Devign. The point of this section is pipeline validation and trade-off surfacing, not capability claims — see Production target below for the meaningful comparison.

	DistilBERT + LoRA (HF PEFT smoke)	cuML GBDT baseline (sklearn-fallback)	Mistral-7B-v0.3 + LoRA via NeMo (production target)
Status	✅ Validated, CPU smoke	✅ Validated, CPU fallback	⏳ Pending CUDA + base-model conversion
F1	0.383	0.436	—
Precision	0.368	0.494	—
Recall	0.398	0.390	—
Accuracy	0.473	0.570	—
AUC-ROC	0.466	0.545	—

DistilBERT LoRA: 300 examples/split, 1 epoch, ~740 K trainable params (rank-8). CPU only.
GBDT baseline: 500 examples/split, 10 engineered features (LOC, alloc/free, pointer arithmetic, branch/loop counts, etc.). CPU sklearn fallback (the script auto-detects RAPIDS cuML on a CUDA box).
Mistral-7B-v0.3 production target: full Devign + ~1 k self-labeled GitHub PR/CI scrapes, followed by TensorRT-LLM engine compilation for Triton serving. Code in src/models/finetune/train_nemo.py; blocked on CUDA environment.

Findings:

On this subsample size, the engineered-feature GBDT beats the LoRA-tuned DistilBERT on F1, precision, accuracy, and AUC-ROC. The baseline existing isn't a defect — it's the point: simple features go a long way on small data, and DistilBERT is the wrong base (not pre-trained on code; ~300 samples insufficient).
DistilBERT AUC-ROC at 0.466 is below random; GBDT at 0.545 is the first sign of real discriminative signal.
Implication for production: don't use DistilBERT. The Mistral-7B-v0.3 path via NeMo (full dataset + code-pretrained base) is the right next step. The GBDT remains useful as a fast classifier ensemble component — and is the always-on T1 gate in the Tiered Router.

Raw metrics: data/models/smoke/smoke_metrics.json · data/models/baselines/cuml_gbdt_metrics.json.

Status

Active development. Public technical artifact. See docs/design-doc.md for current scope and open questions.

License

Apache License 2.0 — see LICENSE.

License intentionally aligned with NVIDIA's open-source AI ecosystem (NeMo, Triton, Garak, NeMo Guardrails) for ecosystem coherence and contributor friendliness.

Author

Built by Mingdong (Eric) Tan. github.com/mingdongt · linkedin.com/in/mingdongt · mingdongtan6@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

commit-risk-scorer

What it does

Risk → Action mapping

Example output

Demo — three scenarios, end-to-end

Why this project exists

Where this fits — Agentic SDLC System

Architecture (in brief)

Tech stack

Try it locally

Repository structure

Initial Results — pipeline validation across baselines

Status

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
data		data
demo		demo
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

commit-risk-scorer

What it does

Risk → Action mapping

Example output

Demo — three scenarios, end-to-end

Why this project exists

Where this fits — Agentic SDLC System

Architecture (in brief)

Tech stack

Try it locally

Repository structure

Initial Results — pipeline validation across baselines

Status

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages