A shift-left engineering intelligence agent that predicts PR risk, recommends reviewer / test / gate actions, and closes the loop through CI, telemetry, and DORA-style engineering metrics. Built on NVIDIA's open AI stack with a hybrid predictive pipeline (FT classifier + LLM judge).
📚 New here? See
docs/README.md— a documentation index organized by audience (hiring manager / engineer / adopter) and by time available (5 / 15 / 30 / 60 min). It's the recommended entry point for anyone going deeper than this README.
Input: PR diff + metadata (author, files, target branch) + build/test history + ownership signals.
Output:
- Risk score (0–100) and risk level (Low / Medium / High / Critical)
- Top risk factors — evidence-backed (file-ownership gaps, weak test coverage, historically failing areas, deployment blast radius)
- Recommended actions — reviewer assignment, test suite to run, gate decision (not just a numeric signal)
- DORA-style impact telemetry — cycle time, change failure rate, MTTR, adoption, FP/FN feedback. (OSS deployment uses a DORA-aligned eval harness with replayed / simulated data — real impact numbers require deployment in a real org. Disclosed in
docs/limitations.md§9 anddocs/metrics.md§Estimation honesty.)
The risk score is not the product — the action is. Score feeds into a policy decision surface:
| Score | Level | Action |
|---|---|---|
| 0–20 | Low | Fast-track / normal review |
| 21–50 | Medium | Add code-owner reviewer + targeted tests |
| 51–80 | High | Require SME review + extended CI |
| 81–100 | Critical | Block merge / manual gate |
{
"riskScore": 72,
"riskLevel": "High",
"topRiskFactors": [
"Touches auth middleware (high-incident area)",
"No test coverage for the modified branch",
"Similar historical PRs caused CI failures"
],
"recommendedActions": [
"Add security / code-owner reviewer",
"Run extended integration test suite",
"Block auto-merge until reviewer approval"
],
"confidence": 0.81,
"evidence": [
"Changed file: src/auth/token_validator.py — owned by @security-team",
"Historical match: PR #1842 failed `test_auth_session_refresh`",
"Test impact: 0 of 12 covering tests modified"
]
}Designed to run as a CI check on every PR — providing pre-merge predictive signal that drives policy decisions, not just numeric scores. See docs/evaluation.md for how each layer is measured and docs/enterprise-safety.md for the production-safety controls.
The repo ships a runnable demo that walks three real-shaped PRs through the
full pipeline (5 sub-agents → policy gatekeeper → markdown PR comment). The
captured output lives in demo/output.md; regenerate it
with python -m demo.run_demo > demo/output.md.
| Scenario | What it is | Risk score | Risk level | Action |
|---|---|---|---|---|
| A | README typo fix by a regular contributor | 0.00 | 🟢 Low | fast_track |
| B | 4-file refactor inside src/auth/ (sensitive path), CODEOWNERS provided |
0.46 | 🟡 Medium | owner_review |
| C | Bot-authored 8-file mechanical refactor; PR description claims paths absent from the diff (prompt-vs-diff drift) | 1.00 | 🔴 Critical | block_merge |
The output that's posted on the PR is real markdown — see
demo/output.md for the verbatim agent comments for each
scenario, including evidence and sub-agent reports.
Motivation. Built as both a working tool and a public showcase of how I
approach enterprise AI tooling. The artifact requires hands-on engagement with
NVIDIA's open AI stack (NeMo, Triton, NIM, Garak, NeMo Guardrails) and the
operating discipline of internal platform teams (eval-gated CI, runbooks,
postmortems, partner-team onboarding). Honest framing in
docs/notes/why-this-project.md.
Market positioning. Existing solutions occupy one corner of the design space:
| Tool | Approach | Limitation |
|---|---|---|
| PR-Agent (Codium) | Generative LLM review | No predictive scoring or calibration |
| CodeBERT / Devign | Trained classifier | No reasoning or agent integration |
| NVIDIA Garak | LLM red-teaming | Not specialized for code-review agents |
This project is the integration that unifies all three. Adjacent big-tech tools cover individual quadrants — Google Tricorder for static analysis, Meta Sapienz / Getafix for test generation and automated repair, Microsoft TestImpact for test selection, Microsoft CloudBuild for traditional-ML build-failure prediction, Amazon CodeGuru and GitHub Copilot Code Review for generative review — but none combines predictive scoring + LLM agent orchestration + commit-history RAG in a single open-source artifact. Academically the predictive side is well established (DeepJIT, CC2Vec, JITLine, PROMISE benchmark), but no open-source project integrates it with the rest.
The landscape mapped to four quadrants:
Predictive
▲
│
┌──────────────────┼──────────────────┐
│ MS CloudBuild │ THIS PROJECT │
│ failure-pred │ commit-risk- │
│ MS TestImpact │ scorer │
│ (classical ML) │ (LLM + RAG + │
│ │ agent) │
├──────────────────┼──────────────────┤
│ Google │ PR-Agent │
│ Tricorder │ GH Copilot │
│ (rule-based │ Code Review │
│ static) │ AWS CodeGuru │
│ │ (generative) │
└──────────────────┼──────────────────┘
▼
Reactive / Generative
◄─── Rules / Static LLM / RAG ───►
The top-right quadrant — predictive + LLM/RAG/agent-driven — is the unoccupied space this project fills. See docs/design-doc.md §Why This Gap Exists for the full prior-art breakdown and honest caveats.
This repo is Node #1 of a 5-agent Agentic SDLC System — a multi-agent platform that uses LLMs + agentic AI to automate end-to-end software-engineering workflows and measure their impact in DORA terms. Node #1 (Pre-Merge Risk Workflow) is shipped here to production-shape; Nodes #2–#5 are scoped as roadmap with concrete interfaces, deliberately not implemented yet so this single node can remain deep rather than the system as a whole remaining shallow.
┌─ ★ Node 1: Pre-Merge Risk Workflow ← THIS REPO (shipped)
│
SDLC ────┼─ Node 2: Build Failure Triage ← roadmap (NVIDIA MTTR lever)
Workflow │ Node 3: Smart Test Selection ← roadmap
Agent │ Node 4: Release Readiness ← roadmap
│ Node 5: Cross-Team Dependency ← roadmap (HW-SW codesign)
│
└─ shared: Orchestrator · Heterogeneous RAG (A/B/C) · Model Gateway
FeedbackLog · Audit Store · Action Surface · DORA loop
Full vision, per-node interfaces, NVIDIA-IPP mapping, and prioritization
in docs/agentic-sdlc-architecture.md.
git diff + PR metadata (+ codeowners, history)
|
v
+----------------------------+
| Multi-Agent Harness |
| (Claude Agent SDK) |
| - diff-analyzer | <- structural shape of the diff
| - ownership-mapper | <- reviewers + bus-factor risk
| - agent-pr-auditor | <- detects AI-authored PRs + agent-specific risk
| - test-impact-scout | <- which tests cover the change
| - historical-context | <- RAG over similar past PRs
+-------------+--------------+
|
v
+----------------------------+
| Multi-Vendor Model Gateway |
| - Claude (judge) |
| - NVIDIA NIM |
| - Triton-served NeMo |
| - Azure OpenAI |
+-------------+--------------+
|
v
+----------------------------+
| Policy Gatekeeper | <- score -> action (4 bands)
| + Explanation Writer | <- markdown PR comment
+-------------+--------------+
|
v
Risk Score + Reasoning + Action
+ Audit Log (Mongo/MySQL/ES)
+ DORA Impact Dashboard
- Agent harness: Claude Agent SDK, MCP tool federation (NVIDIA-native alternative: AIQ Toolkit + NeMo Retriever — see
docs/design-doc.md) - Fine-tuning: NVIDIA NeMo + LoRA (Mistral-7B-v0.3 base)
- Classical-ML baseline: NVIDIA RAPIDS cuML (GBDT on engineered features; sklearn CPU fallback for dev)
- Inference optimization: NVIDIA TensorRT-LLM (engine compilation for the Mistral adapter)
- Serving: NVIDIA Triton Inference Server, NVIDIA NIM
- Evaluation: pytest, NVIDIA Garak (red-teaming)
- Safety: NVIDIA NeMo Guardrails
- Backend: Python, FastAPI
- Dashboard: Streamlit
- Storage: MongoDB / MySQL / Elasticsearch (multi-backend audit log + RAG index — see
src/storage/audit_store.py) - CI: GitHub Actions (eval-gated deploys)
git clone https://github.com/mingdongt/commit-risk-scorer
cd commit-risk-scorer
pip install -e . # installs deps declared in pyproject.toml
# Library demo — full sub-agents -> policy -> markdown PR comment pipeline
python -m src.agent.harness
# HTTP service — same pipeline behind a FastAPI endpoint
uvicorn src.serving.api:app --port 8000
# GET /health
# POST /score { "diff": "...", "metadata": { "codeowners": {...} } }
# DORA impact dashboard — Streamlit (v0.1 simulated data; v0.2 reads audit-store)
streamlit run src/metrics/dora_dashboard.py
# Build the GitHub PR / CI training dataset (requires $GITHUB_TOKEN for useful volume)
python -m src.data.scrape_github_prs --repos kubernetes/kubernetes django/django \
--max-prs-per-repo 100 --output data/raw/github_prs.jsonl
# Tests — 86 across agent / policy / explainer / API / red-team / audit-store /
# scraper / dashboard / fine-tune
pytest tests/.
├── README.md <- you are here
├── LICENSE <- Apache 2.0
├── docs/
│ ├── design-doc.md <- motivation, prior art, architecture
│ ├── onboarding.md <- adoption guide for teams using this
│ ├── runbook.md <- what to do when the agent misfires
│ ├── postmortem-template.md
│ └── metrics.md <- DORA metric definitions
├── src/
│ ├── agent/ <- Claude Agent SDK harness
│ ├── models/ <- model gateway, NeMo fine-tune
│ ├── eval/ <- pytest eval suite, Garak probes
│ ├── serving/ <- FastAPI + Triton client
│ └── metrics/ <- DORA dashboard
├── tests/ <- pytest suite (regression-gated CI)
├── data/ <- labeled commits (gitignored)
├── notebooks/ <- exploration, baselines, fine-tune logs
└── .github/workflows/ <- GitHub Actions: eval.yml runs pytest on push + PR
Two pipelines have been validated end-to-end on subsamples of CodeXGLUE Devign. The point of this section is pipeline validation and trade-off surfacing, not capability claims — see Production target below for the meaningful comparison.
| DistilBERT + LoRA (HF PEFT smoke) | cuML GBDT baseline (sklearn-fallback) | Mistral-7B-v0.3 + LoRA via NeMo (production target) | |
|---|---|---|---|
| Status | ✅ Validated, CPU smoke | ✅ Validated, CPU fallback | ⏳ Pending CUDA + base-model conversion |
| F1 | 0.383 | 0.436 | — |
| Precision | 0.368 | 0.494 | — |
| Recall | 0.398 | 0.390 | — |
| Accuracy | 0.473 | 0.570 | — |
| AUC-ROC | 0.466 | 0.545 | — |
- DistilBERT LoRA: 300 examples/split, 1 epoch, ~740 K trainable params (rank-8). CPU only.
- GBDT baseline: 500 examples/split, 10 engineered features (LOC, alloc/free, pointer arithmetic, branch/loop counts, etc.). CPU sklearn fallback (the script auto-detects RAPIDS cuML on a CUDA box).
- Mistral-7B-v0.3 production target: full Devign + ~1 k self-labeled GitHub PR/CI scrapes, followed by TensorRT-LLM engine compilation for Triton serving. Code in
src/models/finetune/train_nemo.py; blocked on CUDA environment.
Findings:
- On this subsample size, the engineered-feature GBDT beats the LoRA-tuned DistilBERT on F1, precision, accuracy, and AUC-ROC. The baseline existing isn't a defect — it's the point: simple features go a long way on small data, and DistilBERT is the wrong base (not pre-trained on code; ~300 samples insufficient).
- DistilBERT AUC-ROC at 0.466 is below random; GBDT at 0.545 is the first sign of real discriminative signal.
- Implication for production: don't use DistilBERT. The Mistral-7B-v0.3 path via NeMo (full dataset + code-pretrained base) is the right next step. The GBDT remains useful as a fast classifier ensemble component — and is the always-on T1 gate in the Tiered Router.
Raw metrics: data/models/smoke/smoke_metrics.json · data/models/baselines/cuml_gbdt_metrics.json.
Active development. Public technical artifact. See docs/design-doc.md for current scope and open questions.
Apache License 2.0 — see LICENSE.
License intentionally aligned with NVIDIA's open-source AI ecosystem (NeMo, Triton, Garak, NeMo Guardrails) for ecosystem coherence and contributor friendliness.
Built by Mingdong (Eric) Tan. github.com/mingdongt · linkedin.com/in/mingdongt · mingdongtan6@gmail.com