Skip to content

Automate corpora testing in CI#4927

Open
mustansir14 wants to merge 45 commits intomainfrom
hackathon/detector-tests-in-ci
Open

Automate corpora testing in CI#4927
mustansir14 wants to merge 45 commits intomainfrom
hackathon/detector-tests-in-ci

Conversation

@mustansir14
Copy link
Copy Markdown
Contributor

@mustansir14 mustansir14 commented Apr 29, 2026

Motivation

When adding or modifying a detector, the key question is: how much noise will this regex produce against real-world code? Too many false positives means alert fatigue; a regex that's too tight misses real secrets.

Previously this was a fully manual process — download a large corpus locally, run the pipeline, inspect the DuckDB output. There was no enforcement in CI, so it was easy to skip or forget, especially under time pressure.

This PR automates it. Every PR that touches detector code now gets a comment showing exactly how many unique matches the changed detector produces, compared to the main baseline. The comment updates in place on every push, so the PR timeline stays clean.

What it tells you

The bench scans a corpus of real-world public code (S3 datasets, ~35 GB in total compressed) using only the detectors changed in the PR, with verification disabled. It reports unique match counts for the PR build vs. the main baseline:

  • New matches — the regex picks up strings main didn't; could be intentional (broader coverage) or noise (too loose)
  • Removed matches — the regex narrowed; could be intentional (tighter pattern) or a regression (missing real secrets)

PRs that add a new detector will see a 🆕 row with an absolute match count and no baseline comparison.

Example output

image

What changed

  • .github/workflows/detector-corpora-test.yml — new workflow; triggers on PRs touching pkg/detectors/**; detects which detectors changed, builds PR and main binaries in parallel, runs both scans against the corpus concurrently, posts a sticky comment with the diff
  • scripts/test/detect_changed_detectors.sh — resolves changed detector directories to their proto enum names for --include-detectors scoping; skips detectors whose diff doesn't touch regex patterns or Keywords() so PRs that only change verification or redaction logic don't trigger a bench run
  • scripts/test/detector_corpora_test.sh — streams corpus files (S3 or local), runs trufflehog with --no-verification, outputs JSONL
  • scripts/test/diff_corpora_results.py — diffs two JSONL result sets and renders the Markdown report posted to the PR

Performance

The naive implementation was slow. Two full scans (PR + main) each streamed the entire corpus independently from S3 — on a 35 GB dataset that meant two downloads, two decompressions, two jq passes, and two trufflehog runs, serialized per dataset file. First runs took ~50 minutes.

Three optimizations brought first runs down to ~38 minutes and subsequent pushes to the same PR down to ~30 minutes:

Main scan caching — The main binary scans the same commit (the merge base) on every push to a PR, as long as no rebase happens. We cache /tmp/results-main.jsonl in GitHub Actions keyed by merge-base SHA + scoped detector set. On subsequent pushes without a rebase, the entire main side (worktree checkout, go build, S3 download, scan) is skipped entirely.

Single S3 stream with tee — When the main scan does need to run (cache miss), both PR and main binaries now consume the same S3 stream via tee >(main_binary stdin). S3 is downloaded and decompressed once; both scans process the content in parallel at the OS pipe level.

Scoped scanning — Only the detectors changed in the PR are passed via --include-detectors. Scanning the full detector set against a large corpus is most of the work; scoping to 1–3 changed detectors cuts runtime proportionally.

Fork PRs

Excluded for fork PRs — S3 credentials are not available to fork-originated workflows. Maintainers need to run this manually for forked PRs.

Running locally

./scripts/test/detector_corpora_test.sh /path/to/contents.jsonl.zstd

Results land in /tmp/corpora_results.jsonl with a DuckDB summary table printed to stdout.


Note

Medium Risk
Adds a new CI workflow that pulls large corpora from S3 using repository secrets and posts sticky PR comments, so failures/mis-scoping or credential issues can impact PR checks and leak operational signals (though no product runtime paths change).

Overview
Adds an automated detector corpora regression benchmark in CI that runs on PRs touching detector code and posts a sticky <!-- detector-bench --> PR comment with unique-match deltas (PR vs merge-base main) for the impacted detectors.

Introduces scripts to (1) detect which detectors actually changed matching behavior (regex/Keywords()), (2) stream one or more zstd JSONL corpora files (local or s3://) through trufflehog with --include-detectors and --no-verification, and (3) diff the two JSONL outputs into a Markdown report (including handling for new detectors with no baseline).

Optimizes CI runtime by caching the baseline scan keyed by merge-base SHA + detector set, building main and PR binaries in parallel, and teeing a single corpus stream to both binaries when a baseline scan is needed; also updates .gitignore to ignore Python bytecode artifacts.

Reviewed by Cursor Bugbot for commit 6c3bbae. Bugbot is set up for automated code reviews on this repo. Configure here.

@mustansir14 mustansir14 requested a review from a team April 29, 2026 08:22
Comment thread .github/workflows/detector-corpora-test.yml Outdated
Comment thread .github/workflows/detector-corpora-test.yml Outdated
@trufflesecurity trufflesecurity deleted a comment from github-actions Bot Apr 29, 2026
Comment thread .github/workflows/detector-corpora-test.yml Outdated
Comment thread .github/workflows/detector-corpora-test.yml Outdated
Comment thread scripts/test/detector_corpora_test.sh
@shahzadhaider1 shahzadhaider1 requested a review from a team as a code owner April 29, 2026 13:16
Comment thread pkg/detectors/stripe/stripe.go Outdated
The bench uses --no-verification, so the engine's overlap-path dedup
(which exists to protect verifiers from duplicate calls) adds noise
without value here — it causes shifts in unrelated detectors when only
one detector's regex changes. Pair --allow-verification-overlap with
--no-verification so each detector's regex behavior is measured
independently.

Also fix the false 'no diff vs main' claim that triggered when
NEW/REMOVED were zero but total counts differed.
Comment thread pkg/detectors/jdbc/jdbc.go Outdated
Comment thread scripts/diff_corpora_results.py Outdated
Comment thread scripts/test/detector_corpora_test.sh
Comment thread scripts/test/detect_changed_detectors.sh
Comment thread pkg/detectors/acmevault/eraser.go Outdated
Comment thread pkg/detectors/acmevault/eraser.go Outdated
@mustansir14 mustansir14 marked this pull request as draft April 30, 2026 08:15
shahzadhaider1 and others added 2 commits May 2, 2026 00:27
awk's END block doesn't run when trufflehog exits before draining stdin
(SIGPIPE kills awk first), leaving the bytes file empty and breaking the
step with a `$((TOTAL_BYTES + ))` syntax error. Read the file with a
default of 0 and validate it's an integer before arithmetic. Also fold
unzstd/jq stderr into STDERR_FILE so benign Broken pipe notices stay
out of CI logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pection

Static AST parse of a detector package to extract the strings returned by
its Keywords() method. Used by the upcoming keyword-corpus builder to fan
out per-detector GitHub Code Search queries during the corpora bench.

AST-first because each detector lives in its own package; importing them
dynamically would require codegen or `plugin`. Falls back to a regex over
the function body, then a directory-wide grep, when AST resolution can't
statically resolve the return value (helper calls, build-tagged variants).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread pkg/engine/defaults/defaults.go
Comment thread scripts/test/diff_corpora_results.py
Comment thread scripts/test/detect_changed_detectors.sh
@trufflesecurity trufflesecurity deleted a comment from github-actions Bot May 4, 2026
Comment thread .github/workflows/detector-corpora-test.yml
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Corpora Test Results

No detector regex or keyword changes in this PR. Bench skipped.

Comment thread scripts/test/detector_corpora_test.sh
Comment thread scripts/test/diff_corpora_results.py
Comment thread scripts/test/diff_corpora_results.py
@mustansir14 mustansir14 marked this pull request as ready for review May 4, 2026 15:08
Comment thread scripts/test/detector_corpora_test.sh
@shahzadhaider1 shahzadhaider1 requested a review from bradlarsen May 4, 2026 16:00
Copy link
Copy Markdown
Contributor

@bradlarsen bradlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks useful!

Comment thread scripts/test/detector_corpora_test.sh Outdated
Comment on lines +126 to +128
SUM(CASE WHEN Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) verified,
SUM(CASE WHEN NOT Verified AND VerificationError IS NULL THEN 1 ELSE 0 END) unverified,
SUM(CASE WHEN VerificationError IS NOT NULL THEN 1 ELSE 0 END) \"unknown\"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running with --no-verification above makes these values not meaningful, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Thanks for catching this. Will remove


for CORPORA_FILE in "$@"; do
if [[ "$CORPORA_FILE" == s3://* ]]; then
aws s3 cp "$CORPORA_FILE" - | scan /dev/stdin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: though stdin is likely fine for this kind of testing, the newer json-enumerator input source in TruffleHog might make this a bit more straightforward. That input source expects NDJSON where each value looks like this:

{"data": "utf8 string to scan", "metadata": <arbitrary JSON value>}

or this:

{"data_b64": "base64-encoded bytestring to scan", "metadata": <arbitrary JSON value>}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, json-enumerator would give better chunk isolation and carry path metadata through. But in this context we're running with --no-verification and the metadata isn't used downstream, so the practical difference is minimal. Keeping stdin for now to avoid any unexpected behavioral changes, but happy to revisit if there's a specific benefit you had in mind.

Comment on lines +64 to +106
local rc=0
if [[ -n "${TRUFFLEHOG_BIN_MAIN:-}" ]]; then
# Single S3 download teed to both binaries simultaneously.
unzstd -c "$input" 2>> "$STDERR_FILE" \
| jq -r .content 2>> "$STDERR_FILE" \
| tee >(
"${TRUFFLEHOG_BIN_MAIN}" \
--no-update \
--no-verification \
--allow-verification-overlap \
--log-level=3 \
--concurrency=8 \
--json \
"${main_include_flag[@]}" \
stdin >> "${OUTPUT_JSONL_MAIN}" 2>> "$STDERR_FILE"
) \
| "$TRUFFLEHOG_BIN" \
--no-update \
--no-verification \
--allow-verification-overlap \
--log-level=3 \
--concurrency=8 \
--json \
--print-avg-detector-time \
"${INCLUDE_FLAG[@]}" \
stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
rc=$?
wait
else
unzstd -c "$input" 2>> "$STDERR_FILE" \
| jq -r .content 2>> "$STDERR_FILE" \
| "$TRUFFLEHOG_BIN" \
--no-update \
--no-verification \
--allow-verification-overlap \
--log-level=3 \
--concurrency=8 \
--json \
--print-avg-detector-time \
"${INCLUDE_FLAG[@]}" \
stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
rc=$?
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I last used trufflehog stdin like this, I found that it would timeout after 1 minute, leaving you with an incomplete scan when dealing with large inputs. The workaround was, confusingly, to specify --archive-timeout=6h (or some similarly large value).

You might want to check that the scan is not being terminated early! This is another reason why you might prefer the json-enumerator input source, which doesn't have this timeout gotcha.

Copy link
Copy Markdown
Contributor Author

@mustansir14 mustansir14 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an issue in here. The tests consistently go 30+ minutes and successfully complete without truncation. I think it's because we decompress the corpus ourselves with unzstd before piping to stdin, so TruffleHog never sees a compressed archive and the archive timeout never applies.

s3://trufflehog-corpora-datasets/contents.2025-11-04.jsonl.zstd
s3://trufflehog-corpora-datasets/contents.jsonl.zstd

jobs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, some of the actions used here are old versions. Also, you might consider pinning the action versions used here to reduce risk of possible supply-chain attacks.

zizmor is helpful: https://docs.zizmor.sh/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Really helpful! I'll do the needful

Comment on lines +5 to +12
pull_request:
paths:
- 'pkg/detectors/**'
- 'pkg/engine/defaults/defaults.go'
- '.github/workflows/detector-corpora-test.yml'
- 'scripts/test/detector_corpora_test.sh'
- 'scripts/test/diff_corpora_results.py'
- 'scripts/test/detect_changed_detectors.sh'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the CPU work required by a single run of this workflow is pretty expensive (30+ minutes), is this something we want running automatically on pull requests (as it looks like it does here), or only as an opt-in workflow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The workflow is scoped to only trigger when a PR modifies regex patterns or Keywords() in a detector. Purely structural changes (verification logic, redaction, comments, etc.) are filtered out and skip the bench entirely. In practice this means it only runs on PRs that actually affect match behavior.

We also don't currently merge a detector without running this test manually, so automating it in CI seems like the right call. it ensures the check never gets skipped and gives reviewers the data they need without having to ask for it.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6c3bbae. Configure here.

"${INCLUDE_FLAG[@]}" \
stdin >> "$OUTPUT_JSONL" 2>> "$STDERR_FILE"
rc=$?
wait
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main binary failure in process substitution silently ignored

Low Severity

In dual-binary mode, the main binary runs inside a tee >(...) process substitution whose exit code is never checked. If it crashes or exits early, wait runs under set +e so the failure is silently swallowed. Worse, if tee receives SIGPIPE from the broken process substitution pipe, it may also die, truncating input to the PR binary — producing silently incomplete results on both sides with no error surfaced.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6c3bbae. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants