feat: security ci by ShivamBhasin2002 · Pull Request #1 · headout/apache-superset

ShivamBhasin2002 · 2026-05-15T16:37:03Z

Summary

Adds a signature-driven supply-chain malware scanner that runs on every PR.

Walks every commit in the PR (not just final state) via git log + git diff-tree
Two bundled signatures:
- PolinRider RAT (HIGH) — known IOCs: rmcej, numeric constants, obfuscated global[]= assignments; covers JS configs + font binaries
- Generic Config Injection (MEDIUM) — catches child_process, eval, new Function, env-var exfil, base64+eval chains, outbound HTTP, SSH key refs, AWS keys, high-entropy obfuscation
Adding a new attack = drop one YAML file, zero other changes needed
Pure-Python strings fallback so no hard dependency on binutils

File structure

.github/
  workflows/security-scan.yml          ← thin orchestrator
  security/
    scanner.py                          ← Python engine
    signatures/
      pollinrider.yml                   ← PolinRider RAT IOCs (HIGH)
      generic-config-injection.yml      ← broad suspicious patterns (MEDIUM)

Automated rollout — scanner source lives in headout/zapdos#822

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vulcanho

📋 Summary

This PR introduces a signature-driven supply-chain malware scanner with four new files: a GitHub Actions workflow (security-scan.yml), a Python scanner engine (scanner.py), and two YAML signature packs (pollinrider.yml, generic-config-injection.yml). The design — loading signatures from YAML, walking per-commit diffs in Phase 1, and doing a final-state scan in Phase 2 — is sound. However, there are several correctness and security issues: the most critical is that any unhandled Python exception in the scanner causes the workflow to silently pass (continue-on-error absorbs the crash before found is ever written), meaning a scanner failure is indistinguishable from a clean scan. Separately, Phase 1 silently drops all renamed files due to incorrect tab-split parsing of git diff-tree --name-status output. The infrastructure itself also has supply-chain exposure: all three GitHub Actions are pinned to mutable semver tags, and pyyaml is installed without a version pin.

💬 General Feedback

security: The eval_call pattern in generic-config-injection.yml (\beval\s*\() will false-positive on Webpack configs that reference eval in comments explaining devtool source-map options, and on any config file that calls a function ending in eval (e.g., evalSourceMap()). The word boundary \b does not prevent matches inside identifiers where the word ends in eval. Consider tightening to (?<![a-zA-Z])eval\s*\( or accepting that this pattern will require manual triage. Similarly, outbound_http pattern will match on any variable named request, got, or fetch used in Jest/Rollup configs that import those as test utilities — this will produce alerts that erode team trust in the scanner.
code_quality: In _scan_file() (scanner.py line 280), get_text() is called once per matching (sig, target) pair without caching. With two signatures that share overlapping file globs (e.g., both pollinrider.yml and generic-config-injection.yml match eslint.config.js), Phase 1 spawns two separate git show subprocesses for the same commit:path. Memoize the result of the first call per (commit, file_path) pair to avoid redundant subprocess invocations as the signature library grows.

🔀 Architecture

flowchart TD
    PR[PR opened / synchronized] --> WF[security-scan.yml triggered]
    WF --> CO[Checkout full history\nfetch-depth=0]
    CO --> SETUP[Setup Python 3.11\npip install pyyaml]
    SETUP --> SCAN[scanner.py]
    SCAN --> SIGS[Load .github/security/signatures/*.yml\npollinrider.yml + generic-config-injection.yml]
    SIGS --> P1[Phase 1: git log BASE_SHA..HEAD_SHA\nper-commit diff-tree scan]
    P1 --> P2[Phase 2: git diff --name-only\nfinal working-tree scan]
    P2 --> DEDUP[Deduplicate findings by\nsig_id+file_path+pattern_id]
    DEDUP --> DECISION{blocking\nfindings?}
    DECISION -->|found=true| COMMENT[Post PR comment\nactions/github-script]
    COMMENT --> BLOCK[exit 1 — Block merge]
    DECISION -->|found=false| PASS[Workflow passes]
    SCAN -.->|unhandled exception| ABSORB[continue-on-error absorbs crash]
    ABSORB -.->|found never written| PASS

vulcanho · 2026-05-15T18:04:10Z

+        continue-on-error: true
+        run: python .github/security/scanner.py


Scanner crash silently passes all PRs — found is never written when the scanner throws

The combination of continue-on-error: true and checking only steps.scan.outputs.found == 'true' creates a fail-open security gap. If the scanner exits non-zero due to an unhandled exception (YAML parse error, unexpected git output, OOM, etc.), found is never written to $GITHUB_OUTPUT. GitHub Actions leaves the output as an empty string. Both the comment step (line 63) and the block step (line 117) evaluate '' == 'true' → false, so neither runs. The workflow completes green and the PR is not blocked.

Fix both if conditions on lines 63 and 117 to also trigger when the scanner step itself failed:

if: steps.scan.outputs.found == 'true' || steps.scan.outcome == 'failure'

This ensures a crashed scanner blocks the PR (fail-closed) rather than silently approving it.

Suggested change

continue-on-error: true

run: python .github/security/scanner.py

continue-on-error: true

run: python .github/security/scanner.py

Prompt for agents

In `.github/workflows/security-scan.yml`, the scanner step at line 56 uses `continue-on-error: true`. Both the 'Post PR comment' step (line 63) and the 'Block merge' step (line 117) have the condition `if: steps.scan.outputs.found == 'true'`. This is fail-open: if the Python scanner crashes before writing `found=true` to GITHUB_OUTPUT, both steps are skipped and the PR is not blocked. Fix both `if` conditions to read `if: steps.scan.outputs.found == 'true' || steps.scan.outcome == 'failure'` so that a crashing scanner also blocks the PR.

vulcanho · 2026-05-15T18:04:11Z

+def commit_changed_files(commit: str) -> list[tuple[str, str]]:
+    """Return [(status, path)] for files changed in *commit*."""
+    out = _git("diff-tree", "--no-commit-id", "-r", "--name-status", commit)
+    pairs: list[tuple[str, str]] = []
+    for line in out.splitlines():
+        parts = line.strip().split("\t", 1)
+        if len(parts) == 2:
+            pairs.append((parts[0], parts[1]))
+    return pairs


Renamed files silently skipped in Phase 1 — split('\t', 1) produces an invalid path for R/C status lines

When git diff-tree --name-status reports a rename (R100) or copy (C100), the line has three tab-separated fields: R100<TAB>old_path<TAB>new_path. The current split("\t", 1) splits only on the first tab and yields ("R100", "old_path\tnew_path"). This path is invalid; the subsequent git show commit:old_path\tnew_path call fails silently and returns empty bytes. The file is never scanned.

An attacker who squashes an intermediate commit that renames an innocuous file to eslint.config.js and inserts a RAT payload evades Phase 1 entirely. Phase 2 provides a safety net only for the final state — squashed intermediate commits are not re-examined.

Fix by handling multi-field rename/copy lines:

def commit_changed_files(commit: str) -> list[tuple[str, str]]: """Return [(status, path)] for files changed in *commit*.""" out = _git("diff-tree", "--no-commit-id", "-r", "--name-status", commit) pairs: list[tuple[str, str]] = [] for line in out.splitlines(): parts = line.strip().split("\t") if len(parts) < 2: continue status = parts[0] # Renames and copies: R100\told_path\tnew_path — scan the destination if status[:1] in ("R", "C") and len(parts) >= 3: pairs.append((status, parts[2])) else: pairs.append((status, parts[1])) return pairs

Suggested change

def commit_changed_files(commit: str) -> list[tuple[str, str]]:

"""Return [(status, path)] for files changed in *commit*."""

out = _git("diff-tree", "--no-commit-id", "-r", "--name-status", commit)

pairs: list[tuple[str, str]] = []

for line in out.splitlines():

parts = line.strip().split("\t", 1)

if len(parts) == 2:

pairs.append((parts[0], parts[1]))

return pairs

def commit_changed_files(commit: str) -> list[tuple[str, str]]:

"""Return [(status, path)] for files changed in *commit*."""

out = _git("diff-tree", "--no-commit-id", "-r", "--name-status", commit)

pairs: list[tuple[str, str]] = []

for line in out.splitlines():

parts = line.strip().split("\t")

if len(parts) < 2:

continue

status = parts[0]

# Renames and copies: R100\told_path\tnew_path — scan the destination

if status[:1] in ("R", "C") and len(parts) >= 3:

pairs.append((status, parts[2]))

else:

pairs.append((status, parts[1]))

return pairs

Prompt for agents

In `.github/security/scanner.py` at `commit_changed_files()` (lines 198-206), the function parses `git diff-tree --name-status` output using `split('\t', 1)`. This breaks for renamed/copied files, which have three tab-separated fields (`R100\told_path\tnew_path`). The current split produces an invalid path `old_path\tnew_path`, causing the `git show` call to fail silently and the file to be unscanned. Replace the parsing logic to detect when `parts[0]` starts with 'R' or 'C' and use `parts[2]` (the destination path) instead of `parts[1]`.

vulcanho · 2026-05-15T18:04:11Z

+def _git(*args: str) -> str:
+    result = subprocess.run(["git", *args], capture_output=True, text=True)
+    return result.stdout
+
+
+def _git_bytes(*args: str) -> bytes:
+    result = subprocess.run(["git", *args], capture_output=True)
+    return result.stdout


_git() and _git_bytes() ignore subprocess exit codes — silent failure causes unscanned PRs

Both _git() and _git_bytes() discard result.returncode and result.stderr. A failed git invocation returns an empty string or empty bytes indistinguishably from a file that genuinely has empty content. Callers like pr_commits() and pr_all_files() treat empty output as "no commits" or "no files", silently bypassing both scan phases. At minimum, log a warning to stderr so operators notice the scan was hollow:

def _git(*args: str) -> str: result = subprocess.run(["git", *args], capture_output=True, text=True) if result.returncode != 0: print( f" [warn] git {args[0]} exited {result.returncode}: " f"{result.stderr.strip()[:200]}", file=sys.stderr, ) return result.stdout def _git_bytes(*args: str) -> bytes: result = subprocess.run(["git", *args], capture_output=True) if result.returncode != 0: print( f" [warn] git {args[0]} exited {result.returncode}: " f"{result.stderr.decode('utf-8', errors='replace').strip()[:200]}", file=sys.stderr, ) return result.stdout

Consider also having run_scan() return an error/incomplete status when pr_commits() returns [] despite BASE_SHA being non-empty.

Suggested change

def _git(*args: str) -> str:

result = subprocess.run(["git", *args], capture_output=True, text=True)

return result.stdout

def _git_bytes(*args: str) -> bytes:

result = subprocess.run(["git", *args], capture_output=True)

return result.stdout

def _git(*args: str) -> str:

result = subprocess.run(["git", *args], capture_output=True, text=True)

if result.returncode != 0:

print(

f" [warn] git {args[0]} exited {result.returncode}: "

f"{result.stderr.strip()[:200]}",

file=sys.stderr,

)

return result.stdout

def _git_bytes(*args: str) -> bytes:

result = subprocess.run(["git", *args], capture_output=True)

if result.returncode != 0:

print(

f" [warn] git {args[0]} exited {result.returncode}: "

f"{result.stderr.decode('utf-8', errors='replace').strip()[:200]}",

file=sys.stderr,

)

return result.stdout

Prompt for agents

In `.github/security/scanner.py`, the `_git()` function (line 180) and `_git_bytes()` function (line 185) do not check `result.returncode`. A failing git command (invalid SHA, missing object, permission error) returns empty string/bytes silently, causing `pr_commits()` to return an empty list and the entire Phase 1 scan to be skipped with no warning. Add a check of `result.returncode != 0` in both functions and print a warning to stderr that includes the return code and stderr text.

vulcanho · 2026-05-15T18:04:11Z

+def _write_output(key: str, value: str) -> None:
+    if not GITHUB_OUTPUT:
+        return
+    with open(GITHUB_OUTPUT, "a", encoding="utf-8") as fh:
+        if "\n" in value:
+            delimiter = "__SCAN_EOF__"
+            fh.write(f"{key}<<{delimiter}\n{value}\n{delimiter}\n")
+        else:
+            fh.write(f"{key}={value}\n")


Hardcoded __SCAN_EOF__ multiline delimiter can be injected via attacker-controlled file paths

GitHub Actions multiline output uses a heredoc-style delimiter. GitHub's own documentation explicitly warns that the delimiter must not appear in the value. Using a fixed string __SCAN_EOF__ is unsafe because report contains attacker-controlled data (file paths from git diff). A POSIX filename can legally contain a newline, allowing an attacker to name a file so that the rendered table row contains a bare __SCAN_EOF__ line.

Use a random per-run delimiter as GitHub's own documentation recommends:

import secrets def _write_output(key: str, value: str) -> None: if not GITHUB_OUTPUT: return with open(GITHUB_OUTPUT, "a", encoding="utf-8") as fh: if "\n" in value: delimiter = f"SCAN_EOF_{secrets.token_hex(16)}" fh.write(f"{key}<<{delimiter}\n{value}\n{delimiter}\n") else: fh.write(f"{key}={value}\n")

Suggested change

def _write_output(key: str, value: str) -> None:

if not GITHUB_OUTPUT:

return

with open(GITHUB_OUTPUT, "a", encoding="utf-8") as fh:

if "\n" in value:

delimiter = "__SCAN_EOF__"

fh.write(f"{key}<<{delimiter}\n{value}\n{delimiter}\n")

else:

fh.write(f"{key}={value}\n")

def _write_output(key: str, value: str) -> None:

if not GITHUB_OUTPUT:

return

with open(GITHUB_OUTPUT, "a", encoding="utf-8") as fh:

if "\n" in value:

delimiter = f"SCAN_EOF_{secrets.token_hex(16)}"

fh.write(f"{key}<<{delimiter}\n{value}\n{delimiter}\n")

else:

fh.write(f"{key}={value}\n")

Prompt for agents

In `.github/security/scanner.py`, the `_write_output` function at line 400 uses a hardcoded delimiter `__SCAN_EOF__` for multiline GitHub Actions output. If the report value contains a line that exactly equals `__SCAN_EOF__`, the output file is corrupted. Fix this by importing `secrets` at the top of the file and replacing the static `delimiter = "__SCAN_EOF__"` with `delimiter = f"SCAN_EOF_{secrets.token_hex(16)}"` — a different random delimiter for each write call. This matches the pattern recommended in GitHub Actions documentation.

vulcanho · 2026-05-15T18:04:11Z

+
+    steps:
+      - name: Checkout with full history
+        uses: actions/checkout@v4


GitHub Actions pinned to mutable semver tags — supply chain risk in the security scanner itself

All three actions (actions/checkout@v4, actions/setup-python@v5, actions/github-script@v7) use mutable version tags. These tags can be silently redirected to a different (potentially malicious) commit at any time. A security scanning workflow is a high-value target for this attack.

Pin each action to its full commit SHA:

- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 - uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0 - uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1

Verify the current SHA for each tag at https://github.com/actions//releases before pinning.

Also applies to: .github/workflows/security-scan.yml:31, .github/workflows/security-scan.yml:64

Prompt for agents

In `.github/workflows/security-scan.yml`, all three GitHub Actions are pinned to mutable semver tags: `actions/checkout@v4` (line 26), `actions/setup-python@v5` (line 31), `actions/github-script@v7` (line 64). Pin each to its full commit SHA hash instead. Look up the current SHA for each tag at https://github.com/actions/checkout/releases, https://github.com/actions/setup-python/releases, and https://github.com/actions/github-script/releases respectively, and replace the tag with the full 40-character commit hash, adding the tag as a comment.

vulcanho · 2026-05-15T18:04:11Z

+          python-version: "3.11"
+
+      - name: Install Python dependencies
+        run: pip install pyyaml


pip install pyyaml without a pinned version — scanner bootstrap is itself a supply chain vector

Installing pyyaml without a pinned version means the runner silently upgrades (or downgrades) on every job run, and a compromised or typosquatted package update would execute before the scanner runs. For security-critical infrastructure, pin the exact version and ideally verify its hash:

- name: Install Python dependencies run: pip install pyyaml==6.0.2

For stronger guarantees, commit a requirements.txt with pip hash verification:

pyyaml==6.0.2 --hash=sha256:<hash>

and run pip install --require-hashes -r .github/security/requirements.txt.

Suggested change

run: pip install pyyaml

- name: Install Python dependencies

run: pip install pyyaml==6.0.2

Prompt for agents

In `.github/workflows/security-scan.yml` at line 36, `pip install pyyaml` installs without a version pin. For a security scanner, this is a supply chain risk — a compromised package update would execute before the scan. Pin the version: change to `pip install pyyaml==6.0.2` (verify the current stable release). Even better, create a `.github/security/requirements.txt` file with `pyyaml==6.0.2 --hash=sha256:<hash>` and use `pip install --require-hashes -r .github/security/requirements.txt`.

vulcanho · 2026-05-15T18:04:11Z

+  # Override at the repo level via Settings → Secrets and variables → Actions.
+  # Valid values: CRITICAL | HIGH | MEDIUM | LOW | INFO
+  FAIL_SEVERITY: MEDIUM


FAIL_SEVERITY override comment is wrong — env.FAIL_SEVERITY cannot be overridden via repo variables

The comment on line 15 tells operators they can override FAIL_SEVERITY via "Settings → Secrets and variables → Actions" (i.e., repository-level Action variables). This is incorrect. ${{ env.FAIL_SEVERITY }} reads from the workflow-level env: block, not from ${{ vars.FAIL_SEVERITY }}. Repository variables have no effect here.

Either update the comment to reflect reality, or switch to actually reading the repository variable:

env: # Minimum severity that fails the build. # Override per-repo: Settings → Secrets and variables → Actions → Variables → FAIL_SEVERITY # Valid values: CRITICAL | HIGH | MEDIUM | LOW | INFO FAIL_SEVERITY: ${{ vars.FAIL_SEVERITY || 'MEDIUM' }}

Then remove the inner FAIL_SEVERITY: ${{ env.FAIL_SEVERITY }} from the scan step's env: block, since the runner already inherits workflow-level env vars.

Suggested change

# Override at the repo level via Settings → Secrets and variables → Actions.

# Valid values: CRITICAL | HIGH | MEDIUM | LOW | INFO

FAIL_SEVERITY: MEDIUM

env:

# Minimum severity that fails the build.

# Override per-repo: Settings → Secrets and variables → Actions → Variables → FAIL_SEVERITY

# Valid values: CRITICAL | HIGH | MEDIUM | LOW | INFO

FAIL_SEVERITY: ${{ vars.FAIL_SEVERITY || 'MEDIUM' }}

Prompt for agents

In `.github/workflows/security-scan.yml`, line 15 says `FAIL_SEVERITY` can be overridden via repository-level Action variables (Settings → Secrets and variables → Actions). This is wrong: line 17 hardcodes `FAIL_SEVERITY: MEDIUM` and line 52 reads it back as `${{ env.FAIL_SEVERITY }}` which only reads the workflow-level `env:` block, not repository variables. Fix by changing line 17 to `FAIL_SEVERITY: ${{ vars.FAIL_SEVERITY || 'MEDIUM' }}` so the repository variable actually takes effect, then remove the duplicate `FAIL_SEVERITY: ${{ env.FAIL_SEVERITY }}` from the scan step's env block.

feat: add scalable signature-driven supply-chain malware scanner

b37042a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the github_actions label May 15, 2026

vulcanho Bot reviewed May 15, 2026

View reviewed changes

ShivamBhasin2002 closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: security ci#1

feat: security ci#1
ShivamBhasin2002 wants to merge 1 commit into
masterfrom
ft-security-ci

ShivamBhasin2002 commented May 15, 2026

Uh oh!

vulcanho Bot left a comment

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

vulcanho Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		continue-on-error: true
		run: python .github/security/scanner.py

	run: pip install pyyaml
	- name: Install Python dependencies
	run: pip install pyyaml==6.0.2

Conversation

ShivamBhasin2002 commented May 15, 2026

Summary

File structure

Uh oh!

vulcanho Bot left a comment

Choose a reason for hiding this comment

📋 Summary

💬 General Feedback

🔀 Architecture

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

vulcanho Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant