Skip to content

fix: strip markdown code fences from ADF output before JSON parse#9

Merged
adalton merged 5 commits into
mainfrom
fix/strip-code-fences-from-adf
Jun 24, 2026
Merged

fix: strip markdown code fences from ADF output before JSON parse#9
adalton merged 5 commits into
mainfrom
fix/strip-code-fences-from-adf

Conversation

@adalton

@adalton adalton commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • On OSAC-1628, the bot posted raw ADF JSON as a plain text blob. The AI's
    output was valid JSON visually, but contained invisible Unicode characters
    (likely BOM U+FEFF or zero-width non-joiners) that broke json.Unmarshal.
    The fallback path wrapped the entire JSON string in TextToADF() and
    posted it as a single text paragraph.
  • Adds trimInvisible() to strip BOM, zero-width spaces/joiners, NBSP, and
    other invisible characters from assessment boundaries before JSON parsing.
  • Adds stripCodeFences() to handle a second common LLM failure mode:
    wrapping JSON output in markdown code fences despite explicit instructions.
  • Fixes pre-existing gosec G304 lint warnings in main.go/main_test.go
    that were failing CI on main.

Test plan

  • TestTrimInvisible — 11 table-driven cases: BOM, ZWNJ, ZWJ, ZWS,
    NBSP, word joiner, mixed, combined with whitespace
  • TestBuildADFComment_BOM — integration test through buildADFComment
    with BOM-prefixed input
  • TestStripCodeFences — 9 table-driven cases: no fences, bare fences,
    language tags, whitespace, no closing fence, embedded backticks, CRLF
  • TestBuildADFComment_Fenced — integration test with fenced input
  • All existing tests pass (go test -race ./...)
  • Lint clean (make lint — 0 issues, including pre-existing gosec fixes)

Assisted-by: Claude noreply@anthropic.com

Summary

Hardened the triage bot’s ADF comment generation against common LLM output formatting issues. buildADFComment now preprocesses the model’s assessment text before json.Unmarshal by running a normalization pipeline: trimInvisible → stripCodeFences → trimInvisible. This strips invisible Unicode/control characters (e.g., BOM U+FEFF, zero-width space/non-joiner/joiner, word joiner, non-breaking space) and removes an outer Markdown triple-backtick code fence (optionally with a language tag) when present. If ADF JSON parsing still fails, the existing plain-text fallback behavior remains unchanged.

Additionally, in the DRY RUN branch, logging was tightened to avoid printing full parsed ADF/plain-text comment content; it now logs only whether ADF parsing succeeded (e.g., format: "adf" vs "plain text") plus issue/action metadata.

Packages Affected

  • triage/ (primary)

    • triage/processor.go: updated ADF JSON preprocessing in buildADFComment; added trimInvisible() and stripCodeFences() helpers.
    • triage/processor_test.go: added table-driven unit tests for invisible trimming and code-fence stripping, plus integration-style tests for BOM-prefixed and fenced JSON inputs.
  • (root application/tests) (supporting)

    • main.go: fixed gosec G304 by reading the Claude Code config using filepath.Clean(configPath).
    • main_test.go: updated config read paths in relevant MCP config-writing tests to use filepath.Clean(configPath).
  • jira/, scanner/, server/, workflow/, config/: not touched.

Control Plane Impact

Affects the AI output → comment formatting path for ADF comment state generation only (the step that converts an AI-generated assessment into an ADF-formatted comment). No changes to polling/webhooks orchestration or the broader control-plane state machine.

AI Invocation Path

No changes to how the model is invoked (executor/prompt/template/metadata). Changes are limited to post-processing of the model output before JSON parsing and comment construction.

Configuration & Deployment

No Helm chart or deployment changes; only minor config-path cleaning in main.go and corresponding test reads to address lint failures.

LLMs commonly wrap JSON output in markdown code fences despite explicit
instructions not to. When buildADFComment failed to parse the fenced
JSON, the fallback path posted the raw ADF JSON as plain text via
TextToADF, resulting in unreadable comments (observed on OSAC-1628).

Add stripCodeFences() to remove a single layer of fences before
json.Unmarshal. No-op on clean input; preserves the existing plain-text
fallback if the content still isn't valid JSON after stripping.

Assisted-by: Claude Opus 4.6 (1M) <noreply@anthropic.com>
@adalton adalton self-assigned this Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 67a29a2a-0d35-4a3a-9013-ef2c8ff720f5

📥 Commits

Reviewing files that changed from the base of the PR and between 9266b8a and 847e0d6.

📒 Files selected for processing (2)
  • triage/processor.go
  • triage/processor_test.go

Walkthrough

Two independent features are added: (1) ADF JSON preprocessing in triage/processor.go gains trimInvisible to strip BOM and invisible characters, and stripCodeFences to extract JSON from Markdown code fences. buildADFComment chains both before json.Unmarshal, with DRY RUN logging simplified to report format type only. Comprehensive tests cover edge cases (embedded backticks, CRLF, missing closing fence) and integrated parsing. (2) Config file reads in main.go and test helpers normalize paths with filepath.Clean for consistent handling.

Changes

ADF JSON preprocessing for LLM output

Layer / File(s) Summary
Invisible character and fence stripping helpers
triage/processor.go, triage/processor_test.go
Adds trimInvisible to strip BOM (byte-order mark), zero-width joiner/non-joiner, word joiner, and control whitespace; stripCodeFences extracts JSON from triple-backtick fences with optional language tags, CRLF-safe, handles missing closing fence. Both are integrated into buildADFComment and applied in sequence before json.Unmarshal. TestTrimInvisible covers isolated behavior with table-driven cases; TestStripCodeFences exercises 7 edge cases including embedded backticks and CRLF.
Integrated ADF parsing and DRY RUN logging
triage/processor.go, triage/processor_test.go
Integration tests verify buildADFComment correctly parses BOM-prefixed JSON, code-fenced JSON, and combined BOM-before-fence JSON, each producing valid ADF documents with type == "doc". DRY RUN logging in postComment is simplified to report format type ("adf" or "plain text") based on ADF parse success, without emitting parsed body or fallback comment content.

Config path normalization

Layer / File(s) Summary
Config path normalization in main and tests
main.go, main_test.go
Applies filepath.Clean to config path reads in writeMCPConfig and updates three test assertions (TestWriteMCPConfig_NewFile, TestWriteMCPConfig_MergesExistingKeys, TestWriteMCPConfig_ExplicitEnvWins) to read generated config via cleaned paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~13 minutes

Poem

BOM and fences melt away,
Three backticks stripped without delay.
Zero-width ghosts now cleared from sight,
JSON parsing finally right—
LLM output cleaned up bright! ✨

🚥 Pre-merge checks | ✅ 12 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and accurately summarizes the primary change: sanitizing LLM assessment output by stripping markdown code fences before JSON parsing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets found. Production code adds sanitization functions with only logic and string constants. Test code uses obvious fake tokens like "tok123".
No-Weak-Crypto ✅ Passed PR uses only SHA-256 for non-security description hashing; no weak algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB) or custom crypto implementations detected.
No-Injection-Vectors ✅ Passed No injection vectors detected. Changes safely sanitize JSON input with trimInvisible/stripCodeFences and improve security via filepath.Clean gosec G304 fix.
Container-Privileges ✅ Passed PR contains no modifications to container/K8s manifests. Existing configurations use appropriate security controls: runAsNonRoot, allowPrivilegeEscalation disabled, capabilities dropped, and non-ro...
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposed in logs. DRY RUN logging was improved to exclude assessment content; only logs issue key, action, and format. Credentials and API tokens are never logged.
Resource-Leaks ✅ Passed PR introduces no resource leaks: only string processing (trimInvisible, stripCodeFences), no unclosed files/HTTP/DB connections, no unmanaged goroutines or contexts.
Unchecked-Errors ✅ Passed New code in processor.go properly captures, checks, logs, or returns all errors. No unchecked errors assigned to blank identifiers without justification were introduced.
Ai-Attribution ✅ Passed PR discloses Claude AI usage with proper "Assisted-by" attribution trailer in both PR description and commit message; no Co-Authored-By misuse detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/strip-code-fences-from-adf

Comment @coderabbitai help to get the list of available commands.

adalton added 2 commits June 23, 2026 14:26
The OSAC-1628 incident showed the AI's ADF JSON was valid when visible
but contained invisible characters (likely BOM U+FEFF or zero-width
non-joiners) that broke json.Unmarshal, causing the fallback path to
post raw JSON as plain text.

Add trimInvisible() to strip BOM, zero-width spaces, ZWNJ, ZWJ, word
joiners, and NBSP from the boundaries of the assessment before parsing.
Applied after stripCodeFences() in the buildADFComment pipeline.

Assisted-by: Claude Opus 4.6 (1M) <noreply@anthropic.com>
Wrap os.ReadFile calls with filepath.Clean to satisfy gosec's G304
(potential file inclusion via variable) check. The paths are constructed
from os.UserHomeDir() and t.TempDir() so they're already safe, but the
linter can't prove that statically.

Assisted-by: Claude Opus 4.6 (1M) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
main.go (1)

158-199: 🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

Critical: Asymmetric path normalization between read and write.

The function cleans the path only on read (line 169) but not on write (line 195). If configPath contains .. or ./:

  • os.ReadFile(filepath.Clean(configPath)) reads from the normalized path
  • os.WriteFile(configPath, ...) writes to the uncleaned path

This creates a data integrity risk: the function could read config from one location and write to another, potentially losing data or enabling traversal attacks.

Both read and write must use canonical paths consistently.

🔒 Proposed fix: normalize write path
 	if err := os.WriteFile(configPath, data, 0o600); err != nil {
 		return err
 	}
-	return os.Chmod(configPath, 0o600)
+	return os.Chmod(filepath.Clean(configPath), 0o600)

Better: clean configPath once at entry:

 func writeMCPConfig(cfg *config.Config, configPath string) error {
+	configPath = filepath.Clean(configPath)
+	
 	env := make(map[string]string)

Then remove filepath.Clean() from the read call (line 169) to avoid double-cleaning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@main.go` around lines 158 - 199, The writeMCPConfig function has inconsistent
path normalization: the read operation uses filepath.Clean(configPath) but the
write operation uses the raw configPath. This can cause the function to read
from one location and write to another, creating data integrity and security
risks. Fix this by normalizing configPath once at the beginning of the
writeMCPConfig function using filepath.Clean, then use the normalized path
consistently in both the os.ReadFile call and the os.WriteFile call. Remove the
filepath.Clean call from the read operation since the path will already be
normalized.

Source: Path instructions

triage/processor_test.go (1)

228-285: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add regression test for BOM-prefixed fenced JSON.

Current tests cover BOM and fenced inputs separately, but not the combined case (\uFEFF```json ... ````). Add one integration case in buildADFComment` tests to lock in the expected parse path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@triage/processor_test.go` around lines 228 - 285, The test suite covers
BOM-prefixed JSON separately via TestBuildADFComment_BOM and fenced JSON
separately via TestBuildADFComment_Fenced, but lacks a test case for the
combined scenario of BOM-prefixed fenced JSON. Add a new test function (e.g.,
TestBuildADFComment_BOM_Fenced) that calls buildADFComment with input combining
both a BOM prefix (\uFEFF) and code fences (```json ... ```), following the same
assertion pattern as the existing buildADFComment tests to ensure the function
correctly handles this combined case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@triage/processor.go`:
- Line 327: In the json.Unmarshal call at line 327, the order of function calls
for assessment needs to be reversed. Currently stripCodeFences is called before
trimInvisible, but this can miss fenced payloads when invisible characters like
BOM or zero-width characters prefix the opening fence. Change the order so that
trimInvisible is called first on the assessment, and then stripCodeFences is
applied to the result of that operation, ensuring invisible characters are
removed before fence detection occurs.

---

Outside diff comments:
In `@main.go`:
- Around line 158-199: The writeMCPConfig function has inconsistent path
normalization: the read operation uses filepath.Clean(configPath) but the write
operation uses the raw configPath. This can cause the function to read from one
location and write to another, creating data integrity and security risks. Fix
this by normalizing configPath once at the beginning of the writeMCPConfig
function using filepath.Clean, then use the normalized path consistently in both
the os.ReadFile call and the os.WriteFile call. Remove the filepath.Clean call
from the read operation since the path will already be normalized.

In `@triage/processor_test.go`:
- Around line 228-285: The test suite covers BOM-prefixed JSON separately via
TestBuildADFComment_BOM and fenced JSON separately via
TestBuildADFComment_Fenced, but lacks a test case for the combined scenario of
BOM-prefixed fenced JSON. Add a new test function (e.g.,
TestBuildADFComment_BOM_Fenced) that calls buildADFComment with input combining
both a BOM prefix (\uFEFF) and code fences (```json ... ```), following the same
assertion pattern as the existing buildADFComment tests to ensure the function
correctly handles this combined case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a5c301d5-a872-4234-9fb1-f9bf9a26d2e2

📥 Commits

Reviewing files that changed from the base of the PR and between cad54ac and 98c6dac.

📒 Files selected for processing (4)
  • main.go
  • main_test.go
  • triage/processor.go
  • triage/processor_test.go

Comment thread triage/processor.go Outdated
adalton added 2 commits June 23, 2026 14:32
The dry-run log lines dumped the entire AI assessment (ADF body or plain
text) into container logs. Replace with issue key, action, and format
only — sufficient for verifying the bot's behavior without leaking
potentially sensitive issue content.

Assisted-by: Claude Opus 4.6 (1M) <noreply@anthropic.com>
A BOM or zero-width char prefixing the opening fence would prevent
stripCodeFences from detecting it. Reorder the pipeline to:
trimInvisible -> stripCodeFences -> trimInvisible, so invisible chars
are stripped before fence detection, and again after fence removal.

Assisted-by: Claude Opus 4.6 (1M) <noreply@anthropic.com>
@adalton adalton requested a review from amir-yogev-gh June 23, 2026 18:41
@adalton adalton merged commit 8e3fd91 into main Jun 24, 2026
8 checks passed
@adalton adalton deleted the fix/strip-code-fences-from-adf branch June 24, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants