Skip to content

Add Unicode escape sequence validation (Issue #3)#19

Merged
mprpic merged 1 commit into
mprpic:mainfrom
jgamblin:issue/3-unicode-escape-sequences
May 21, 2026
Merged

Add Unicode escape sequence validation (Issue #3)#19
mprpic merged 1 commit into
mprpic:mainfrom
jgamblin:issue/3-unicode-escape-sequences

Conversation

@jgamblin
Copy link
Copy Markdown
Contributor

Summary

This PR adds validation to detect Unicode escape sequences in CVE descriptions, which should not be present since the CVE schema expects UTF-8 encoded data. Unicode characters should be used directly instead of their escape sequence representations.

Impact Analysis

Testing Results (Full CVE dataset: 340,652 files):

Metric Finding
CVEs with Unicode Escapes 0 (not found in current dataset)
Status No violations detected in production data

While no Unicode escape sequences were found in the current CVE dataset, this validation is important for:

  • Future data quality assurance
  • Catching legacy data issues if re-imported
  • Maintaining compliance with UTF-8 requirement

Changes

New Validation Rule: E011

  • Name: check-unicode-escape-sequences
  • Description: Descriptions do not contain Unicode escape sequences; UTF-8 characters should be used instead
  • Scope: Checks both PUBLISHED record descriptions and REJECTED record rejection reasons

Features

  • Detects both 4-digit (\uXXXX) and 8-digit (\uXXXXXXXX) Unicode escape sequences
  • Clear error messages distinguishing descriptions from rejection reasons
  • Proper handling of both published and rejected records

Testing

  • 5 comprehensive unit tests
  • Tests for published records, rejected records, multiple escapes
  • Tests verify UTF-8 characters don't trigger false positives

References

@jgamblin jgamblin force-pushed the issue/3-unicode-escape-sequences branch from 903af23 to 3262f3b Compare May 14, 2026 19:32
@jgamblin
Copy link
Copy Markdown
Contributor Author

Update: I have rebased this PR on top of #23 and pushed the latest changes. I also ran QC checks (go test ./..., go test -race ./..., go vet ./..., gofmt -l .) and everything is passing on this branch.

@mprpic
Copy link
Copy Markdown
Owner

mprpic commented May 18, 2026

I merged 23, can you rebase on top of it now that it's on main? Thanks!

@jgamblin jgamblin force-pushed the issue/3-unicode-escape-sequences branch from 0bbe4ac to 981c419 Compare May 19, 2026 23:59
@jgamblin
Copy link
Copy Markdown
Contributor Author

Rebased and tested

This PR has been rebased on the latest main branch and all tests pass.

Test Results:

  • Build: ✅ PASS
  • Unit tests: ✅ PASS
  • Integration tests: ✅ PASS

Ready for review and merge.

@mprpic
Copy link
Copy Markdown
Owner

mprpic commented May 20, 2026

@jgamblin I think you merged main with this branch so it pulled in the commits into this PR instead of rebasing onto main. I think to fix this, just create a temp branch from main, cherry-pick your commit that was specific to the change here, remove your old branch, and recreate it from the temp branch (that you can then also delete). That will make for cleaner history and a much more reviewable diff. Looking at this, I'm not sure what changes are from main and which from your unicode escape change. Thanks!

Add E011 validation rule to check for Unicode escape sequences in CVE descriptions and rejection reasons. The CVE schema expects UTF-8 encoded data, so Unicode escape sequences like \uXXXX should not be used.

Includes:
- CheckUnicodeEscapeSequences function to detect escape sequences
- Comprehensive test coverage for both published and rejected records
- Support for checking descriptions and rejection reasons
- Proper regex validation for 4 and 8-digit escape sequences

Fixes mprpic#3
@jgamblin jgamblin force-pushed the issue/3-unicode-escape-sequences branch from a63a201 to afe3908 Compare May 20, 2026 13:20
@jgamblin
Copy link
Copy Markdown
Contributor Author

Fixed - Clean rebase applied

Thanks for catching that! You were right - I had merged main instead of properly rebasing. I've now fixed it using the cherry-pick approach:

What I did:

  1. Created a fresh temp branch from main
  2. Cherry-picked only the Unicode escape sequence validation commit (afe3908)
  3. Deleted the old branch and recreated it from the temp branch
  4. Verified all tests pass

Result:

  • Clean history with only 1 commit specific to this PR
  • All tests passing ✅
  • Ready for review

The PR now has a much cleaner diff showing only the Unicode escape sequence validation changes.

Copy link
Copy Markdown
Owner

@mprpic mprpic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mprpic mprpic merged commit 2f56af3 into mprpic:main May 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check for Unicode escape sequences in CVE descriptions and/or anywhere else relevant

2 participants