feat: implement structured extraction checkpoint B3#921
Conversation
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Summary by CodeRabbit
WalkthroughCheatsheet extraction now tracks fallback usage per-call instead of using a module constant. The ChangesFallback handling and extraction robustness
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py (1)
12-13:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winEmpty
##headings are currently dropped despite the “can be empty” contract.Line 99 states empty headings are allowed, but
_HEADING_RErequires at least one character (.+), so##/##headings are excluded instead of being preserved as empty strings.Suggested fix
-_HEADING_RE = re.compile(r"^##\s+(?P<heading>.+)$", re.MULTILINE) +_HEADING_RE = re.compile(r"^##(?:\s+(?P<heading>.*))?$", re.MULTILINE)- headings = [m.group("heading").strip() for m in _HEADING_RE.finditer(markdown)] + headings = [(m.group("heading") or "").strip() for m in _HEADING_RE.finditer(markdown)]Also applies to: 99-100
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py` around lines 12 - 13, The `_HEADING_RE` and `_ANY_HEADING_RE` currently require at least one non-space character (using `.+`) which drops empty `##` headings; change their patterns to allow empty heading text (replace `.+` with `.*`) while keeping the `(?P<heading>)` group name intact so empty headings capture as an empty/whitespace string, and ensure downstream code that uses `heading` (e.g., any trimming or truthiness checks) preserves empty headings per the comment at line 99 (trim whitespace but treat the result as "" rather than dropping the heading).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`:
- Around line 12-13: The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 05aa9804-df31-406d-af28-37a40b53f9af
📒 Files selected for processing (1)
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
Summary
Implements Workstream B Checkpoint B3 from the OWASP Cheat Sheet to CRE Mapping pipeline RFC.
This PR is built on top of the existing Workstream B1/B2 extraction implementation (PR #912).
Changes
cheatsheet_extractor.pyAdded deterministic fallback handling for malformed or irregular markdown inputs.
Implemented:
fallback_usedNo fallback handling was added for headings, as empty headings are explicitly allowed by the RFC.
No fallback handling was required for:
source_idhyperlinkraw_markdown_pathsince these are deterministically derived from the source path provided by Workstream A.
This PR specifically focuses on graceful fallback behavior for malformed markdown inputs while still returning a valid
CheatsheetRecord.