Skip to content

feat: implement structured extraction checkpoint B3#921

Open
Abhijeet2409 wants to merge 6 commits into
OWASP:mainfrom
Abhijeet2409:feature/structured-extraction-b3
Open

feat: implement structured extraction checkpoint B3#921
Abhijeet2409 wants to merge 6 commits into
OWASP:mainfrom
Abhijeet2409:feature/structured-extraction-b3

Conversation

@Abhijeet2409

Copy link
Copy Markdown
Contributor

Summary

Implements Workstream B Checkpoint B3 from the OWASP Cheat Sheet to CRE Mapping pipeline RFC.

This PR is built on top of the existing Workstream B1/B2 extraction implementation (PR #912).

Changes

cheatsheet_extractor.py

Added deterministic fallback handling for malformed or irregular markdown inputs.

Implemented:

  • fallback title extraction
  • fallback summary extraction
  • consistent fallback metadata handling through fallback_used

No fallback handling was added for headings, as empty headings are explicitly allowed by the RFC.

No fallback handling was required for:

  • source_id
  • hyperlink
  • raw_markdown_path

since these are deterministically derived from the source path provided by Workstream A.

This PR specifically focuses on graceful fallback behavior for malformed markdown inputs while still returning a valid CheatsheetRecord.

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Bug Fixes

    • Improved cheatsheet extraction with more robust fallback handling for titles and summaries.
    • Enhanced error resilience when extracting introduction sections from cheatsheets.
  • Refactor

    • Changed fallback tracking from global to per-operation level for better accuracy.

Walkthrough

Cheatsheet extraction now tracks fallback usage per-call instead of using a module constant. The _extract_summary function searches explicitly for "Introduction" sections and raises errors when unsuitable. The main extraction function wraps these helpers in try/except handlers, logs failures, applies fallback values, and sets a local flag that flows to metadata.

Changes

Fallback handling and extraction robustness

Layer / File(s) Summary
Extraction function enhancements and fallback helpers
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
_extract_summary refactored to search for "Introduction" heading and raise ValueError when not found; new _fallback_title and _fallback_summary helper functions provide default values on extraction failure.
Error handling and fallback orchestration
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
extract_cheatsheet_record wraps title and summary extraction in try/except blocks, logs errors, applies fallback helpers on ValueError, sets per-call fallback_used flag, and propagates that flag to CheatsheetRecord.metadata instead of using the removed module constant.
Module constants migration
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
Module-level FALLBACK_USED constant removed; only PARSER_VERSION remains as module-level indicator.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • OWASP/OpenCRE#912: Overlapping changes to cheatsheet_extractor.py's title/summary extraction, fallback behavior, and FALLBACK_USED constant handling.

Suggested reviewers

  • Pa04rth
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: implement structured extraction checkpoint B3' is specific and directly related to the changeset, which implements Workstream B Checkpoint B3 with fallback handling for cheatsheet extraction.
Description check ✅ Passed The description clearly explains the PR's purpose (implementing Workstream B Checkpoint B3), the specific changes to cheatsheet_extractor.py, and the fallback handling implemented, all of which align with the actual code changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py (1)

12-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty ## headings are currently dropped despite the “can be empty” contract.

Line 99 states empty headings are allowed, but _HEADING_RE requires at least one character (.+), so ## / ## headings are excluded instead of being preserved as empty strings.

Suggested fix
-_HEADING_RE = re.compile(r"^##\s+(?P<heading>.+)$", re.MULTILINE)
+_HEADING_RE = re.compile(r"^##(?:\s+(?P<heading>.*))?$", re.MULTILINE)
-    headings = [m.group("heading").strip() for m in _HEADING_RE.finditer(markdown)]
+    headings = [(m.group("heading") or "").strip() for m in _HEADING_RE.finditer(markdown)]

Also applies to: 99-100

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`
around lines 12 - 13, The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`:
- Around line 12-13: The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 05aa9804-df31-406d-af28-37a40b53f9af

📥 Commits

Reviewing files that changed from the base of the PR and between b637225 and a76be61.

📒 Files selected for processing (1)
  • application/utils/external_project_parsers/parsers/cheatsheet_extractor.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant