feat: implement structured extraction checkpoint B3 by Abhijeet2409 · Pull Request #921 · OWASP/OpenCRE

Abhijeet2409 · 2026-06-09T05:39:51Z

Summary

Implements Workstream B Checkpoint B3 from the OWASP Cheat Sheet to CRE Mapping pipeline RFC.

This PR is built on top of the existing Workstream B1/B2 extraction implementation (PR #912).

Changes

`cheatsheet_extractor.py`

Added deterministic fallback handling for malformed or irregular markdown inputs.

Implemented:

fallback title extraction
fallback summary extraction
consistent fallback metadata handling through fallback_used

No fallback handling was added for headings, as empty headings are explicitly allowed by the RFC.

No fallback handling was required for:

source_id
hyperlink
raw_markdown_path

since these are deterministically derived from the source path provided by Workstream A.

This PR specifically focuses on graceful fallback behavior for malformed markdown inputs while still returning a valid CheatsheetRecord.

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

coderabbitai · 2026-06-09T05:39:59Z

Summary by CodeRabbit

Bug Fixes
- Improved cheatsheet extraction with more robust fallback handling for titles and summaries.
- Enhanced error resilience when extracting introduction sections from cheatsheets.
Refactor
- Changed fallback tracking from global to per-operation level for better accuracy.

Walkthrough

Cheatsheet extraction now tracks fallback usage per-call instead of using a module constant. The _extract_summary function searches explicitly for "Introduction" sections and raises errors when unsuitable. The main extraction function wraps these helpers in try/except handlers, logs failures, applies fallback values, and sets a local flag that flows to metadata.

Changes

Fallback handling and extraction robustness

Layer / File(s)	Summary
Extraction function enhancements and fallback helpers `application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`	`_extract_summary` refactored to search for "Introduction" heading and raise `ValueError` when not found; new `_fallback_title` and `_fallback_summary` helper functions provide default values on extraction failure.
Error handling and fallback orchestration `application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`	`extract_cheatsheet_record` wraps title and summary extraction in try/except blocks, logs errors, applies fallback helpers on `ValueError`, sets per-call `fallback_used` flag, and propagates that flag to `CheatsheetRecord.metadata` instead of using the removed module constant.
Module constants migration `application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`	Module-level `FALLBACK_USED` constant removed; only `PARSER_VERSION` remains as module-level indicator.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

OWASP/OpenCRE#912: Overlapping changes to cheatsheet_extractor.py's title/summary extraction, fallback behavior, and FALLBACK_USED constant handling.

Suggested reviewers

Pa04rth

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: implement structured extraction checkpoint B3' is specific and directly related to the changeset, which implements Workstream B Checkpoint B3 with fallback handling for cheatsheet extraction.
Description check	✅ Passed	The description clearly explains the PR's purpose (implementing Workstream B Checkpoint B3), the specific changes to cheatsheet_extractor.py, and the fallback handling implemented, all of which align with the actual code changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

application/utils/external_project_parsers/parsers/cheatsheet_extractor.py (1)

12-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty ## headings are currently dropped despite the “can be empty” contract.

Line 99 states empty headings are allowed, but _HEADING_RE requires at least one character (.+), so ## / ## headings are excluded instead of being preserved as empty strings.

Suggested fix

-_HEADING_RE = re.compile(r"^##\s+(?P<heading>.+)$", re.MULTILINE)
+_HEADING_RE = re.compile(r"^##(?:\s+(?P<heading>.*))?$", re.MULTILINE)

-    headings = [m.group("heading").strip() for m in _HEADING_RE.finditer(markdown)]
+    headings = [(m.group("heading") or "").strip() for m in _HEADING_RE.finditer(markdown)]

Also applies to: 99-100

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`
around lines 12 - 13, The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`:
- Around line 12-13: The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 05aa9804-df31-406d-af28-37a40b53f9af

📥 Commits

Reviewing files that changed from the base of the PR and between b637225 and a76be61.

📒 Files selected for processing (1)

application/utils/external_project_parsers/parsers/cheatsheet_extractor.py

Abhijeet2409 added 5 commits June 9, 2026 11:00

feat: implement structured extraction checkpoints B1 and B2

ade7c18

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

docs: add docstrings

a9e54a3

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

fix: validate normalized string field values correctly

26d7e92

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

fix: validate normalized string field values correctly

dc9d2d0

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

feat: implement structured extraction checkpoint B3

a76be61

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

Abhijeet2409 mentioned this pull request Jun 9, 2026

feat: implement structured extraction checkpoint B3 Abhijeet2409/OpenCRE#1

Closed

Abhijeet2409 marked this pull request as ready for review June 9, 2026 05:44

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Merge branch 'main' into feature/structured-extraction-b3

2ba1f4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement structured extraction checkpoint B3#921

feat: implement structured extraction checkpoint B3#921
Abhijeet2409 wants to merge 6 commits into
OWASP:mainfrom
Abhijeet2409:feature/structured-extraction-b3

Abhijeet2409 commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abhijeet2409 commented Jun 9, 2026

Summary

Changes

cheatsheet_extractor.py

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`cheatsheet_extractor.py`

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading