Fix OpenCRE map analysis 503 on Heroku (#915)#918
Conversation
Fix #292 Co-authored-by: Spyros <northdpole@users.noreply.github.com>
Co-authored-by: Spyros <northdpole@users.noreply.github.com>
Co-authored-by: Spyros <northdpole@users.noreply.github.com>
* Update BodyText.tsx on OpenCRE Chat * Chatbot disclaimer improvement and SIG mentioning * Improved layout of chatbot bottom text --------- Co-authored-by: Spyros <northdpole@users.noreply.github.com>
Serve precomputed OpenCRE GA from cache on Heroku instead of computing on the web dyno, expand backfill to include automatic CRE links, and harden PCI DSS / Secure Headers imports with better linking and parser fixes. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Wondering what really moved? Review this PR in Change Stack to inspect semantic changes, definitions, and references. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (6)
Summary by CodeRabbit
WalkthroughThis PR adds OpenCRE-directed gap-analysis overlap computation and caching, enhances PCI DSS and Secure Headers parsers with embeddings-based CRE resolution and fallbacks, wires CLI/Make backfill support, updates the map-analysis endpoint to use cached results and fast-fail on Heroku, and adds tests plus utilities for PCI mapping and syncing gap-analysis caches. ChangesOpenCRE Gap Analysis Backfill and Parser Integration
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
application/tests/secure_headers_parser_test.py (1)
47-69:⚠️ Potential issue | 🟠 MajorFix dict equality assertion in
secure_headers_parser_test
assertCountEqual()can pass even when dict values differ (it doesn’t enforce full dict/value equality). Sinceexpected.todict()andnodes[0].todict()are dict payloads, useassertEqual()instead.Suggested fix
- self.assertCountEqual(expected.todict(), nodes[0].todict()) + self.assertEqual(expected.todict(), nodes[0].todict())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/tests/secure_headers_parser_test.py` around lines 47 - 69, The test currently uses assertCountEqual(expected.todict(), nodes[0].todict()) which can miss value mismatches; replace that call with self.assertEqual(expected.todict(), nodes[0].todict()) in the secure_headers_parser_test so the dict payloads from expected and nodes[0] (built from SecureHeaders().name / entries.results) are compared for exact equality.
🧹 Nitpick comments (2)
application/utils/gap_analysis.py (2)
131-152: ⚡ Quick winConsider documenting the deduplication behavior.
The early return at line 147-148 combined with the link sort order in the caller ensures that when a node has both
LinkedToandAutomaticallyLinkedTorelationships to the same CRE, only the manual link is kept. This is correct behavior (manual links are more authoritative), but it's subtle and would benefit from a comment explaining the preference.📝 Suggested comment
def _add_direct_link_result( grouped_paths: Dict[str, Dict[str, Any]], start_document: defs.Document, end_document: defs.Document, *, ltype: defs.LinkTypes = defs.LinkTypes.LinkedTo, ) -> None: + # Deduplicates paths by end_document.id. Combined with caller's link sort + # (manual before automatic), this ensures manual LinkedTo relationships + # take precedence over AutomaticallyLinkedTo when both exist to the same CRE. shared_paths = grouped_paths.setdefault(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/gap_analysis.py` around lines 131 - 152, Add a brief inline comment in _add_direct_link_result explaining the deduplication: note that the early return when path_key already exists, together with the caller's link sort order, ensures that when both defs.LinkTypes.LinkedTo and defs.LinkTypes.AutomaticallyLinkedTo exist for the same end_document only the manual (LinkedTo) link is retained; mention that this is intentional because manual links are preferred over automatically generated ones, so the function intentionally skips overwriting an existing path.
230-254: ⚖️ Poor tradeoffOptional: Cache the remaining-pairs count instead of recomputing.
Line 251 calls
missing_opencre_direct_pairs(collection)again for logging, which re-queries the cache for all pairs. Consider storing the initial missing count and decrementing it, or only computing the final count when logging is actually needed.♻️ Optional optimization
+ initial_missing = len(todo) written = 0 for pair in todo: cache_key = make_resources_key(pair) if build_direct_cre_overlap_map_analysis(pair, cache_key, collection): written += 1 logger.info( "OpenCRE direct GA backfill: wrote=%s remaining=%s", written, - len(missing_opencre_direct_pairs(collection)), + initial_missing - written, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/gap_analysis.py` around lines 230 - 254, The logging recomputes missing_opencre_direct_pairs(collection) at the end; change backfill_opencre_direct_pairs to capture the initial todo list/count (e.g. missing_count = len(todo) when todo comes from missing_opencre_direct_pairs) and then compute remaining as missing_count - written (or decrement a remaining counter inside the loop) so you avoid calling missing_opencre_direct_pairs(collection) again; keep existing behavior for the refresh path (use len(todo) there) and still use cache_key/make_resources_key and build_direct_cre_overlap_map_analysis as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/tests/pci_dss_parser_test.py`:
- Around line 42-55: The test test_resolve_cre_falls_back_to_bridge_standard is
brittle because it assumes a fixed number of bridge standards; make the test
deterministic by patching the module-level bridge-standards constant before
calling resolve_cre_for_pci_control: use patch.object (or monkeypatch) to set
pci_mod.PCI_BRIDGE_STANDARDS (or PCI_DSS_BRIDGE_STANDARDS if that’s the actual
name in the module) to a fixed tuple (e.g. ("std1", "std2")) inside the test,
then run the existing assertions against best_cre_via_bridge_standard and
resolve_cre_for_pci_control so the bridge_mock.call_count assertion no longer
depends on external env configuration.
In `@application/tests/secure_headers_parser_test.py`:
- Around line 97-103: The current assertions on nodes (from
entries.results[secure_headers.SecureHeaders().name]) only check sets and allow
swapped associations; update the test to assert the exact mapping from
node.section to node.links[0].document.id by building a dict or mapping from
{node.section: node.links[0].document.id} and comparing it to the expected
mapping {"First": "636-347", "Second": "743-110"} so the association between
section and CRE id is verified (use the existing nodes variable and
node.section/node.links[0].document.id to locate values).
In `@application/utils/external_project_parsers/parsers/pci_dss.py`:
- Around line 20-37: PCI_DSS_CRE_SIMILARITY_THRESHOLDS, PCI_BRIDGE_STANDARDS and
PCI_BRIDGE_MIN_SIMILARITY are parsed directly from env vars and can raise
ValueError at import time; change them to robust parsing: for
PCI_DSS_CRE_SIMILARITY_THRESHOLDS split the env string, try to parse each part
to float, filter out invalid entries and fall back to the default tuple
(0.55,0.45,0.35) if none valid; for PCI_BRIDGE_STANDARDS split and strip as now
but ensure empty results are ignored and fall back to the default list ("NIST
800-53 v5","ISO 27001","ASVS","CWE") if none valid; for
PCI_BRIDGE_MIN_SIMILARITY read the env, attempt float conversion in a try/except
and use 0.4 on failure or if the value is not finite; optionally emit a
warning/log when an env value is ignored but do not let exceptions propagate at
import.
In `@scripts/sync_gap_analysis_table.py`:
- Around line 22-26: The helper _pg_host_is_loopback incorrectly treats
"0.0.0.0" as a loopback host; update the function so it no longer classifies
"0.0.0.0" as loopback by removing "0.0.0.0" from the tuple checked in
_pg_host_is_loopback (keep checks for "127.0.0.1", "localhost", "::1" and the
existing empty-host check), so destination safety checks won't treat 0.0.0.0 as
local.
- Around line 28-33: In _fetch_sqlite_rows, avoid coercing ga_object NULLs into
the string "None": preserve SQL NULLs by returning None for v when it's NULL and
only str() non-null values; update the return type from List[Tuple[str, str]] to
List[Tuple[str, Optional[str]]] and add Optional to the imports. Concretely,
change the list comprehension to something like [(str(k), None if v is None else
str(v)) for k, v in cur.fetchall()] and adjust the function signature/type hints
accordingly.
---
Outside diff comments:
In `@application/tests/secure_headers_parser_test.py`:
- Around line 47-69: The test currently uses assertCountEqual(expected.todict(),
nodes[0].todict()) which can miss value mismatches; replace that call with
self.assertEqual(expected.todict(), nodes[0].todict()) in the
secure_headers_parser_test so the dict payloads from expected and nodes[0]
(built from SecureHeaders().name / entries.results) are compared for exact
equality.
---
Nitpick comments:
In `@application/utils/gap_analysis.py`:
- Around line 131-152: Add a brief inline comment in _add_direct_link_result
explaining the deduplication: note that the early return when path_key already
exists, together with the caller's link sort order, ensures that when both
defs.LinkTypes.LinkedTo and defs.LinkTypes.AutomaticallyLinkedTo exist for the
same end_document only the manual (LinkedTo) link is retained; mention that this
is intentional because manual links are preferred over automatically generated
ones, so the function intentionally skips overwriting an existing path.
- Around line 230-254: The logging recomputes
missing_opencre_direct_pairs(collection) at the end; change
backfill_opencre_direct_pairs to capture the initial todo list/count (e.g.
missing_count = len(todo) when todo comes from missing_opencre_direct_pairs) and
then compute remaining as missing_count - written (or decrement a remaining
counter inside the loop) so you avoid calling
missing_opencre_direct_pairs(collection) again; keep existing behavior for the
refresh path (use len(todo) there) and still use cache_key/make_resources_key
and build_direct_cre_overlap_map_analysis as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 838fd5b3-87b0-4163-91ce-5da53c971c5c
📒 Files selected for processing (13)
Makefileapplication/cmd/cre_main.pyapplication/tests/opencre_gap_analysis_test.pyapplication/tests/pci_dss_parser_test.pyapplication/tests/secure_headers_parser_test.pyapplication/tests/web_main_test.pyapplication/utils/external_project_parsers/parsers/pci_dss.pyapplication/utils/external_project_parsers/parsers/secure_headers.pyapplication/utils/gap_analysis.pyapplication/web/web_main.pycre.pyscripts/compute_pci_dss_cre_mappings.pyscripts/sync_gap_analysis_table.py
Harden PCI env parsing, tighten sync script safety checks, make bridge fallback tests deterministic, and format files flagged by CI black. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
AutomaticallyLinkedToCRE links and addmake backfill-opencre-ga/--ga_backfill_opencre_direct.Thanks to @5anjeev for reporting #915 with a clear repro — that made it straightforward to track down the Heroku timeout on OpenCRE map analysis.
Prod validation
opencreorg(Heroku v937).gap_analysis_resultscache to production Postgres.Test plan
python -m unittest application.tests.web_main_test application.tests.opencre_gap_analysis_test application.tests.pci_dss_parser_test application.tests.secure_headers_parser_testFixes #915