Skip to content

fix(web): restore Heroku cache-only map_analysis (503 regression)#924

Open
northdpole wants to merge 4 commits into
mainfrom
fix/heroku-map-analysis-cache-only
Open

fix(web): restore Heroku cache-only map_analysis (503 regression)#924
northdpole wants to merge 4 commits into
mainfrom
fix/heroku-map-analysis-cache-only

Conversation

@northdpole

Copy link
Copy Markdown
Collaborator

Summary

  • Restores Heroku/read-only cache-only behavior for GET /rest/v1/map_analysis after PR Add curated CWE fallback mappings and coverage for issue #472 #823 (d796ff53) reintroduced Redis/Neo4j fallback on cache miss, causing 503 when Neo4j DNS fails on opencreorg.
  • On Heroku (DYNO or HEROKU): serve material SQL cache hits; return 404 on cache miss (no Redis queue, no Neo4j on web dyno).
  • OpenCRE fast path on Heroku also checks SQL cache before live computation.
  • Adds scripts/monitor_ga_health.py + make monitor-ga-health-prod to detect incomplete GA pairs and flag HTTP 503 regressions.

Root cause

PR #823 removed if os.environ.get("HEROKU"): abort(404) and replaced the Redis-unavailable fallback with direct db.gap_analysis() (Neo4j). Production was force-updated from the prior #915 fix back to d796ff53.

Heroku logs for failing pairs (e.g. ENISA→ISO 27001):

  1. Non-material/empty SQL cache row → cache miss
  2. Redis/RQ unavailable (localhost:6379 connection refused)
  3. Neo4j DNS failure → 503

Closes #923

Test plan

  • python -m unittest application.tests.web_main_test.TestMain.test_gap_analysis_heroku_cache_miss_returns_404
  • python -m unittest application.tests.web_main_test.TestMain.test_gap_analysis_dyno_cache_miss_returns_404
  • python -m unittest application.tests.web_main_test.TestMain.test_map_analysis_opencre_heroku_cache_miss_returns_404
  • python -m unittest application.tests.web_main_test.TestMain.test_gap_analysis_fallback_backend_failure_returns_503
  • python -m unittest application.tests.monitor_ga_health_test
  • Deploy to opencreorg and verify:
    • ASVS + ENISA200 (material cache)
    • ENISA + ISO 27001404 (not 503)
    • make monitor-ga-health-prod reports 0 http_503_regression

Follow-up (data)

Production DB audit: 118/462 directed pairs have material cache; 305 have empty placeholder rows; 39 missing. Those pairs need offline GA backfill (Neo4j workers), separate from this 503 regression fix.

Made with Cursor

…nitor

PR #823 reintroduced Neo4j/Redis fallback on Heroku cache misses, causing 503s
when Neo4j DNS fails. Serve precomputed GA from Postgres only on Heroku and
return 404 on cache miss. Add monitor_ga_health.py for production regression
alerting (especially HTTP 503).

Fixes #923

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@northdpole, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 18 minutes and 5 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cbaa6426-8a45-4cef-a3a6-a5b5efbfd631

📥 Commits

Reviewing files that changed from the base of the PR and between a49b7d9 and c7cc9cc.

📒 Files selected for processing (7)
  • .gitignore
  • AGENTS.md
  • application/database/db.py
  • application/tests/db_test.py
  • application/tests/gap_analysis_pair_job_test.py
  • application/utils/gap_analysis.py
  • scripts/monitor_ga_health.py

Walkthrough

This PR adds Heroku environment detection to prevent Neo4j fallback on cache misses, refactors Redis error handling, and introduces a CLI monitoring tool to detect incomplete gap-analysis pairs and HTTP 503 regressions. The changes address a production issue where map_analysis returned 503 errors on read-only Heroku dynos due to failed fallback attempts.

Changes

Heroku Cache-Only Protection and Gap-Analysis Monitoring

Layer / File(s) Summary
Heroku deployment detection and cache-miss guards
application/web/web_main.py
Add _is_heroku_deploy() helper to detect Heroku/read-only dynos from DYNO and HEROKU environment variables. Apply the helper to the OpenCRE fast path and upstream cached-result path in map_analysis to abort with 404 on cache misses when Heroku is detected, preventing fallback attempts. Remove unused HTTPException import.
Redis unavailable error handling refactor
application/web/web_main.py
Replace inline Neo4j fallback logic with call to _compute_ga_without_redis(database, standards) helper. On exception, log and return 503 with "Gap analysis unavailable" message instead of attempting production database queries on Heroku.
Heroku behavior integration tests
application/tests/web_main_test.py
Add environment-isolated tests: backend-failure test clears DYNO/HEROKU env to isolate 503 condition; DYNO cache-miss test validates 404 response under DYNO=web.1 with cache miss; new HEROKU=True test exercises OpenCRE path with cache-miss expecting 404.
Gap-analysis health monitoring script
scripts/monitor_ga_health.py
Implement CLI that fetches all standards, iterates all directed pairs, tests completeness via /map_analysis, buckets outcomes by HTTP status (including 503 regression), JSON validity, and result materiality. Aggregates counts, samples incomplete pairs, optionally writes JSON report, and posts webhook on failures. Exit codes: 0 (success), 1 (incomplete pairs), 2 (configuration/fetch failures).
Monitoring script unit tests
application/tests/monitor_ga_health_test.py
Unit test module covering monitoring helpers: test_material_result_detection validates _http_gap_result_is_material for empty/non-empty dicts and None; test_503_bucket_is_regression_marker verifies bucket constant.
Production monitoring automation
Makefile
Add monitor-ga-health-prod target that activates the virtualenv and runs the monitoring script against https://opencre.org, writing the JSON report to tmp/prod-ga-health.json.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: restoring Heroku cache-only behavior for map_analysis and fixing the 503 regression caused by Neo4j fallback.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining root cause, expected behavior, test plan, and deployment verification steps.
Linked Issues check ✅ Passed The PR successfully addresses all objectives from issue #923: restores cache-only Heroku behavior [web_main.py, web_main_test.py], prevents 503s on cache miss [web_main.py returns 404], adds regression tests [web_main_test.py, monitor_ga_health_test.py], and provides monitoring utility [scripts/monitor_ga_health.py, Makefile].
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the 503 regression: Heroku detection and cache-only logic [web_main.py], comprehensive tests [web_main_test.py, monitor_ga_health_test.py], and monitoring tooling [scripts/monitor_ga_health.py, Makefile].

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/heroku-map-analysis-cache-only

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

northdpole and others added 3 commits June 10, 2026 08:56
Cloudflare blocks anonymous urllib requests to ga_standards on production.

Co-authored-by: Cursor <cursoragent@cursor.com>
Allow AGENTS.md through the *.md gitignore exception and document that
Heroku/opencreorg gap analysis is cache-only (no compute on production).

Co-authored-by: Cursor <cursoragent@cursor.com>
Guard add_gap_analysis_result so non-material {"result":{}} primary rows
are not inserted and cannot overwrite material cache; subresource keys unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
application/web/web_main.py (1)

15-20: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate import.

Lines 15 and 20 are identical:

from application.utils import oscal_utils, redis
Proposed fix
-from application.utils import oscal_utils, redis
-
-from rq import job, exceptions
-from rq import Queue
-
 from application.utils import oscal_utils, redis
+from rq import job, exceptions
+from rq import Queue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/web/web_main.py` around lines 15 - 20, The file contains a
duplicated import statement for oscal_utils and redis; remove the redundant line
so only one "from application.utils import oscal_utils, redis" remains (keep the
import that aligns with the other rq imports). Locate the duplicate import of
oscal_utils and redis and delete the second occurrence to avoid redundant
imports.
🧹 Nitpick comments (2)
application/tests/monitor_ga_health_test.py (1)

7-10: ⚡ Quick win

Add test coverage for list cases.

The implementation of _http_gap_result_is_material handles list inputs (lines 32-33 in monitor_ga_health.py), but the test only covers dict and None cases. Add test cases for both empty and non-empty lists to ensure complete coverage.

🧪 Proposed test additions
     def test_material_result_detection(self) -> None:
         self.assertTrue(monitor._http_gap_result_is_material({"a": 1}))
         self.assertFalse(monitor._http_gap_result_is_material({}))
         self.assertFalse(monitor._http_gap_result_is_material(None))
+        self.assertTrue(monitor._http_gap_result_is_material([1, 2]))
+        self.assertFalse(monitor._http_gap_result_is_material([]))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/monitor_ga_health_test.py` around lines 7 - 10, Update the
test_material_result_detection test to cover list cases for
monitor._http_gap_result_is_material: add assertions that a non-empty list
(e.g., [1] or ["x"]) returns True and an empty list ([]) returns False so both
list branches in _http_gap_result_is_material are exercised; keep the existing
dict and None assertions and place the new list assertions alongside them in the
same test method.
scripts/monitor_ga_health.py (1)

153-155: ⚡ Quick win

Dead code: HTTP error check is unreachable.

urllib.request.urlopen raises urllib.error.HTTPError for HTTP error status codes (4xx, 5xx) before returning a response object. Therefore, the check on line 154 is unreachable—if resp.status >= 400, an HTTPError exception would have been raised before entering the with block.

The error is correctly caught by the try/except in main() (lines 228-233), so functionality is not affected. You can safely remove lines 154-155.

♻️ Proposed fix to remove dead code
     with urllib.request.urlopen(req, timeout=30) as resp:
-        if resp.status >= 400:
-            raise RuntimeError(f"webhook returned HTTP {resp.status}")
+        pass  # HTTPError is raised automatically by urlopen for status >= 400

Or simply:

-    with urllib.request.urlopen(req, timeout=30) as resp:
-        if resp.status >= 400:
-            raise RuntimeError(f"webhook returned HTTP {resp.status}")
+    with urllib.request.urlopen(req, timeout=30):
+        pass  # HTTPError is raised automatically by urlopen for status >= 400
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/monitor_ga_health.py` around lines 153 - 155, Remove the unreachable
HTTP status check inside the urllib.request.urlopen context: the call to urlopen
raises urllib.error.HTTPError for 4xx/5xx responses, so delete the "if
resp.status >= 400: raise RuntimeError(...)" block that references resp.status
(the variables req and resp are in the same scope) and rely on the existing
try/except in main() to handle HTTPError instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/monitor_ga_health_test.py`:
- Around line 12-14: The test currently asserts a local variable and should
instead invoke monitor_ga_health._check_pair with a simulated input that
represents a 503 response and assert the returned bucket equals
"http_503_regression"; update test_503_bucket_is_regression_marker to construct
the same Pair/record shape used by monitor_ga_health (or mock the dependency
that provides the status) so _check_pair sees status 503, call
monitor_ga_health._check_pair(...), and assert the result is
"http_503_regression".

---

Outside diff comments:
In `@application/web/web_main.py`:
- Around line 15-20: The file contains a duplicated import statement for
oscal_utils and redis; remove the redundant line so only one "from
application.utils import oscal_utils, redis" remains (keep the import that
aligns with the other rq imports). Locate the duplicate import of oscal_utils
and redis and delete the second occurrence to avoid redundant imports.

---

Nitpick comments:
In `@application/tests/monitor_ga_health_test.py`:
- Around line 7-10: Update the test_material_result_detection test to cover list
cases for monitor._http_gap_result_is_material: add assertions that a non-empty
list (e.g., [1] or ["x"]) returns True and an empty list ([]) returns False so
both list branches in _http_gap_result_is_material are exercised; keep the
existing dict and None assertions and place the new list assertions alongside
them in the same test method.

In `@scripts/monitor_ga_health.py`:
- Around line 153-155: Remove the unreachable HTTP status check inside the
urllib.request.urlopen context: the call to urlopen raises
urllib.error.HTTPError for 4xx/5xx responses, so delete the "if resp.status >=
400: raise RuntimeError(...)" block that references resp.status (the variables
req and resp are in the same scope) and rely on the existing try/except in
main() to handle HTTPError instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 7612f55f-196e-4736-b05f-ad5498e8d5ab

📥 Commits

Reviewing files that changed from the base of the PR and between d796ff5 and a49b7d9.

📒 Files selected for processing (5)
  • Makefile
  • application/tests/monitor_ga_health_test.py
  • application/tests/web_main_test.py
  • application/web/web_main.py
  • scripts/monitor_ga_health.py

Comment on lines +12 to +14
def test_503_bucket_is_regression_marker(self) -> None:
bucket = "http_503_regression"
self.assertEqual(bucket, "http_503_regression")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Test is ineffective: validates only a local variable, not the production code.

This test assigns a local variable and asserts it equals itself, providing no validation of the monitor_ga_health module. Given that detecting HTTP 503 regressions is a core objective of this PR, the test should verify that _check_pair actually returns "http_503_regression" as the bucket when a 503 status is encountered.

🧪 Proposed fix to test the actual code
-    def test_503_bucket_is_regression_marker(self) -> None:
-        bucket = "http_503_regression"
-        self.assertEqual(bucket, "http_503_regression")
+    def test_503_bucket_is_regression_marker(self) -> None:
+        """Verify that HTTP 503 responses are bucketed as regression markers."""
+        # Mock a 503 response by testing _check_pair's classification logic
+        # This would require mocking urllib or testing the bucket assignment directly
+        # At minimum, verify the constant used in the code:
+        from unittest.mock import patch, Mock
+        
+        mock_resp = Mock()
+        mock_resp.status = 503
+        mock_resp.read.return_value = b'{"error": "Service Unavailable"}'
+        
+        with patch('urllib.request.urlopen', return_value=mock_resp):
+            result = monitor._check_pair("https://test.example.com/rest/v1", "ASVS", "NIST", 10)
+            self.assertIsNotNone(result)
+            self.assertEqual(result["bucket"], "http_503_regression")
+            self.assertEqual(result["status_code"], 503)

Alternative minimal fix (if mocking is too complex for this test suite):

-    def test_503_bucket_is_regression_marker(self) -> None:
-        bucket = "http_503_regression"
-        self.assertEqual(bucket, "http_503_regression")
+    def test_503_bucket_is_regression_marker(self) -> None:
+        """Verify the 503 regression bucket constant is correctly defined."""
+        # Test that the expected bucket name matches what's used in _check_pair
+        # by inspecting the module's source or a reference constant
+        import inspect
+        source = inspect.getsource(monitor._check_pair)
+        self.assertIn('"http_503_regression"', source, 
+                      "503 responses must be bucketed as http_503_regression")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/monitor_ga_health_test.py` around lines 12 - 14, The test
currently asserts a local variable and should instead invoke
monitor_ga_health._check_pair with a simulated input that represents a 503
response and assert the returned bucket equals "http_503_regression"; update
test_503_bucket_is_regression_marker to construct the same Pair/record shape
used by monitor_ga_health (or mock the dependency that provides the status) so
_check_pair sees status 503, call monitor_ga_health._check_pair(...), and assert
the result is "http_503_regression".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Production map_analysis 503 regression: Neo4j fallback on Heroku cache miss

1 participant