week_1: Module C (The Librarian) — RFC contracts + eval harness + golden dataset#922
week_1: Module C (The Librarian) — RFC contracts + eval harness + golden dataset#922PRAteek-singHWY wants to merge 2 commits into
Conversation
dataset Contracts + regression ruler before any pipeline code, per the OIE RFC's 'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in, LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring; 319-row golden dataset derived from standards_cache.sqlite with --check drift detection. One prod edit: pydantic>=2,<3 pin.
Summary by CodeRabbit
WalkthroughThis PR implements Module C (The Librarian): RFC-aligned JSON schemas and Pydantic models, environment-backed frozen configuration loader, hub-firewall and scoring utilities, deterministic golden dataset builder and fixtures, an evaluation harness, and comprehensive unit/schema tests. ChangesModule C — The Librarian
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
application/utils/librarian/scoring.py (1)
21-22: 💤 Low valueOptional: Remove unreachable defensive check.
Since line 19 returns early when both sets are empty, the union on line 20 cannot be empty, making lines 21-22 unreachable. You may remove this check for clarity.
♻️ Simplify by removing unreachable code
def jaccard(a: Sequence[str], b: Sequence[str]) -> float: sa, sb = set(a), set(b) if not sa and not sb: return 1.0 union = sa | sb - if not union: - return 0.0 return len(sa & sb) / len(union)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/scoring.py` around lines 21 - 22, Remove the unreachable defensive check that returns 0.0 when "union" is falsy: because there is already an early return when both input sets are empty (the return at line 19), "union" cannot be empty here—delete the if not union: return 0.0 block in the scoring calculation (the lines referencing the variable "union") so the function's flow is clearer and avoids dead code.application/tests/librarian/config_loader_test.py (1)
48-53: ⚡ Quick winAdd boundary tests for numeric domain constraints.
Once loader invariants are enforced, this suite should assert failures for out-of-range numeric env values (e.g.,
LINK_THRESHOLD=1.2, negativeTOP_K,TOP_K_RERANK > TOP_K_RETRIEVAL) to lock behavior and prevent regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/tests/librarian/config_loader_test.py` around lines 48 - 53, Extend the test suite around load_config by adding boundary tests that assert it raises ValueError for out-of-range numeric env vars: use mock.patch.dict to set LINK_THRESHOLD="1.2" (above 1.0) and assertRaises(ValueError); set CRE_LIBRARIAN_TOP_K_RETRIEVAL to a negative value (e.g., "-1") and assertRaises(ValueError); and set CRE_LIBRARIAN_TOP_K_RERANK greater than CRE_LIBRARIAN_TOP_K_RETRIEVAL (e.g., "5" vs "3") and assertRaises(ValueError). Keep the new tests alongside test_bad_int_env_raises and call load_config() inside each assertRaises block so loader invariants for LINK_THRESHOLD, CRE_LIBRARIAN_TOP_K_RETRIEVAL, and CRE_LIBRARIAN_TOP_K_RERANK are enforced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/tests/librarian/fixtures/golden_dataset.schema.json`:
- Around line 23-24: The JSON Schema currently allows empty strings/arrays for
fields that GoldenDatasetRow validation requires to be non-empty; update the
schema entries for "explicit_cre_ref" and "prior_text" to enforce non-empty
strings (add "minLength": 1) and add "minItems": 1 to any array fields that must
be non-empty (notably the "expected.cre_ids" schema for decision="linked"); scan
the other occurrences mentioned (around the other schema blocks at the indicated
ranges) and apply the same constraints so the JSON Schema matches
GoldenDatasetRow's truthy checks.
In `@application/utils/librarian/_rfc_schemas/source-ref.json`:
- Around line 20-25: The schema currently only requires "repo" and "commit_sha"
when type is "github" but does not require "url" for other types; update the
JSON Schema in the "allOf" array so that when "type" is "url" or "rss" (or when
type is not "github") the schema's "then" adds "required": ["url"]; locate the
existing conditional block using "allOf" / "if" / "then" in source-ref.json (the
block that checks { "properties": { "type": { "const": "github" } } }) and add
an additional conditional (or replace with an "if" using
"properties":{"type":{"enum":["url","rss"]}}) whose "then" enforces the "url"
required property for those source types.
In `@application/utils/librarian/config_loader.py`:
- Around line 22-33: The load_config function returns a LibrarianConfig without
validating numeric invariants; update load_config to validate parsed values
(from os.getenv) and raise clear ValueError messages when invalid: ensure
top_k_retrieval and top_k_rerank are positive ints, top_k_rerank <=
top_k_retrieval, batch_size > 0, link_threshold is between 0.0 and 1.0
inclusive, and ece_target and conformal_alpha are between 0.0 and 1.0 inclusive;
perform these checks immediately after converting the env values and before
constructing LibrarianConfig so callers get immediate, descriptive failures
(include the offending variable name and its value in each error).
In `@scripts/build_golden_dataset.py`:
- Around line 325-327: The code currently silently skips curated ASVS rows when
_fetch_asvs_cre(conn, entry["asvs_section_id"]) returns None; change this to
fail fast by raising a clear exception (or calling sys.exit with an error
message) that includes the missing asvs_section_id and any relevant entry
identifiers so the pipeline stops and the issue is obvious; update both
occurrences (the block around cre = _fetch_asvs_cre(...) at the shown spot and
the similar block at the other occurrence around lines 356-358) to perform the
fail-fast behavior instead of continue.
- Around line 445-447: When running in --check mode (the block that inspects
Path(args.out).read_text()), handle a missing output file explicitly instead of
letting read_text() raise: first test Path(args.out).exists() and if it does not
exist print or log a clear error and exit non-zero (e.g. sys.exit(1)) with a
descriptive message, otherwise continue to read_text() and compare contents;
update the code in the args.check branch (referencing args.check,
Path(args.out), and read_text()) to perform the existence check and clean exit.
In `@scripts/evaluate_librarian.py`:
- Around line 67-68: The slice currently uses a truthy check on args.limit so
passing --limit 0 is treated as no limit; change the condition around the rows
slicing to explicitly check for None (e.g., if args.limit is not None) so that
args.limit == 0 correctly results in rows = rows[:0] and yields zero evaluated
rows; update the block that references args.limit and rows to use this explicit
None check.
---
Nitpick comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 48-53: Extend the test suite around load_config by adding boundary
tests that assert it raises ValueError for out-of-range numeric env vars: use
mock.patch.dict to set LINK_THRESHOLD="1.2" (above 1.0) and
assertRaises(ValueError); set CRE_LIBRARIAN_TOP_K_RETRIEVAL to a negative value
(e.g., "-1") and assertRaises(ValueError); and set CRE_LIBRARIAN_TOP_K_RERANK
greater than CRE_LIBRARIAN_TOP_K_RETRIEVAL (e.g., "5" vs "3") and
assertRaises(ValueError). Keep the new tests alongside test_bad_int_env_raises
and call load_config() inside each assertRaises block so loader invariants for
LINK_THRESHOLD, CRE_LIBRARIAN_TOP_K_RETRIEVAL, and CRE_LIBRARIAN_TOP_K_RERANK
are enforced.
In `@application/utils/librarian/scoring.py`:
- Around line 21-22: Remove the unreachable defensive check that returns 0.0
when "union" is falsy: because there is already an early return when both input
sets are empty (the return at line 19), "union" cannot be empty here—delete the
if not union: return 0.0 block in the scoring calculation (the lines referencing
the variable "union") so the function's flow is clearer and avoids dead code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 90fc550c-1b08-4bf8-aa33-273c1b81a94c
📒 Files selected for processing (24)
application/tests/librarian/__init__.pyapplication/tests/librarian/config_loader_test.pyapplication/tests/librarian/dataset_test.pyapplication/tests/librarian/fixtures/golden_dataset.jsonapplication/tests/librarian/fixtures/golden_dataset.schema.jsonapplication/tests/librarian/fixtures/sample_knowledge_queue.jsonlapplication/tests/librarian/hub_firewall_test.pyapplication/tests/librarian/schemas_test.pyapplication/tests/librarian/scoring_test.pyapplication/utils/librarian/__init__.pyapplication/utils/librarian/_rfc_schemas/knowledge-item.jsonapplication/utils/librarian/_rfc_schemas/link-proposal.jsonapplication/utils/librarian/_rfc_schemas/locator.jsonapplication/utils/librarian/_rfc_schemas/proposed-link.jsonapplication/utils/librarian/_rfc_schemas/review-item.jsonapplication/utils/librarian/_rfc_schemas/source-ref.jsonapplication/utils/librarian/config_loader.pyapplication/utils/librarian/hub_firewall.pyapplication/utils/librarian/knowledge_source.pyapplication/utils/librarian/schemas.pyapplication/utils/librarian/scoring.pyrequirements.txtscripts/build_golden_dataset.pyscripts/evaluate_librarian.py
…nts, fail-fast build, edge cases
There was a problem hiding this comment.
🧹 Nitpick comments (1)
application/tests/librarian/config_loader_test.py (1)
20-23: ⚡ Quick winPrefer
AttributeErrorover genericException.Frozen dataclasses raise
AttributeError(orFrozenInstanceError, which subclasses it in Python 3.10+) when attributes are mutated. Using the genericExceptionis too permissive and could mask other unexpected errors.♻️ Proposed refinement
def test_config_is_frozen(self): cfg = load_config() - with self.assertRaises(Exception): + with self.assertRaises(AttributeError): cfg.link_threshold = 0.5 # type: ignore[misc]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/tests/librarian/config_loader_test.py` around lines 20 - 23, The test test_config_is_frozen currently uses self.assertRaises(Exception) which is too broad; change it to expect AttributeError (or FrozenInstanceError if you prefer Python-3.10-specific) when attempting to mutate the frozen dataclass returned by load_config(); update the assertion in test_config_is_frozen so the with self.assertRaises(...) targets AttributeError and keep the mutation line (cfg.link_threshold = 0.5) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 20-23: The test test_config_is_frozen currently uses
self.assertRaises(Exception) which is too broad; change it to expect
AttributeError (or FrozenInstanceError if you prefer Python-3.10-specific) when
attempting to mutate the frozen dataclass returned by load_config(); update the
assertion in test_config_is_frozen so the with self.assertRaises(...) targets
AttributeError and keep the mutation line (cfg.link_threshold = 0.5) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 17638fe7-c137-4db7-a644-ccc816fcfa61
📒 Files selected for processing (6)
application/tests/librarian/config_loader_test.pyapplication/tests/librarian/fixtures/golden_dataset.schema.jsonapplication/utils/librarian/config_loader.pyapplication/utils/librarian/scoring.pyscripts/build_golden_dataset.pyscripts/evaluate_librarian.py
💤 Files with no reviewable changes (1)
- application/utils/librarian/scoring.py
🚧 Files skipped from review as they are similar to previous changes (4)
- application/utils/librarian/config_loader.py
- application/tests/librarian/fixtures/golden_dataset.schema.json
- scripts/evaluate_librarian.py
- scripts/build_golden_dataset.py
week_1: Module C (The Librarian) — contracts + eval harness + golden dataset
Overview
This is the Week 1 deliverable for Module C (The Librarian). Following the OIE RFC's "build the test before the code" principle, this PR establishes the data contracts and the evaluation harness for Module C — before any linking logic. There is no retriever, no embeddings, and no decision logic here; those land in later weeks and will be measured against what this PR provides.
Module C consumes a
KnowledgeItemfrom Module B and produces one of two outputs: aLinkProposal(auto-link to the OpenCRE graph) or aReviewItem(route to human review in Module D). This PR makes those contracts concrete, guards them against drift from the RFC, and builds an evaluation harness plus a real 319-row dataset so that every future change to Module C is verifiable in CI.Scope: 24 files. The only change to existing code is a one-line
pydanticversion pin. No frontend, no migrations, no pipeline wiring.What changed
schemas.pyKnowledgeItem(input),LinkProposal/ReviewItem(output) — plus shared sub-models. The RFC's conditional rules are enforced as validators._rfc_schemas/*.json(6) +schemas_test.pyconfig_loader.pyCRE_LIBRARIAN_*settings (thresholds, model name, top-k) loaded into one frozen config.knowledge_source.pyKnowledgeSourceinterface with a fixture-backed stub. The real database-backed reader arrives later and yields the same shape.evaluate_librarian.pyhub_firewall.pyscoring.pybuild_golden_dataset.py+fixtures/golden_dataset.json(and its schema)standards_cache.sqlite. A--checkmode fails CI if the committed file drifts from the database.*_test.pyfiles + fixturesrequirements.txtpydantic→pydantic>=2,<3(the contracts use Pydantic v2).How the pieces connect
flowchart TB subgraph RFC["RFC #734 schemas (vendored, pinned)"] rfc["_rfc_schemas/*.json"] end subgraph CONTRACTS["Contracts and config"] schemas["schemas.py<br/>KnowledgeItem · LinkProposal · ReviewItem"] config["config_loader.py"] ksource["knowledge_source.py"] end subgraph DATA["Golden dataset"] db[("standards_cache.sqlite")] build["build_golden_dataset.py<br/>derive and --check"] gold["golden_dataset.json<br/>319 rows"] end subgraph HARNESS["Eval harness"] eval["evaluate_librarian.py"] firewall["hub_firewall.py"] scoring["scoring.py"] end subgraph TESTS["Tests (48 passing)"] t1["schemas_test.py (drift guard)"] t2["dataset_test.py"] t3["scoring_test.py"] t4["hub_firewall_test.py"] t5["config_loader_test.py"] end rfc -->|"validated against"| schemas db --> build --> gold schemas --> eval config --> eval firewall --> eval scoring --> eval gold --> eval schemas --> ksource rfc --> t1 schemas --> t1 gold --> t2 build -. "--check" .-> t2 scoring --> t3 firewall --> t4 config --> t5There are three flows. First, the vendored RFC schemas are what
schemas.pyis validated against — this is the drift guard. Second, the dataset is derived from the database intogolden_dataset.json, with--checkguarding against drift. Third, the harness loads the dataset and composes the contracts, config, firewall, and scorer.The dataset, by slice
What is intentionally not here
Retriever, embeddings, cross-encoder, decision engine, CLI wiring, and database models/migrations are all later weeks. Week 1 is the contracts and the evaluation harness only.
How to verify locally
Note on the dataset design
The dataset is derived from the database but committed as a static file, so CI can load the JSON directly without needing the database;
--checkis what proves the committed file still matches the database. From here on, every Module C PR is scored against these 319 real rows.