Skip to content

week_1: Module C (The Librarian) — RFC contracts + eval harness + golden dataset#922

Open
PRAteek-singHWY wants to merge 2 commits into
OWASP:mainfrom
PRAteek-singHWY:gsocmodule_C_week_1
Open

week_1: Module C (The Librarian) — RFC contracts + eval harness + golden dataset#922
PRAteek-singHWY wants to merge 2 commits into
OWASP:mainfrom
PRAteek-singHWY:gsocmodule_C_week_1

Conversation

@PRAteek-singHWY

@PRAteek-singHWY PRAteek-singHWY commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

week_1: Module C (The Librarian) — contracts + eval harness + golden dataset

Overview

This is the Week 1 deliverable for Module C (The Librarian). Following the OIE RFC's "build the test before the code" principle, this PR establishes the data contracts and the evaluation harness for Module C — before any linking logic. There is no retriever, no embeddings, and no decision logic here; those land in later weeks and will be measured against what this PR provides.

Module C consumes a KnowledgeItem from Module B and produces one of two outputs: a LinkProposal (auto-link to the OpenCRE graph) or a ReviewItem (route to human review in Module D). This PR makes those contracts concrete, guards them against drift from the RFC, and builds an evaluation harness plus a real 319-row dataset so that every future change to Module C is verifiable in CI.

Scope: 24 files. The only change to existing code is a one-line pydantic version pin. No frontend, no migrations, no pipeline wiring.

What changed

Area Files Description
Contracts schemas.py The three RFC envelopes as Pydantic v2 models — KnowledgeItem (input), LinkProposal / ReviewItem (output) — plus shared sub-models. The RFC's conditional rules are enforced as validators.
Drift guard _rfc_schemas/*.json (6) + schemas_test.py The canonical RFC schemas, vendored and pinned to a specific upstream commit. A test validates every model against them, so CI fails if our contracts diverge from the RFC.
Config config_loader.py The seven CRE_LIBRARIAN_* settings (thresholds, model name, top-k) loaded into one frozen config.
Input interface knowledge_source.py An abstract KnowledgeSource interface with a fixture-backed stub. The real database-backed reader arrives later and yields the same shape.
Eval harness evaluate_librarian.py Loads the dataset, applies the hub-firewall and scorer, and prints per-slice results. This is the baseline that every later week is measured against.
Hub-firewall hub_firewall.py Strips each test row's own text out of the CRE hub before scoring, so retrieval cannot echo the answer back and inflate accuracy. On by default.
Scoring scoring.py The multi-link correctness rule: a prediction is correct only if Jaccard(expected, predicted) ≥ 0.5 and the top-1 prediction is in the expected set.
Golden dataset build_golden_dataset.py + fixtures/golden_dataset.json (and its schema) 319 real rows derived from standards_cache.sqlite. A --check mode fails CI if the committed file drifts from the database.
Tests five *_test.py files + fixtures 48 tests covering the contracts, firewall, scoring, and dataset integrity.
Production pin requirements.txt A single line: pydanticpydantic>=2,<3 (the contracts use Pydantic v2).

How the pieces connect

flowchart TB
    subgraph RFC["RFC #734 schemas (vendored, pinned)"]
        rfc["_rfc_schemas/*.json"]
    end

    subgraph CONTRACTS["Contracts and config"]
        schemas["schemas.py<br/>KnowledgeItem · LinkProposal · ReviewItem"]
        config["config_loader.py"]
        ksource["knowledge_source.py"]
    end

    subgraph DATA["Golden dataset"]
        db[("standards_cache.sqlite")]
        build["build_golden_dataset.py<br/>derive and --check"]
        gold["golden_dataset.json<br/>319 rows"]
    end

    subgraph HARNESS["Eval harness"]
        eval["evaluate_librarian.py"]
        firewall["hub_firewall.py"]
        scoring["scoring.py"]
    end

    subgraph TESTS["Tests (48 passing)"]
        t1["schemas_test.py (drift guard)"]
        t2["dataset_test.py"]
        t3["scoring_test.py"]
        t4["hub_firewall_test.py"]
        t5["config_loader_test.py"]
    end

    rfc -->|"validated against"| schemas
    db --> build --> gold

    schemas --> eval
    config --> eval
    firewall --> eval
    scoring --> eval
    gold --> eval
    schemas --> ksource

    rfc --> t1
    schemas --> t1
    gold --> t2
    build -. "--check" .-> t2
    scoring --> t3
    firewall --> t4
    config --> t5
Loading

There are three flows. First, the vendored RFC schemas are what schemas.py is validated against — this is the drift guard. Second, the dataset is derived from the database into golden_dataset.json, with --check guarding against drift. Third, the harness loads the dataset and composes the contracts, config, firewall, and scorer.

The dataset, by slice

Slice Rows What it tests
positive (1:1) 277 ASVS requirement mapped to its CRE (database ground truth)
positive (multi-link) 15 nodes with 2–4 CREs, to exercise the multi-link rule
hard_negative 12 negation phrasing ("do not", "shall not", and similar)
explicit 5 text that literally cites a CRE id
update 5 before/after rewordings of a requirement
ambiguous 5 broad statements that should route to review
Total 319

What is intentionally not here

Retriever, embeddings, cross-encoder, decision engine, CLI wiring, and database models/migrations are all later weeks. Week 1 is the contracts and the evaluation harness only.

How to verify locally

# 48 tests
python3 -m unittest discover -s application/tests/librarian -p '*_test.py' -t .

# dataset matches the database (no drift)
python3 scripts/build_golden_dataset.py --check

# harness end-to-end
python3 scripts/evaluate_librarian.py --dataset application/tests/librarian/fixtures/golden_dataset.json

Note on the dataset design

The dataset is derived from the database but committed as a static file, so CI can load the JSON directly without needing the database; --check is what proves the committed file still matches the database. From here on, every Module C PR is scored against these 319 real rows.

  dataset

  Contracts + regression ruler before any pipeline code, per the OIE RFC's
  'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in,
  LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the
  vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring;
  319-row golden dataset derived from standards_cache.sqlite with --check
  drift detection. One prod edit: pydantic>=2,<3 pin.
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Added configuration system supporting environment variable overrides with validation.
    • Added data contract validation through JSON Schemas for knowledge items, proposals, and reviews.
    • Added evaluation harness with golden dataset support for testing the Librarian module.
    • Added hub firewall to prevent knowledge item text leakage during processing.
  • Tests

    • Added comprehensive test suite covering configuration loading, dataset validation, data model contracts, schema compliance, and scoring utilities.
  • Chores

    • Updated Pydantic to v2.

Walkthrough

This PR implements Module C (The Librarian): RFC-aligned JSON schemas and Pydantic models, environment-backed frozen configuration loader, hub-firewall and scoring utilities, deterministic golden dataset builder and fixtures, an evaluation harness, and comprehensive unit/schema tests.

Changes

Module C — The Librarian

Layer / File(s) Summary
RFC Contract Schemas and Pydantic Models
application/utils/librarian/_rfc_schemas/*, application/utils/librarian/schemas.py
Defines KnowledgeItem, LinkProposal, and ReviewItem RFC envelopes via draft 2020-12 JSON schemas and Pydantic v2 models with conditional field validation (e.g., status-dependent content/rejection, kind-dependent path/url). Adds SourceRef, Locator, ProposedLink, and KnowledgeSnapshot sub-contracts plus internal KnowledgeQueueItem and golden-harness models (GoldenDatasetRow).
Configuration and Utility Helpers
application/utils/librarian/config_loader.py, application/utils/librarian/hub_firewall.py, application/utils/librarian/knowledge_source.py, application/utils/librarian/scoring.py, application/tests/librarian/config_loader_test.py, application/tests/librarian/hub_firewall_test.py, application/tests/librarian/scoring_test.py
Implements frozen LibrarianConfig dataclass and load_config() for environment-driven configuration, HubRep dataclass and leaks/firewall helpers for text leakage detection/filtering with whitespace/case normalization, abstract KnowledgeSource interface with FixtureKnowledgeSource fixture reader, and Jaccard-based score_case correctness predicate with threshold enforcement. Includes unit tests covering defaults, overrides, immutability, hub firewall semantics, and scoring boundary cases.
RFC Contract Validation Tests
application/tests/librarian/schemas_test.py
Validates Pydantic models and RFC JSON schemas via canonical round-trip assertions using referencing.Registry for $ref resolution. Includes fixture construction for shared structures, conditional-field constraint tests for SourceRef/Locator/KnowledgeItem/LinkProposal/ReviewItem, and documentation example re-validation.
Golden Dataset Definition and Building
application/tests/librarian/fixtures/golden_dataset.schema.json, application/tests/librarian/fixtures/sample_knowledge_queue.jsonl, scripts/build_golden_dataset.py, application/tests/librarian/dataset_test.py
Defines GoldenDatasetRow JSON schema with slice- and decision-dependent conditional validation. Implements builder script that synthesizes six dataset slices (explicit, positive ASVS, positive multi-link, hard negative, update, ambiguous) from standards_cache.sqlite with deterministic ordering and supports --check mode for re-derivation validation. Includes dataset sanity tests verifying row counts, uniqueness, schema compliance, and determinism.
Evaluation Harness
scripts/evaluate_librarian.py
Week 1 skeleton harness that loads golden dataset, builds a stub hub seeded from row inputs, optionally applies firewall filtering, runs per-row leak detection and scoring (currently stubbed to always predict empty lists), and prints summary statistics by slice.
Module Documentation and Dependencies
application/utils/librarian/__init__.py, requirements.txt
Adds module docstring documenting Librarian responsibilities, inter-module contract flow, RFC schema vendoring/pinning details, and resync instructions. Constrains Pydantic to v2 major version (>=2,<3).

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the PR's main deliverable: Week 1 of Module C (The Librarian), implementing RFC contracts, an evaluation harness, and a golden dataset.
Description check ✅ Passed The description is highly detailed and directly related to the changeset, explaining the scope, components, connections, and verification steps for the Week 1 Module C implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
application/utils/librarian/scoring.py (1)

21-22: 💤 Low value

Optional: Remove unreachable defensive check.

Since line 19 returns early when both sets are empty, the union on line 20 cannot be empty, making lines 21-22 unreachable. You may remove this check for clarity.

♻️ Simplify by removing unreachable code
 def jaccard(a: Sequence[str], b: Sequence[str]) -> float:
     sa, sb = set(a), set(b)
     if not sa and not sb:
         return 1.0
     union = sa | sb
-    if not union:
-        return 0.0
     return len(sa & sb) / len(union)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/librarian/scoring.py` around lines 21 - 22, Remove the
unreachable defensive check that returns 0.0 when "union" is falsy: because
there is already an early return when both input sets are empty (the return at
line 19), "union" cannot be empty here—delete the if not union: return 0.0 block
in the scoring calculation (the lines referencing the variable "union") so the
function's flow is clearer and avoids dead code.
application/tests/librarian/config_loader_test.py (1)

48-53: ⚡ Quick win

Add boundary tests for numeric domain constraints.

Once loader invariants are enforced, this suite should assert failures for out-of-range numeric env values (e.g., LINK_THRESHOLD=1.2, negative TOP_K, TOP_K_RERANK > TOP_K_RETRIEVAL) to lock behavior and prevent regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/librarian/config_loader_test.py` around lines 48 - 53,
Extend the test suite around load_config by adding boundary tests that assert it
raises ValueError for out-of-range numeric env vars: use mock.patch.dict to set
LINK_THRESHOLD="1.2" (above 1.0) and assertRaises(ValueError); set
CRE_LIBRARIAN_TOP_K_RETRIEVAL to a negative value (e.g., "-1") and
assertRaises(ValueError); and set CRE_LIBRARIAN_TOP_K_RERANK greater than
CRE_LIBRARIAN_TOP_K_RETRIEVAL (e.g., "5" vs "3") and assertRaises(ValueError).
Keep the new tests alongside test_bad_int_env_raises and call load_config()
inside each assertRaises block so loader invariants for LINK_THRESHOLD,
CRE_LIBRARIAN_TOP_K_RETRIEVAL, and CRE_LIBRARIAN_TOP_K_RERANK are enforced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/librarian/fixtures/golden_dataset.schema.json`:
- Around line 23-24: The JSON Schema currently allows empty strings/arrays for
fields that GoldenDatasetRow validation requires to be non-empty; update the
schema entries for "explicit_cre_ref" and "prior_text" to enforce non-empty
strings (add "minLength": 1) and add "minItems": 1 to any array fields that must
be non-empty (notably the "expected.cre_ids" schema for decision="linked"); scan
the other occurrences mentioned (around the other schema blocks at the indicated
ranges) and apply the same constraints so the JSON Schema matches
GoldenDatasetRow's truthy checks.

In `@application/utils/librarian/_rfc_schemas/source-ref.json`:
- Around line 20-25: The schema currently only requires "repo" and "commit_sha"
when type is "github" but does not require "url" for other types; update the
JSON Schema in the "allOf" array so that when "type" is "url" or "rss" (or when
type is not "github") the schema's "then" adds "required": ["url"]; locate the
existing conditional block using "allOf" / "if" / "then" in source-ref.json (the
block that checks { "properties": { "type": { "const": "github" } } }) and add
an additional conditional (or replace with an "if" using
"properties":{"type":{"enum":["url","rss"]}}) whose "then" enforces the "url"
required property for those source types.

In `@application/utils/librarian/config_loader.py`:
- Around line 22-33: The load_config function returns a LibrarianConfig without
validating numeric invariants; update load_config to validate parsed values
(from os.getenv) and raise clear ValueError messages when invalid: ensure
top_k_retrieval and top_k_rerank are positive ints, top_k_rerank <=
top_k_retrieval, batch_size > 0, link_threshold is between 0.0 and 1.0
inclusive, and ece_target and conformal_alpha are between 0.0 and 1.0 inclusive;
perform these checks immediately after converting the env values and before
constructing LibrarianConfig so callers get immediate, descriptive failures
(include the offending variable name and its value in each error).

In `@scripts/build_golden_dataset.py`:
- Around line 325-327: The code currently silently skips curated ASVS rows when
_fetch_asvs_cre(conn, entry["asvs_section_id"]) returns None; change this to
fail fast by raising a clear exception (or calling sys.exit with an error
message) that includes the missing asvs_section_id and any relevant entry
identifiers so the pipeline stops and the issue is obvious; update both
occurrences (the block around cre = _fetch_asvs_cre(...) at the shown spot and
the similar block at the other occurrence around lines 356-358) to perform the
fail-fast behavior instead of continue.
- Around line 445-447: When running in --check mode (the block that inspects
Path(args.out).read_text()), handle a missing output file explicitly instead of
letting read_text() raise: first test Path(args.out).exists() and if it does not
exist print or log a clear error and exit non-zero (e.g. sys.exit(1)) with a
descriptive message, otherwise continue to read_text() and compare contents;
update the code in the args.check branch (referencing args.check,
Path(args.out), and read_text()) to perform the existence check and clean exit.

In `@scripts/evaluate_librarian.py`:
- Around line 67-68: The slice currently uses a truthy check on args.limit so
passing --limit 0 is treated as no limit; change the condition around the rows
slicing to explicitly check for None (e.g., if args.limit is not None) so that
args.limit == 0 correctly results in rows = rows[:0] and yields zero evaluated
rows; update the block that references args.limit and rows to use this explicit
None check.

---

Nitpick comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 48-53: Extend the test suite around load_config by adding boundary
tests that assert it raises ValueError for out-of-range numeric env vars: use
mock.patch.dict to set LINK_THRESHOLD="1.2" (above 1.0) and
assertRaises(ValueError); set CRE_LIBRARIAN_TOP_K_RETRIEVAL to a negative value
(e.g., "-1") and assertRaises(ValueError); and set CRE_LIBRARIAN_TOP_K_RERANK
greater than CRE_LIBRARIAN_TOP_K_RETRIEVAL (e.g., "5" vs "3") and
assertRaises(ValueError). Keep the new tests alongside test_bad_int_env_raises
and call load_config() inside each assertRaises block so loader invariants for
LINK_THRESHOLD, CRE_LIBRARIAN_TOP_K_RETRIEVAL, and CRE_LIBRARIAN_TOP_K_RERANK
are enforced.

In `@application/utils/librarian/scoring.py`:
- Around line 21-22: Remove the unreachable defensive check that returns 0.0
when "union" is falsy: because there is already an early return when both input
sets are empty (the return at line 19), "union" cannot be empty here—delete the
if not union: return 0.0 block in the scoring calculation (the lines referencing
the variable "union") so the function's flow is clearer and avoids dead code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 90fc550c-1b08-4bf8-aa33-273c1b81a94c

📥 Commits

Reviewing files that changed from the base of the PR and between d796ff5 and 2fc65e9.

📒 Files selected for processing (24)
  • application/tests/librarian/__init__.py
  • application/tests/librarian/config_loader_test.py
  • application/tests/librarian/dataset_test.py
  • application/tests/librarian/fixtures/golden_dataset.json
  • application/tests/librarian/fixtures/golden_dataset.schema.json
  • application/tests/librarian/fixtures/sample_knowledge_queue.jsonl
  • application/tests/librarian/hub_firewall_test.py
  • application/tests/librarian/schemas_test.py
  • application/tests/librarian/scoring_test.py
  • application/utils/librarian/__init__.py
  • application/utils/librarian/_rfc_schemas/knowledge-item.json
  • application/utils/librarian/_rfc_schemas/link-proposal.json
  • application/utils/librarian/_rfc_schemas/locator.json
  • application/utils/librarian/_rfc_schemas/proposed-link.json
  • application/utils/librarian/_rfc_schemas/review-item.json
  • application/utils/librarian/_rfc_schemas/source-ref.json
  • application/utils/librarian/config_loader.py
  • application/utils/librarian/hub_firewall.py
  • application/utils/librarian/knowledge_source.py
  • application/utils/librarian/schemas.py
  • application/utils/librarian/scoring.py
  • requirements.txt
  • scripts/build_golden_dataset.py
  • scripts/evaluate_librarian.py

Comment thread application/tests/librarian/fixtures/golden_dataset.schema.json
Comment thread application/utils/librarian/_rfc_schemas/source-ref.json
Comment thread application/utils/librarian/config_loader.py
Comment thread scripts/build_golden_dataset.py Outdated
Comment thread scripts/build_golden_dataset.py
Comment thread scripts/evaluate_librarian.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
application/tests/librarian/config_loader_test.py (1)

20-23: ⚡ Quick win

Prefer AttributeError over generic Exception.

Frozen dataclasses raise AttributeError (or FrozenInstanceError, which subclasses it in Python 3.10+) when attributes are mutated. Using the generic Exception is too permissive and could mask other unexpected errors.

♻️ Proposed refinement
     def test_config_is_frozen(self):
         cfg = load_config()
-        with self.assertRaises(Exception):
+        with self.assertRaises(AttributeError):
             cfg.link_threshold = 0.5  # type: ignore[misc]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/librarian/config_loader_test.py` around lines 20 - 23, The
test test_config_is_frozen currently uses self.assertRaises(Exception) which is
too broad; change it to expect AttributeError (or FrozenInstanceError if you
prefer Python-3.10-specific) when attempting to mutate the frozen dataclass
returned by load_config(); update the assertion in test_config_is_frozen so the
with self.assertRaises(...) targets AttributeError and keep the mutation line
(cfg.link_threshold = 0.5) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 20-23: The test test_config_is_frozen currently uses
self.assertRaises(Exception) which is too broad; change it to expect
AttributeError (or FrozenInstanceError if you prefer Python-3.10-specific) when
attempting to mutate the frozen dataclass returned by load_config(); update the
assertion in test_config_is_frozen so the with self.assertRaises(...) targets
AttributeError and keep the mutation line (cfg.link_threshold = 0.5) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 17638fe7-c137-4db7-a644-ccc816fcfa61

📥 Commits

Reviewing files that changed from the base of the PR and between 2fc65e9 and e097016.

📒 Files selected for processing (6)
  • application/tests/librarian/config_loader_test.py
  • application/tests/librarian/fixtures/golden_dataset.schema.json
  • application/utils/librarian/config_loader.py
  • application/utils/librarian/scoring.py
  • scripts/build_golden_dataset.py
  • scripts/evaluate_librarian.py
💤 Files with no reviewable changes (1)
  • application/utils/librarian/scoring.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • application/utils/librarian/config_loader.py
  • application/tests/librarian/fixtures/golden_dataset.schema.json
  • scripts/evaluate_librarian.py
  • scripts/build_golden_dataset.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant