Skip to content

Adding Evals for date extraction#450

Open
rajeee wants to merge 19 commits into
mainfrom
date_parsing_clean
Open

Adding Evals for date extraction#450
rajeee wants to merge 19 commits into
mainfrom
date_parsing_clean

Conversation

@rajeee
Copy link
Copy Markdown
Collaborator

@rajeee rajeee commented May 27, 2026

Adding evals suite. Starting with date_extraction but the idea is to support other pieces of pipeline including whole tech.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

❌ Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.14%. Comparing base (eab9df2) to head (74b66c6).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
compass/scripts/process.py 14.28% 6 Missing ⚠️

❌ Your patch status has failed because the patch coverage (45.45%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #450      +/-   ##
==========================================
+ Coverage   56.11%   56.14%   +0.02%     
==========================================
  Files          63       63              
  Lines        6080     6086       +6     
  Branches      591      591              
==========================================
+ Hits         3412     3417       +5     
- Misses       2598     2599       +1     
  Partials       70       70              
Flag Coverage Δ
unittests 56.14% <45.45%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rajeee rajeee mentioned this pull request May 27, 2026
@rajeee rajeee changed the title Date parsing clean Adding Evals for date parsing May 27, 2026
@rajeee rajeee changed the title Adding Evals for date parsing Adding Evals for date extraction May 27, 2026
@rajeee rajeee marked this pull request as ready for review May 27, 2026 17:54
@rajeee rajeee requested review from castelao and ppinchuk as code owners May 27, 2026 17:54
Copilot AI review requested due to automatic review settings May 27, 2026 17:54
@rajeee rajeee marked this pull request as draft May 27, 2026 17:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an evals/ test suite intended to run live, billable LLM-based quality evaluations (starting with date extraction) alongside the existing unit/integration test structure, with baseline artifacts committed for regression gating.

Changes:

  • Introduces an eval harness (evals/utilities/*) to compute per-feature metrics + write breakdown/metrics artifacts, plus a first suite for extract_date.
  • Updates test configuration to deselect evals by default and adds a Pixi task to run eval suites explicitly.
  • Extends cost utilities/tests to compute aggregate prompt/response token totals alongside cost.

Reviewed changes

Copilot reviewed 24 out of 90 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/python/unit/utilities/test_utilities_costs.py Adds unit tests for new token/cost aggregation helpers.
compass/utilities/costs.py Adds helpers to sum tokens and return {cost, prompt_tokens, response_tokens} from usage totals.
compass/utilities/init.py Re-exports the new cost/token aggregation helpers.
pyproject.toml Adds statsmodels to test extras; configures pytest to deselect evals by default; registers the evals marker and includes evals/ in testpaths.
pixi.toml Adds a pixi run evals <name> task; adds git-lfs and statsmodels to test env; adds openpyxl to dev env.
pixi.lock Locks new/updated dependencies (incl. statsmodels, git-lfs, openpyxl and transitive deps).
evals/utilities/base.py Defines shared Result schema, SUCCESS/FAILURE constants, and document loading helpers for eval suites.
evals/utilities/metrics.py Adds metric computation (accuracy/precision/recall/F1 + Wilson CIs) for eval results.
evals/utilities/reports.py Adds report writer (CSV/JSON), baseline comparison helpers, and terminal summary output.
evals/utilities/init.py Introduces the eval utilities package.
evals/conftest.py Registers the --held-out flag for eval suites.
evals/test_run_date_extraction_evals.py Implements the date extraction eval suite (dev vs held-out dataset selection, result capture, regression gate).
evals/results/dev/date_extraction_evals.json Adds committed dev baseline metrics snapshot for gating.
evals/results/dev/date_extraction_evals_breakdown.csv Adds committed dev baseline per-case breakdown for regression detection.
evals/results/held_out/date_extraction_evals.json Adds committed held-out baseline metrics snapshot (no breakdown committed).
evals/README.md Documents how to run evals, the dev vs held-out policy, and the eval suite structure.
evals/data/dev/solar/manifest.json5 Adds dev dataset manifest for date extraction evals.
evals/data/dev/solar/Town_of_Pendleton_Niagara_County_New_York.txt Adds dev dataset document via Git LFS pointer.
evals/data/dev/solar/Town_of_Livonia_Livingston_County_New_York.txt Adds dev dataset document via Git LFS pointer.
evals/data/dev/solar/Okanogan_County_Washington.txt Adds dev dataset document via Git LFS pointer.
evals/data/dev/solar/Medina_County_Ohio.txt Adds dev dataset document via Git LFS pointer.
evals/data/held-out/solar/manifest.json5 Adds held-out dataset manifest for date extraction evals.
evals/data/held-out/solar/Town_of_Walton_Delaware_County_New_York.txt Adds held-out dataset document via Git LFS pointer.
evals/data/held-out/solar/Town_of_Niagara_Niagara_County_New_York.txt Adds held-out dataset document via Git LFS pointer.
evals/data/held-out/solar/City_of_Agawam_Town_Hampden_County_Massachusetts.txt Adds held-out dataset document via Git LFS pointer.
.gitattributes Configures Git LFS tracking for evals/data/**/*.pdf and evals/data/**/*.txt.

Comment thread compass/utilities/costs.py
Comment thread evals/utilities/base.py
Comment thread pixi.toml
@rajeee rajeee marked this pull request as ready for review May 27, 2026 19:57
Copy link
Copy Markdown
Collaborator

@ppinchuk ppinchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I am excited to discuss this with you and Gui tomorrow.

A few requests here and there that I hope are not too arduous. Feel free to let me kmow if anything seems out of scope

Comment thread evals/data/dev/solar/Medina_County_Ohio.txt
"total_prompt_tokens": 32185,
"total_response_tokens": 4358,
"total_time_taken_s": 142.48,
"total_cost_usd": 0.0729
Copy link
Copy Markdown
Collaborator

@ppinchuk ppinchuk May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 7 cents to run the entire thing? How is it so cheap?

I see that the token count is super low so maybe this number is believable, but it might be worth double checking. Also, we should add the LLM used to this dictionary. In fact, maybe more metadata about the run could be useful here. I'm thinking date of run, COMPASS version, etc.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because most of the parsing is happening via url parsing - I am exploring adding a "extraction_method" column so we can surface how the data was extracted.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding adding more metadata - I agree - that would be good to do. I have some questions about the level of details - especially for methods that might use multiple models. Let's discuss.

Comment thread evals/results/held_out/date_extraction_evals.json Outdated
Comment thread evals/utilities/base.py Outdated
Comment on lines +60 to +68
def load_doc(fp, *, source=None):
"""Load a local file as an elm Document, dispatching on suffix"""
fp = Path(fp)
attrs = {"source": source} if source else {}
if fp.suffix.casefold() == ".pdf":
pages = read_pdf(fp.read_bytes(), verbose=False)
return PDFDocument(pages, attrs=attrs)
text = fp.read_text(encoding="utf-8", errors="ignore")
return HTMLDocument([text], attrs=attrs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if it's within the scope of this PR but there are already existing utilities in this repo to read local files (specifically this:

COMPASSLocalFileLoader = AsyncLocalFileLoader
)

We should try to re-use as much as possible in order to avoid re-writing code and having to maintain it in multiple places

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I switched to using production loader that uncovered hidden issue - the load_doc was returning empty doc if the PDF needs OCR.

Comment thread evals/utilities/metrics.py Outdated
Comment thread evals/utilities/reports.py
Comment thread evals/utilities/reports.py Outdated
Comment thread pixi.toml
Comment thread evals/test_run_date_extraction_evals.py Outdated
Comment on lines +43 to +59
def pytest_generate_tests(metafunc):
"""Parametrize ``case`` with the dataset chosen by ``--held-out``

Each case gets its resolved document path stamped on as ``case["fp"]``
so the test body doesn't need to know which dataset it came from.
"""
if "case" not in metafunc.fixturenames:
return
dataset_dir = (
_HELD_OUT_DATASET_DIR
if metafunc.config.getoption("--held-out")
else _DEV_DATASET_DIR
)
cases = load_config(dataset_dir / "manifest.json5")
for c in cases:
c["fp"] = dataset_dir / c["file"]
metafunc.parametrize("case", cases, ids=[c["file"] for c in cases])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool to see how this can be done bespoke, but I wonder if you considered using pytest-cases, which is seemingly designed to do exactly this kind of thing in a more readable way? We already have it as a dependency in our test suite, so it's just a matter of importing it and wiring it up.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about pytest-cases - looks like an awesome library. For this particular instance though, don't think it will help. The main issue here is that we need to load a file to create the test cases - which means the file loading needs to happen before test cases are generated (either natively by pytest, or through pytest-cases). pytest_generate_tests is one of the available standard hook for doing so. I did simplify this to be strictly about case generation and put the case modification logic to a separate fixture that runs during the test.


@pytest.fixture(scope="module")
def _model_config():
model = "compassop-gpt-5.4"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be read from an environment var so that users don't have to re-hard code it? I think we even have some designated environment variables for LLM model names in our tests, so it may be worth re-using those for consistency

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this got me thinking. It seems like evals run should probably take in model config argument allowing defining different models for different tasks? With your work on splitting document extraction and document parsing as separate entry point - maybe the whole eval suite should probably just use the compass cli to do the extraction - making use of the full config.json5. Let's talk more.

@castelao
Copy link
Copy Markdown
Member

What about managing those PDFs with pooch? Let's discuss that in the meeting later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants