@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset by joeconstantino · Pull Request #354 · tableau/tableau-mcp

joeconstantino · 2026-05-15T22:28:47Z

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Any change needs to be discussed before proceeding. Failure to do so may result in the rejection of
the pull request.

Pull Request Template

Description

Adds Claude Code eval harness for BIRD California Schools benchmark. Includes 30 text-to-SQL questions that can be run against a corresponding tableau published data source using Tableau MCP server and grades answers on numeric and semantic correctness. It also routes traces to LangSmith.

Motivation and Context

Evaluating tableau mcp with a state of the art quality agent has been difficult. Now, using claude code + hooks and integration with langsmith, we can effectively evaluate the impact of tmcp tool additions, refactors, and skills, as well as various agent harnesses and models.

Type of Change

How Has This Been Tested?

n/a

Checklist

I have updated the version in the package.json file by using npm run version. For example,
use npm run version:patch for a patch version bump.
I have made any necessary changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have documented any breaking changes in the PR description. For example, renaming a config
environment variable or changing its default value.

Contributor Agreement

By submitting this pull request, I confirm that:

I have read the CONTRIBUTING guidelines for this project and followed
its Contribution Checklist.

…nia schools data and official test cases

mattcfilbert · 2026-06-17T19:36:18Z

@joeconstantino — 🤖 MattGPT (Matt Filbert's agent)

Well-architected eval harness — hooks capture timing accurately, grading has both numeric and semantic signals, and eval artifacts are gitignored / opt-in.

Two longevity questions before merge:

The 728-line committed suite JSON couples answers to a specific dataset snapshot — worth documenting the regeneration path.
The bird_mini submodule pulls an external BIRD dataset — verify the submodule URL is stable/public and the license permits redistribution here.

(Inline notes on both.)

joeconstantino added 3 commits May 15, 2026 14:51

robust test harness for evluating tmcp + claude code with the califor…

3d9a805

…nia schools data and official test cases

version bump

f25b8db

docs update

1544a53

joeconstantino changed the title ~~robust test harness for evaling claude code with tmcp using benchmark california schools dataset~~ @W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset May 15, 2026

linting fixes

18137bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset#354

@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset#354
joeconstantino wants to merge 4 commits into
mainfrom
joecon/eval

joeconstantino commented May 15, 2026 •

edited

Loading

Uh oh!

mattcfilbert commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joeconstantino commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Template

Description

Motivation and Context

Type of Change

How Has This Been Tested?

Checklist

Contributor Agreement

Uh oh!

mattcfilbert commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joeconstantino commented May 15, 2026 •

edited

Loading

mattcfilbert commented Jun 17, 2026 •

edited

Loading