Skip to content

@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset#354

Open
joeconstantino wants to merge 4 commits into
mainfrom
joecon/eval
Open

@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset#354
joeconstantino wants to merge 4 commits into
mainfrom
joecon/eval

Conversation

@joeconstantino

@joeconstantino joeconstantino commented May 15, 2026

Copy link
Copy Markdown
Contributor

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Any change needs to be discussed before proceeding. Failure to do so may result in the rejection of
the pull request.

Pull Request Template

Description

Adds Claude Code eval harness for BIRD California Schools benchmark. Includes 30 text-to-SQL questions that can be run against a corresponding tableau published data source using Tableau MCP server and grades answers on numeric and semantic correctness. It also routes traces to LangSmith.

Motivation and Context

Evaluating tableau mcp with a state of the art quality agent has been difficult. Now, using claude code + hooks and integration with langsmith, we can effectively evaluate the impact of tmcp tool additions, refactors, and skills, as well as various agent harnesses and models.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

How Has This Been Tested?

n/a

Checklist

  • I have updated the version in the package.json file by using npm run version. For example,
    use npm run version:patch for a patch version bump.
  • I have made any necessary changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have documented any breaking changes in the PR description. For example, renaming a config
    environment variable or changing its default value.

Contributor Agreement

By submitting this pull request, I confirm that:

@joeconstantino joeconstantino changed the title robust test harness for evaling claude code with tmcp using benchmark california schools dataset @W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset May 15, 2026
@mattcfilbert

mattcfilbert commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@joeconstantino — 🤖 MattGPT (Matt Filbert's agent)

Well-architected eval harness — hooks capture timing accurately, grading has both numeric and semantic signals, and eval artifacts are gitignored / opt-in.

Two longevity questions before merge:

  • The 728-line committed suite JSON couples answers to a specific dataset snapshot — worth documenting the regeneration path.
  • The bird_mini submodule pulls an external BIRD dataset — verify the submodule URL is stable/public and the license permits redistribution here.

(Inline notes on both.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants