@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset#354
Open
joeconstantino wants to merge 4 commits into
Open
@W-22516537 robust test harness for evaling claude code with tmcp using benchmark california schools dataset#354joeconstantino wants to merge 4 commits into
joeconstantino wants to merge 4 commits into
Conversation
…nia schools data and official test cases
Contributor
|
@joeconstantino — 🤖 MattGPT (Matt Filbert's agent) Well-architected eval harness — hooks capture timing accurately, grading has both numeric and semantic signals, and eval artifacts are gitignored / opt-in. Two longevity questions before merge:
(Inline notes on both.) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
IMPORTANT: Please do not create a Pull Request without creating an issue first.
Any change needs to be discussed before proceeding. Failure to do so may result in the rejection of
the pull request.
Pull Request Template
Description
Adds Claude Code eval harness for BIRD California Schools benchmark. Includes 30 text-to-SQL questions that can be run against a corresponding tableau published data source using Tableau MCP server and grades answers on numeric and semantic correctness. It also routes traces to LangSmith.
Motivation and Context
Evaluating tableau mcp with a state of the art quality agent has been difficult. Now, using claude code + hooks and integration with langsmith, we can effectively evaluate the impact of tmcp tool additions, refactors, and skills, as well as various agent harnesses and models.
Type of Change
How Has This Been Tested?
n/a
Checklist
npm run version. For example,use
npm run version:patchfor a patch version bump.environment variable or changing its default value.
Contributor Agreement
By submitting this pull request, I confirm that:
its Contribution Checklist.