Add ToolCorrectness metric with full Python deepeval parity by holsee · Pull Request #1 · holsee/deep_eval_ex

holsee · 2026-03-10T17:34:19Z

Summary

Ports deepeval's ToolCorrectnessMetric to Elixir with full feature parity
Deterministic tool calling score with three modes: greedy matching, exact positional matching, and weighted LCS for ordering
LLM-based tool selection scoring when available_tools are provided (final score = min of both)
Supports evaluation_params for comparing :input_parameters and :output fields beyond tool name
Adds strict_mode, include_reason, and threshold configuration
Comprehensive test suite covering all scoring modes, edge cases, and LLM integration

Test plan

All existing tests pass
New tool_correctness_test.exs covers: non-exact match, exact match, ordering (weighted LCS), parameter evaluation, strict mode, reason generation, tool selection (LLM), and validation edge cases
mix format passes
mix credo --strict passes
mix dialyzer passes

Port of deepeval's ToolCorrectnessMetric with all features: - Deterministic tool calling score (exact match, non-exact, weighted LCS) - LLM-based tool selection scoring when available_tools provided - strict_mode, include_reason, evaluation_params options - Empty list handling matching Python behaviour (scores, not errors) - Prompt template and schema modules following existing patterns

holsee added 2 commits March 10, 2026 17:42

Update CI versions to match .tool-versions (Elixir 1.19.5, OTP 28.4)

e4388a5

holsee force-pushed the feat/tool-correctness-metric branch from d579604 to e4388a5 Compare March 10, 2026 17:49

holsee merged commit 9614f5b into main Mar 10, 2026
1 check passed

holsee deleted the feat/tool-correctness-metric branch March 10, 2026 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ToolCorrectness metric with full Python deepeval parity#1

Add ToolCorrectness metric with full Python deepeval parity#1
holsee merged 2 commits into
mainfrom
feat/tool-correctness-metric

holsee commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

holsee commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

holsee commented Mar 10, 2026 •

edited

Loading