Skip to content

Add ToolCorrectness metric with full Python deepeval parity#1

Merged
holsee merged 2 commits into
mainfrom
feat/tool-correctness-metric
Mar 10, 2026
Merged

Add ToolCorrectness metric with full Python deepeval parity#1
holsee merged 2 commits into
mainfrom
feat/tool-correctness-metric

Conversation

@holsee

@holsee holsee commented Mar 10, 2026

Copy link
Copy Markdown
Owner

Summary

  • Ports deepeval's ToolCorrectnessMetric to Elixir with full feature parity
  • Deterministic tool calling score with three modes: greedy matching, exact positional matching, and weighted LCS for ordering
  • LLM-based tool selection scoring when available_tools are provided (final score = min of both)
  • Supports evaluation_params for comparing :input_parameters and :output fields beyond tool name
  • Adds strict_mode, include_reason, and threshold configuration
  • Comprehensive test suite covering all scoring modes, edge cases, and LLM integration

Test plan

  • All existing tests pass
  • New tool_correctness_test.exs covers: non-exact match, exact match, ordering (weighted LCS), parameter evaluation, strict mode, reason generation, tool selection (LLM), and validation edge cases
  • mix format passes
  • mix credo --strict passes
  • mix dialyzer passes

holsee added 2 commits March 10, 2026 17:42
Port of deepeval's ToolCorrectnessMetric with all features:
- Deterministic tool calling score (exact match, non-exact, weighted LCS)
- LLM-based tool selection scoring when available_tools provided
- strict_mode, include_reason, evaluation_params options
- Empty list handling matching Python behaviour (scores, not errors)
- Prompt template and schema modules following existing patterns
@holsee holsee force-pushed the feat/tool-correctness-metric branch from d579604 to e4388a5 Compare March 10, 2026 17:49
@holsee holsee merged commit 9614f5b into main Mar 10, 2026
1 check passed
@holsee holsee deleted the feat/tool-correctness-metric branch March 10, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant