Enable access to `reasoning_text` and `tool_calls` in post-hoc LLM judges via `flexeval_file`. by junya-takayama · Pull Request #285 · sbintuitions/flexeval

junya-takayama · 2026-04-30T16:11:14Z

flexeval_lm passes LMOutput objects to Metric.evaluate(), giving metrics access to the full generation output including reasoning_text and tool_calls. However, flexeval_file, used for post-hoc evaluation of saved results, only passed the plain lm_output string, discarding all other fields. This made it impossible to apply LLM-judge metrics to reasoning content or tool call data after the fact.

This PR closes that gap: flexeval_file now reconstructs LMOutput from the saved fields, making post-hoc evaluation with LLM judges behave consistently with online evaluation via flexeval_lm.

Implementation summary

evaluate_from_data() now reconstructs LMOutput objects from flat eval data dicts, picking up raw_lm_output, reasoning_text, finish_reason, tool_calls, and tool_call_validation_result fields
LLM judge metrics (ChatLLMScore, ChatLLMGEvalScore, ChatLLMLabel) now receive LMOutput objects directly instead of plain strings, enabling Jinja2 templates to access fields like {{ lm_output.reasoning_text }}
Added LMOutput.__str__ so that existing templates using {{ lm_output }} continue to render the text field without modification (backward compatibility)

junya-takayama added 6 commits May 1, 2026 00:56

Merge branch 'main' of github.com:sbintuitions/flexeval

3a37193

add reasoning_content to chat messages

d4d254d

load as LMOutput

b4dd613

lint

94d7b57

add tests

575f2ad

fix

cb1bb09

junya-takayama changed the title ~~[WIP] Enable access to reasoning_text and tool_calls in post-hoc LLM judges via flexeval_file.~~ Enable access to reasoning_text and tool_calls in post-hoc LLM judges via flexeval_file. May 11, 2026

junya-takayama marked this pull request as ready for review May 11, 2026 12:31

junya-takayama marked this pull request as draft May 11, 2026 12:46

fix test_evaluate_chat_response.py

652bb79

junya-takayama marked this pull request as ready for review May 11, 2026 19:32

add tests

d2f5b4e

yuma-hirakawa approved these changes May 13, 2026

View reviewed changes

junya-takayama merged commit 4ed5363 into main May 13, 2026
8 checks passed

junya-takayama deleted the load_lmoutput branch May 13, 2026 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable access to `reasoning_text` and `tool_calls` in post-hoc LLM judges via `flexeval_file`.#285

Enable access to `reasoning_text` and `tool_calls` in post-hoc LLM judges via `flexeval_file`.#285
junya-takayama merged 8 commits into
mainfrom
load_lmoutput

junya-takayama commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

junya-takayama commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

junya-takayama commented Apr 30, 2026 •

edited

Loading