Skip to content

Enable access to reasoning_text and tool_calls in post-hoc LLM judges via flexeval_file.#285

Merged
junya-takayama merged 8 commits into
mainfrom
load_lmoutput
May 13, 2026
Merged

Enable access to reasoning_text and tool_calls in post-hoc LLM judges via flexeval_file.#285
junya-takayama merged 8 commits into
mainfrom
load_lmoutput

Conversation

@junya-takayama
Copy link
Copy Markdown
Collaborator

@junya-takayama junya-takayama commented Apr 30, 2026

flexeval_lm passes LMOutput objects to Metric.evaluate(), giving metrics access to the full generation output including reasoning_text and tool_calls. However, flexeval_file, used for post-hoc evaluation of saved results, only passed the plain lm_output string, discarding all other fields. This made it impossible to apply LLM-judge metrics to reasoning content or tool call data after the fact.

This PR closes that gap: flexeval_file now reconstructs LMOutput from the saved fields, making post-hoc evaluation with LLM judges behave consistently with online evaluation via flexeval_lm.

Implementation summary

  • evaluate_from_data() now reconstructs LMOutput objects from flat eval data dicts, picking up raw_lm_output, reasoning_text, finish_reason, tool_calls, and tool_call_validation_result fields
  • LLM judge metrics (ChatLLMScore, ChatLLMGEvalScore, ChatLLMLabel) now receive LMOutput objects directly instead of plain strings, enabling Jinja2 templates to access fields like {{ lm_output.reasoning_text }}
  • Added LMOutput.__str__ so that existing templates using {{ lm_output }} continue to render the text field without modification (backward compatibility)

@junya-takayama junya-takayama changed the title [WIP] Enable access to reasoning_text and tool_calls in post-hoc LLM judges via flexeval_file. Enable access to reasoning_text and tool_calls in post-hoc LLM judges via flexeval_file. May 11, 2026
@junya-takayama junya-takayama marked this pull request as ready for review May 11, 2026 12:31
@junya-takayama junya-takayama marked this pull request as draft May 11, 2026 12:46
@junya-takayama junya-takayama marked this pull request as ready for review May 11, 2026 19:32
@junya-takayama junya-takayama merged commit 4ed5363 into main May 13, 2026
8 checks passed
@junya-takayama junya-takayama deleted the load_lmoutput branch May 13, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants