Hi, thanks for releasing DeepEyesV2 and the evaluation code.
I am trying to reproduce the reported results on several benchmarks, including: VStarBench, HRBench4K, HRBench8K, OCRBench, TreeBench, SEEDBench2_Plus. My reproduced scores are somewhat different from the reported results, so I would like to clarify the expected evaluation setup.
Since DeepEyesV2 supports both code execution and web search tools, my main question is whether web search is intended to be used on these perception/OCR-style benchmarks, or only on search-oriented benchmarks (e.g., knowledge-/info-seeking ones like MMSearch, SimpleVQA, etc.). Intuitively, the benchmarks I listed above seem to be primarily visual perception / fine-grained recognition / OCR tasks, where I would expect the code sandbox to be the dominant tool and web search to be unnecessary or even harmful. But I want to confirm this matches the official evaluation protocol.
In my local setup, the code sandbox is working, but the web search service is not configured, so search results are effectively unavailable. From the released evaluation/VLMEvalKit code, agent mode seems to mainly use the Python/code sandbox, and the search-related logic appears to live mostly in the inference demo and RL environment
Could you please advise on:
For the benchmarks listed above (VStarBench, HRBench4K, HRBench8K, OCRBench, TreeBench, SEEDBench2_Plus), is web search enabled during official evaluation, or is it reserved only for search-oriented benchmarks?
If web search is in fact used for any of them, what is the recommended entrypoint or configuration in evaluation/VLMEvalKit to enable it?
If web search is not used for these benchmarks, are there other config differences (max tool turns, sandbox timeout, prompt template, sampling params, etc.) that could explain a gap from the reported numbers?
Thanks in advance!
Hi, thanks for releasing DeepEyesV2 and the evaluation code.
I am trying to reproduce the reported results on several benchmarks, including: VStarBench, HRBench4K, HRBench8K, OCRBench, TreeBench, SEEDBench2_Plus. My reproduced scores are somewhat different from the reported results, so I would like to clarify the expected evaluation setup.
Since DeepEyesV2 supports both code execution and web search tools, my main question is whether web search is intended to be used on these perception/OCR-style benchmarks, or only on search-oriented benchmarks (e.g., knowledge-/info-seeking ones like MMSearch, SimpleVQA, etc.). Intuitively, the benchmarks I listed above seem to be primarily visual perception / fine-grained recognition / OCR tasks, where I would expect the code sandbox to be the dominant tool and web search to be unnecessary or even harmful. But I want to confirm this matches the official evaluation protocol.
In my local setup, the code sandbox is working, but the web search service is not configured, so search results are effectively unavailable. From the released evaluation/VLMEvalKit code, agent mode seems to mainly use the Python/code sandbox, and the search-related logic appears to live mostly in the inference demo and RL environment
Could you please advise on:
For the benchmarks listed above (VStarBench, HRBench4K, HRBench8K, OCRBench, TreeBench, SEEDBench2_Plus), is web search enabled during official evaluation, or is it reserved only for search-oriented benchmarks?
If web search is in fact used for any of them, what is the recommended entrypoint or configuration in evaluation/VLMEvalKit to enable it?
If web search is not used for these benchmarks, are there other config differences (max tool turns, sandbox timeout, prompt template, sampling params, etc.) that could explain a gap from the reported numbers?
Thanks in advance!