Hello,
I am currently working on reproducing the results presented in Table 2 of the DriveBench paper. I have two specific questions regarding the experimental setup:
- Data Specification for Inference
Could you clarify whether the inference results for all models were obtained using drivebench-test.json or drivebench-test-final.json?
Additionally, I would appreciate it if you could explain the motivation behind adding the test-final version, particularly for handling single-image cases.
- Evaluation Prompt Consistency
I noticed a potential discrepancy between the PERCEPTION_VQA_PROMPT in the repository and the version described in Figure 23 of the paper. Could you please verify this?
Since the paper mentions various prompt types (e.g., rubric-aware, context-aware), could you specify which evaluation prompt was used to generate the results in Table 2?
Thank you for your time and for sharing this valuable research.
Hello,
I am currently working on reproducing the results presented in Table 2 of the DriveBench paper. I have two specific questions regarding the experimental setup:
Could you clarify whether the inference results for all models were obtained using drivebench-test.json or drivebench-test-final.json?
Additionally, I would appreciate it if you could explain the motivation behind adding the test-final version, particularly for handling single-image cases.
I noticed a potential discrepancy between the PERCEPTION_VQA_PROMPT in the repository and the version described in Figure 23 of the paper. Could you please verify this?
Since the paper mentions various prompt types (e.g., rubric-aware, context-aware), could you specify which evaluation prompt was used to generate the results in Table 2?
Thank you for your time and for sharing this valuable research.