Hi guys,
When a task completes, I see a result.json like this in the eval/ folder:
{
"task_id": 0,
"task_name": "SystemBluetoothTurnOn",
"task_idx": 0,
"task_description": "Turn bluetooth on.",
"max_steps": 15,
"success": 0.0,
"agent_success": false,
"steps_taken": 1,
"execution_time": 12.689261,
"reasoning": false,
"final_thought": "To turn on Bluetooth, we first need to open the Settings app, but it is not currently open. Please navigate to the Settings app first.",
"logs": [],
"timestamp": "2025-06-26T16:18:20.153066",
"error": null,
"trajectory": [],
"trajectory_stats": {
"total_steps": 0,
"planning_steps": 0,
"execution_steps": 0
},
"device": "emulator-5554"
}
In this instance, the result.json correctly identifies that the task failed (I watched it, and can confirm).
However, I have noticed that on a lot of occasions, your result.json will incorrectly mark the agent's actions as a 'success', despite the task actually failing. For example here, when I ran the add contact task, you can see that despite success being a 0.0 float (I imagine you get this reward back from the Android Env environment?), the 'agent_success' is marked as boolean 'true'.
{
"task_id": 0,
"task_name": "ContactsAddContact",
"task_idx": 0,
"task_description": "Create a new contact for Emilia Gonzalez. Their number is +14240925675.",
"max_steps": 18,
"success": 0.0,
"agent_success": true,
"steps_taken": 7,
"execution_time": 77.803622,
"reasoning": false,
"final_thought": "Successfully created new contact for Emilia Gonzalez",
"logs": [],
"timestamp": "2025-06-26T16:01:55.911568",
"error": null,
"trajectory": [],
"trajectory_stats": {
"total_steps": 0,
"planning_steps": 0,
"execution_steps": 0
},
"device": "emulator-5554"
}
Can I just please confirm that the 63% success rate you achieve on the AndroidWorld benchmark is based on the float 'success' value, and not the boolean 'agent_success' value?
Also, would you mind please publishing some code / config so I can reproduce your results on the benchmark? I am trying a few tasks here and there with your framework but most of them are failing, so perhaps I am using the wrong settings / using it incorrectly!
thanks for your work on this!
Hi guys,
When a task completes, I see a result.json like this in the eval/ folder:
{ "task_id": 0, "task_name": "SystemBluetoothTurnOn", "task_idx": 0, "task_description": "Turn bluetooth on.", "max_steps": 15, "success": 0.0, "agent_success": false, "steps_taken": 1, "execution_time": 12.689261, "reasoning": false, "final_thought": "To turn on Bluetooth, we first need to open the Settings app, but it is not currently open. Please navigate to the Settings app first.", "logs": [], "timestamp": "2025-06-26T16:18:20.153066", "error": null, "trajectory": [], "trajectory_stats": { "total_steps": 0, "planning_steps": 0, "execution_steps": 0 }, "device": "emulator-5554" }In this instance, the result.json correctly identifies that the task failed (I watched it, and can confirm).
However, I have noticed that on a lot of occasions, your result.json will incorrectly mark the agent's actions as a 'success', despite the task actually failing. For example here, when I ran the add contact task, you can see that despite success being a 0.0 float (I imagine you get this reward back from the Android Env environment?), the 'agent_success' is marked as boolean 'true'.
{ "task_id": 0, "task_name": "ContactsAddContact", "task_idx": 0, "task_description": "Create a new contact for Emilia Gonzalez. Their number is +14240925675.", "max_steps": 18, "success": 0.0, "agent_success": true, "steps_taken": 7, "execution_time": 77.803622, "reasoning": false, "final_thought": "Successfully created new contact for Emilia Gonzalez", "logs": [], "timestamp": "2025-06-26T16:01:55.911568", "error": null, "trajectory": [], "trajectory_stats": { "total_steps": 0, "planning_steps": 0, "execution_steps": 0 }, "device": "emulator-5554" }Can I just please confirm that the 63% success rate you achieve on the AndroidWorld benchmark is based on the float 'success' value, and not the boolean 'agent_success' value?
Also, would you mind please publishing some code / config so I can reproduce your results on the benchmark? I am trying a few tasks here and there with your framework but most of them are failing, so perhaps I am using the wrong settings / using it incorrectly!
thanks for your work on this!