Skip to content

How are you grading yourself for your AndroidWorld results? #1

@rossamurphy

Description

@rossamurphy

Hi guys,

When a task completes, I see a result.json like this in the eval/ folder:

{
  "task_id": 0,
  "task_name": "SystemBluetoothTurnOn",
  "task_idx": 0,
  "task_description": "Turn bluetooth on.",
  "max_steps": 15,
  "success": 0.0,
  "agent_success": false,
  "steps_taken": 1,
  "execution_time": 12.689261,
  "reasoning": false,
  "final_thought": "To turn on Bluetooth, we first need to open the Settings app, but it is not currently open. Please navigate to the Settings app first.",
  "logs": [],
  "timestamp": "2025-06-26T16:18:20.153066",
  "error": null,
  "trajectory": [],
  "trajectory_stats": {
    "total_steps": 0,
    "planning_steps": 0,
    "execution_steps": 0
  },
  "device": "emulator-5554"
}

In this instance, the result.json correctly identifies that the task failed (I watched it, and can confirm).

However, I have noticed that on a lot of occasions, your result.json will incorrectly mark the agent's actions as a 'success', despite the task actually failing. For example here, when I ran the add contact task, you can see that despite success being a 0.0 float (I imagine you get this reward back from the Android Env environment?), the 'agent_success' is marked as boolean 'true'.

{
  "task_id": 0,
  "task_name": "ContactsAddContact",
  "task_idx": 0,
  "task_description": "Create a new contact for Emilia Gonzalez. Their number is +14240925675.",
  "max_steps": 18,
  "success": 0.0,
  "agent_success": true,
  "steps_taken": 7,
  "execution_time": 77.803622,
  "reasoning": false,
  "final_thought": "Successfully created new contact for Emilia Gonzalez",
  "logs": [],
  "timestamp": "2025-06-26T16:01:55.911568",
  "error": null,
  "trajectory": [],
  "trajectory_stats": {
    "total_steps": 0,
    "planning_steps": 0,
    "execution_steps": 0
  },
  "device": "emulator-5554"
}

Can I just please confirm that the 63% success rate you achieve on the AndroidWorld benchmark is based on the float 'success' value, and not the boolean 'agent_success' value?

Also, would you mind please publishing some code / config so I can reproduce your results on the benchmark? I am trying a few tasks here and there with your framework but most of them are failing, so perhaps I am using the wrong settings / using it incorrectly!

thanks for your work on this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions