Add ClawBench (arXiv:2604.08523) to Benchmark by reacher-z · Pull Request #1 · SafeRL-Lab/agentic-web

reacher-z · 2026-05-21T00:05:24Z

Adds ClawBench to the Benchmark section.

ClawBench evaluates browser agents on live production websites (Uber Eats, Indeed, Craigslist, etc.) with two-stage scoring: deterministic HTTP-request interception + LLM judge on the intercepted payload.

arXiv: https://arxiv.org/abs/2604.08523 (May 2026)
283 tasks (V1 153 + V2 130) across 163 live platforms · 15 life categories
Code: https://github.com/reacher-z/ClawBench · Live: https://claw-bench.com

Direct conversation partner with the "An Illusion of Progress?" Online-Mind2Web paper (already listed) on live-web evaluation methodology — we add deterministic per-request schema checks on top of the live-execution paradigm.

Affiliation: I'm one of the maintainers.

Add ClawBench (arXiv:2604.08523) to Benchmark

b89ffe4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClawBench (arXiv:2604.08523) to Benchmark#1

Add ClawBench (arXiv:2604.08523) to Benchmark#1
reacher-z wants to merge 1 commit into
SafeRL-Lab:mainfrom
reacher-z:add-clawbench

reacher-z commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reacher-z commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant