Skip to content

Add ClawBench (arXiv:2604.08523) to Benchmark#1

Open
reacher-z wants to merge 1 commit into
SafeRL-Lab:mainfrom
reacher-z:add-clawbench
Open

Add ClawBench (arXiv:2604.08523) to Benchmark#1
reacher-z wants to merge 1 commit into
SafeRL-Lab:mainfrom
reacher-z:add-clawbench

Conversation

@reacher-z
Copy link
Copy Markdown

Adds ClawBench to the Benchmark section.

ClawBench evaluates browser agents on live production websites (Uber Eats, Indeed, Craigslist, etc.) with two-stage scoring: deterministic HTTP-request interception + LLM judge on the intercepted payload.

Direct conversation partner with the "An Illusion of Progress?" Online-Mind2Web paper (already listed) on live-web evaluation methodology — we add deterministic per-request schema checks on top of the live-execution paradigm.

Affiliation: I'm one of the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant