Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision
TL;DR β CaptchaMind is the first training-based CAPTCHA solver. We build CaptchaBench, a 16,000-sample benchmark with process-level annotations, and train a Qwen2.5-VL-7B model with an RL framework that explicitly supervises the model's reasoning process by rewarding correct grounding of task-relevant visual regions at intermediate steps. CaptchaMind reaches 82.9% average success across eight tasks and 71.0% on real-world CAPTCHAs β surpassing all closed-source-API and agentic baselines.
The eight CAPTCHA task types in CaptchaBench.
- π’ News
- π Resources
- π Overview
- π Project Structure
- βοΈ Installation
- π¦ Dataset
- π€ Model & Inference Server
- π Quick Start / Evaluation
- π Results
- πΊοΈ Roadmap
- π Citation
- π License
- π Acknowledgements
- [2026-06-05] We release the CaptchaBench dataset and CaptchaMind-7B model weights. π
- [2026-05-20] Our paper is available on arXiv, along with the evaluation code.
| Resource | Link |
|---|---|
| π Paper | https://arxiv.org/abs/2605.19538 |
| π€ Model (CaptchaMind-7B) | https://huggingface.co/AIDC-AI/CaptchaMind-7B |
| ποΈ Dataset (CaptchaBench) | https://huggingface.co/datasets/AIDC-AI/Captcha |
CaptchaMind is the first training-based CAPTCHA solver. We release CaptchaBench β 16,000 programmatically generated samples across 8 task types (2,000 train + 200 test each) with region- and process-level annotations β and a CaptchaMind-7B model (Qwen2.5-VL-7B), trained in two stages (SFT followed by GRPO-based RL with explicit supervision of intermediate region grounding). See the paper for method details.
This repository provides the CaptchaBench environments and evaluation harness.
| Category | Tasks |
|---|---|
| Image-switching | connect_icon, coordinates, dart_count, rotation_match |
| Multi-step interaction | click_order, image_select |
| Single-step decision | dice_count, slide_puzzle |
Our task taxonomy is inspired by OpenCaptchaWorld (Luo et al., 2025); all CaptchaBench samples are independently and programmatically generated by us.
captcha-mind/
βββ captcha/
β βββ run.py # Entry point: per-task batch evaluation functions
β βββ data_types.py # Action / EnvResponse / SolveResult dataclasses
β βββ agents/
β β βββ base.py # Abstract Agent interface
β β βββ react_agent.py # ReAct agent: <think> + <tool_call> loop & parsing
β βββ env/ # One POMDP environment per task
β β βββ base.py # Env base class (reset / step)
β β βββ connect_icon/ # each task: env.py + data/ (images + data.jsonl)
β β βββ coordinates/
β β βββ dart_count/
β β βββ dice_count/
β β βββ rotation_match/
β β βββ image_recognition/ # image_select task
β β βββ patch_select/ # image_select task
β β βββ click_order/
β β βββ slide_puzzle/
β βββ utils/
β β βββ model_utils.py # OpenAI-compatible model endpoint config
β β βββ image_utils.py # Image composition / annotation helpers
β βββ test_results/ # Saved evaluation reports (JSON)
βββ setup.py
βββ LICENSE
βββ README.md
Each environment follows the same reset() / step(action) contract (see captcha/env/base.py), exposing the CAPTCHA as a POMDP with deterministic, reproducible transitions.
git clone https://github.com/AlibabaResearch/captcha-mind
cd captcha-mind
pip install -e .The harness additionally requires requests and Pillow (installed transitively). Python 3.9+ is recommended.
The full CaptchaBench dataset is hosted on Hugging Face: AIDC-AI/Captcha.
pip install -U huggingface_hub
huggingface-cli download AIDC-AI/Captcha --repo-type dataset --local-dir ./captcha_dataEach task lives under captcha/env/<task>/data/, containing the task images plus a data.jsonl file with one sample per line. For example, a dice_count sample records the image, the answer, and the interactive regions:
{"image": "image.png", "ground_truth": 77, "input_box": [30, 690, 610, 755], "submit_box": [700, 665, 925, 750]}Place the downloaded files into the matching captcha/env/<task>/data/ directories. A small set of demo samples ships with the repo so you can smoke-test the harness before downloading the full set.
The released checkpoint is AIDC-AI/CaptchaMind-7B (built on Qwen2.5-VL-7B). The harness talks to the model over an OpenAI-compatible chat-completions endpoint, so you serve the weights yourself and point the harness at them.
1. Serve the model with vLLM (recommended):
pip install vllm
vllm serve AIDC-AI/CaptchaMind-7B \
--served-model-name CaptchaMind-7B \
--port 8000 \
--limit-mm-per-prompt image=8This exposes http://localhost:8000/v1/chat/completions.
2. Point the harness at your endpoint. Edit captcha/utils/model_utils.py:
AIDC_URL = "http://localhost:8000/v1/chat/completions" # your vLLM endpoint
AIDC_TOKEN = "EMPTY" # any non-empty string for local vLLM
AIDC_MODEL_MAPPING = {
...
"qwen-7b-sft": "CaptchaMind-7B", # map the harness alias to your served-model-name
}The harness sends standard OpenAI chat messages (text + base64 images) and reads choices[0].message.content and usage. Any server implementing that contract (vLLM, SGLang, or a hosted API) will work.
The entry point is captcha/run.py, which provides one batch-evaluation function per task.
1. Pick the model alias at the top of run.py:
MODEL_NAME = "qwen-7b-sft" # must match a key in AIDC_MODEL_MAPPING
MAX_SAMPLES = 50 # set to None to evaluate all test samples2. Select the task to run inside main() (uncomment the one you want):
def main():
# test_connect_icon_batch()
# test_dart_count_batch()
# test_click_order_batch()
test_coordinates_batch()
# test_dice_count_batch()
# test_rotation_match()
# test_slide_puzzle_batch()
# test_image_recognition_batch()
# test_patch_select_batch()| Task | Function |
|---|---|
connect_icon |
test_connect_icon_batch() |
coordinates |
test_coordinates_batch() |
dart_count |
test_dart_count_batch() |
dice_count |
test_dice_count_batch() |
rotation_match |
test_rotation_match() |
click_order |
test_click_order_batch() |
image_select |
test_image_recognition_batch() / test_patch_select_batch() |
slide_puzzle |
test_slide_puzzle_batch() |
3. Run it:
cd captcha
python run.pyEach run prints a running success rate and writes a detailed JSON report (success rate, average reward, token usage, per-sample results) to captcha/test_results/<task>_<timestamp>.json.
Task success rate (%) on the CaptchaBench test set (200 samples per task).
| Method | Avg. | connect_icon | coordinates | dart_count | dice_count | rotation_match | image_select | click_order | slide_puzzle |
|---|---|---|---|---|---|---|---|---|---|
| Closed-source VLMs | |||||||||
| Claude-4-Sonnet | 47.4 | 7.5 | 93.5 | 68.5 | 51.0 | 20.0 | 43.0 | 71.5 | 24.5 |
| Claude-3.7-Sonnet | 43.9 | 19.0 | 83.5 | 51.0 | 75.5 | 10.0 | 38.5 | 30.0 | 44.0 |
| Gemini-3-Pro | 45.8 | 47.5 | 62.0 | 88.5 | 63.0 | 50.0 | 52.0 | 0.0 | 3.0 |
| OpenAI-o3 | 46.9 | 37.0 | 69.0 | 51.0 | 72.0 | 4.0 | 46.0 | 62.0 | 34.0 |
| GPT-5 | 33.6 | 15.0 | 35.5 | 37.0 | 42.0 | 3.5 | 32.5 | 67.0 | 36.0 |
| Qwen-VL-Max | 38.6 | 14.5 | 55.5 | 40.0 | 42.5 | 5.5 | 39.5 | 72.5 | 39.0 |
| Agentic methods | |||||||||
| Oedipus | 51.9 | 35.0 | 68.0 | 76.0 | 44.5 | 31.5 | 53.0 | 68.5 | 38.5 |
| Halligan | 54.7 | 34.0 | 76.5 | 69.5 | 43.0 | 36.0 | 66.0 | 71.0 | 41.5 |
| GUI agent | |||||||||
| GUI_R1_7b | 6.5 | 0.0 | 13.5 | 14.0 | 0.5 | 3.5 | 18.0 | 2.0 | 0.0 |
| Training-based | |||||||||
| SFT-only | 68.1 | 49.0 | 79.5 | 77.0 | 76.0 | 88.0 | 14.0 | 70.5 | 91.0 |
| CaptchaMind (Ours) | 82.9 | 71.0 | 91.0 | 93.0 | 72.5 | 86.0 | 71.0 | 87.5 | 91.5 |
On real-world live CAPTCHAs (GeeTest, hCaptcha), CaptchaMind achieves 71.0% success, demonstrating effective sim-to-real generalization.
- CaptchaBench environments & evaluation harness
- CaptchaMind-7B model weights (Hugging Face)
- CaptchaBench dataset (Hugging Face)
- SFT training code (behavior cloning + CoT)
- RL training code (GRPO with process-level reward)
- Data generation pipeline
- Real-world evaluation set & scripts
Contributions and issues are welcome.
If you find CaptchaMind or CaptchaBench useful, please cite:
@misc{wang2026captchamindtrainingcaptchasolvers,
title={CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision},
author={Pengcheng Wang and Haoxiang Liu and Yang Dai and Xiangxiang Zeng and Guanhua Chen and Baotian Hu and Longyue Wang and Weihua Luo},
year={2026},
eprint={2605.19538},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.19538},
}This project is released under the MIT License.
CaptchaMind is built on Qwen2.5-VL, and our RL training is based on the verl framework (with our own modifications) using the GRPO algorithm. Our task taxonomy is inspired by OpenCaptchaWorld (Luo et al., 2025). We thank the open-source community for the tooling that made this work possible.