VERA - Verification-Enabled Research Agent

The dashboard is the main debugging surface: live metrics, last tool call, verifier score, event log, notes, and ledgers are visible from the same session view.

Local research agent with Docker sandbox tools, evidence logs, and test-time verification loops.

A proof-of-concept “Verification-Enabled Research Agent” (VERA) that runs locally, with full root permissions in a Linux Docker sandbox, can browse the public internet, read/write mounted files, run Linux commands + Python, and uses a test-time verification loop to reduce hallucinations and enforce evidence-grounded outputs.

When an LLM is given real I/O (files, network, shell) and real tools, it can exhibit emergent problem-solving behaviors. In one experiment, a small 30B coding-optimized model independently converged on a strict numerical error tolerance while searching atomic mass data, installed required libraries at runtime, and produced chemical structure diagrams - without any explicit human guidance.

This repo is explicitly motivated by two complementary research threads:

LLM-in-Sandbox: giving an LLM a real “virtual computer” (terminal + files + internet) can elicit general agentic capabilities without additional training, and can reduce long-context token costs by offloading context to files. See [1].
Inference-time scaling of verification (DeepVerifier): correctness can often be improved by iteratively verifying and repairing outputs using rubric-guided, decomposed checks rather than “one-shot” answers. See [2].

Note: citations are included as a References section below; the two papers above are the main “science behind” this project.

Why This Exists

Agentic systems fail in predictable ways:

Wrong sources / low-quality sources, especially on legal/technical claims.
Mis-extraction: wrong number, wrong section, wrong quote.
Tool misuse: a command fails but the agent proceeds anyway.
Overconfident synthesis: conclusions not supported by evidence.
Long-horizon drift: continued synthesis without epistemic progress.

This project exists to make a local agent behave more like an auditable system:

every tool action is logged,
claims are forced to carry “evidence hooks” (URLs + snippets, or file paths + commands),
a verifier loop checks risky claims and provides targeted corrective instructions.

What You Get

A Docker sandbox with /input (read-only) + /work (read-write).
A minimal tool protocol: model outputs a single-line JSON tool call.
A DeepVerifier-style verifier loop (decompose -> verify -> judge) with stop-early logic and a configurable retry budget.
Live “digging” monitoring via a local dashboard:
- SSE event stream from trace.jsonl
- Prometheus-style /metrics
- Session picker + “New session” + “Start run” UI
A non-terminal epistemic state model: missing evidence does not cause failure; tasks remain UNRESOLVED until new evidence is produced or search is exhausted.

Repository Layout

run.py – CLI entrypoint (build, run, dashboard)
agent/
- loop.py – main agent loop + trace logging + verifier integration
- tools.py – Docker sandbox + shell tool (shell-only interface)
- verifier.py – DeepVerifier-style verifier modules in one file
- model_client.py – OpenAI-compatible /chat/completions client (+ latency & usage capture)
assets/
- docker/Dockerfile – sandbox image
- system_prompt.en.txt – system prompt for sandbox interaction
dashboard/
- server.py – local dashboard UI + /events + /metrics

Architecture

User Task
   |
   v
Agent Model  --(shell JSON)-->  Sandbox (Docker)
   ^                                 |
   |                                 v
Verifier (decompose/check/judge)  trace.jsonl + notes.md
   |                                 |
   +---------- feedback -------------+
                     |
                     v
        Dashboard (/events, /metrics)

Runtime Flow (Current Control Logic)

User task
  |
  v
Context builder
(System + PRIMARY TASK + pinned notes + action tail)
  |
  v
Model output
THOUGHT + ACTION + EVIDENCE_USED + STATUS_UPDATE
  |
  v
Policy gates (structural)
- Notes append-only
- Query mutation / vector shift
- Domain shift (negative-claim)
- Brave rate limit / circuit breaker
- Source budget / stagnation controls
  |
  v
Tool execution
- shell (curl/wget/python)
- brave_search / brave_news (if key set)
  |
  v
Evidence ledger + notes append
evidence.jsonl + notes.md + trace.jsonl
  |
  v
Verifier (clean-room auditor)
- scores + constraints
- updates epistemic state
  |
  v
Termination
- VERIFIED
- or UNRESOLVED(reason) with constraints + evidence summary

Design Decisions (and Why)

1) “Linux box” interface with file mounts

We follow the “LLM-in-Sandbox” idea: a computer is a universal substrate (files + shell + internet) and can generalize beyond coding when the model is encouraged to explore and use the environment. See [1].

Implementation:

host mounts --work-dir -> /work (rw)
optionally mount --input-dir -> /input (ro)

2) Runtime tool acquisition (venv-first, plus OS packages if needed)

We want “install at runtime” behavior. We bootstrap /work/.venv and put it on PATH so pip install ... works in an isolated, writable environment under the mounted work directory. In lab mode the container runs privileged, so OS-level installs (e.g., apt-get) are also possible.

2.1) Lab-mode privilege (maximum freedom)

The sandbox currently runs privileged to enable unrestricted experimentation (including OS package installs and low-level network changes). This is intentional for the “emergent behavior” lab setting, but it removes most isolation guardrails. Use only on trusted, local machines.

3) Persistent-ish shell ergonomics

LLM-in-Sandbox emphasizes a terminal session where state persists (cwd, environment). See [1].

We simulate the most important parts:

cd ... persists across shell calls
simple export KEY=VALUE persists across shell calls

(The underlying container exec is still per-call; this is a pragmatic approximation.)

4) Verification scaling (decompose -> verify -> judge)

Instead of a single “judge the entire answer” prompt, we decompose verification into a few yes/no checks and verify those with tools (asymmetry of verification). This follows the DeepVerifier direction: verification is more reliable when broken into targeted, source-bound questions. See [2].

The verifier:

proposes ≤3 checks
runs a tiny tool-using loop per check to gather evidence
returns a score (1–4) and ≤3 concrete corrective instructions
stops early when score ≥ 3; caps verification rounds to avoid diminishing returns

5) SCOUT gating (Scope -> Candidates -> Outcomes)

Some tasks are structurally easy to answer overconfidently (especially negative claims like “none / no one / never”). To reduce this failure mode, the verifier applies SCOUT-style gating:

Scope: make the scope explicit (what entities count, time window, success criteria).
Candidates: if the question implies a complete set, enumerate the candidate set from a cited source before concluding.
Outcomes: verify the predicate for the candidates (or use a source that asserts it collectively).

The verifier caps scores (≤2) when any load-bearing check is unknown, when it cannot establish coverage for a negative claim, or when it cannot produce two independent citations (distinct domains).

6) Observability-first

Long-horizon agents are hard to debug without visibility. We log a structured event stream (trace.jsonl) and expose:

/events (SSE) for live dashboards
/metrics (Prometheus text format) for graphs and alerting

We also log internal verifier activity (decomposition/judge model calls, per-check tool calls, and verifier->agent feedback injection), so you can audit the end-to-end “digging” process. Additional observability hooks:

Raw model request/response snapshots are captured as model_io events and shown in the dashboard (Model I/O panel).
Per-session container logs and Docker event stream are written to /work/container.log and /work/container_events.log and shown in the dashboard (Container Logs panel).

7) Web access is “just the shell”

There is no special web-browsing tool. If the agent needs the internet, it must use standard CLI tooling from the sandbox (typically curl/wget). This makes the environment feel like a real Linux box and keeps all network actions auditable as shell commands in trace.jsonl.

Requirements

Python 3.11+
Docker (Docker Desktop / Docker Engine) running locally
A model server exposing an OpenAI-compatible endpoint:
- POST {BASE_URL}/v1/chat/completions (if you pass http://127.0.0.1:1234, this project auto-normalizes to /v1)

Python deps:

see requirements.txt

Quickstart

Install Python deps:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Build sandbox image:

python3 run.py build

Run the dashboard (in one terminal):

python3 run.py dashboard --base-dir . --host 127.0.0.1 --port 8844

Run a task (in another terminal):

python3 run.py run \
  --task "Find the official source for X and quote it." \
  --work-dir ./work/example \
  --input-dir ./input \
  --model-base-url http://127.0.0.1:1234

Open the dashboard:

http://127.0.0.1:8844/?work_dir=./work/example

Artifacts produced in --work-dir:

notes.md – human-readable notes
evidence.jsonl – durable tool-output evidence ledger (one JSON per tool call)
move_ledger.jsonl – structured search-move ledger (move typing + outcomes)
query_ledger.jsonl – query mutation ledger (normalized query families + outcomes)
trace.jsonl – structured event log (tools, model calls, verifier decisions)
container.log – sandbox container stdout/stderr (per session)
container_events.log – Docker events for the sandbox container (per session)
run.log – stdout/stderr of the agent process when started from the dashboard
run.pid – PID of the agent process when started from the dashboard
session.log – per-session dashboard control log (start_run/new_session events)

Dashboard sessions

Each work_dir is a “session”. In the dashboard you can:

open an existing session (from ./work/*)
create a new session (auto-creates ./work/ui-run-...)
start a run once per session (prompt locks after start)

First Test Run (Scripts)

Start the “backend/observability stack” (dashboard + Docker sandbox sanity check):

bash start_stack.sh

The script prompts for a step budget:

1 = unlimited (MAX_STEPS=0)
2 = set an integer limit

Stop the dashboard:

bash stop_stack.sh

Metrics You’ll Likely Watch

Decision score tracking:
- dra_verifier_scores_total{score="1|2|3|4"}
- dra_verifier_last_score
Verification cost:
- dra_verifier_duration_seconds_sum / _count
- dra_verifier_model_tokens_total
- dra_model_tokens_total{scope="agent"}
Instruction drift / concreteness proxies:
- dra_verifier_instruction_chars_sum
- dra_verifier_instruction_has_url_total
- dra_verifier_instruction_has_path_total
- dra_verifier_instruction_has_cmd_total
Diagnosis / paralysis indicators:
- dra_verifier_before_tools_total
- dra_model_finish_reason_length_total
- dra_policy_pre_tool_nudge_total
- dra_policy_length_nudge_total
- dra_policy_reminder_total
- dra_policy_choice_total
- dra_policy_choice_matched_total
- dra_policy_stagnation_total
- dra_policy_query_vector_total
- dra_policy_domain_shift_total
- dra_policy_conclusion_ready_total
- dra_policy_source_budget_total
- dra_policy_brave_budget_total
- dra_policy_brave_circuit_total
- dra_verifier_gradient_total

Safety Notes (POC)

/input is mounted read-only; /work is writable.
The container currently runs as root (lab-mode / exploration). This enables apt-get and broader tool acquisition, but it can also create root-owned files under your mounted --work-dir.
Shell commands are not allowlisted (to preserve exploration), but obvious destructive patterns are blocked (see agent/config.py).

This is a research prototype: review deny patterns and add policy/allowlists before using on untrusted inputs.

Important behavior notes

Tool execution still requires a parseable JSON tool call, but the parser now accepts either {"tool":"shell","args":{"cmd":"..."}} or {"tool":"shell","command":"..."}, and can recover JSON from fenced blocks.
Optional Brave Search tool: the agent can call brave_search / brave_news directly (no shell) when BRAVE_API_KEY is set. Example: {"tool":"brave_search","args":{"q":"windows 12","count":12,"freshness":"month","country":"us","search_lang":"en"}}.
Brave discovery-first (when enabled): for web-facing tasks, the agent must do one Brave search (brave_search/brave_news) before any curl/wget. This seeds a URL set, then the agent is expected to fetch those URLs with curl/wget for evidence.
Brave URL fallback: if a Brave-provided URL returns empty/low-signal content via curl/wget, the agent should assume the URL is valid and try client-side rendering (e.g., Playwright) before abandoning the source.
If a model talks about using tools but does not call them, you will see model/verifier events but few tool events in trace.jsonl.
A lightweight policy layer nudges the agent before the first tool call and when the model hits finish_reason="length" repeatedly; these show up as policy_* metrics and events.
Context is now deterministic: each step reassembles the prompt from System + PRIMARY TASK + pinned notes.md + a short action tail (no FIFO clipping of the middle).
- Context assembly (fixed order):
```
[System Prompt]
     |
[PRIMARY TASK]
     |
[PINNED notes.md]
     |
[Action Tail]
```
Every N steps (default 3), the agent is required to update notes.md before it can run further tools (see NOTES_UPDATE_INTERVAL in loop.py).
Notes are append-only; overwrite/delete attempts are blocked.
- Notes gate (append-only):
```
step % N == 0
    |
must write notes.md
    |
tool call allowed
```
Output format is strict: THOUGHT line + single JSON Action line + EVIDENCE_USED + STATUS_UPDATE. Missing required lines results in a format error and no tool execution.
Tool outputs are normalized into evidence.jsonl; evidence IDs (ev_0001…) are appended to notes.md and must be cited via EVIDENCE_USED.
Search intuition scaffolding: tool calls are typed into move_ledger.jsonl and query families recorded in query_ledger.jsonl; repeated query families trigger a query-mutation requirement before retrying.
- Search pressure (query + move typing):
```
same query family (xN)
    |
require mutation
    |
new query family
```
The verifier runs in a clean-room “Auditor” persona and does not reuse the worker’s chat history (only task + notes + evidence).
Stagnation detector: if UNRESOLVED repeats without new evidence for STAGNATION_LIMIT turns, the agent is forced to run a tool; repeated failure types prompt escalation (FAILURE_ESCALATION_LIMIT).
Query-vector gating (negative-claim tasks): repeated intent class before QUERY_VECTOR_MIN vectors are used forces a switch in search intent (policy_query_vector).
Source budget: after SOURCE_BUDGET distinct non-search domains are checked without verification, the run exits as UNRESOLVED with a source-exhaustion summary (policy_source_budget).
Negative-claim protocol (e.g., “has X launched yet?”):
- Minimum source coverage is enforced before any “no official announcement found” conclusion:
  - ≥2 official domains (vendor-owned)
  - ≥1 independent domain (non‑vendor)
- Domain‑shift guard prevents hammering the same domain when minimums aren’t met (policy_domain_shift).
- Once the negative‑claim budget is exhausted and source minimums are satisfied, the run transitions to UNRESOLVED(reason) with a concrete “no official announcement found in sources checked” summary (policy_conclusion_ready).
- Negative‑claim constraints are injected as OPEN CONSTRAINTS so the model treats them as hard requirements (no explicit denial required).
To prevent “final deliverables” tool-call loops, the runner stops after repeated finalization-style file writes (see policy_finalization_stop in trace.jsonl).
Defaults were relaxed for exploration: --max-steps now defaults to 80 and tool timeouts are longer (MAX_TOOL_SECONDS=900). You can set unlimited steps with --max-steps 0 or MAX_STEPS=0.
Model request timeout defaults to 150s and can be overridden with MODEL_TIMEOUT env var.
Tooling caches are routed into /work/.cache (pip/npm/playwright) to make runtime installs more reliable across steps.
Brave API token: set BRAVE_API_KEY (or pass --brave-api-key to run.py). Optional BRAVE_API_VERSION sets the API version header. Use brave_news for news-only queries. Defaults: BRAVE_DEFAULT_COUNT=12. Discovery-first policy can be toggled with BRAVE_DISCOVERY_REQUIRED=0.
Brave Search API throttling (optional): BRAVE_MIN_INTERVAL, BRAVE_BACKOFF_MAX, BRAVE_COOLDOWN_S, BRAVE_MAX_CONSEC_429, and BRAVE_MAX_CALLS enforce rate limits, exponential backoff, and a circuit breaker.

Roadmap (Likely Next)

Evidence enforcement: reject verifier answers that lack URL/path+command evidence.
File-based output contract (/work/output/... + explicit submit) and stricter formatting.
Better “persistent session” (true interactive shell or tmux-like session).
Optional indexing (SQLite FTS5) for large GDPR exports and deterministic excerpt citations.

References

[1] Daixuan Cheng et al. (2026). LLM-in-Sandbox Elicits General Agentic Intelligence. arXiv:2601.16206. Paper page: https://arxiv.org/abs/2601.16206

[2] Yuxuan Wan et al. (2026). Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification. arXiv:2601.15808. Repo: https://github.com/yxwan123/DeepVerifier

[3] Shunyu Yao et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629. Paper page: https://arxiv.org/abs/2210.03629.

[4] Charlie Snell et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314. Paper page: https://arxiv.org/abs/2408.03314

Maintained by Damyan Deshev - local-first software, deterministic data paths, retrieval, evaluation, and practical product systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VERA - Verification-Enabled Research Agent

Why This Exists

What You Get

Repository Layout

Architecture

Runtime Flow (Current Control Logic)

Design Decisions (and Why)

1) “Linux box” interface with file mounts

2) Runtime tool acquisition (venv-first, plus OS packages if needed)

2.1) Lab-mode privilege (maximum freedom)

3) Persistent-ish shell ergonomics

4) Verification scaling (decompose -> verify -> judge)

5) SCOUT gating (Scope -> Candidates -> Outcomes)

6) Observability-first

7) Web access is “just the shell”

Requirements

Quickstart

Dashboard sessions

First Test Run (Scripts)

Metrics You’ll Likely Watch

Safety Notes (POC)

Important behavior notes

Roadmap (Likely Next)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
agent		agent
assets		assets
dashboard		dashboard
docs/screenshots		docs/screenshots
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
start_stack.sh		start_stack.sh
start_stack_iquest.sh		start_stack_iquest.sh
stop_stack.sh		stop_stack.sh

Folders and files

Latest commit

History

Repository files navigation

VERA - Verification-Enabled Research Agent

Why This Exists

What You Get

Repository Layout

Architecture

Runtime Flow (Current Control Logic)

Design Decisions (and Why)

1) “Linux box” interface with file mounts

2) Runtime tool acquisition (venv-first, plus OS packages if needed)

2.1) Lab-mode privilege (maximum freedom)

3) Persistent-ish shell ergonomics

4) Verification scaling (decompose -> verify -> judge)

5) SCOUT gating (Scope -> Candidates -> Outcomes)

6) Observability-first

7) Web access is “just the shell”

Requirements

Quickstart

Dashboard sessions

First Test Run (Scripts)

Metrics You’ll Likely Watch

Safety Notes (POC)

Important behavior notes

Roadmap (Likely Next)

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages