Skip to content

sola-st/software-analysis-agent

Repository files navigation

Analysis Agent

Tests Release Python License: MIT

An LLM-based agent that autonomously applies software analysis tools to target projects. Given an analysis tool (e.g. Clang, Infer, KLEE) and a target repository, the agent sets up the environment, installs the tool, prepares the project, and drives the analysis to completion — retrying and learning across attempts.

Installation

python -m venv .venv && source .venv/bin/activate
pip install -e .                 # from source

This installs the analysis-agent and analysis-agent-experiments console scripts. The browser-based replay viewer needs the optional web extra:

pip install -e ".[web]"          # adds fastapi + uvicorn for replay_web

Requires Python ≥ 3.10 and (optionally) Docker for isolated execution.

Quick Start

# Set your LLM API key
export OPENAI_API_KEY="sk-..."   # or ANTHROPIC_API_KEY, etc.

# Run on a single tool/target pair
analysis-agent \
  --tool-name   "clang" \
  --tool-url    "https://github.com/llvm/llvm-project" \
  --target-name "curl" \
  --target-url  "https://github.com/curl/curl"

# Interactive wizard
analysis-agent --interactive

The console scripts are thin wrappers; the equivalent module forms (python -m analysis_agent.main ...) work identically.

Batch and Parallel Execution

# Batch from a JSONL file
python -m analysis_agent.main --instances-file instances.jsonl

# Parallel execution
python -m analysis_agent.run_parallel \
  --instances-file instances.jsonl \
  --workers 4

Instance file format (JSONL):

{"tool_name": "clang", "tool_url": "...", "target_name": "curl", "target_url": "..."}
{"tool_name": "infer", "tool_url": "...", "target_name": "checkstyle", "target_url": "..."}

Experiment Manager

A queue-based runner for managing multiple experiments with metrics tracking:

analysis-agent-experiments add exp.yaml      # queue an experiment from a YAML config
analysis-agent-experiments list              # list queued/running/completed experiments
analysis-agent-experiments start             # process the queue
analysis-agent-experiments status            # queue + per-experiment status
analysis-agent-experiments metrics <id>      # success rates, timing, and LLM cost

State is persisted under .experiment_manager/ in the working directory.

Large-Scale Orchestrated Runs

For large benchmark experiments, the recommended approach is a custom orchestrator with live monitoring and an emergency kill switch. Key design principles:

  • Global concurrency cap + per-tool sub-limit for disk-hungry tools (fuzzers, call-graph analyzers)
  • Resource-based launch pause (RAM/swap/disk thresholds)
  • Background thread monitoring Docker container writable-layer size (docker inspect --size)
  • Per-instance hard timeout; status file for monitor integration

See docs/experiment_management_guide.md for the full design and reference implementation.

Evaluated Tools and Targets

AnalysisAgent is tool- and project-agnostic — it works with any analysis tool and any target repository; you only provide their names and URLs. The combinations below are simply those used in our evaluation (the included instances.json covers 35 tool/target pairs):

Category Tools Targets
C/C++ analysis AFLplusplus, Clang, cFlow, KLEE curl, fastfetch, ImageMagick, masscan, radare2
Java analysis Infer, jvm-tools, WALA checkstyle, closer-compiler, jmh, saxon-he, tika

The augmented_doc/ directory contains LLM-generated runbooks for every evaluated tool/target pair. New pairs can be added with generate_brief_doc.py.

Key Features

  • Multi-stage workflow: docker_setuptool_setupproject_setupanalysis_run
  • Multi-attempt retry with cross-attempt learning
  • Exit attempt: on full failure, generates a recovery Dockerfile and replay script
  • Execution environments: local or Docker
  • LLM-agnostic: uses LiteLLM — works with OpenAI, Anthropic, Deepseek, Gemini, and any compatible provider
  • Experiment manager: queue-based batch runner with metrics tracking
  • Replay web interface: browser-based viewer for inspecting and replaying past agent runs

Evaluation Results

We evaluate AnalysisAgent against three baselines — RAG-Agent, Mini-SWE-Agent, and ExecutionAgent — across 35 benchmark tasks and 4 LLM backends, with 3 repetitions per configuration (560 total runs). All agents share per-task limits of 120 cycles, $2 API cost, and 5 h wall-clock time.

Headline Numbers

Agent Avg. Verified Success
RAG-Agent 10%
Mini-SWE-Agent 37%
ExecutionAgent 57%
AnalysisAgent 79%

AnalysisAgent achieves 94% verified success with both Gemini-3-Flash and DeepSeek-V3.2 (33/35 tasks), demonstrating that purpose-built scaffolding reliably handles multi-step tool installation, project building, and analysis evidence production.

Success Rates by Agent and LLM Backend

GPT-5-nano GPT-5-mini DeepSeek-V3.2 Gemini-3-Flash
RAG-Agent 9% 6% 3% 23%
Mini-SWE-Agent 9% 20% 57% 63%
ExecutionAgent 40% 54% 57% 77%
AnalysisAgent 54% 75% 94% 94%

Verified success rates from manual review. Self-validated rates (mean +/- std over n=3 runs) available in the paper.

Statistical Significance

AnalysisAgent's advantage is large, consistent across all LLM backends, and statistically significant (Fisher's exact test, Holm-Bonferroni corrected):

Comparison Odds Ratio (95% CI) Cohen's h padj
vs. RAG-Agent 34.5 [17.3, 68.5] 1.55 < 0.001
vs. Mini-SWE-Agent 8.1 [4.0, 16.2] 0.92 < 0.001
vs. ExecutionAgent 2.7 [1.6, 4.6] 0.45 < 0.001

The Cochran-Mantel-Haenszel test confirms the advantage holds across all LLM backends (p < 10-4 for all comparisons).

Tool and Ecosystem Patterns

Tool Avg. Success Ecosystem
cflow 64% C/C++
CSA (Clang Static Analyzer) 55% C/C++
AFL++ 52% C/C++
KLEE 40% C/C++
Infer 35% Java
SJK (jvm-tools) 35% Java
WALA 31% Java

Java tasks account for 62% of all failures across agents, reflecting the complexity of Java toolchains (classpaths, bytecode generation, JVM attachment) and heavyweight whole-program analyses.

Efficiency Highlights

  • Stronger models are cheaper overall: weaker models (GPT-5-nano) require more cycles and wall-clock time despite lower per-token prices.
  • Failed runs are expensive: 2.8x more cycles, 4.1x longer duration, and 1.3x higher cost than successful runs.
  • Best throughput: Gemini-3-Flash achieves the lowest mean time per task (36 min) while tied for the highest success rate.

Failure Taxonomy

Analysis of 182 failing trajectories reveals distinct failure profiles per agent:

Failure Mode RAG-Agent Mini-SWE-Agent ExecutionAgent AnalysisAgent
Docker/Build Failure 60% 11% 20% 7%
Analysis Tool Misuse 7% 44% 23% 50%
Malformed LLM Output 20% 50%
Budget/Time Exhausted 19% 29%
Incorrect Analysis Result 11% 48%

AnalysisAgent has largely solved environment setup (only 7% Docker/build failures) and primarily fails during tool invocation — a qualitatively harder problem that represents the next frontier.


Configuration

Key environment variables:

Variable Default Description
EXEC_AGENT_MODEL gpt-5-nano LLM model to use
EXEC_AGENT_IMAGE (unset) Docker image; unset = local execution
EXEC_AGENT_MODE auto auto (runs continuously) or step (pause each cycle)
EXEC_AGENT_LLM_USAGE_JSONL (unset) Path to log per-call LLM usage

See docs/AGENT_LAUNCH_GUIDE.md for the full CLI reference.

Replay Web Interface

Inspect and replay past agent runs in a browser:

python -m analysis_agent.replay_web --port 8080
# Then open http://localhost:8080

Options: --host, --port, --logs-dir /path/to/logs.

Running Tests

pip install -e ".[dev]"

# Project test suite (one file per module under tests/analysis_agent/)
pytest tests/analysis_agent -m "not slow"          # fast, no Docker needed
pytest tests/analysis_agent -m "slow"              # Docker-backed tests (needs a daemon)

# Coverage
pytest tests/analysis_agent -m "not slow" --cov=analysis_agent --cov-report=term-missing

The suite mirrors the source layout (tests/analysis_agent/test_<module>.py, plus tests/analysis_agent/experiment_manager/). Tests marked xfail document known-open issues. tests/upstream/ holds the vendored mini-swe-agent tests and is not part of the project gate.

Documentation

Citation

If you use this code in your research, please cite:

@article{TODO,
  title   = {TODO},
  author  = {TODO},
  year    = {TODO},
}

License

MIT

About

AnalysisAgent: Run any software analysis tool on any (compatible) project

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors