Skip to content

nlink-jp/shell-agent-v2

Repository files navigation

shell-agent-v2

macOS local-first chat & agent tool with interactive data analysis.

Successor to shell-agent v0.7.x, redesigned with session-scoped analysis, an Idle/Busy agent execution model, and hybrid LLM backend (Local + Vertex AI).

Features

  • Interactive data analysis — dialogue-driven exploration with embedded DuckDB. Every analysis tool (load_data, query_sql, describe_data, save_query, analyze_data, etc.) is exposed to the LLM every round so the model can plan multi-step workflows up front instead of discovering tools round-by-round. See agent-tool-visibility.md. Set tools.hide_analysis_tools_until_data_loaded: true in config.json to restore the pre-v0.1.21 hide-until-load behaviour (opt-in for weaker local backends).
  • Session-scoped analysis — each session owns its own database, no cross-session state leakage
  • Agent execution model — Idle/Busy states with UI lockout during processing
  • Hybrid LLM backend — Local LLM (LM Studio) and Vertex AI (Gemini), switchable at runtime via /model
  • Multi-profile LLM (v0.12.0) — define multiple named (Local, VertexAI) profiles for billing attribution / GCP project isolation. Each session references one profile (persisted in session.json). Edit profiles in Settings → LLM Profiles (live-apply on blur, no Save button); switch a session via the status-bar pill popover or /profile <name> chat command. See ADR-0016.
  • Per-backend context budgetsContextBudget configured separately for Local and Vertex inside each profile (Settings → LLM Profiles → pick profile → expand the Local / Vertex AI sections).
  • Memory model (v0.2.0 rewrite) — four facilities work together. Records (immutable conversation history) live in chat.json. Session Memory auto-extracts fact / context per session. Findings are session-scoped data-analysis discoveries surfaced in a dedicated chat-pane panel. Global Memory holds preference / decision across sessions. Auto-extraction routes by category; "Pin to Global Memory" is the explicit user action that promotes a Session Memory entry or a Finding into the cross-session pool. Context-budget enforcement is non-destructive (internal/contextbuild summary cache). See memory-model.md.
  • Container sandbox (opt-in) — eight sandbox-* tools that execute shell or Python in a per-session podman/docker container with /work mounted from the session's data dir, MITL-gated, network-off by default. Includes sandbox_load_into_analysis (CSV/JSON in /work → DuckDB) and sandbox_export_sql (SQL query → CSV in /work) so query results flow between analysis and Python without round-tripping through chat. See sandbox-execution.md for the macOS setup guide.
  • Findings panel — chat-pane disclosure with severity filter, free-text search, bulk delete, real-time refresh, and a Pin-to-Global-Memory star button per row.
  • Shell script Tool Calling — register scripts as tools with MITL approval for write/execute. Per-tool @timeout: N header (seconds) overrides the 30-second default for legitimately long-running tools — see agent-tool-visibility.md and tool-execution-timeout.md. Scripts can write to $SHELL_AGENT_WORK_DIR (the same physical directory the sandbox bind-mounts at /work); use the built-in register_object tool to surface the artefact in chat as object:<ID> — see work-dir-shell-bridge.md.
  • MITL approval, end-to-end — every tool source (analysis / shell / sandbox / MCP) routes through one gate. Destructive analysis tools (load_data, reset_analysis, promote_finding) and SQL/analyze prompts are MITL-by-default; metadata reads (describe_data, list_tables, etc.) are not. Override per-tool from Settings → Tools — the toggle reflects the actual dispatcher default. See security-hardening-2.md.
  • Bundled scriptsfile_info, preview_file, list_files, weather, get_location, write_note. Auto-installed on first launch via go:embed; user customizations are preserved. Optional shell-tool examples wrapping companion CLIs (not auto-installed — copy from examples/shell_tools/ into your tool dir): web-search (gem-search), generate-image (gem-image), search-kb-gem (gem-rag — Vertex AI Gemini RAG), search-kb-lite (lite-rag — local LLM RAG). KB examples require the corresponding CLI installed and a pre-indexed corpus.
  • Tool-call timeline — every tool start/end appears inline in the chat as a transient pill, in addition to the existing status-bar indicator. The pill is restored on session reload as a compact tool-name + status (success / error) bubble; live argument and result text remain ephemeral. See tool-event-restore.md.
  • Background task visibility — when the agent kicks off post-response work (title generation, memory extraction), a small badge appears in the input-status-bar naming what's running. The input field stays disabled until those tasks finish, so the next user message can't race them and lose extracted facts. See background-task-indicator.md.
  • MCP support — via mcp-guardian stdio proxy
  • Multimodal — image input via drag & drop, paste, or file picker
  • Per-session Data panel — collapsible disclosure at the top of the chat pane showing the current session's objects (images / reports / blobs as cards with thumbnails), DuckDB tables (click for a 20-row preview), and sandbox /work files. Click an image for the lightbox, a report for the markdown viewer, or a CSV / text blob for an in-app preview — CSV / TSV render as an HTML table, other text MIMEs (JSON, plain text, HTML, etc.) drop to a fixed-width pre. Bulk-select and delete with separate Yes / No confirmation.
  • Bulk select / delete — Findings, Global Memory, and Session Memory entries can be checked individually or all-at-once, with two-click confirm.
  • Private sessions (v0.3.0)+ New Private Chat in the sidebar bottom-nav opts a session out of cross-session Global Memory promotion. preference / decision facts are dropped at the extraction layer; Pin to Global Memory is hidden in the UI and rejected server-side. A 🔒 indicator appears on the sidebar row and as a chat-pane banner. The privacy flag is fixed at session creation and persisted in chat.json (omitempty keeps legacy sessions loading as non-private). See privacy-controls.md.
  • Log privacy controls (v0.3.0)app.log defaults to info level so prompt / response / tool-argument bodies stop leaking to disk. Settings → Privacy → Log verbosity flips to debug for diagnosis. Audit log entries (session created/loaded/exported/imported/deleted) are content-free.
  • Session import / export (v0.4.0) — package a complete session (chat, session memory, findings, summaries, sandbox work/, analysis DuckDB, and every objstore object the session owns) into a single .shellagent ZIP bundle and re-import it on the same or a different machine. Per-row Export icon in the sidebar, Import Chat button in the bottom-nav, /export and /import slash commands. Privacy flag preserved across the round-trip; object IDs are always regenerated on import with bounded reference rewriting in chat.json and summaries.json. See session-import-export.md.
  • In-place tool progress (v0.4.1) — long-running tools (currently analyze_data) update a single chat-pane bubble in place via the tool_progress activity event, rather than spawning a fresh "running" pill per progress tick. The bubble matches by tool_call_id, so future parallel-tool work won't cross-contaminate. See tool-progress-events.md.
  • Session delete safeguards (v0.4.2) — the row's ✕ button arms a 6-second Confirm state (red-emphasis text matching the existing bulk-delete pattern) before the destructive call fires; while the delete is in flight the row greys with a ↻ Deleting… spinner. The agent state machine holds Busy for the duration so concurrent Send / Load / Export / Import return ErrBusy instead of racing the half-deleted session directory. See session-delete-ux.md.
  • Sandbox UID mapping fix (v0.4.3) — on podman the container is now started with --userns=keep-id:uid=1000,gid=1000 + --user 1000:1000 instead of --user $(id -u). Large host UIDs (e.g., the 200M+ values produced by Active-Directory / LDAP-mapped corporate macOS accounts) used to fall outside the rootless subuid range and crash crun with setresuid: Invalid argument at container start; the keep-id remap pulls them into a small in-namespace UID while preserving /work file ownership. Docker path is unchanged. See sandbox-uid-mapping.md.
  • analyze_data row cap fix (v0.4.4) — the sliding-window summarizer used to inherit the interactive 10,000-row chat-output cap and refused to start on tables larger than that, defeating the feature in the only regime where it was interesting. A new Engine.QuerySQLForAnalyze path (backed by MaxAnalyzeRows = 1_000_000) keeps the chat-output cap intact for query_sql / query_preview / quick_summary while letting analyze_data walk through tables that are large enough for the sliding window to actually matter. The new error message (only reachable past the much higher cap) suggests pre-aggregation rather than LIMIT, since LIMIT would silently truncate the analysis to the first N rows and defeat the feature. See analyze_data-row-cap.md.
  • Session rename persistence fix (v0.4.5)bindings.RenameSession used to call memory.RenameSession directly, which read chat.json from disk, mutated the title, and wrote it back. The agent's in-memory a.session.Title was untouched, so any subsequent a.session.Save() (after a Send / tool / generateTitleIfNeeded) silently overwrote the rename with the stale title and on next launch the user saw the original name. The fix routes rename through a new agent.RenameSession method that updates the in-memory title under a.mu before the disk save, mirroring the v0.4.0+ pattern where every per-session-state operation goes through the agent layer (parallels Export / Import / Delete).
  • Markdown attachments (v0.5.0) — drag-drop / paste .md / .txt files into the chat input alongside images. Each attachment is stored as TypeMarkdown in objstore with auto-computed Lines and Tokens metadata. Three new tools — analyze_text (sliding-window summarisation), grep_text (RE2 regex search with context), get_text (verbatim line-range read) — operate on markdown attachments AND existing create_report outputs, enabling "report on report" follow-up analysis. The LLM discovers attachments via list_objects (now showing Lines / Tokens columns for text-bearing types) and sees a Document (object ID: …): anchor at the top of any user message that carried an attachment — symmetric with the existing Image (object ID: …): convention but text-only. System prompt teaches the LLM the provenance distinction (TypeReport = agent-generated, TypeMarkdown = user-attached) so citations in follow-up reports calibrate appropriately. PDF / DOCX deferred to v0.6 (external converter contract is its own design problem). See markdown-attachments.md.
  • System Rules (v0.7.0) — user-authored standing instructions injected near the top of the system prompt at every turn. The AGENTS.md / CLAUDE.md analogue for shell-agent-v2: write durable rules ("always respond in Japanese", "default to creating reports for long answers", "never propose rm -rf without confirmation") once, and the agent follows them across every session. Edit from Settings → System Rules (textarea with live char / token counter + context-budget advisory) or directly in any text editor at ~/Library/Application Support/shell-agent-v2/system_rules.md. Hot-reloaded on Save — no restart needed. Separate from the four memory facilities; System Rules is configuration, not learned state. See system-rules.md.
  • Filtered analysis via save_query (v0.8.0)analyze_data previously ran sliding-window analysis over the whole table only. New save_query tool materialises a SELECT result as a fresh derived base table; pass that table's name to analyze_data to deep-analyse just the filtered subset (last 24 h, errors only, one customer's events, …). Derived tables appear in list_tables alongside loaded ones, travel inside analysis.duckdb for export/import, and are wiped by reset_analysis like everything else. No engine schema changes, no bundle-format version bump. See saved-query-tables.md.
  • Temporal context — enriched date/time injection + resolve_date system tool

Installation

cd app
make build
# Output: dist/shell-agent-v2.app

Configuration

Settings stored at ~/Library/Application Support/shell-agent-v2/config.json.

LLM Backend

# In chat:
/model           # Show current engine
/model local     # Switch to local LLM (within the current profile)
/model vertex    # Switch to Vertex AI (within the current profile)
/profile         # List profiles, mark the current with ●
/profile <name>  # Switch this session's profile binding

Or use the status-bar pill ([Profile / Local|Vertex]) — clicking opens the Session Control Popover for one-click profile / backend switching.

Vertex AI Setup

gcloud auth application-default login
# Requires roles/aiplatform.user

Settings reference

Knobs exposed in the Settings dialog. Per-backend values override the legacy top-level fallbacks in config.json.

Agent loop

Setting JSON path Default Notes
Max tool rounds per message agent.max_tool_rounds 10 Hard cap on tool-call rounds for one user message. The loop-detection ring buffer (Feature 1, v0.1.16) catches stuck same-error stretches early, so raising this is reasonably safe when a long, legitimate analysis legitimately needs more rounds.

Per-backend context budget (Local / Vertex AI)

Setting JSON path Default (Local) Default (Vertex) Notes
Hot Token Limit llm.{local,vertex_ai}.hot_token_limit 4096 65536 Compaction trigger. When the total token count of the Hot tier exceeds this, the oldest Hot records are summarised into Warm.
Max Context Tokens llm.{local,vertex_ai}.context_budget.max_context_tokens 16384 524288 Total token budget sent to the model per call. 0 = unlimited.
Max Warm Summary Tokens llm.{local,vertex_ai}.context_budget.max_warm_tokens 1024 16384 Cap for the warm-summary block. Older summaries are dropped past this.
Max Tool-Result Tokens llm.{local,vertex_ai}.context_budget.max_tool_result_tokens 2048 32768 Per-tool-result truncation before insertion into the LLM message list.
Output Reserve llm.{local,vertex_ai}.context_budget.output_reserve 4096 4096 Tokens reserved for the model's reply. Subtracted from max_context_tokens before context packing, so the request stays under the model's window.
Per-request timeout (s) llm.{local,vertex_ai}.request_timeout_seconds 300 180 Per-attempt cap inside the retry layer.
Retry max attempts llm.{local,vertex_ai}.retry_max_attempts 3 3 Total LLM call attempts including the first (1 = no retries). Settings → Local LLM / Vertex AI surface this.
Retry backoff base (s) llm.{local,vertex_ai}.retry_backoff_base_seconds 5 5 Initial backoff between retries. Doubles on each subsequent retry, capped at the max below. Config-only — not in Settings UI.
Retry backoff max (s) llm.{local,vertex_ai}.retry_backoff_max_seconds 120 120 Cap on the per-retry wait. Config-only.
Retry jitter (s) llm.{local,vertex_ai}.retry_jitter_seconds 1 1 Uniform ±jitter randomisation around each backoff. Config-only.

Sandbox (sandbox.*)

Setting JSON path Default Notes
Enabled sandbox.enabled false Master toggle. When off, the eight sandbox-* tools are not registered.
Engine sandbox.engine auto auto picks podman then docker from PATH.
Image sandbox.image (empty until you Build) Active container image. Locally-built images (shell-agent-v2-sandbox:<sha>) and @sha256:-pinned references are treated as safe; mutable upstream tags (e.g. python:3.12-slim) trigger an advisory banner in the Settings → Sandbox tab.
Max output bytes sandbox.max_output_bytes 8388608 (8 MiB) Per-exec cap on each of stdout / stderr. Excess is dropped with a [output truncated at N bytes] marker — defends against an LLM-issued cat /dev/zero etc. OOMing the app. Config-only; no UI surface.
Network sandbox.network false Egress; default off.
CPU limit sandbox.cpu_limit 2 Passed to --cpus.
Memory limit sandbox.memory_limit 1g Passed to --memory.
Per-call timeout (s) sandbox.timeout_seconds 60 Per-exec cap.

Cross-session memory trust

shell-agent-v2 auto-extracts important facts from each conversation. Cross-session entries (Global Memory) are re-injected into every future session's system prompt as authoritative context — which means anything that ever appears in an assistant turn (a quoted CSV cell, an MCP response, OCR'd image text, a fetched web page) can structurally end up steering future sessions. Each entry carries a provenance tag:

  • user-stated — came from a user turn, a manual pin, or an explicit "Pin to Global Memory" promotion. Treated as authoritative.
  • derived — extracted from an assistant turn, or a finding the LLM promoted via promote_finding. Lower trust because the content traces back through the LLM and may carry attacker- influenced bytes.

The sidebar (Global / Session Memory) and the chat-pane Findings panel show the badge inline. If a fact starts driving weird behaviour (the recoverable case being the THINK leak that prompted this hardening), open the relevant list, select the offending entries, and bulk-delete them. See docs/en/history/memory-injection-hardening.md for the full threat model and docs/en/memory-model.md for the v0.2.0 4-facility design.

Requirements

  • macOS 10.15+
  • LM Studio (for local backend) — Apple Silicon M1/M2 Pro+ recommended
  • GCP project with billing enabled (for Vertex AI backend)

Building

cd app
make build      # Build .app bundle
make dev        # Development with hot reload
make test       # Run tests

Documentation

Older pre-v0.2.0 audit-trail material lives under docs/en/history/ (Japanese: docs/ja/history/). Most of it has been superseded by the current reference docs — consult only when you need the historical "why."

License

MIT

About

macOS local-first chat & agent tool with interactive data analysis (Wails v2 + React)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors