Skip to content

RFC: VTuber adapter Tier-2 — WebSocket agent-state push #1235

Description

@canyugs

Depends on: Tier-1 RFC #1233 / PR #1234
Discord discussion: https://discord.com/channels/1491295327620169908/1520790210320011274

Problem

Tier-1 gives character skins a real OAB agent via OpenAI-compatible /v1/chat/completions SSE. But OpenAI's pull model has two gaps:

  1. No agent state visibility. The skin can't see what the agent is doing between request and reply — thinking, running tools, spawning subagents, hitting an error. The avatar just sits idle during a 10-second tool-use chain. clawd-on-desk proves this is solvable: its 13-state vocabulary drives pixel-pet animation that shows the agent working in real time.

  2. No server-initiated push. The agent can't proactively reach the user — "your CI finished," "I need permission to delete this file" — without the skin polling. Every existing VTuber protocol (Open-LLM-VTuber, VTube Studio) is also client-initiated; proactive ambient notification is an open gap in the ecosystem.

Proposal in one line

Add a WebSocket side-channel (/v1/vtuber/ws) to the vtuber adapter that pushes structured agent-state events, tool-call visibility, and ambient notifications to connected skins in real time.

At a Glance

Skin (AniCompanion / Open-LLM-VTuber / desktop pet)
  │
  ├── POST /v1/chat/completions (SSE)          ← Tier-1 (text + inline [emotion])
  │
  └── WS /v1/vtuber/ws (Bearer auth)           ← Tier-2 (this RFC)
        ↕ JSON frames:
        ← agent_state:  idle → thinking → working → attention
        ← tool_status:  {name: "Bash", status: "running"}
        ← emotion:      {tag: "excited", intensity: 0.8}
        ← notification:  {text: "CI passed", urgency: "low"}
        → subscribe:    {events: ["agent_state", "tool_status"]}
        → ping/pong

The WS channel is optional — Tier-1 works standalone. Skins that connect get richer animation; skins that don't still get full chat.

Prior Art (14 projects surveyed)

Agent State Push (only 2 exist)

clawd-on-desk (HTTP hook → Electron, 13 states) — Maps Claude Code lifecycle hooks to 13 named animation states: idle, thinking, working, juggling (subagents, tiered by count), error, attention, notification, sweeping, carrying, sleeping, plus mini-mode variants. Multi-session priority merging. The most complete agent-state vocabulary in the ecosystem.

Flawed Avatar (OpenClaw plugin, WS event-driven) — Electron overlay subscribing to OpenClaw gateway lifecycle events over WS (protocol v3, JSON text frames). 4-state FSM: idle/thinking/speaking/working. Proves event-driven WS push model works for avatar animation. Gateway stays avatar-agnostic — pushes raw lifecycle events; the avatar plugin maps to its own states.

WS Protocols (avatar control)

VTube Studio API (ws://localhost:8001) — Plugin API for Live2D. Expression activation, hotkey triggering (23 types), continuous parameter injection (≥1 Hz). Token auth. Expression/hotkey names are model-specific — must be discovered at runtime.

nizima LIVE Plugin API (ws://localhost:22022) — Official Live2D WS protocol. Envelope: {nLPlugin, Timestamp, Id, Type: Request|Response|Event|Error, Method, Data}. Token auth. SetLiveParameterValues, TriggerModelHotkey, event subscriptions.

Warudo (ws://localhost:19190) — 3D VTuber studio. WS + Blueprint node "On WebSocket Message Received". No formal public schema. Plugin SDK with WebSocketService base class.

Veadotube — WS API with state stack model (push/pop/toggle/set). Simple but documented.

WS Protocols (chat/LLM)

Open-LLM-VTuber (ws://<host>/client-ws) — Full conversation protocol. State implicit via control messages. tool_call_status messages with running/completed/error. Emotions as integer indices via emotionMap. Proactive speech is frontend-triggered only.

aituber-kit (v2, ws://localhost:8000/ws) — Structured envelope with versioning, session management, streaming deltas, heartbeat, ack, cancel. Emotion as simple string field in payload. No agent state push.

Non-WS / OSC-Based

VRChat — OSC over UDP only (port 9000). Custom avatar parameters writable via /avatar/parameters/{name}. AI integration via WS → OSC relay (proven by vrchat-mcp-osc). Expression control = int/float params → Unity Animator.

VMC Protocol — OSC over UDP for mocap. Wrong abstraction for agent state — no semantic state concept.

Inline Tag Systems (no WS)

ChatdollKit (Unity, 1.2k stars) — [face:X]/[anim:X]/[pause:N] inline tags. HTTP/REST + raw TCP socket. No agent state, no WS.

Hermes Agent — Custom SSE events (hermes.tool.progress) inline in Tier-1 stream. Avoids second connection but standard OpenAI clients ignore unknown events.

Unreal Engine

UE Remote Control API (WS + JSON on port 30020), Pixel Streaming (emitUIInteraction over WebRTC), Live Link (UDP). Commercial: Convai (gRPC, emotion events), MetaSoul (12 emotion channels), NVIDIA A2F (52 ARKit blend shapes).

Dead / No External API

w-AI-fu (repo deleted, low adoption), SillyTavern (internal browser JS only), VSeeFace (VMC only), KalidoKit (deprecated), Moemate (shut down 2025-01).

Comparison Matrix

Feature clawd-on-desk Flawed Avatar VTube Studio nizima LIVE Open-LLM-VTuber aituber-kit VRChat Warudo
Transport HTTP POST WebSocket WebSocket WebSocket WebSocket WS (v2) OSC/UDP WebSocket
Agent state push Yes (13) Yes (4) No No Implicit (4) No No No
Emotion N/A Per-state Expr/param Parameter Inline [tag] String field OSC param JSON
Tool visibility tool_name session.tool N/A N/A tool_call_status N/A N/A N/A
Auth None Challenge Token+popup Token None None N/A None

Key Takeaways

  1. Only clawd-on-desk and Flawed Avatar do agent state push — this is a wide-open gap across 14 surveyed projects
  2. No platform does server-initiated ambient notification — our notification event is novel
  3. JSON-over-WS with type discriminator is the universal protocol shape (VTube Studio, nizima LIVE, OpenClaw, aituber-kit v2)
  4. Our 7-state vocabulary (idle, thinking, working, juggling, error, attention, notification) sits between Flawed Avatar's 4 and clawd's 13 — the sweet spot
  5. Expression control is two primitives everywhere: trigger named expression, or set parameter by ID + value. Our emotion event with tag + intensity maps to both
  6. VRChat/VMC integration needs a WS → OSC relay; our protocol can serve as the upstream source
  7. Warudo (WS:19190) and UE Remote Control (WS:30020) can consume our events directly via plugins

Full prior art report: TIER2-PRIOR-ART.md (14 projects, comparison matrix, 14 design implications)

Proposed Solution

Connection

  • Endpoint: GET /v1/vtuber/ws → WebSocket upgrade
  • Auth: Same Authorization: Bearer <VTUBER_AUTH_KEY> as Tier-1 (validated on upgrade)
  • Lifecycle: Long-lived. Skin connects once, receives events for all its Tier-1 chat sessions. Reconnect at will — state is not stored server-side.

Message Envelope

{"type": "<event-type>", "ts": 1719600000, ...payload}

All messages are JSON objects with a type discriminator and a Unix timestamp ts. Server → client messages are events; client → server messages are commands.

Server → Client Events

agent_state

{"type": "agent_state", "ts": 1719600000,
 "state": "working",
 "session_id": "vtb_abc123",
 "detail": {"tool_name": "Bash", "subagent_count": 0}}

State vocabulary (based on clawd, mapped to OAB ACP events):

State OAB trigger Skin behavior
idle No active request, session start Default pose, eye tracking
thinking Event dispatched, waiting for agent response Thinking animation
working Agent executing a tool (PreToolUse) Typing/working animation
juggling Subagent spawned Multi-tasking animation; detail.subagent_count for tiering
error Tool failure, API error Error expression
attention Agent reply complete Alert/wave, then fade to idle
notification Permission request, elicitation Notification bubble

tool_status

{"type": "tool_status", "ts": 1719600000,
 "session_id": "vtb_abc123",
 "tool_id": "tu_xyz", "tool_name": "Bash",
 "status": "running", "content": "npm test"}

status: running | completed | error. Enables "watch the agent work" UX.

emotion

{"type": "emotion", "ts": 1719600000,
 "session_id": "vtb_abc123",
 "tag": "excited", "intensity": 0.8}

Structured complement to Tier-1's inline [tag] text passthrough. Skins that want richer control (e.g. VTube Studio parameter injection with intensity blending) use it.

notification

{"type": "notification", "ts": 1719600000,
 "text": "CI run #452 passed",
 "urgency": "low", "action_url": "https://..."}

Server-initiated ambient push — novel in the ecosystem. urgency: low | normal | high.

Client → Server Commands

  • subscribe: {"type": "subscribe", "events": ["agent_state", "tool_status"]} — opt-in to event categories. Default = all.
  • ping: {"type": "ping"} → server replies {"type": "pong", "ts": ...}.

Why this approach

  • Additive, not breaking — Tier-1 is unmodified. The WS channel is optional.
  • Proven state vocabulary — clawd's 13 states are battle-tested. We take the core 7 and leave room to extend.
  • Ecosystem-aligned — JSON-over-WS with type discriminator matches Open-LLM-VTuber, VTube Studio, nizima LIVE, and OpenClaw's protocol shape.
  • Novel where it matters — server-initiated notification is a gap in every existing VTuber protocol (confirmed across 14 projects).

Alternatives Considered

  • Inline custom SSE events in Tier-1 (Hermes style) — breaks zero-change Tier-1 compatibility.
  • Polling endpoint — can't keep up with bursty tool-call events.
  • Full Open-LLM-VTuber protocol adoption — too broad (audio, history, config). Cherry-pick tool_call_status.
  • VTube Studio parameter injection as primary — designed for continuous face tracking, not discrete agent events.

Scope

In scope: WS endpoint with Bearer auth, agent_state / tool_status / emotion / notification events, subscribe filtering, gateway-side event derivation from GatewayReply.

Out of scope: Audio/TTS over WS, lip sync, capability negotiation, bidirectional commands, multi-session priority merging.

Open Questions

  1. Path: /v1/vtuber/ws alongside /v1/chat/completions — or a separate port?
  2. Session association: WS tied to a specific session, or all sessions under the same Bearer key?
  3. Notification source: How does the agent emit a proactive notification? New GatewayReply command?
  4. Event derivation: Gateway sees add_reaction/remove_reaction/edit_message. Is the mapping to agent_state reliable, or should OAB-core send explicit state events?
  5. Backpressure: Drop or buffer when WS client is slow? Agent state is idempotent (latest wins).

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1520790210320011274

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions