Depends on: Tier-1 RFC #1233 / PR #1234
Discord discussion: https://discord.com/channels/1491295327620169908/1520790210320011274
Problem
Tier-1 gives character skins a real OAB agent via OpenAI-compatible /v1/chat/completions SSE. But OpenAI's pull model has two gaps:
-
No agent state visibility. The skin can't see what the agent is doing between request and reply — thinking, running tools, spawning subagents, hitting an error. The avatar just sits idle during a 10-second tool-use chain. clawd-on-desk proves this is solvable: its 13-state vocabulary drives pixel-pet animation that shows the agent working in real time.
-
No server-initiated push. The agent can't proactively reach the user — "your CI finished," "I need permission to delete this file" — without the skin polling. Every existing VTuber protocol (Open-LLM-VTuber, VTube Studio) is also client-initiated; proactive ambient notification is an open gap in the ecosystem.
Proposal in one line
Add a WebSocket side-channel (/v1/vtuber/ws) to the vtuber adapter that pushes structured agent-state events, tool-call visibility, and ambient notifications to connected skins in real time.
At a Glance
Skin (AniCompanion / Open-LLM-VTuber / desktop pet)
│
├── POST /v1/chat/completions (SSE) ← Tier-1 (text + inline [emotion])
│
└── WS /v1/vtuber/ws (Bearer auth) ← Tier-2 (this RFC)
↕ JSON frames:
← agent_state: idle → thinking → working → attention
← tool_status: {name: "Bash", status: "running"}
← emotion: {tag: "excited", intensity: 0.8}
← notification: {text: "CI passed", urgency: "low"}
→ subscribe: {events: ["agent_state", "tool_status"]}
→ ping/pong
The WS channel is optional — Tier-1 works standalone. Skins that connect get richer animation; skins that don't still get full chat.
Prior Art (14 projects surveyed)
Agent State Push (only 2 exist)
clawd-on-desk (HTTP hook → Electron, 13 states) — Maps Claude Code lifecycle hooks to 13 named animation states: idle, thinking, working, juggling (subagents, tiered by count), error, attention, notification, sweeping, carrying, sleeping, plus mini-mode variants. Multi-session priority merging. The most complete agent-state vocabulary in the ecosystem.
Flawed Avatar (OpenClaw plugin, WS event-driven) — Electron overlay subscribing to OpenClaw gateway lifecycle events over WS (protocol v3, JSON text frames). 4-state FSM: idle/thinking/speaking/working. Proves event-driven WS push model works for avatar animation. Gateway stays avatar-agnostic — pushes raw lifecycle events; the avatar plugin maps to its own states.
WS Protocols (avatar control)
VTube Studio API (ws://localhost:8001) — Plugin API for Live2D. Expression activation, hotkey triggering (23 types), continuous parameter injection (≥1 Hz). Token auth. Expression/hotkey names are model-specific — must be discovered at runtime.
nizima LIVE Plugin API (ws://localhost:22022) — Official Live2D WS protocol. Envelope: {nLPlugin, Timestamp, Id, Type: Request|Response|Event|Error, Method, Data}. Token auth. SetLiveParameterValues, TriggerModelHotkey, event subscriptions.
Warudo (ws://localhost:19190) — 3D VTuber studio. WS + Blueprint node "On WebSocket Message Received". No formal public schema. Plugin SDK with WebSocketService base class.
Veadotube — WS API with state stack model (push/pop/toggle/set). Simple but documented.
WS Protocols (chat/LLM)
Open-LLM-VTuber (ws://<host>/client-ws) — Full conversation protocol. State implicit via control messages. tool_call_status messages with running/completed/error. Emotions as integer indices via emotionMap. Proactive speech is frontend-triggered only.
aituber-kit (v2, ws://localhost:8000/ws) — Structured envelope with versioning, session management, streaming deltas, heartbeat, ack, cancel. Emotion as simple string field in payload. No agent state push.
Non-WS / OSC-Based
VRChat — OSC over UDP only (port 9000). Custom avatar parameters writable via /avatar/parameters/{name}. AI integration via WS → OSC relay (proven by vrchat-mcp-osc). Expression control = int/float params → Unity Animator.
VMC Protocol — OSC over UDP for mocap. Wrong abstraction for agent state — no semantic state concept.
Inline Tag Systems (no WS)
ChatdollKit (Unity, 1.2k stars) — [face:X]/[anim:X]/[pause:N] inline tags. HTTP/REST + raw TCP socket. No agent state, no WS.
Hermes Agent — Custom SSE events (hermes.tool.progress) inline in Tier-1 stream. Avoids second connection but standard OpenAI clients ignore unknown events.
Unreal Engine
UE Remote Control API (WS + JSON on port 30020), Pixel Streaming (emitUIInteraction over WebRTC), Live Link (UDP). Commercial: Convai (gRPC, emotion events), MetaSoul (12 emotion channels), NVIDIA A2F (52 ARKit blend shapes).
Dead / No External API
w-AI-fu (repo deleted, low adoption), SillyTavern (internal browser JS only), VSeeFace (VMC only), KalidoKit (deprecated), Moemate (shut down 2025-01).
Comparison Matrix
| Feature |
clawd-on-desk |
Flawed Avatar |
VTube Studio |
nizima LIVE |
Open-LLM-VTuber |
aituber-kit |
VRChat |
Warudo |
| Transport |
HTTP POST |
WebSocket |
WebSocket |
WebSocket |
WebSocket |
WS (v2) |
OSC/UDP |
WebSocket |
| Agent state push |
Yes (13) |
Yes (4) |
No |
No |
Implicit (4) |
No |
No |
No |
| Emotion |
N/A |
Per-state |
Expr/param |
Parameter |
Inline [tag] |
String field |
OSC param |
JSON |
| Tool visibility |
tool_name |
session.tool |
N/A |
N/A |
tool_call_status |
N/A |
N/A |
N/A |
| Auth |
None |
Challenge |
Token+popup |
Token |
None |
None |
N/A |
None |
Key Takeaways
- Only clawd-on-desk and Flawed Avatar do agent state push — this is a wide-open gap across 14 surveyed projects
- No platform does server-initiated ambient notification — our
notification event is novel
- JSON-over-WS with
type discriminator is the universal protocol shape (VTube Studio, nizima LIVE, OpenClaw, aituber-kit v2)
- Our 7-state vocabulary (idle, thinking, working, juggling, error, attention, notification) sits between Flawed Avatar's 4 and clawd's 13 — the sweet spot
- Expression control is two primitives everywhere: trigger named expression, or set parameter by ID + value. Our
emotion event with tag + intensity maps to both
- VRChat/VMC integration needs a
WS → OSC relay; our protocol can serve as the upstream source
- Warudo (WS:19190) and UE Remote Control (WS:30020) can consume our events directly via plugins
Full prior art report: TIER2-PRIOR-ART.md (14 projects, comparison matrix, 14 design implications)
Proposed Solution
Connection
- Endpoint:
GET /v1/vtuber/ws → WebSocket upgrade
- Auth: Same
Authorization: Bearer <VTUBER_AUTH_KEY> as Tier-1 (validated on upgrade)
- Lifecycle: Long-lived. Skin connects once, receives events for all its Tier-1 chat sessions. Reconnect at will — state is not stored server-side.
Message Envelope
{"type": "<event-type>", "ts": 1719600000, ...payload}
All messages are JSON objects with a type discriminator and a Unix timestamp ts. Server → client messages are events; client → server messages are commands.
Server → Client Events
agent_state
{"type": "agent_state", "ts": 1719600000,
"state": "working",
"session_id": "vtb_abc123",
"detail": {"tool_name": "Bash", "subagent_count": 0}}
State vocabulary (based on clawd, mapped to OAB ACP events):
| State |
OAB trigger |
Skin behavior |
idle |
No active request, session start |
Default pose, eye tracking |
thinking |
Event dispatched, waiting for agent response |
Thinking animation |
working |
Agent executing a tool (PreToolUse) |
Typing/working animation |
juggling |
Subagent spawned |
Multi-tasking animation; detail.subagent_count for tiering |
error |
Tool failure, API error |
Error expression |
attention |
Agent reply complete |
Alert/wave, then fade to idle |
notification |
Permission request, elicitation |
Notification bubble |
tool_status
{"type": "tool_status", "ts": 1719600000,
"session_id": "vtb_abc123",
"tool_id": "tu_xyz", "tool_name": "Bash",
"status": "running", "content": "npm test"}
status: running | completed | error. Enables "watch the agent work" UX.
emotion
{"type": "emotion", "ts": 1719600000,
"session_id": "vtb_abc123",
"tag": "excited", "intensity": 0.8}
Structured complement to Tier-1's inline [tag] text passthrough. Skins that want richer control (e.g. VTube Studio parameter injection with intensity blending) use it.
notification
{"type": "notification", "ts": 1719600000,
"text": "CI run #452 passed",
"urgency": "low", "action_url": "https://..."}
Server-initiated ambient push — novel in the ecosystem. urgency: low | normal | high.
Client → Server Commands
subscribe: {"type": "subscribe", "events": ["agent_state", "tool_status"]} — opt-in to event categories. Default = all.
ping: {"type": "ping"} → server replies {"type": "pong", "ts": ...}.
Why this approach
- Additive, not breaking — Tier-1 is unmodified. The WS channel is optional.
- Proven state vocabulary — clawd's 13 states are battle-tested. We take the core 7 and leave room to extend.
- Ecosystem-aligned — JSON-over-WS with
type discriminator matches Open-LLM-VTuber, VTube Studio, nizima LIVE, and OpenClaw's protocol shape.
- Novel where it matters — server-initiated
notification is a gap in every existing VTuber protocol (confirmed across 14 projects).
Alternatives Considered
- Inline custom SSE events in Tier-1 (Hermes style) — breaks zero-change Tier-1 compatibility.
- Polling endpoint — can't keep up with bursty tool-call events.
- Full Open-LLM-VTuber protocol adoption — too broad (audio, history, config). Cherry-pick
tool_call_status.
- VTube Studio parameter injection as primary — designed for continuous face tracking, not discrete agent events.
Scope
In scope: WS endpoint with Bearer auth, agent_state / tool_status / emotion / notification events, subscribe filtering, gateway-side event derivation from GatewayReply.
Out of scope: Audio/TTS over WS, lip sync, capability negotiation, bidirectional commands, multi-session priority merging.
Open Questions
- Path:
/v1/vtuber/ws alongside /v1/chat/completions — or a separate port?
- Session association: WS tied to a specific session, or all sessions under the same Bearer key?
- Notification source: How does the agent emit a proactive notification? New
GatewayReply command?
- Event derivation: Gateway sees
add_reaction/remove_reaction/edit_message. Is the mapping to agent_state reliable, or should OAB-core send explicit state events?
- Backpressure: Drop or buffer when WS client is slow? Agent state is idempotent (latest wins).
Discord Discussion URL: https://discord.com/channels/1491295327620169908/1520790210320011274
Problem
Tier-1 gives character skins a real OAB agent via OpenAI-compatible
/v1/chat/completionsSSE. But OpenAI's pull model has two gaps:No agent state visibility. The skin can't see what the agent is doing between request and reply — thinking, running tools, spawning subagents, hitting an error. The avatar just sits idle during a 10-second tool-use chain. clawd-on-desk proves this is solvable: its 13-state vocabulary drives pixel-pet animation that shows the agent working in real time.
No server-initiated push. The agent can't proactively reach the user — "your CI finished," "I need permission to delete this file" — without the skin polling. Every existing VTuber protocol (Open-LLM-VTuber, VTube Studio) is also client-initiated; proactive ambient notification is an open gap in the ecosystem.
Proposal in one line
Add a WebSocket side-channel (
/v1/vtuber/ws) to the vtuber adapter that pushes structured agent-state events, tool-call visibility, and ambient notifications to connected skins in real time.At a Glance
The WS channel is optional — Tier-1 works standalone. Skins that connect get richer animation; skins that don't still get full chat.
Prior Art (14 projects surveyed)
Agent State Push (only 2 exist)
clawd-on-desk (HTTP hook → Electron, 13 states) — Maps Claude Code lifecycle hooks to 13 named animation states:
idle,thinking,working,juggling(subagents, tiered by count),error,attention,notification,sweeping,carrying,sleeping, plus mini-mode variants. Multi-session priority merging. The most complete agent-state vocabulary in the ecosystem.Flawed Avatar (OpenClaw plugin, WS event-driven) — Electron overlay subscribing to OpenClaw gateway lifecycle events over WS (protocol v3, JSON text frames). 4-state FSM: idle/thinking/speaking/working. Proves event-driven WS push model works for avatar animation. Gateway stays avatar-agnostic — pushes raw lifecycle events; the avatar plugin maps to its own states.
WS Protocols (avatar control)
VTube Studio API (
ws://localhost:8001) — Plugin API for Live2D. Expression activation, hotkey triggering (23 types), continuous parameter injection (≥1 Hz). Token auth. Expression/hotkey names are model-specific — must be discovered at runtime.nizima LIVE Plugin API (
ws://localhost:22022) — Official Live2D WS protocol. Envelope:{nLPlugin, Timestamp, Id, Type: Request|Response|Event|Error, Method, Data}. Token auth.SetLiveParameterValues,TriggerModelHotkey, event subscriptions.Warudo (
ws://localhost:19190) — 3D VTuber studio. WS + Blueprint node "On WebSocket Message Received". No formal public schema. Plugin SDK withWebSocketServicebase class.Veadotube — WS API with state stack model (push/pop/toggle/set). Simple but documented.
WS Protocols (chat/LLM)
Open-LLM-VTuber (
ws://<host>/client-ws) — Full conversation protocol. State implicit viacontrolmessages.tool_call_statusmessages with running/completed/error. Emotions as integer indices viaemotionMap. Proactive speech is frontend-triggered only.aituber-kit (v2,
ws://localhost:8000/ws) — Structured envelope with versioning, session management, streaming deltas, heartbeat, ack, cancel. Emotion as simple string field in payload. No agent state push.Non-WS / OSC-Based
VRChat — OSC over UDP only (port 9000). Custom avatar parameters writable via
/avatar/parameters/{name}. AI integration viaWS → OSCrelay (proven by vrchat-mcp-osc). Expression control = int/float params → Unity Animator.VMC Protocol — OSC over UDP for mocap. Wrong abstraction for agent state — no semantic state concept.
Inline Tag Systems (no WS)
ChatdollKit (Unity, 1.2k stars) —
[face:X]/[anim:X]/[pause:N]inline tags. HTTP/REST + raw TCP socket. No agent state, no WS.Hermes Agent — Custom SSE events (
hermes.tool.progress) inline in Tier-1 stream. Avoids second connection but standard OpenAI clients ignore unknown events.Unreal Engine
UE Remote Control API (WS + JSON on port 30020), Pixel Streaming (
emitUIInteractionover WebRTC), Live Link (UDP). Commercial: Convai (gRPC, emotion events), MetaSoul (12 emotion channels), NVIDIA A2F (52 ARKit blend shapes).Dead / No External API
w-AI-fu (repo deleted, low adoption), SillyTavern (internal browser JS only), VSeeFace (VMC only), KalidoKit (deprecated), Moemate (shut down 2025-01).
Comparison Matrix
[tag]Key Takeaways
notificationevent is noveltypediscriminator is the universal protocol shape (VTube Studio, nizima LIVE, OpenClaw, aituber-kit v2)emotionevent withtag+intensitymaps to bothWS → OSCrelay; our protocol can serve as the upstream sourceFull prior art report: TIER2-PRIOR-ART.md (14 projects, comparison matrix, 14 design implications)
Proposed Solution
Connection
GET /v1/vtuber/ws→ WebSocket upgradeAuthorization: Bearer <VTUBER_AUTH_KEY>as Tier-1 (validated on upgrade)Message Envelope
{"type": "<event-type>", "ts": 1719600000, ...payload}All messages are JSON objects with a
typediscriminator and a Unix timestampts. Server → client messages are events; client → server messages are commands.Server → Client Events
agent_state{"type": "agent_state", "ts": 1719600000, "state": "working", "session_id": "vtb_abc123", "detail": {"tool_name": "Bash", "subagent_count": 0}}State vocabulary (based on clawd, mapped to OAB ACP events):
idlethinkingworkingPreToolUse)jugglingdetail.subagent_countfor tieringerrorattentionnotificationtool_status{"type": "tool_status", "ts": 1719600000, "session_id": "vtb_abc123", "tool_id": "tu_xyz", "tool_name": "Bash", "status": "running", "content": "npm test"}status:running|completed|error. Enables "watch the agent work" UX.emotion{"type": "emotion", "ts": 1719600000, "session_id": "vtb_abc123", "tag": "excited", "intensity": 0.8}Structured complement to Tier-1's inline
[tag]text passthrough. Skins that want richer control (e.g. VTube Studio parameter injection with intensity blending) use it.notification{"type": "notification", "ts": 1719600000, "text": "CI run #452 passed", "urgency": "low", "action_url": "https://..."}Server-initiated ambient push — novel in the ecosystem.
urgency:low|normal|high.Client → Server Commands
subscribe:{"type": "subscribe", "events": ["agent_state", "tool_status"]}— opt-in to event categories. Default = all.ping:{"type": "ping"}→ server replies{"type": "pong", "ts": ...}.Why this approach
typediscriminator matches Open-LLM-VTuber, VTube Studio, nizima LIVE, and OpenClaw's protocol shape.notificationis a gap in every existing VTuber protocol (confirmed across 14 projects).Alternatives Considered
tool_call_status.Scope
In scope: WS endpoint with Bearer auth,
agent_state/tool_status/emotion/notificationevents,subscribefiltering, gateway-side event derivation fromGatewayReply.Out of scope: Audio/TTS over WS, lip sync, capability negotiation, bidirectional commands, multi-session priority merging.
Open Questions
/v1/vtuber/wsalongside/v1/chat/completions— or a separate port?GatewayReplycommand?add_reaction/remove_reaction/edit_message. Is the mapping toagent_statereliable, or should OAB-core send explicit state events?Discord Discussion URL: https://discord.com/channels/1491295327620169908/1520790210320011274