RFC: VTuber adapter Tier-2 — WebSocket agent-state push

> Depends on: Tier-1 RFC #1233 / PR #1234
> Discord discussion: https://discord.com/channels/1491295327620169908/1520790210320011274

## Problem

Tier-1 gives character skins a real OAB agent via OpenAI-compatible `/v1/chat/completions` SSE. But OpenAI's pull model has two gaps:

1. **No agent state visibility.** The skin can't see what the agent is doing between request and reply — thinking, running tools, spawning subagents, hitting an error. The avatar just sits idle during a 10-second tool-use chain. clawd-on-desk proves this is solvable: its 13-state vocabulary drives pixel-pet animation that shows the agent working in real time.

2. **No server-initiated push.** The agent can't proactively reach the user — "your CI finished," "I need permission to delete this file" — without the skin polling. Every existing VTuber protocol (Open-LLM-VTuber, VTube Studio) is also client-initiated; proactive ambient notification is an open gap in the ecosystem.

## Proposal in one line

Add a **WebSocket side-channel** (`/v1/vtuber/ws`) to the vtuber adapter that pushes structured agent-state events, tool-call visibility, and ambient notifications to connected skins in real time.

## At a Glance

```
Skin (AniCompanion / Open-LLM-VTuber / desktop pet)
  │
  ├── POST /v1/chat/completions (SSE)          ← Tier-1 (text + inline [emotion])
  │
  └── WS /v1/vtuber/ws (Bearer auth)           ← Tier-2 (this RFC)
        ↕ JSON frames:
        ← agent_state:  idle → thinking → working → attention
        ← tool_status:  {name: "Bash", status: "running"}
        ← emotion:      {tag: "excited", intensity: 0.8}
        ← notification:  {text: "CI passed", urgency: "low"}
        → subscribe:    {events: ["agent_state", "tool_status"]}
        → ping/pong
```

The WS channel is **optional** — Tier-1 works standalone. Skins that connect get richer animation; skins that don't still get full chat.

## Prior Art (14 projects surveyed)

### Agent State Push (only 2 exist)

**clawd-on-desk** (HTTP hook → Electron, 13 states) — Maps Claude Code lifecycle hooks to 13 named animation states: `idle`, `thinking`, `working`, `juggling` (subagents, tiered by count), `error`, `attention`, `notification`, `sweeping`, `carrying`, `sleeping`, plus mini-mode variants. Multi-session priority merging. The most complete agent-state vocabulary in the ecosystem.

**Flawed Avatar** (OpenClaw plugin, WS event-driven) — Electron overlay subscribing to OpenClaw gateway lifecycle events over WS (protocol v3, JSON text frames). 4-state FSM: idle/thinking/speaking/working. Proves event-driven WS push model works for avatar animation. Gateway stays avatar-agnostic — pushes raw lifecycle events; the avatar plugin maps to its own states.

### WS Protocols (avatar control)

**VTube Studio API** (`ws://localhost:8001`) — Plugin API for Live2D. Expression activation, hotkey triggering (23 types), continuous parameter injection (≥1 Hz). Token auth. Expression/hotkey names are model-specific — must be discovered at runtime.

**nizima LIVE Plugin API** (`ws://localhost:22022`) — Official Live2D WS protocol. Envelope: `{nLPlugin, Timestamp, Id, Type: Request|Response|Event|Error, Method, Data}`. Token auth. `SetLiveParameterValues`, `TriggerModelHotkey`, event subscriptions.

**Warudo** (`ws://localhost:19190`) — 3D VTuber studio. WS + Blueprint node "On WebSocket Message Received". No formal public schema. Plugin SDK with `WebSocketService` base class.

**Veadotube** — WS API with state stack model (push/pop/toggle/set). Simple but documented.

### WS Protocols (chat/LLM)

**Open-LLM-VTuber** (`ws://<host>/client-ws`) — Full conversation protocol. State implicit via `control` messages. `tool_call_status` messages with running/completed/error. Emotions as integer indices via `emotionMap`. Proactive speech is frontend-triggered only.

**aituber-kit** (v2, `ws://localhost:8000/ws`) — Structured envelope with versioning, session management, streaming deltas, heartbeat, ack, cancel. Emotion as simple string field in payload. No agent state push.

### Non-WS / OSC-Based

**VRChat** — OSC over UDP only (port 9000). Custom avatar parameters writable via `/avatar/parameters/{name}`. AI integration via `WS → OSC` relay (proven by vrchat-mcp-osc). Expression control = int/float params → Unity Animator.

**VMC Protocol** — OSC over UDP for mocap. Wrong abstraction for agent state — no semantic state concept.

### Inline Tag Systems (no WS)

**ChatdollKit** (Unity, 1.2k stars) — `[face:X]`/`[anim:X]`/`[pause:N]` inline tags. HTTP/REST + raw TCP socket. No agent state, no WS.

**Hermes Agent** — Custom SSE events (`hermes.tool.progress`) inline in Tier-1 stream. Avoids second connection but standard OpenAI clients ignore unknown events.

### Unreal Engine

UE Remote Control API (WS + JSON on port 30020), Pixel Streaming (`emitUIInteraction` over WebRTC), Live Link (UDP). Commercial: Convai (gRPC, emotion events), MetaSoul (12 emotion channels), NVIDIA A2F (52 ARKit blend shapes).

### Dead / No External API

w-AI-fu (repo deleted, low adoption), SillyTavern (internal browser JS only), VSeeFace (VMC only), KalidoKit (deprecated), Moemate (shut down 2025-01).

### Comparison Matrix

| Feature | clawd-on-desk | Flawed Avatar | VTube Studio | nizima LIVE | Open-LLM-VTuber | aituber-kit | VRChat | Warudo |
|---|---|---|---|---|---|---|---|---|
| **Transport** | HTTP POST | WebSocket | WebSocket | WebSocket | WebSocket | WS (v2) | OSC/UDP | WebSocket |
| **Agent state push** | Yes (13) | Yes (4) | No | No | Implicit (4) | No | No | No |
| **Emotion** | N/A | Per-state | Expr/param | Parameter | Inline `[tag]` | String field | OSC param | JSON |
| **Tool visibility** | tool_name | session.tool | N/A | N/A | tool_call_status | N/A | N/A | N/A |
| **Auth** | None | Challenge | Token+popup | Token | None | None | N/A | None |

### Key Takeaways

1. **Only clawd-on-desk and Flawed Avatar do agent state push** — this is a wide-open gap across 14 surveyed projects
2. **No platform does server-initiated ambient notification** — our `notification` event is novel
3. **JSON-over-WS with `type` discriminator** is the universal protocol shape (VTube Studio, nizima LIVE, OpenClaw, aituber-kit v2)
4. **Our 7-state vocabulary** (idle, thinking, working, juggling, error, attention, notification) sits between Flawed Avatar's 4 and clawd's 13 — the sweet spot
5. **Expression control is two primitives everywhere**: trigger named expression, or set parameter by ID + value. Our `emotion` event with `tag` + `intensity` maps to both
6. **VRChat/VMC integration** needs a `WS → OSC` relay; our protocol can serve as the upstream source
7. **Warudo** (WS:19190) and **UE Remote Control** (WS:30020) can consume our events directly via plugins

Full prior art report: [TIER2-PRIOR-ART.md](https://github.com/openabdev/openab/blob/feat/vtuber-adapter/TIER2-PRIOR-ART.md) (14 projects, comparison matrix, 14 design implications)

## Proposed Solution

### Connection

- **Endpoint:** `GET /v1/vtuber/ws` → WebSocket upgrade
- **Auth:** Same `Authorization: Bearer <VTUBER_AUTH_KEY>` as Tier-1 (validated on upgrade)
- **Lifecycle:** Long-lived. Skin connects once, receives events for all its Tier-1 chat sessions. Reconnect at will — state is not stored server-side.

### Message Envelope

```json
{"type": "<event-type>", "ts": 1719600000, ...payload}
```

All messages are JSON objects with a `type` discriminator and a Unix timestamp `ts`. Server → client messages are events; client → server messages are commands.

### Server → Client Events

#### `agent_state`

```json
{"type": "agent_state", "ts": 1719600000,
 "state": "working",
 "session_id": "vtb_abc123",
 "detail": {"tool_name": "Bash", "subagent_count": 0}}
```

**State vocabulary** (based on clawd, mapped to OAB ACP events):

| State | OAB trigger | Skin behavior |
|---|---|---|
| `idle` | No active request, session start | Default pose, eye tracking |
| `thinking` | Event dispatched, waiting for agent response | Thinking animation |
| `working` | Agent executing a tool (`PreToolUse`) | Typing/working animation |
| `juggling` | Subagent spawned | Multi-tasking animation; `detail.subagent_count` for tiering |
| `error` | Tool failure, API error | Error expression |
| `attention` | Agent reply complete | Alert/wave, then fade to idle |
| `notification` | Permission request, elicitation | Notification bubble |

#### `tool_status`

```json
{"type": "tool_status", "ts": 1719600000,
 "session_id": "vtb_abc123",
 "tool_id": "tu_xyz", "tool_name": "Bash",
 "status": "running", "content": "npm test"}
```

`status`: `running` | `completed` | `error`. Enables "watch the agent work" UX.

#### `emotion`

```json
{"type": "emotion", "ts": 1719600000,
 "session_id": "vtb_abc123",
 "tag": "excited", "intensity": 0.8}
```

Structured complement to Tier-1's inline `[tag]` text passthrough. Skins that want richer control (e.g. VTube Studio parameter injection with intensity blending) use it.

#### `notification`

```json
{"type": "notification", "ts": 1719600000,
 "text": "CI run #452 passed",
 "urgency": "low", "action_url": "https://..."}
```

**Server-initiated ambient push** — novel in the ecosystem. `urgency`: `low` | `normal` | `high`.

### Client → Server Commands

- **`subscribe`**: `{"type": "subscribe", "events": ["agent_state", "tool_status"]}` — opt-in to event categories. Default = all.
- **`ping`**: `{"type": "ping"}` → server replies `{"type": "pong", "ts": ...}`.

## Why this approach

- **Additive, not breaking** — Tier-1 is unmodified. The WS channel is optional.
- **Proven state vocabulary** — clawd's 13 states are battle-tested. We take the core 7 and leave room to extend.
- **Ecosystem-aligned** — JSON-over-WS with `type` discriminator matches Open-LLM-VTuber, VTube Studio, nizima LIVE, and OpenClaw's protocol shape.
- **Novel where it matters** — server-initiated `notification` is a gap in every existing VTuber protocol (confirmed across 14 projects).

## Alternatives Considered

- **Inline custom SSE events in Tier-1** (Hermes style) — breaks zero-change Tier-1 compatibility.
- **Polling endpoint** — can't keep up with bursty tool-call events.
- **Full Open-LLM-VTuber protocol adoption** — too broad (audio, history, config). Cherry-pick `tool_call_status`.
- **VTube Studio parameter injection as primary** — designed for continuous face tracking, not discrete agent events.

## Scope

**In scope:** WS endpoint with Bearer auth, `agent_state` / `tool_status` / `emotion` / `notification` events, `subscribe` filtering, gateway-side event derivation from `GatewayReply`.

**Out of scope:** Audio/TTS over WS, lip sync, capability negotiation, bidirectional commands, multi-session priority merging.

## Open Questions

1. **Path:** `/v1/vtuber/ws` alongside `/v1/chat/completions` — or a separate port?
2. **Session association:** WS tied to a specific session, or all sessions under the same Bearer key?
3. **Notification source:** How does the agent emit a proactive notification? New `GatewayReply` command?
4. **Event derivation:** Gateway sees `add_reaction`/`remove_reaction`/`edit_message`. Is the mapping to `agent_state` reliable, or should OAB-core send explicit state events?
5. **Backpressure:** Drop or buffer when WS client is slow? Agent state is idempotent (latest wins).

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1520790210320011274

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: VTuber adapter Tier-2 — WebSocket agent-state push #1235

Problem

Proposal in one line

At a Glance

Prior Art (14 projects surveyed)

Agent State Push (only 2 exist)

WS Protocols (avatar control)

WS Protocols (chat/LLM)

Non-WS / OSC-Based

Inline Tag Systems (no WS)

Unreal Engine

Dead / No External API

Comparison Matrix

Key Takeaways

Proposed Solution

Connection

Message Envelope

Server → Client Events

`agent_state`

`tool_status`

`emotion`

`notification`

Client → Server Commands

Why this approach

Alternatives Considered

Scope

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	clawd-on-desk	Flawed Avatar	VTube Studio	nizima LIVE	Open-LLM-VTuber	aituber-kit	VRChat	Warudo
Transport	HTTP POST	WebSocket	WebSocket	WebSocket	WebSocket	WS (v2)	OSC/UDP	WebSocket
Agent state push	Yes (13)	Yes (4)	No	No	Implicit (4)	No	No	No
Emotion	N/A	Per-state	Expr/param	Parameter	Inline `[tag]`	String field	OSC param	JSON
Tool visibility	tool_name	session.tool	N/A	N/A	tool_call_status	N/A	N/A	N/A
Auth	None	Challenge	Token+popup	Token	None	None	N/A	None

State	OAB trigger	Skin behavior
`idle`	No active request, session start	Default pose, eye tracking
`thinking`	Event dispatched, waiting for agent response	Thinking animation
`working`	Agent executing a tool (`PreToolUse`)	Typing/working animation
`juggling`	Subagent spawned	Multi-tasking animation; `detail.subagent_count` for tiering
`error`	Tool failure, API error	Error expression
`attention`	Agent reply complete	Alert/wave, then fade to idle
`notification`	Permission request, elicitation	Notification bubble

Uh oh!

RFC: VTuber adapter Tier-2 — WebSocket agent-state push #1235

Description

Problem

Proposal in one line

At a Glance

Prior Art (14 projects surveyed)

Agent State Push (only 2 exist)

WS Protocols (avatar control)

WS Protocols (chat/LLM)

Non-WS / OSC-Based

Inline Tag Systems (no WS)

Unreal Engine

Dead / No External API

Comparison Matrix

Key Takeaways

Proposed Solution

Connection

Message Envelope

Server → Client Events

agent_state

tool_status

emotion

notification

Client → Server Commands

Why this approach

Alternatives Considered

Scope

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`agent_state`

`tool_status`

`emotion`

`notification`