Replies: 1 comment
-
|
Decision update:
Confirmed v0 export set:
Adoption/DAU export additions:
Privacy guardrails remain unchanged for public trust:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
ReverbCode currently captures zero user analytics. We want to fix that, but want to do it in a way that's defensible from a privacy standpoint and consistent with the rest of the architecture (loopback-only daemon, durable facts + derived reads, ports/adapters, CDC via DB triggers). This post lays out a shape based on how four similar tools handle the same problem.
Why this exists
We want to understand:
Today the backend has zero user-analytics. It does have plenty of system observation:
backend/internal/observe/is the SCM/tracker poll loop, thechange_logtable already records every durable domain mutation via DB triggers, and the CDC poller broadcasts those to in-process subscribers. None of that surfaces user behavior.Reference designs
Four similar tools were read end-to-end before writing this. The relevant mechanisms are summarised below; cited URLs at the bottom go to the exact sections.
[analytics](anonymous, opt-out) +[otel](rich, opt-in,log_user_prompt = falsedefault). Dot-namespaced events (codex.api_request,codex.tool.call). Counter + histogram pairs per event. Project-local config cannot override telemetry keys.workspace created,model selected,message sent(metadata only), provider errors with message, unexpected errors with message + stack. Explicit "no session recordings".enterprise_data_privacytoggle in.conductor/settings.toml.sharesetting:"manual"/"auto"/"disabled"controls cloud-sync of conversations. MDM-deployable managed configs on macOS (.mobileconfigvia Jamf/Kandji/FleetDM) sit at top priority.superset/utils/log.py)AbstractEventLoggerinterface;EVENT_LOGGERconfig swaps backend at boot (DBEventLogger,StdOutEventLogger, statsd). Decorator-based instrumentation. Curated payload allowlist — only allowlisted keys reach the sink.What we already have to build on
change_log(DB-triggered CDC, inbackend/internal/storage/sqlite/migrations): every durable domain mutation is already captured chronologically with a monotonic sequence. This is half of "replay" already, for free.backend/internal/cdcpoller + in-process broadcaster: a tested fan-out primitive we can reuse to deliver telemetry events to multiple sinks.slogeverywhere, with request IDs threaded through the chi router (inbackend/internal/httpd/api.go).Architectural constraints we must not violate
These are restated from
AGENTS.mdanddocs/architecture.mdbecause each rules out a "convenient" design that we'd otherwise reach for:127.0.0.1.change_log. Do not bypass the trigger mechanism to write a parallel telemetry stream from store methods. New telemetry tables get their own triggers or a separate insert path that doesn't touch domain tables.context.Contextfirst for I/O. Sinks must be context-cancellable so shutdown is bounded.sqlite/gen/*. Any new tables go throughmigrations/+queries/+npm run sqlc.Proposal
Mental model: a fourth lane
The current model is OBSERVE → UPDATE → DERIVE / ACT. Telemetry adds a fourth lane:
The lane is parallel to "ACT" and reads from the same sources (lifecycle/PR managers, CLI command runners, HTTP handlers). It never writes back to the domain.
One sink interface, several backends
Add a single port in
backend/internal/ports/telemetry.go:Implementations live under
backend/internal/adapters/telemetry/:noop— default. Discards. Zero cost.localsqlite— appends to a newtelemetry_eventtable behind the single-writer pool. Bounded retention (rolling N days, hard cap by row count). Read-only HTTP surface for the CLI and a future debug dashboard.otlp— OTLP/HTTP exporter, batched and async. Modeled on Codex's[otel]shape. Mapped fields: events become OTel logs, durations become histograms.posthog— optional, only if we decide to take the Conductor route. Mapped 1:1 fromEvent→ PostHogcapture(). Strict allowlist; no PII.fanout— composes multiple sinks; used by the daemon wiring to fan to bothlocalsqlite(always, when telemetry is enabled at all) and the user's chosen remote sink.This is the Superset/Codex pattern: behaviour and policy in the wiring layer, not in the call sites. Call sites only see
EventSink.Two-tier user control (Codex pattern, adapted)
Codex split telemetry into two settings because the privacy posture is fundamentally different between anonymous counters and rich event traces. We should do the same. The defaults below are the privacy-first reading; the opposite is a viable position — see "Open decisions".
metricslocalsqliteonly unless remote is also enabledeventslocalsqliteonly unless remote is also enabledremotemetricsand/oreventsto a configured exporterConfiguration lives in the existing env-only config layer:
Per existing convention (no
AO_HOST, etc.) these are env vars on the daemon, not flags. They can be inspected viaao doctorso the user can see what is actually configured.Curated payload allowlist (Superset pattern)
Every event is a typed Go struct, not a
map[string]any. The payload schema is the surface area we audit. A new event = a new struct + a new entry in the event-name constants. Free-formextrafields are not permitted at the call site:The hashing is the same trick PostHog uses for "distinct_id": a stable opaque identifier the daemon can compute locally without ever leaving the machine. The backend stores the raw value alongside the hash so it can join for local debugging, but only the hash leaves the daemon.
Trust boundary for telemetry config
Following Codex: telemetry settings (
AO_TELEMETRY_*and any future managed-config equivalent) are user-scope only. A project's checked-in.ao/settingsfile (if/when we add one) cannot turn on remote export or change the endpoint. This prevents a hostile repo from leaking events from anyone who clones it.Crash bundles instead of always-on crash reporting
Conductor auto-uploads crash logs to PostHog with stack traces. That requires us to ship a stable identity, an upload endpoint, and a retention policy on day one. We can defer all of that and still solve the "what crashed and when" question with a CLI command:
This produces a single
.zipin the cwd. The contents:telemetry_eventin the windowchange_logrows in the window (already durable, already redacted of content)running.jsonand the daemon versionThe user attaches the zip to a GitHub issue. We never auto-upload anything without an explicit opt-in.
Replay means event playback, not screen recording
Conductor is explicit: "we don't capture or store any session recordings." That is the right line for us too. The replay capability is event playback, not terminal/UI capture, for these reasons:
change_log+ the newtelemetry_eventtable together already give us a chronological, durable, replayable history of what the user did and what the system observed.The replay tool is a separate, small thing:
This is achievable because everything in the backend already flows through the ports/adapters boundary, so injecting fakes for the runtime/workspace/agent adapters is the existing test pattern.
Event taxonomy (mapped to the five questions)
The names below are the initial set. Each has a typed struct; each is a distinct line item we can debate. All event names are dot-namespaced under
ao.<domain>.<verb>."Where the user is going" (navigation + funnel)
ao.daemon.startedao.cli.invokedao.onboarding.first_project_addedao.onboarding.first_session_spawnedao.onboarding.first_pr_observedao.onboarding.first_mergeThese are exactly the lifecycle waypoints the docs already call out. Aggregated, they answer "how far do new users get."
"Where the user is getting stuck"
ao.cli.exit_2ao.cli.repeated_failureao.daemon.error_envelopeao.spawn.failedao.adapter.unavailableao.lifecycle.session_terminated_unexpectedPattern matches Conductor's "provider returned an error" + "unexpected error" shape.
"What crashed and when"
ao.daemon.panicRecoverermiddleware or in any tracked goroutineao.daemon.shutdown_uncleanao.adapter.panicStack traces are included only when
eventstier is on, and only for daemon code (never user/agent code)."What a user does"
The CLI verbs are the natural unit. One event per verb. The payload is a typed struct with allowlisted fields.
ao.project.addedproject_id_hash,has_git_remote,duration_msao.session.spawnedsession_id_hash,agent_kind,runtime_kind,from_pr_branch(bool)ao.session.killedsession_id_hash,reason ∈ {user,reaper,merged}ao.session.restoredsession_id_hashao.sendsession_id_hash,body_len_chars(length only, never text)ao.terminal.openedsession_id_hashao.doctor.runfailing_checks_count,os,arch,daemon_version"Replay it back"
Covered by
ao bug-report+ao replayabove. No additional events.What we will not capture
Stating these explicitly so they don't quietly creep in via PR review:
AO_TELEMETRY_REDACT_BRANCH_NAMES=true). An enterprise user who wants names for self-hosted dashboards can opt back in for their own collector — but not the default.Phasing (each step is a separately reviewable PR)
Plumbing only, default off. Add
ports.EventSink, thenoopandlocalsqliteadapters, thetelemetry_eventtable + sqlc queries, the[telemetry]env config, and the new fourth lane wired into the daemon composition root. Instrument exactly two paths as a smoke test: daemon start/stop andao.cli.invoked. No remote sinks yet. No CLI surface yet. This is the smallest "real but not load-bearing" PR.Bug-report bundle. Implement
ao bug-reportover the daemon HTTP surface (new read endpoint that streams a zip). No upload — just the download. This is immediately useful for our own support workflow even if no events are wired beyond the smoke set.Full event taxonomy + funnel events. Wire every event listed above through the existing services (session_manager, lifecycle, pr, doctor) at the points where the durable fact is already being written. Add tests that assert the event fires exactly once per fact (mirrors the change_log test style).
Remote sinks behind explicit opt-in. Add the
otlpadapter, gated byAO_TELEMETRY_REMOTE=otlp+ a non-empty endpoint. Optionally addposthogif the answer to Open Decision feat(backend): Lifecycle Manager + Session Manager lane #2 below is PostHog.Replay command.
ao replay <bug-report.zip>against an isolated daemon instance with fake adapters. Useful for our own regression work; ship it later, no rush.Open decisions (these are where input is most useful)
Default state. Should
metricsdefault to on (Codex / Conductor — more data, harder enterprise sell) or off (privacy-first — slower product feedback loop)? Current lean is off until we have a published privacy notice; Codex's hybrid (metrics on, events off) is a reasonable middle ground if we can stand up a notice quickly.Remote sink: OTLP vs PostHog. OTLP is vendor-neutral and matches the self-hosted-friendly posture of the project, but we get nothing for free — we have to stand up a collector and a dashboard. PostHog is turnkey and is what Conductor uses, but it's a vendor relationship with an attached privacy policy we'd have to publish. We can support both behind the same sink interface; the question is which one we wire as "blessed."
Scope: backend-only or renderer too? The frontend is still a placeholder. Backend-only is the lowest-risk first slice. Adding a renderer-side
analytics.tslater is independent and can reuse the same event names over an existing daemon route.Replay scope. Confirm we are explicitly choosing event playback only and not full terminal/UI capture. Conductor went the same way and called it out as a feature, not a limitation. Current lean: same.
Crash auto-upload. Bug-report bundles cover the "user files an issue" case. Do we additionally want the daemon to auto-upload
daemon.panicevents when remote is enabled? Codex does (under[otel]); Conductor does (under PostHog). Worth a separate decision because the answer changes whether we can ever drop the manual bug-report path.References:
AbstractEventLogger— https://github.com/apache/superset/blob/master/superset/utils/log.pyBeta Was this translation helpful? Give feedback.
All reactions