From 078f239ac2df29f56f1d908682a153e5e830ff64 Mon Sep 17 00:00:00 2001 From: Tale Agent Date: Tue, 23 Jun 2026 10:51:01 +0000 Subject: [PATCH 1/3] feat(platform): track and alert on response-time SLAs (#1924) Add a single source of truth for the dialog (~1s mean) and long-operation (~40s mean) response-time SLAs in services/platform/sla-targets.ts, and wire the verification layer on top of the existing measurement primitives: - Expose each target as a tale_sla_target_seconds gauge on /metrics so a Grafana panel can draw the budget line straight from Prometheus. - Generate the Prometheus recording + alerting rules from the same targets, served read-only at /metrics/sla-rules (added to the bearer-gated metrics paths in the proxy Caddyfile). - Document the SLAs, the rules, and the reconciliation of the ~1s dialog mean with the looser ~3s warm TTFT ceiling in the observability operator guide (en/de/fr) and the manual performance plan. Tests: unit coverage for the targets, the rule renderer (mean + percentile, empty + duplicate-id edge/error cases), the gauge registration, and the /metrics SLA-gauge regression. i18n messages N/A (no UI strings); migrations N/A. --- .../configuration/observability-config.md | 13 +- .../operate/observability/operations.md | 47 +++++ .../configuration/observability-config.md | 13 +- .../operate/observability/operations.md | 47 +++++ .../configuration/observability-config.md | 13 +- .../operate/observability/operations.md | 47 +++++ services/platform/server.ts | 7 + services/platform/sla-targets.test.ts | 134 ++++++++++++ services/platform/sla-targets.ts | 193 ++++++++++++++++++ services/platform/telemetry.test.ts | 8 + services/platform/telemetry.ts | 8 +- services/platform/tests/manual/performance.md | 30 ++- services/proxy/Caddyfile | 5 +- 13 files changed, 535 insertions(+), 30 deletions(-) create mode 100644 services/platform/sla-targets.test.ts create mode 100644 services/platform/sla-targets.ts diff --git a/docs/de/self-hosted/configuration/observability-config.md b/docs/de/self-hosted/configuration/observability-config.md index c0bf1cae0f..501cec4cd9 100644 --- a/docs/de/self-hosted/configuration/observability-config.md +++ b/docs/de/self-hosted/configuration/observability-config.md @@ -19,14 +19,15 @@ Tale bringt keinen Log-Shipper mit. Der Driver-Tausch ist der unterstützte Inte ## Metriken -Der Caddy-Proxy exponiert zwei Metric-Pfade, gegated von einem einzigen Bearer-Token: +Der Caddy-Proxy exponiert drei Metric-Pfade, gegated von einem einzigen Bearer-Token: -| Pfad | Quelle | Was drinsteckt | -| ------------------- | --------------- | --------------------------------------------------------------- | -| `/metrics/platform` | `tale-platform` | HTTP-Latenz, Route-Counter, Node-Prozessmetriken | -| `/metrics/convex` | `tale-convex` | 261 eingebaute Convex-Metriken, plus die RAG- und Crawl-Timings | +| Pfad | Quelle | Was drinsteckt | +| -------------------- | --------------- | ----------------------------------------------------------------------------- | +| `/metrics/platform` | `tale-platform` | HTTP-Latenz, Route-Counter, Node-Prozessmetriken, Antwortzeit-SLA-Ziel-Gauges | +| `/metrics/convex` | `tale-convex` | 261 eingebaute Convex-Metriken, plus die RAG- und Crawl-Timings | +| `/metrics/sla-rules` | `tale-platform` | Generierte Prometheus-Recording- + Alerting-Rules für die Antwortzeit-SLAs | -Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um die zwei Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Alles ausser den zwei gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht. +Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um diese Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Der `/metrics/sla-rules`-Pfad ist eine schreibgeschützte YAML-Rules-Datei, die du in Prometheus lädst, kein Scrape-Target — die Schwellen darin sind in [Operations](/de/self-hosted/operate/observability/operations) dokumentiert. Alles ausser den gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht. Eine funktionierende Prometheus-Scrape-Stanza: diff --git a/docs/de/self-hosted/operate/observability/operations.md b/docs/de/self-hosted/operate/observability/operations.md index 9cbc3905dc..323f5310bd 100644 --- a/docs/de/self-hosted/operate/observability/operations.md +++ b/docs/de/self-hosted/operate/observability/operations.md @@ -49,6 +49,53 @@ Wenn eine Page landet, folgen die ersten fünf Minuten jedes Mal derselben Form. Ein `tale-knowledge-db`-Ausfall ist ein warn, kein page. Der Web-Crawl-Plan absorbiert Stunden von Downtime ohne Benutzerwirkung, und die Dokument-Ingestion versucht es erneut, statt Arbeit zu verwerfen — Uploads sitzen in „indexing", bis die Korpus-Datenbank zurück ist. Die Wissens-Suche liefert in der Zwischenzeit leer, aber Chats, die kein Wissen abrufen, arbeiten weiter. Fang das im warn-Band und fix es zu Geschäftszeiten. +## Antwortzeit-SLAs + +Zwei Antwortzeit-Budgets werden als erstklassige Signale verfolgt: interaktive Dialog-Eingabe und langlaufende Operationen wie Evaluierungen. Beide werden als **Mittelwert** über ein gleitendes Fenster verifiziert — die vertragliche Zahl ist ein Durchschnitt, keine Obergrenze pro Anfrage — und beide sind so verdrahtet, dass Prometheus alarmiert, sobald der Durchschnitt über das Budget driftet. + +| Budget | Statistik | Ziel | Fenster | Zugrundeliegende Serie | +| --------------- | ---------- | ----- | ------- | ----------------------------- | +| Dialog-Eingabe | Mittelwert | ~1 s | 30 Min | `tale_dialog_ttft_seconds` | +| Lange Operation | Mittelwert | ~40 s | 6 Std | `tale_long_operation_seconds` | + +Jedes Ziel reitet zudem auf dem Plattform-Metrik-Endpoint als `tale_sla_target_seconds{sla,statistic}`, sodass ein Grafana-Panel die Budget-Linie direkt aus Prometheus zeichnet, statt sie fest zu verdrahten. Die zugrundeliegenden Latenz-Serien sind die Convex-Funktions-Ausführungs-Histogramme auf `/metrics/convex`; relabel oder record sie auf die Namen oben, damit die Rules auflösen. Die Plattform liefert die fertigen Recording- und Alerting-Rules unter `/metrics/sla-rules` (hinter demselben Bearer-Token wie die anderen Metrik-Pfade) — hole sie einmal und referenziere die Datei unter `rule_files:`, oder füge das Äquivalent ein: + +```yaml +groups: + - name: tale-sla-recording + rules: + - record: tale_sla_dialog_ttft:mean30m + expr: rate(tale_dialog_ttft_seconds_sum[30m]) / rate(tale_dialog_ttft_seconds_count[30m]) + labels: + sla: dialog_ttft + - record: tale_sla_long_operation:mean6h + expr: rate(tale_long_operation_seconds_sum[6h]) / rate(tale_long_operation_seconds_count[6h]) + labels: + sla: long_operation + - name: tale-sla-alerts + rules: + - alert: TaleSlaDialogTtftBreached + expr: tale_sla_dialog_ttft:mean30m > 1 + for: 15m + labels: + severity: warn + sla: dialog_ttft + annotations: + summary: 'Dialog input response time: mean response time over 30m exceeds the 1s SLA' + description: Mean time-to-first-token for an interactive chat / dialog turn. + - alert: TaleSlaLongOperationBreached + expr: tale_sla_long_operation:mean6h > 40 + for: 30m + labels: + severity: warn + sla: long_operation + annotations: + summary: 'Long operation response time: mean response time over 6h exceeds the 40s SLA' + description: Mean end-to-end time for long-running operations such as evaluations. +``` + +Ein Breach hier ist ein **warn**, kein page: ein driftender Durchschnitt ist eine Degradation, die zu Geschäftszeiten zu verfolgen ist, und die `for:`-Fenster warten bewusst eine kurze Spitze aus, bevor sie feuern. Das ~1-s-Dialog-Budget versöhnt sich mit dem lockereren ~3-s-Warm-Time-to-First-Token im manuellen Performance-Plan — jene ~3 s sind eine Obergrenze pro Anfrage für ein einzelnes kaltes, Auto-geroutetes erstes Token inklusive Modell- und Netzwerk-Zeit, während die ~1 s hier der Steady-State-Mittelwert über Dialog-Turns ist, sodass gelegentliche erste Tokens, die die Obergrenze erreichen, mit einem Sub-Sekunden-Mittelwert vereinbar sind. Den 1-s-Mittelwert auf Live-Anbietern zu halten, kann noch die Backend-Overhead-Optimierung brauchen, die im Feature-Issue verfolgt wird; dieser Alert bestätigt, ob das Ziel erreicht ist. + ## Wo das hingehört Die Signale oben sind die proaktive Seite des Betreibens einer Tale-Instanz; die reaktive Seite ist [Troubleshooting](/de/self-hosted/operate/observability/troubleshooting), und die Konfiguration, die die Metriken in Prometheus bekommt, ist [Observability-Konfiguration](/de/self-hosted/configuration/observability-config). Hast du `METRICS_BEARER_TOKEN` noch nicht gesetzt, ist jede Schwelle oben unbeobachtet — fang dort an. diff --git a/docs/en/self-hosted/configuration/observability-config.md b/docs/en/self-hosted/configuration/observability-config.md index 0866e57737..d5bf716bf7 100644 --- a/docs/en/self-hosted/configuration/observability-config.md +++ b/docs/en/self-hosted/configuration/observability-config.md @@ -19,14 +19,15 @@ Tale does not ship a log shipper. The driver swap is the supported integration p ## Metrics -The Caddy proxy exposes two metrics paths gated by a single bearer token: +The Caddy proxy exposes three metrics paths gated by a single bearer token: -| Path | Source | What's inside | -| ------------------- | --------------- | ----------------------------------------------------------- | -| `/metrics/platform` | `tale-platform` | HTTP latency, route counters, Node process metrics | -| `/metrics/convex` | `tale-convex` | 261 built-in Convex metrics, plus the RAG and crawl timings | +| Path | Source | What's inside | +| -------------------- | --------------- | ----------------------------------------------------------------------------------- | +| `/metrics/platform` | `tale-platform` | HTTP latency, route counters, Node process metrics, response-time SLA target gauges | +| `/metrics/convex` | `tale-convex` | 261 built-in Convex metrics, plus the RAG and crawl timings | +| `/metrics/sla-rules` | `tale-platform` | Generated Prometheus recording + alerting rules for the response-time SLAs | -Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable the two endpoints; leave it unset to keep them returning 401 to every request. Anything other than the two listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints. +Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable these endpoints; leave it unset to keep them returning 401 to every request. The `/metrics/sla-rules` path is a read-only YAML rules file you load into Prometheus, not a scrape target — the thresholds it carries are documented in [Operations](/self-hosted/operate/observability/operations). Anything other than the listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints. A working Prometheus scrape stanza: diff --git a/docs/en/self-hosted/operate/observability/operations.md b/docs/en/self-hosted/operate/observability/operations.md index 2c81910c3a..77712bd54f 100644 --- a/docs/en/self-hosted/operate/observability/operations.md +++ b/docs/en/self-hosted/operate/observability/operations.md @@ -49,6 +49,53 @@ When a page lands, the first five minutes follow the same shape every time. A `tale-knowledge-db` outage is a warn, not a page. The web-crawl schedule absorbs hours of downtime without user impact, and document ingestion retries rather than dropping work — uploads sit in "indexing" until the corpus database is back. Knowledge search returns empty in the meantime, but chats that do not retrieve knowledge keep working. Catch this in the warn band and fix it in business hours. +## Response-time SLAs + +Two response-time budgets are tracked as first-class signals: interactive dialog input and long-running operations such as evaluations. Both are verified as a **mean** over a rolling window — the contractual figure is an average, not a per-request ceiling — and both are wired so Prometheus alerts the moment the average drifts past budget. + +| Budget | Statistic | Target | Window | Underlying series | +| -------------- | --------- | ------ | ------ | ----------------------------- | +| Dialog input | mean | ~1 s | 30 m | `tale_dialog_ttft_seconds` | +| Long operation | mean | ~40 s | 6 h | `tale_long_operation_seconds` | + +Each target also rides the platform metrics endpoint as `tale_sla_target_seconds{sla,statistic}`, so a Grafana panel draws the budget line straight from Prometheus instead of hard-coding it. The underlying latency series are the Convex function-execution histograms on `/metrics/convex`; relabel or record them to the names above so the rules resolve. The platform serves the ready-made recording and alerting rules at `/metrics/sla-rules` (behind the same bearer token as the other metrics paths) — fetch it once and reference the file under `rule_files:`, or paste the equivalent: + +```yaml +groups: + - name: tale-sla-recording + rules: + - record: tale_sla_dialog_ttft:mean30m + expr: rate(tale_dialog_ttft_seconds_sum[30m]) / rate(tale_dialog_ttft_seconds_count[30m]) + labels: + sla: dialog_ttft + - record: tale_sla_long_operation:mean6h + expr: rate(tale_long_operation_seconds_sum[6h]) / rate(tale_long_operation_seconds_count[6h]) + labels: + sla: long_operation + - name: tale-sla-alerts + rules: + - alert: TaleSlaDialogTtftBreached + expr: tale_sla_dialog_ttft:mean30m > 1 + for: 15m + labels: + severity: warn + sla: dialog_ttft + annotations: + summary: 'Dialog input response time: mean response time over 30m exceeds the 1s SLA' + description: Mean time-to-first-token for an interactive chat / dialog turn. + - alert: TaleSlaLongOperationBreached + expr: tale_sla_long_operation:mean6h > 40 + for: 30m + labels: + severity: warn + sla: long_operation + annotations: + summary: 'Long operation response time: mean response time over 6h exceeds the 40s SLA' + description: Mean end-to-end time for long-running operations such as evaluations. +``` + +A breach here is a **warn**, not a page: a drifting average is a degradation to chase in business hours, and the `for:` windows deliberately wait out a short spike before firing. The ~1 s dialog budget reconciles with the looser ~3 s warm time-to-first-token in the manual performance plan — that ~3 s is a per-request ceiling for a single cold, Auto-routed first token including model and network time, whereas the ~1 s here is the steady-state mean across dialog turns, so occasional first tokens reaching the ceiling are consistent with a sub-second mean. Holding the 1 s mean on live providers may still need the backend-overhead optimization tracked on the feature issue; this alert is what confirms whether the target is met. + ## Where this fits The signals above are the proactive side of operating a Tale instance; the reactive side is [Troubleshooting](/self-hosted/operate/observability/troubleshooting), and the configuration that gets the metrics into Prometheus is [Observability config](/self-hosted/configuration/observability-config). If you have not yet set `METRICS_BEARER_TOKEN`, every threshold above is unmonitored — start there. diff --git a/docs/fr/self-hosted/configuration/observability-config.md b/docs/fr/self-hosted/configuration/observability-config.md index f9a527337f..f3beed0f68 100644 --- a/docs/fr/self-hosted/configuration/observability-config.md +++ b/docs/fr/self-hosted/configuration/observability-config.md @@ -19,14 +19,15 @@ Tale ne ship pas de log shipper. L'échange de driver est le point d'intégratio ## Métriques -Le proxy Caddy expose deux chemins de métriques derrière un seul bearer token : +Le proxy Caddy expose trois chemins de métriques derrière un seul bearer token : -| Chemin | Source | Ce qui est dedans | -| ------------------- | --------------- | ---------------------------------------------------------------- | -| `/metrics/platform` | `tale-platform` | Latence HTTP, compteurs de routes, métriques de processus Node | -| `/metrics/convex` | `tale-convex` | 261 métriques Convex intégrées, plus les timings RAG et de crawl | +| Chemin | Source | Ce qui est dedans | +| -------------------- | --------------- | ------------------------------------------------------------------------------------------------------- | +| `/metrics/platform` | `tale-platform` | Latence HTTP, compteurs de routes, métriques de processus Node, gauges de cible SLA de temps de réponse | +| `/metrics/convex` | `tale-convex` | 261 métriques Convex intégrées, plus les timings RAG et de crawl | +| `/metrics/sla-rules` | `tale-platform` | Rules Prometheus de recording + alerting générées pour les SLA de temps de réponse | -Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer les deux endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Tout sauf les deux chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme. +Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer ces endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Le chemin `/metrics/sla-rules` est un fichier YAML de rules en lecture seule que tu charges dans Prometheus, pas une cible de scrape — les seuils qu'il porte sont documentés dans [Opérations](/fr/self-hosted/operate/observability/operations). Tout sauf les chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme. Une stanza de scrape Prometheus qui marche : diff --git a/docs/fr/self-hosted/operate/observability/operations.md b/docs/fr/self-hosted/operate/observability/operations.md index 2cf11883bc..de43c461df 100644 --- a/docs/fr/self-hosted/operate/observability/operations.md +++ b/docs/fr/self-hosted/operate/observability/operations.md @@ -49,6 +49,53 @@ Quand une page atterrit, les cinq premières minutes suivent la même forme à c Une panne de `tale-knowledge-db` est un warn, pas un page. Le planning du crawl web absorbe des heures de downtime sans impact utilisateur, et l'ingestion de documents retente plutôt que de jeter le travail — les téléversements restent en « indexation » jusqu'au retour de la base du corpus. La recherche de connaissances renvoie vide entre-temps, mais les chats qui ne récupèrent pas de connaissances continuent de marcher. Attrape ça dans la bande warn et corrige-le pendant les heures de bureau. +## SLA de temps de réponse + +Deux budgets de temps de réponse sont suivis comme signaux de premier ordre : la saisie de dialogue interactive et les opérations longues comme les évaluations. Les deux sont vérifiés comme une **moyenne** sur une fenêtre glissante — le chiffre contractuel est une moyenne, pas un plafond par requête — et les deux sont câblés pour que Prometheus alerte dès que la moyenne dérive au-delà du budget. + +| Budget | Statistique | Cible | Fenêtre | Série sous-jacente | +| ---------------- | ----------- | ----- | ------- | ----------------------------- | +| Saisie dialogue | moyenne | ~1 s | 30 min | `tale_dialog_ttft_seconds` | +| Opération longue | moyenne | ~40 s | 6 h | `tale_long_operation_seconds` | + +Chaque cible chevauche aussi l'endpoint de métriques de la plateforme sous `tale_sla_target_seconds{sla,statistic}`, pour qu'un panel Grafana trace la ligne de budget directement depuis Prometheus au lieu de la coder en dur. Les séries de latence sous-jacentes sont les histogrammes d'exécution de fonction Convex sur `/metrics/convex` ; relabel ou record-les vers les noms ci-dessus pour que les rules se résolvent. La plateforme sert les rules de recording et d'alerting prêtes à l'emploi sous `/metrics/sla-rules` (derrière le même bearer token que les autres chemins de métriques) — récupère-le une fois et référence le fichier sous `rule_files:`, ou colle l'équivalent : + +```yaml +groups: + - name: tale-sla-recording + rules: + - record: tale_sla_dialog_ttft:mean30m + expr: rate(tale_dialog_ttft_seconds_sum[30m]) / rate(tale_dialog_ttft_seconds_count[30m]) + labels: + sla: dialog_ttft + - record: tale_sla_long_operation:mean6h + expr: rate(tale_long_operation_seconds_sum[6h]) / rate(tale_long_operation_seconds_count[6h]) + labels: + sla: long_operation + - name: tale-sla-alerts + rules: + - alert: TaleSlaDialogTtftBreached + expr: tale_sla_dialog_ttft:mean30m > 1 + for: 15m + labels: + severity: warn + sla: dialog_ttft + annotations: + summary: 'Dialog input response time: mean response time over 30m exceeds the 1s SLA' + description: Mean time-to-first-token for an interactive chat / dialog turn. + - alert: TaleSlaLongOperationBreached + expr: tale_sla_long_operation:mean6h > 40 + for: 30m + labels: + severity: warn + sla: long_operation + annotations: + summary: 'Long operation response time: mean response time over 6h exceeds the 40s SLA' + description: Mean end-to-end time for long-running operations such as evaluations. +``` + +Un breach ici est un **warn**, pas un page : une moyenne qui dérive est une dégradation à traiter pendant les heures de bureau, et les fenêtres `for:` attendent délibérément qu'un pic court s'estompe avant de déclencher. Le budget dialogue de ~1 s se réconcilie avec le time-to-first-token chaud plus lâche de ~3 s du plan de performance manuel — ces ~3 s sont un plafond par requête pour un seul premier token froid, routé en Auto, temps modèle et réseau inclus, alors que les ~1 s ici sont la moyenne en régime permanent sur les tours de dialogue, donc des premiers tokens atteignant occasionnellement le plafond restent compatibles avec une moyenne sous la seconde. Tenir la moyenne de 1 s sur des fournisseurs live peut encore exiger l'optimisation du surcoût backend suivie sur l'issue de fonctionnalité ; cette alerte est ce qui confirme si la cible est atteinte. + ## Où cela s'inscrit Les signaux ci-dessus sont le côté proactif d'opérer une instance Tale ; le côté réactif est [Dépannage](/fr/self-hosted/operate/observability/troubleshooting), et la configuration qui fait passer les métriques dans Prometheus est [Configuration de l'observabilité](/fr/self-hosted/configuration/observability-config). Si tu n'as pas encore réglé `METRICS_BEARER_TOKEN`, chaque seuil ci-dessus est non surveillé — commence par là. diff --git a/services/platform/server.ts b/services/platform/server.ts index 196909e450..dc822899c8 100644 --- a/services/platform/server.ts +++ b/services/platform/server.ts @@ -25,6 +25,7 @@ import { WEBDAV_HMAC_KEY_MIN_LENGTH, } from './lib/webdav/hmac-key'; import { WEBDAV_METHODS } from './lib/webdav/types'; +import { slaRulesResponse } from './sla-targets'; import { buildStatusFeed, probeServices, @@ -598,6 +599,12 @@ export function createApp(env: EnvConfig = getEnvConfig()): Hono { convexMetricsResponse(c.req.query('format') ?? null), ); + // Generated Prometheus recording + alerting rules for the response-time + // SLAs, derived from the canonical targets in `sla-targets.ts`. Operators + // load these instead of hand-copying thresholds; the rule expressions track + // the same budgets exposed as `tale_sla_target_seconds` on `/metrics`. + app.get('/metrics/sla-rules', () => slaRulesResponse()); + // Branding images. Defense-in-depth: filename is already locked // down (no `/`, no `..`), but the prefix check uses `path.sep` so a // future sibling dir like `imagesXYZ/` cannot prefix-match via diff --git a/services/platform/sla-targets.test.ts b/services/platform/sla-targets.test.ts new file mode 100644 index 0000000000..deb4561bff --- /dev/null +++ b/services/platform/sla-targets.test.ts @@ -0,0 +1,134 @@ +import * as client from 'prom-client'; +import { afterEach, describe, expect, test } from 'vitest'; +import { parse } from 'yaml'; + +import { + registerSlaTargetMetrics, + renderSlaPrometheusRules, + slaAlertName, + slaRecordingRuleName, + slaRulesResponse, + SLA_TARGETS, + type SlaTarget, +} from './sla-targets'; + +afterEach(() => { + client.register.clear(); +}); + +describe('SLA_TARGETS', () => { + test('covers the dialog (1s) and long-operation (40s) budgets', () => { + const byId = new Map(SLA_TARGETS.map((t) => [t.id, t])); + expect(byId.get('dialog_ttft')?.targetSeconds).toBe(1); + expect(byId.get('long_operation')?.targetSeconds).toBe(40); + }); + + test('every target is well-formed', () => { + for (const t of SLA_TARGETS) { + expect(t.targetSeconds).toBeGreaterThan(0); + expect(t.metric).toMatch(/^[a-z][a-z0-9_]*$/); + expect(['mean', 'p95', 'p99']).toContain(t.statistic); + } + }); + + test('ids are unique', () => { + const ids = SLA_TARGETS.map((t) => t.id); + expect(new Set(ids).size).toBe(ids.length); + }); +}); + +describe('renderSlaPrometheusRules', () => { + test('emits a parseable rules document with recording + alert groups', () => { + const doc = parse(renderSlaPrometheusRules()) as { + groups: { name: string; rules: Record[] }[]; + }; + const names = doc.groups.map((g) => g.name); + expect(names).toContain('tale-sla-recording'); + expect(names).toContain('tale-sla-alerts'); + }); + + test('mean targets aggregate via rate(sum)/rate(count) and alert on the budget', () => { + const doc = parse(renderSlaPrometheusRules()) as { + groups: { name: string; rules: Record[] }[]; + }; + const recording = doc.groups.find((g) => g.name === 'tale-sla-recording'); + const alerts = doc.groups.find((g) => g.name === 'tale-sla-alerts'); + + const dialog = SLA_TARGETS.find((t) => t.id === 'dialog_ttft') as SlaTarget; + const recordRule = recording?.rules.find( + (r) => r.record === slaRecordingRuleName(dialog), + ); + expect(recordRule?.expr).toBe( + 'rate(tale_dialog_ttft_seconds_sum[30m]) / rate(tale_dialog_ttft_seconds_count[30m])', + ); + + const alertRule = alerts?.rules.find( + (r) => r.alert === slaAlertName(dialog), + ); + expect(alertRule?.alert).toBe('TaleSlaDialogTtftBreached'); + expect(alertRule?.expr).toBe(`${slaRecordingRuleName(dialog)} > 1`); + expect(alertRule?.for).toBe('15m'); + }); + + test('percentile targets aggregate via histogram_quantile (edge case)', () => { + const p95: SlaTarget = { + id: 'p95_case', + title: 'P95 case', + description: 'A percentile budget.', + statistic: 'p95', + targetSeconds: 2, + window: '5m', + alertFor: '10m', + severity: 'page', + metric: 'tale_demo_seconds', + }; + const doc = parse(renderSlaPrometheusRules([p95])) as { + groups: { name: string; rules: Record[] }[]; + }; + const recordRule = doc.groups[0].rules[0]; + expect(recordRule.expr).toBe( + 'histogram_quantile(0.95, sum by (le) (rate(tale_demo_seconds_bucket[5m])))', + ); + }); + + test('empty target list yields an empty document (edge case)', () => { + const doc = parse(renderSlaPrometheusRules([])) as { groups: unknown[] }; + expect(doc.groups).toEqual([]); + }); + + test('duplicate ids are rejected (error case)', () => { + const dup = SLA_TARGETS[0]; + expect(() => renderSlaPrometheusRules([dup, dup])).toThrow(/Duplicate/); + }); +}); + +describe('registerSlaTargetMetrics', () => { + test('exposes tale_sla_target_seconds with the budget values', async () => { + registerSlaTargetMetrics(); + const body = await client.register.metrics(); + expect(body).toContain('tale_sla_target_seconds'); + expect(body).toMatch( + /tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/, + ); + expect(body).toMatch( + /tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/, + ); + }); + + test('rejects duplicate ids (error case)', () => { + const dup = SLA_TARGETS[0]; + expect(() => registerSlaTargetMetrics(client.register, [dup, dup])).toThrow( + /Duplicate/, + ); + }); +}); + +describe('slaRulesResponse', () => { + test('serves the rules as YAML', async () => { + const res = slaRulesResponse(); + expect(res.status).toBe(200); + expect(res.headers.get('Content-Type')).toContain('yaml'); + const body = await res.text(); + expect(body).toContain('tale-sla-alerts'); + }); +}); diff --git a/services/platform/sla-targets.ts b/services/platform/sla-targets.ts new file mode 100644 index 0000000000..753767a979 --- /dev/null +++ b/services/platform/sla-targets.ts @@ -0,0 +1,193 @@ +/** + * Response-time SLA targets — the single source of truth. + * + * Tale carries two contractual response-time budgets: + * + * dialog_ttft interactive chat / dialog input — mean ~1 s + * long_operation long-running work (evaluations) — mean ~40 s + * + * The measurement primitives already exist (Convex function-execution + * histograms on `/metrics/convex`, TTFT metadata, cold-load tracing). What was + * missing — and what this module adds — is the SLA layer on top of them: + * + * 1. The targets themselves, defined once here so code, dashboards, alert + * rules and docs cannot drift. + * 2. `registerSlaTargetMetrics` — exposes each target as a + * `tale_sla_target_seconds` gauge on `/metrics` (via `/metrics/platform`), + * so a Grafana panel can draw the budget line straight from Prometheus + * instead of hard-coding it. + * 3. `renderSlaPrometheusRules` / `slaRulesResponse` — generate the + * recording + alerting rules that aggregate the underlying latency + * histogram into the chosen statistic and page/warn when it breaches the + * budget. Served read-only at `/metrics/sla-rules` so operators load the + * ready-made rules instead of hand-copying thresholds. + * + * The `metric` field names the latency histogram each budget is measured + * against. It defaults to a `tale_*_seconds` series the operator produces from + * their backend's Convex function-execution histogram (a relabel/recording + * step documented in the observability guide), so the SLA aggregation stays + * correct regardless of the exact built-in series a given Convex version emits. + */ + +import * as client from 'prom-client'; +import { stringify } from 'yaml'; + +/** Aggregation a target is verified against. */ +export type SlaStatistic = 'mean' | 'p95' | 'p99'; + +export interface SlaTarget { + /** Stable id; the `sla` metric label and the alert-name stem. */ + id: string; + /** Human-readable title for dashboards and docs. */ + title: string; + /** What the budget covers. */ + description: string; + /** The statistic the target is measured against. */ + statistic: SlaStatistic; + /** Budget in seconds — the statistic must stay at or below this. */ + targetSeconds: number; + /** Rolling window the statistic is computed over (PromQL duration). */ + window: string; + /** How long a breach must persist before the alert fires (PromQL duration). */ + alertFor: string; + /** Alert severity label. */ + severity: 'page' | 'warn'; + /** + * Latency histogram base name (without `_bucket`/`_sum`/`_count`) carrying + * this operation's timings. Operators map it to their backend series per the + * observability docs. + */ + metric: string; +} + +export const SLA_TARGETS: readonly SlaTarget[] = [ + { + id: 'dialog_ttft', + title: 'Dialog input response time', + description: + 'Mean time-to-first-token for an interactive chat / dialog turn.', + statistic: 'mean', + targetSeconds: 1, + window: '30m', + alertFor: '15m', + severity: 'warn', + metric: 'tale_dialog_ttft_seconds', + }, + { + id: 'long_operation', + title: 'Long operation response time', + description: + 'Mean end-to-end time for long-running operations such as evaluations.', + statistic: 'mean', + targetSeconds: 40, + window: '6h', + alertFor: '30m', + severity: 'warn', + metric: 'tale_long_operation_seconds', + }, +]; + +const SLA_TARGET_METRIC = 'tale_sla_target_seconds'; + +/** Throw on duplicate ids so two budgets can never share a series/alert name. */ +function assertUniqueIds(targets: readonly SlaTarget[]): void { + const seen = new Set(); + for (const t of targets) { + if (seen.has(t.id)) { + throw new Error(`Duplicate SLA target id: ${t.id}`); + } + seen.add(t.id); + } +} + +/** Recording-rule name for a target's aggregated statistic. */ +export function slaRecordingRuleName(target: SlaTarget): string { + return `tale_sla_${target.id}:${target.statistic}${target.window}`; +} + +/** Alert name for a target, e.g. `TaleSlaDialogTtftBreached`. */ +export function slaAlertName(target: SlaTarget): string { + const camel = target.id + .split('_') + .map((part) => part.charAt(0).toUpperCase() + part.slice(1)) + .join(''); + return `TaleSla${camel}Breached`; +} + +/** PromQL that aggregates a target's histogram into its chosen statistic. */ +function statisticExpr(target: SlaTarget): string { + const { metric, window, statistic } = target; + if (statistic === 'mean') { + return `rate(${metric}_sum[${window}]) / rate(${metric}_count[${window}])`; + } + const quantile = statistic === 'p95' ? 0.95 : 0.99; + return `histogram_quantile(${quantile}, sum by (le) (rate(${metric}_bucket[${window}])))`; +} + +/** + * Render the Prometheus recording + alerting rules for the given targets as a + * `rule_files`-ready YAML document. A recording rule materialises each budget's + * statistic; an alert fires when it stays above the target for `alertFor`. + */ +export function renderSlaPrometheusRules( + targets: readonly SlaTarget[] = SLA_TARGETS, +): string { + assertUniqueIds(targets); + + const recordingRules = targets.map((t) => ({ + record: slaRecordingRuleName(t), + expr: statisticExpr(t), + labels: { sla: t.id }, + })); + + const alertRules = targets.map((t) => ({ + alert: slaAlertName(t), + expr: `${slaRecordingRuleName(t)} > ${t.targetSeconds}`, + for: t.alertFor, + labels: { severity: t.severity, sla: t.id }, + annotations: { + summary: `${t.title}: ${t.statistic} response time over ${t.window} exceeds the ${t.targetSeconds}s SLA`, + description: t.description, + }, + })); + + const groups: Record[] = []; + if (recordingRules.length > 0) { + groups.push({ name: 'tale-sla-recording', rules: recordingRules }); + groups.push({ name: 'tale-sla-alerts', rules: alertRules }); + } + + // `lineWidth: 0` keeps PromQL expressions on a single line — folded scalars + // are valid YAML but harder to read and copy. + return stringify({ groups }, { lineWidth: 0 }); +} + +/** + * Register a `tale_sla_target_seconds` gauge — one sample per target — so the + * contractual budget is queryable in Prometheus and drawable as a threshold + * line on latency dashboards. + */ +export function registerSlaTargetMetrics( + registry: client.Registry = client.register, + targets: readonly SlaTarget[] = SLA_TARGETS, +): void { + assertUniqueIds(targets); + + const gauge = new client.Gauge({ + name: SLA_TARGET_METRIC, + help: 'Response-time SLA target in seconds, by operation and statistic.', + labelNames: ['sla', 'statistic'], + registers: [registry], + }); + + for (const t of targets) { + gauge.set({ sla: t.id, statistic: t.statistic }, t.targetSeconds); + } +} + +/** Serve the generated SLA rules as YAML at `/metrics/sla-rules`. */ +export function slaRulesResponse(): Response { + return new Response(renderSlaPrometheusRules(), { + headers: { 'Content-Type': 'application/yaml; charset=utf-8' }, + }); +} diff --git a/services/platform/telemetry.test.ts b/services/platform/telemetry.test.ts index 75eff99af1..f935d26ca5 100644 --- a/services/platform/telemetry.test.ts +++ b/services/platform/telemetry.test.ts @@ -39,6 +39,14 @@ describe('metricsResponse', () => { body.includes('nodejs_'); expect(hasProcessMetrics).toBe(true); }); + + test('body exposes the response-time SLA target gauges', async () => { + initTelemetry(); + const response = await metricsResponse(); + const body = await response.text(); + + expect(body).toContain('tale_sla_target_seconds'); + }); }); describe('shutdownTelemetry', () => { diff --git a/services/platform/telemetry.ts b/services/platform/telemetry.ts index 54505ad7ef..eb7accb0c8 100644 --- a/services/platform/telemetry.ts +++ b/services/platform/telemetry.ts @@ -1,8 +1,9 @@ /** * Prometheus metrics for Tale Platform (Bun static server). * - * Collects process-level metrics (CPU, memory, event loop, GC) - * and exposes them at GET /metrics in Prometheus text format. + * Collects process-level metrics (CPU, memory, event loop, GC) plus the + * response-time SLA target gauges, and exposes them at GET /metrics in + * Prometheus text format. * * HTTP request metrics are not included because this server * only serves static files — the real backend is Convex. @@ -10,11 +11,14 @@ import * as client from 'prom-client'; +import { registerSlaTargetMetrics } from './sla-targets'; + let initialized = false; export function initTelemetry() { if (initialized) return; client.collectDefaultMetrics(); + registerSlaTargetMetrics(); initialized = true; } diff --git a/services/platform/tests/manual/performance.md b/services/platform/tests/manual/performance.md index aaad161b6d..9de8c84452 100644 --- a/services/platform/tests/manual/performance.md +++ b/services/platform/tests/manual/performance.md @@ -28,15 +28,27 @@ and module cache are primed. ## Checks -| ID | Metric | How | Target | -| --- | ----------------------- | --------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | -| P1 | Cold load → first paint | Clear cache, hard-reload `/dashboard/{org}` | Shell paints during the auth handshake; usable < 3 s warm provider | -| P2 | Chat TTFT | Send a simple prompt, time to first token | < 3 s warm (live); ~150 ms (mock). Note: real Auto routing is slower locally | -| P3 | Thread switch | Open another history thread | Renders < 1 s | -| P4 | Warm transition | Hover a nav target, then click | Near-instant (row-hover + loader prefetch primes it) | -| P5 | List pagination | Page through a DataTable | Each page loads < 1 s; no full-page reflow | -| P6 | Settings save | Save a field, await persistence | Round-trip < 2 s | -| P7 | Auth recovery | Force a transient backend hiccup during boot (e.g. restart Convex mid-load) | The WS recovers and authenticates; no endless skeletons / manual reload required | +| ID | Metric | How | Target | +| --- | ----------------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| P1 | Cold load → first paint | Clear cache, hard-reload `/dashboard/{org}` | Shell paints during the auth handshake; usable < 3 s warm provider | +| P2 | Chat TTFT | Send a simple prompt, time to first token | Per-request ceiling < 3 s warm (live); ~150 ms (mock). Steady-state mean tracked against the ~1 s dialog SLA — see note below | +| P3 | Thread switch | Open another history thread | Renders < 1 s | +| P4 | Warm transition | Hover a nav target, then click | Near-instant (row-hover + loader prefetch primes it) | +| P5 | List pagination | Page through a DataTable | Each page loads < 1 s; no full-page reflow | +| P6 | Settings save | Save a field, await persistence | Round-trip < 2 s | +| P7 | Auth recovery | Force a transient backend hiccup during boot (e.g. restart Convex mid-load) | The WS recovers and authenticates; no endless skeletons / manual reload required | + +## Response-time SLAs + +P2 above is the per-request **ceiling** a single warm first token should stay +under; the contractual budget is a **mean** over many turns. Two SLAs are tracked +continuously in Prometheus rather than by this manual pass — dialog input at a +~1 s mean and long operations (e.g. evaluations) at a ~40 s mean. The targets, +the recording/alerting rules, and the reconciliation of the ~1 s mean with this +~3 s ceiling live in the operator guide (`docs/*/self-hosted/operate/observability/operations.md`, +"Response-time SLAs") and are defined once in `services/platform/sla-targets.ts`. +When tuning P2 here, confirm the change moves the mean the SLA tracks, not just a +single warm sample. ## Boundary & error tests diff --git a/services/proxy/Caddyfile b/services/proxy/Caddyfile index 61eeac184c..4f2e6d29b8 100644 --- a/services/proxy/Caddyfile +++ b/services/proxy/Caddyfile @@ -187,7 +187,7 @@ # ============================================================================ @metricsAuth { - path /metrics/platform /metrics/convex + path /metrics/platform /metrics/convex /metrics/sla-rules expression `"{$METRICS_BEARER_TOKEN:}" != ""` header Authorization "Bearer {$METRICS_BEARER_TOKEN:}" } @@ -200,6 +200,9 @@ rewrite * /metrics/convex reverse_proxy platform:3000 } + handle /metrics/sla-rules { + reverse_proxy platform:3000 + } } # Block all other /metrics requests (no token, wrong token, or unknown service) From 5d39556d8e6eecc2fe5164483d34a266d14a463e Mon Sep 17 00:00:00 2001 From: tale-agent Date: Tue, 23 Jun 2026 11:27:32 +0000 Subject: [PATCH 2/3] fix(platform): copy sla-targets.ts into the platform runtime image (#1924) --- services/platform/Dockerfile | 1 + 1 file changed, 1 insertion(+) diff --git a/services/platform/Dockerfile b/services/platform/Dockerfile index cad4c882e3..6c1f8f3e12 100644 --- a/services/platform/Dockerfile +++ b/services/platform/Dockerfile @@ -111,6 +111,7 @@ WORKDIR /app COPY services/platform/server.ts \ services/platform/telemetry.ts \ services/platform/convex-metrics.ts \ + services/platform/sla-targets.ts \ ./services/platform/ # ============================================================================ From 07608ef252cb3069273ae495a356be86668bc6f7 Mon Sep 17 00:00:00 2001 From: tale-agent Date: Tue, 23 Jun 2026 11:49:10 +0000 Subject: [PATCH 3/3] fix(platform): copy sla-targets.ts into the runtime stage too (#1924) --- services/platform/Dockerfile | 1 + 1 file changed, 1 insertion(+) diff --git a/services/platform/Dockerfile b/services/platform/Dockerfile index 6c1f8f3e12..35054a3bbf 100644 --- a/services/platform/Dockerfile +++ b/services/platform/Dockerfile @@ -255,6 +255,7 @@ COPY --from=pruner --chown=app:app \ /app/services/platform/server.ts \ /app/services/platform/telemetry.ts \ /app/services/platform/convex-metrics.ts \ + /app/services/platform/sla-targets.ts \ /app/services/platform/status-probe.ts \ ./ COPY --from=pruner --chown=app:app /app/services/platform/convex ./convex