tale-project · larryro · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · coderabbitai
diff --git a/docs/de/self-hosted/configuration/observability-config.md b/docs/de/self-hosted/configuration/observability-config.md
@@ -19,14 +19,15 @@ Tale bringt keinen Log-Shipper mit. Der Driver-Tausch ist der unterstützte Inte
 
 ## Metriken
 
-Der Caddy-Proxy exponiert zwei Metric-Pfade, gegated von einem einzigen Bearer-Token:
+Der Caddy-Proxy exponiert drei Metric-Pfade, gegated von einem einzigen Bearer-Token:
 
-| Pfad                | Quelle          | Was drinsteckt                                                  |
-| ------------------- | --------------- | --------------------------------------------------------------- |
-| `/metrics/platform` | `tale-platform` | HTTP-Latenz, Route-Counter, Node-Prozessmetriken                |
-| `/metrics/convex`   | `tale-convex`   | 261 eingebaute Convex-Metriken, plus die RAG- und Crawl-Timings |
+| Pfad                 | Quelle          | Was drinsteckt                                                                |
+| -------------------- | --------------- | ----------------------------------------------------------------------------- |
+| `/metrics/platform`  | `tale-platform` | HTTP-Latenz, Route-Counter, Node-Prozessmetriken, Antwortzeit-SLA-Ziel-Gauges |
+| `/metrics/convex`    | `tale-convex`   | 261 eingebaute Convex-Metriken, plus die RAG- und Crawl-Timings               |
+| `/metrics/sla-rules` | `tale-platform` | Generierte Prometheus-Recording- + Alerting-Rules für die Antwortzeit-SLAs    |
 
-Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um die zwei Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Alles ausser den zwei gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht.
+Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um diese Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Der `/metrics/sla-rules`-Pfad ist eine schreibgeschützte YAML-Rules-Datei, die du in Prometheus lädst, kein Scrape-Target — die Schwellen darin sind in [Operations](/de/self-hosted/operate/observability/operations) dokumentiert. Alles ausser den gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht.
 
 Eine funktionierende Prometheus-Scrape-Stanza:
 

diff --git a/docs/de/self-hosted/operate/observability/operations.md b/docs/de/self-hosted/operate/observability/operations.md
@@ -49,6 +49,53 @@ Wenn eine Page landet, folgen die ersten fünf Minuten jedes Mal derselben Form.
 
 Ein `tale-knowledge-db`-Ausfall ist ein warn, kein page. Der Web-Crawl-Plan absorbiert Stunden von Downtime ohne Benutzerwirkung, und die Dokument-Ingestion versucht es erneut, statt Arbeit zu verwerfen — Uploads sitzen in „indexing", bis die Korpus-Datenbank zurück ist. Die Wissens-Suche liefert in der Zwischenzeit leer, aber Chats, die kein Wissen abrufen, arbeiten weiter. Fang das im warn-Band und fix es zu Geschäftszeiten.
 
+## Antwortzeit-SLAs
+
+Zwei Antwortzeit-Budgets werden als erstklassige Signale verfolgt: interaktive Dialog-Eingabe und langlaufende Operationen wie Evaluierungen. Beide werden als **Mittelwert** über ein gleitendes Fenster verifiziert — die vertragliche Zahl ist ein Durchschnitt, keine Obergrenze pro Anfrage — und beide sind so verdrahtet, dass Prometheus alarmiert, sobald der Durchschnitt über das Budget driftet.
+
+| Budget          | Statistik  | Ziel  | Fenster | Zugrundeliegende Serie        |
+| --------------- | ---------- | ----- | ------- | ----------------------------- |
+| Dialog-Eingabe  | Mittelwert | ~1 s  | 30 Min  | `tale_dialog_ttft_seconds`    |
+| Lange Operation | Mittelwert | ~40 s | 6 Std   | `tale_long_operation_seconds` |
+
+Jedes Ziel reitet zudem auf dem Plattform-Metrik-Endpoint als `tale_sla_target_seconds{sla,statistic}`, sodass ein Grafana-Panel die Budget-Linie direkt aus Prometheus zeichnet, statt sie fest zu verdrahten. Die zugrundeliegenden Latenz-Serien sind die Convex-Funktions-Ausführungs-Histogramme auf `/metrics/convex`; relabel oder record sie auf die Namen oben, damit die Rules auflösen. Die Plattform liefert die fertigen Recording- und Alerting-Rules unter `/metrics/sla-rules` (hinter demselben Bearer-Token wie die anderen Metrik-Pfade) — hole sie einmal und referenziere die Datei unter `rule_files:`, oder füge das Äquivalent ein:
+
+```yaml
+groups:
+  - name: tale-sla-recording
+    rules:
+      - record: tale_sla_dialog_ttft:mean30m
+        expr: rate(tale_dialog_ttft_seconds_sum[30m]) / rate(tale_dialog_ttft_seconds_count[30m])
+        labels:
+          sla: dialog_ttft
+      - record: tale_sla_long_operation:mean6h
+        expr: rate(tale_long_operation_seconds_sum[6h]) / rate(tale_long_operation_seconds_count[6h])
+        labels:
+          sla: long_operation
+  - name: tale-sla-alerts
+    rules:
+      - alert: TaleSlaDialogTtftBreached
+        expr: tale_sla_dialog_ttft:mean30m > 1
+        for: 15m
+        labels:
+          severity: warn
+          sla: dialog_ttft
+        annotations:
+          summary: 'Dialog input response time: mean response time over 30m exceeds the 1s SLA'
+          description: Mean time-to-first-token for an interactive chat / dialog turn.
+      - alert: TaleSlaLongOperationBreached
+        expr: tale_sla_long_operation:mean6h > 40
+        for: 30m
+        labels:
+          severity: warn
+          sla: long_operation
+        annotations:
+          summary: 'Long operation response time: mean response time over 6h exceeds the 40s SLA'
+          description: Mean end-to-end time for long-running operations such as evaluations.
+```
+
+Ein Breach hier ist ein **warn**, kein page: ein driftender Durchschnitt ist eine Degradation, die zu Geschäftszeiten zu verfolgen ist, und die `for:`-Fenster warten bewusst eine kurze Spitze aus, bevor sie feuern. Das ~1-s-Dialog-Budget versöhnt sich mit dem lockereren ~3-s-Warm-Time-to-First-Token im manuellen Performance-Plan — jene ~3 s sind eine Obergrenze pro Anfrage für ein einzelnes kaltes, Auto-geroutetes erstes Token inklusive Modell- und Netzwerk-Zeit, während die ~1 s hier der Steady-State-Mittelwert über Dialog-Turns ist, sodass gelegentliche erste Tokens, die die Obergrenze erreichen, mit einem Sub-Sekunden-Mittelwert vereinbar sind. Den 1-s-Mittelwert auf Live-Anbietern zu halten, kann noch die Backend-Overhead-Optimierung brauchen, die im Feature-Issue verfolgt wird; dieser Alert bestätigt, ob das Ziel erreicht ist.
+
 ## Wo das hingehört
 
 Die Signale oben sind die proaktive Seite des Betreibens einer Tale-Instanz; die reaktive Seite ist [Troubleshooting](/de/self-hosted/operate/observability/troubleshooting), und die Konfiguration, die die Metriken in Prometheus bekommt, ist [Observability-Konfiguration](/de/self-hosted/configuration/observability-config). Hast du `METRICS_BEARER_TOKEN` noch nicht gesetzt, ist jede Schwelle oben unbeobachtet — fang dort an.
diff --git a/docs/en/self-hosted/configuration/observability-config.md b/docs/en/self-hosted/configuration/observability-config.md
@@ -19,14 +19,15 @@ Tale does not ship a log shipper. The driver swap is the supported integration p
 
 ## Metrics
 
-The Caddy proxy exposes two metrics paths gated by a single bearer token:
+The Caddy proxy exposes three metrics paths gated by a single bearer token:
 
-| Path                | Source          | What's inside                                               |
-| ------------------- | --------------- | ----------------------------------------------------------- |
-| `/metrics/platform` | `tale-platform` | HTTP latency, route counters, Node process metrics          |
-| `/metrics/convex`   | `tale-convex`   | 261 built-in Convex metrics, plus the RAG and crawl timings |
+| Path                 | Source          | What's inside                                                                       |
+| -------------------- | --------------- | ----------------------------------------------------------------------------------- |
+| `/metrics/platform`  | `tale-platform` | HTTP latency, route counters, Node process metrics, response-time SLA target gauges |
+| `/metrics/convex`    | `tale-convex`   | 261 built-in Convex metrics, plus the RAG and crawl timings                         |
+| `/metrics/sla-rules` | `tale-platform` | Generated Prometheus recording + alerting rules for the response-time SLAs          |
 
-Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable the two endpoints; leave it unset to keep them returning 401 to every request. Anything other than the two listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints.
+Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable these endpoints; leave it unset to keep them returning 401 to every request. The `/metrics/sla-rules` path is a read-only YAML rules file you load into Prometheus, not a scrape target — the thresholds it carries are documented in [Operations](/self-hosted/operate/observability/operations). Anything other than the listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints.
 
 A working Prometheus scrape stanza:
 

diff --git a/docs/en/self-hosted/operate/observability/operations.md b/docs/en/self-hosted/operate/observability/operations.md
@@ -49,6 +49,53 @@ When a page lands, the first five minutes follow the same shape every time.
 
 A `tale-knowledge-db` outage is a warn, not a page. The web-crawl schedule absorbs hours of downtime without user impact, and document ingestion retries rather than dropping work — uploads sit in "indexing" until the corpus database is back. Knowledge search returns empty in the meantime, but chats that do not retrieve knowledge keep working. Catch this in the warn band and fix it in business hours.
 
+## Response-time SLAs
+
+Two response-time budgets are tracked as first-class signals: interactive dialog input and long-running operations such as evaluations. Both are verified as a **mean** over a rolling window — the contractual figure is an average, not a per-request ceiling — and both are wired so Prometheus alerts the moment the average drifts past budget.
+
+| Budget         | Statistic | Target | Window | Underlying series             |
+| -------------- | --------- | ------ | ------ | ----------------------------- |
+| Dialog input   | mean      | ~1 s   | 30 m   | `tale_dialog_ttft_seconds`    |
+| Long operation | mean      | ~40 s  | 6 h    | `tale_long_operation_seconds` |
+
+Each target also rides the platform metrics endpoint as `tale_sla_target_seconds{sla,statistic}`, so a Grafana panel draws the budget line straight from Prometheus instead of hard-coding it. The underlying latency series are the Convex function-execution histograms on `/metrics/convex`; relabel or record them to the names above so the rules resolve. The platform serves the ready-made recording and alerting rules at `/metrics/sla-rules` (behind the same bearer token as the other metrics paths) — fetch it once and reference the file under `rule_files:`, or paste the equivalent:
+
+```yaml
+groups:
+  - name: tale-sla-recording
+    rules:
+      - record: tale_sla_dialog_ttft:mean30m
+        expr: rate(tale_dialog_ttft_seconds_sum[30m]) / rate(tale_dialog_ttft_seconds_count[30m])
+        labels:
+          sla: dialog_ttft
+      - record: tale_sla_long_operation:mean6h
+        expr: rate(tale_long_operation_seconds_sum[6h]) / rate(tale_long_operation_seconds_count[6h])
+        labels:
+          sla: long_operation
+  - name: tale-sla-alerts
+    rules:
+      - alert: TaleSlaDialogTtftBreached
+        expr: tale_sla_dialog_ttft:mean30m > 1
+        for: 15m
+        labels:
+          severity: warn
+          sla: dialog_ttft
+        annotations:
+          summary: 'Dialog input response time: mean response time over 30m exceeds the 1s SLA'
+          description: Mean time-to-first-token for an interactive chat / dialog turn.
+      - alert: TaleSlaLongOperationBreached
+        expr: tale_sla_long_operation:mean6h > 40
+        for: 30m
+        labels:
+          severity: warn
+          sla: long_operation
+        annotations:
+          summary: 'Long operation response time: mean response time over 6h exceeds the 40s SLA'
+          description: Mean end-to-end time for long-running operations such as evaluations.
+```
+
+A breach here is a **warn**, not a page: a drifting average is a degradation to chase in business hours, and the `for:` windows deliberately wait out a short spike before firing. The ~1 s dialog budget reconciles with the looser ~3 s warm time-to-first-token in the manual performance plan — that ~3 s is a per-request ceiling for a single cold, Auto-routed first token including model and network time, whereas the ~1 s here is the steady-state mean across dialog turns, so occasional first tokens reaching the ceiling are consistent with a sub-second mean. Holding the 1 s mean on live providers may still need the backend-overhead optimization tracked on the feature issue; this alert is what confirms whether the target is met.
+
 ## Where this fits
 
 The signals above are the proactive side of operating a Tale instance; the reactive side is [Troubleshooting](/self-hosted/operate/observability/troubleshooting), and the configuration that gets the metrics into Prometheus is [Observability config](/self-hosted/configuration/observability-config). If you have not yet set `METRICS_BEARER_TOKEN`, every threshold above is unmonitored — start there.
diff --git a/docs/fr/self-hosted/configuration/observability-config.md b/docs/fr/self-hosted/configuration/observability-config.md
@@ -19,14 +19,15 @@ Tale ne ship pas de log shipper. L'échange de driver est le point d'intégratio
 
 ## Métriques
 
-Le proxy Caddy expose deux chemins de métriques derrière un seul bearer token :
+Le proxy Caddy expose trois chemins de métriques derrière un seul bearer token :
 
-| Chemin              | Source          | Ce qui est dedans                                                |
-| ------------------- | --------------- | ---------------------------------------------------------------- |
-| `/metrics/platform` | `tale-platform` | Latence HTTP, compteurs de routes, métriques de processus Node   |
-| `/metrics/convex`   | `tale-convex`   | 261 métriques Convex intégrées, plus les timings RAG et de crawl |
+| Chemin               | Source          | Ce qui est dedans                                                                                       |
+| -------------------- | --------------- | ------------------------------------------------------------------------------------------------------- |
+| `/metrics/platform`  | `tale-platform` | Latence HTTP, compteurs de routes, métriques de processus Node, gauges de cible SLA de temps de réponse |
+| `/metrics/convex`    | `tale-convex`   | 261 métriques Convex intégrées, plus les timings RAG et de crawl                                        |
+| `/metrics/sla-rules` | `tale-platform` | Rules Prometheus de recording + alerting générées pour les SLA de temps de réponse                      |
 
-Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer les deux endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Tout sauf les deux chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme.
+Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer ces endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Le chemin `/metrics/sla-rules` est un fichier YAML de rules en lecture seule que tu charges dans Prometheus, pas une cible de scrape — les seuils qu'il porte sont documentés dans [Opérations](/fr/self-hosted/operate/observability/operations). Tout sauf les chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme.
 
 Une stanza de scrape Prometheus qui marche :