feat(platform): track and alert on response-time SLAs (#1924)#1939
feat(platform): track and alert on response-time SLAs (#1924)#1939larryro wants to merge 3 commits into
Conversation
Add a single source of truth for the dialog (~1s mean) and long-operation (~40s mean) response-time SLAs in services/platform/sla-targets.ts, and wire the verification layer on top of the existing measurement primitives: - Expose each target as a tale_sla_target_seconds gauge on /metrics so a Grafana panel can draw the budget line straight from Prometheus. - Generate the Prometheus recording + alerting rules from the same targets, served read-only at /metrics/sla-rules (added to the bearer-gated metrics paths in the proxy Caddyfile). - Document the SLAs, the rules, and the reconciliation of the ~1s dialog mean with the looser ~3s warm TTFT ceiling in the observability operator guide (en/de/fr) and the manual performance plan. Tests: unit coverage for the targets, the rule renderer (mean + percentile, empty + duplicate-id edge/error cases), the gauge registration, and the /metrics SLA-gauge regression. i18n messages N/A (no UI strings); migrations N/A.
📝 WalkthroughWalkthroughA new Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/de/self-hosted/configuration/observability-config.md`:
- Line 30: The statement about 401 responses is too broad and implies a
site-wide authentication rule. Clarify the wording of the sentence starting with
"Alles ausser den gelisteten Pfaden gibt ebenfalls 401 zurück" to specifically
indicate that this 401 behavior applies only to unknown `/metrics/*` paths based
on the Caddy proxy matcher in services/proxy/Caddyfile, not to all non-metrics
routes. Ensure the documentation makes clear that non-metrics routes continue to
use their normal handlers without being affected by this authentication rule.
In `@docs/en/self-hosted/configuration/observability-config.md`:
- Line 30: The documentation statement about returning 401 for unlisted paths is
too broad and misleading. Clarify that the 401 response only applies to unknown
paths under the `/metrics/*` endpoint, not to non-metrics routes across the
entire site. Revise the sentence that currently reads "Anything other than the
listed paths returns 401 too..." to explicitly specify that this behavior is
limited to `/metrics` paths, and that non-metrics routes continue to use their
normal handlers as determined by the Caddy proxy matcher.
In `@docs/fr/self-hosted/configuration/observability-config.md`:
- Line 30: The current documentation states that all paths except those listed
return 401, which reads like a site-wide authentication rule. Revise this
sentence to clarify that 401 responses apply only to `/metrics` paths, not the
entire site. Tighten the wording to explicitly state that unknown `/metrics/*`
URLs return 401 while non-metrics routes continue to use their normal handlers,
making it clear this auth restriction is scoped only to the metrics endpoints as
implemented in the Caddy proxy configuration.
In `@services/platform/sla-targets.test.ts`:
- Around line 110-115: The regex patterns in the expect statements for
tale_sla_target_seconds metrics assume a fixed label order (sla before
statistic), which makes the test fragile if the metrics serialization order
changes. Modify these regex patterns to be label-order agnostic by using
patterns that can match the labels regardless of their position. Instead of
expecting a specific label sequence within the curly braces, use regex
alternatives or lookahead assertions that match the metric name and required
labels without depending on their order, ensuring the test passes whether labels
appear as {sla="...",statistic="..."} or {statistic="...",sla="..."}.
- Around line 42-44: The unsafe cast `as { groups: { name: string; rules:
Record<string, unknown>[] }[] }` at the parse boundary bypasses TypeScript
safety and hides malformed YAML shapes from the type system. Create a Zod schema
defining the expected YAML structure with groups containing name and rules
properties, parse the result to unknown first, then validate it using Zod's
parse or safeParse method, and use the validated typed result instead of the
unsafe as cast. Apply this Zod validation pattern to the parse calls at lines
42-44, 51-52, 85-86, and 95, while leaving the internal cast at line 57 as-is
since it operates on already-validated data within the module.
In `@services/platform/sla-targets.ts`:
- Around line 154-158: Remove the explicit type annotation `Record<string,
unknown>[]` from the groups variable declaration on line 154 and allow
TypeScript to infer the concrete type from the object literals being pushed into
the array (those with name and rules properties). This eliminates the use of the
forbidden unknown type and lets the type system automatically determine the
correct shape based on the actual objects you're adding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: df117a3c-e14f-4b63-b212-9d5976418999
📒 Files selected for processing (14)
docs/de/self-hosted/configuration/observability-config.mddocs/de/self-hosted/operate/observability/operations.mddocs/en/self-hosted/configuration/observability-config.mddocs/en/self-hosted/operate/observability/operations.mddocs/fr/self-hosted/configuration/observability-config.mddocs/fr/self-hosted/operate/observability/operations.mdservices/platform/Dockerfileservices/platform/server.tsservices/platform/sla-targets.test.tsservices/platform/sla-targets.tsservices/platform/telemetry.test.tsservices/platform/telemetry.tsservices/platform/tests/manual/performance.mdservices/proxy/Caddyfile
| | `/metrics/sla-rules` | `tale-platform` | Generierte Prometheus-Recording- + Alerting-Rules für die Antwortzeit-SLAs | | ||
|
|
||
| Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um die zwei Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Alles ausser den zwei gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht. | ||
| Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um diese Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Der `/metrics/sla-rules`-Pfad ist eine schreibgeschützte YAML-Rules-Datei, die du in Prometheus lädst, kein Scrape-Target — die Schwellen darin sind in [Operations](/de/self-hosted/operate/observability/operations) dokumentiert. Alles ausser den gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Tighten the 401 scope here.
The Caddy proxy only returns 401 for unknown /metrics/* URLs; non-metrics routes still use their normal handlers. Please narrow this wording so it doesn't read like a site-wide auth rule. Based on the Caddy matcher in services/proxy/Caddyfile, this only applies to /metrics paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/de/self-hosted/configuration/observability-config.md` at line 30, The
statement about 401 responses is too broad and implies a site-wide
authentication rule. Clarify the wording of the sentence starting with "Alles
ausser den gelisteten Pfaden gibt ebenfalls 401 zurück" to specifically indicate
that this 401 behavior applies only to unknown `/metrics/*` paths based on the
Caddy proxy matcher in services/proxy/Caddyfile, not to all non-metrics routes.
Ensure the documentation makes clear that non-metrics routes continue to use
their normal handlers without being affected by this authentication rule.
| | `/metrics/sla-rules` | `tale-platform` | Generated Prometheus recording + alerting rules for the response-time SLAs | | ||
|
|
||
| Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable the two endpoints; leave it unset to keep them returning 401 to every request. Anything other than the two listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints. | ||
| Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable these endpoints; leave it unset to keep them returning 401 to every request. The `/metrics/sla-rules` path is a read-only YAML rules file you load into Prometheus, not a scrape target — the thresholds it carries are documented in [Operations](/self-hosted/operate/observability/operations). Anything other than the listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Tighten the 401 scope here.
The Caddy proxy only returns 401 for unknown /metrics/* URLs; non-metrics routes still use their normal handlers. Please narrow this wording so it doesn't read like a site-wide auth rule. Based on the Caddy matcher in services/proxy/Caddyfile, this only applies to /metrics paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/en/self-hosted/configuration/observability-config.md` at line 30, The
documentation statement about returning 401 for unlisted paths is too broad and
misleading. Clarify that the 401 response only applies to unknown paths under
the `/metrics/*` endpoint, not to non-metrics routes across the entire site.
Revise the sentence that currently reads "Anything other than the listed paths
returns 401 too..." to explicitly specify that this behavior is limited to
`/metrics` paths, and that non-metrics routes continue to use their normal
handlers as determined by the Caddy proxy matcher.
| | `/metrics/sla-rules` | `tale-platform` | Rules Prometheus de recording + alerting générées pour les SLA de temps de réponse | | ||
|
|
||
| Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer les deux endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Tout sauf les deux chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme. | ||
| Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer ces endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Le chemin `/metrics/sla-rules` est un fichier YAML de rules en lecture seule que tu charges dans Prometheus, pas une cible de scrape — les seuils qu'il porte sont documentés dans [Opérations](/fr/self-hosted/operate/observability/operations). Tout sauf les chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme. |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Tighten the 401 scope here.
The Caddy proxy only returns 401 for unknown /metrics/* URLs; non-metrics routes still use their normal handlers. Please narrow this wording so it doesn't read like a site-wide auth rule. Based on the Caddy matcher in services/proxy/Caddyfile, this only applies to /metrics paths.
🧰 Tools
🪛 LanguageTool
[typographical] ~30-~30: Il manque une espace après le point.
Context: ...éparé. Mets METRICS_BEARER_TOKEN dans .env pour activer ces endpoints ; laisse-le...
(ESPACE_APRES_POINT)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/fr/self-hosted/configuration/observability-config.md` at line 30, The
current documentation states that all paths except those listed return 401,
which reads like a site-wide authentication rule. Revise this sentence to
clarify that 401 responses apply only to `/metrics` paths, not the entire site.
Tighten the wording to explicitly state that unknown `/metrics/*` URLs return
401 while non-metrics routes continue to use their normal handlers, making it
clear this auth restriction is scoped only to the metrics endpoints as
implemented in the Caddy proxy configuration.
| const doc = parse(renderSlaPrometheusRules()) as { | ||
| groups: { name: string; rules: Record<string, unknown>[] }[]; | ||
| }; |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify remaining banned assertion/type patterns in this file.
rg -n --type=ts '\sas\s|unknown' services/platform/sla-targets.test.tsRepository: tale-project/tale
Length of output: 664
🏁 Script executed:
# Check if Zod is already used in services/platform/
rg -l 'from.*zod|import.*zod' services/platform/ --type ts --type tsx | head -20Repository: tale-project/tale
Length of output: 183
🏁 Script executed:
# Read the full test file to understand the context
wc -l services/platform/sla-targets.test.tsRepository: tale-project/tale
Length of output: 197
🏁 Script executed:
# Look at the test file structure around the flagged lines
sed -n '1,120p' services/platform/sla-targets.test.ts | cat -nRepository: tale-project/tale
Length of output: 5139
🏁 Script executed:
# Check for SlaTarget type definition
rg -n 'type SlaTarget|interface SlaTarget' services/platform/ --type ts --type tsxRepository: tale-project/tale
Length of output: 183
🏁 Script executed:
# Check for Zod imports in services/platform without --type flag
rg 'from.*zod|import.*zod' services/platform/ | head -10Repository: tale-project/tale
Length of output: 1046
🏁 Script executed:
# Look for shared schemas directory
find services/platform/lib/shared/schemas -type f 2>/dev/null | head -20Repository: tale-project/tale
Length of output: 1194
🏁 Script executed:
# Check the sla-targets source file to understand the types
head -50 services/platform/sla-targets.tsRepository: tale-project/tale
Length of output: 2453
🏁 Script executed:
# Continue reading the sla-targets.ts file to see SlaTarget interface and renderSlaPrometheusRules
sed -n '30,100p' services/platform/sla-targets.ts | cat -nRepository: tale-project/tale
Length of output: 2786
🏁 Script executed:
# Check if an SLA schema exists in shared schemas
ls -la services/platform/lib/shared/schemas/ | grep -i slaRepository: tale-project/tale
Length of output: 155
🏁 Script executed:
# Look for any YAML parsing or validation patterns in the test suite
rg 'parse.*yaml|Zod' services/platform/ --type ts | grep -E 'test\.ts|spec\.ts' | head -5Repository: tale-project/tale
Length of output: 599
🏁 Script executed:
# Check if the test file uses any validation at boundaries currently
grep -n 'Zod\|z\.' services/platform/sla-targets.test.tsRepository: tale-project/tale
Length of output: 155
🏁 Script executed:
# Check what renderSlaPrometheusRules returns
rg -A 20 'function renderSlaPrometheusRules|export.*renderSlaPrometheusRules' services/platform/sla-targets.ts | head -40Repository: tale-project/tale
Length of output: 813
🏁 Script executed:
# Look at how the parsed doc is actually used in the tests to confirm the structure
sed -n '40,105p' services/platform/sla-targets.test.ts | cat -nRepository: tale-project/tale
Length of output: 3095
🏁 Script executed:
# Check if oxlint is configured and enforces these rules
find . -name '.eslintrc*' -o -name 'oxlint.toml' -o -name 'eslint.config.*' 2>/dev/null | head -5Repository: tale-project/tale
Length of output: 155
🏁 Script executed:
# Verify the exact structure of what parse() returns from YAML
sed -n '95,134p' services/platform/sla-targets.test.ts | cat -nRepository: tale-project/tale
Length of output: 1736
🏁 Script executed:
# Check if SlaTarget type could be moved to shared schemas
head -35 services/platform/sla-targets.ts | tail -25Repository: tale-project/tale
Length of output: 1483
Replace unsafe casts with Zod-validated parsing.
parse(...) as ... and unknown usages bypass TypeScript safety rules and hide malformed YAML shapes from the type system. Parse to unknown, validate with Zod at the boundary, then use the validated typed value.
Proposed direction
+import { z } from 'zod';
import { parse } from 'yaml';
+const RuleSchema = z.record(z.string(), z.unknown());
+const GroupSchema = z.object({
+ name: z.string(),
+ rules: z.array(RuleSchema),
+});
+const RulesDocSchema = z.object({
+ groups: z.array(GroupSchema),
+});
+
test('emits a parseable rules document with recording + alert groups', () => {
- const doc = parse(renderSlaPrometheusRules()) as {
- groups: { name: string; rules: Record<string, unknown>[] }[];
- };
+ const doc = RulesDocSchema.parse(parse(renderSlaPrometheusRules()));
const names = doc.groups.map((g) => g.name);
expect(names).toContain('tale-sla-recording');
expect(names).toContain('tale-sla-alerts');
});Also applies to lines 51–52, 85–86, and 95. Line 57 (as SlaTarget) can remain as the cast is internal to the module—only the parsed YAML boundary needs validation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/platform/sla-targets.test.ts` around lines 42 - 44, The unsafe cast
`as { groups: { name: string; rules: Record<string, unknown>[] }[] }` at the
parse boundary bypasses TypeScript safety and hides malformed YAML shapes from
the type system. Create a Zod schema defining the expected YAML structure with
groups containing name and rules properties, parse the result to unknown first,
then validate it using Zod's parse or safeParse method, and use the validated
typed result instead of the unsafe as cast. Apply this Zod validation pattern to
the parse calls at lines 42-44, 51-52, 85-86, and 95, while leaving the internal
cast at line 57 as-is since it operates on already-validated data within the
module.
Source: Coding guidelines
| expect(body).toMatch( | ||
| /tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/, | ||
| ); | ||
| expect(body).toMatch( | ||
| /tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/, | ||
| ); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Make metric assertions label-order agnostic.
These regexes assume a fixed label order in exposition text, which can make the test flaky if serialization order changes. Prefer order-insensitive matching.
Suggested adjustment
expect(body).toMatch(
- /tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/,
+ /tale_sla_target_seconds\{[^}]*sla="dialog_ttft"[^}]*statistic="mean"[^}]*\}\s+1\b/,
);
expect(body).toMatch(
- /tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/,
+ /tale_sla_target_seconds\{[^}]*sla="long_operation"[^}]*statistic="mean"[^}]*\}\s+40\b/,
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expect(body).toMatch( | |
| /tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/, | |
| ); | |
| expect(body).toMatch( | |
| /tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/, | |
| ); | |
| expect(body).toMatch( | |
| /tale_sla_target_seconds\{[^}]*sla="dialog_ttft"[^}]*statistic="mean"[^}]*\}\s+1\b/, | |
| ); | |
| expect(body).toMatch( | |
| /tale_sla_target_seconds\{[^}]*sla="long_operation"[^}]*statistic="mean"[^}]*\}\s+40\b/, | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/platform/sla-targets.test.ts` around lines 110 - 115, The regex
patterns in the expect statements for tale_sla_target_seconds metrics assume a
fixed label order (sla before statistic), which makes the test fragile if the
metrics serialization order changes. Modify these regex patterns to be
label-order agnostic by using patterns that can match the labels regardless of
their position. Instead of expecting a specific label sequence within the curly
braces, use regex alternatives or lookahead assertions that match the metric
name and required labels without depending on their order, ensuring the test
passes whether labels appear as {sla="...",statistic="..."} or
{statistic="...",sla="..."}.
| const groups: Record<string, unknown>[] = []; | ||
| if (recordingRules.length > 0) { | ||
| groups.push({ name: 'tale-sla-recording', rules: recordingRules }); | ||
| groups.push({ name: 'tale-sla-alerts', rules: alertRules }); | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --fixed-strings 'Record<string, unknown>' services/platform/sla-targets.tsRepository: tale-project/tale
Length of output: 207
🏁 Script executed:
ast-grep outline services/platform/sla-targets.ts --view expandedRepository: tale-project/tale
Length of output: 1018
🏁 Script executed:
sed -n '130,170p' services/platform/sla-targets.tsRepository: tale-project/tale
Length of output: 1537
🏁 Script executed:
cat -n services/platform/sla-targets.ts | grep -E 'unknown|any' | head -20Repository: tale-project/tale
Length of output: 210
🏁 Script executed:
sed -n '1,50p' services/platform/sla-targets.ts | head -20Repository: tale-project/tale
Length of output: 1154
🏁 Script executed:
sed -n '154,169p' services/platform/sla-targets.tsRepository: tale-project/tale
Length of output: 756
Replace unknown in rule-group typing.
Line 154 uses Record<string, unknown>[], violating the repo's TypeScript rules (forbidden: any, unknown, as). Let TypeScript infer the concrete type from the object literals instead.
Suggested fix
- const groups: Record<string, unknown>[] = [];
- if (recordingRules.length > 0) {
- groups.push({ name: 'tale-sla-recording', rules: recordingRules });
- groups.push({ name: 'tale-sla-alerts', rules: alertRules });
- }
+ const groups =
+ recordingRules.length === 0
+ ? []
+ : [
+ { name: 'tale-sla-recording', rules: recordingRules },
+ { name: 'tale-sla-alerts', rules: alertRules },
+ ];🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@services/platform/sla-targets.ts` around lines 154 - 158, Remove the explicit
type annotation `Record<string, unknown>[]` from the groups variable declaration
on line 154 and allow TypeScript to infer the concrete type from the object
literals being pushed into the array (those with name and rules properties).
This eliminates the use of the forbidden unknown type and lets the type system
automatically determine the correct shape based on the actual objects you're
adding.
Source: Coding guidelines
Desk review — Response-time SLA tracking (#1924)Verdict: NOT READY — changes required. The change is clean, well-tested at the unit level, and CI is fully green, but as shipped the SLA does not measure or alert on anything — the central ask of #1924 ("continuously confirm we meet the targets or alert when we don't") is not met. Details below. Environment
BLOCKING — the alert/recording rules reference series that are emitted nowhere
Consequence: out of the box The docs (
An operator following the docs literally would relabel an unrelated histogram to the SLA name and get a green/false SLA, which is worse than an inert one. Required: bridge the rules to a real measured series. Either Non-blocking (worth addressing, not gating on their own)
What is correct (verified)
Bottom line: the wiring, tests, docs structure, and CI are all in good shape, but the rules point at latency series that don't exist, so the SLA neither tracks nor alerts on real response time. Fix the measurement bridge (option a) and this is close. |
larryro
left a comment
There was a problem hiding this comment.
Desk review — Response-time SLA tracking (#1924)
Verdict: READY TO MERGE.
Reviewed the whole implementation (not just the diff) across correctness, robustness, elegance, tests, and issue-resolution, with independent re-verification of every candidate finding.
CI
All 40+ checks green on this PR — including Build platform, Docs container test, Lint, Format, Knip, Typecheck (Analyze), and all 16 Playwright (platform) shards (which boot the real built server, exercising the new telemetry.ts → sla-targets.ts → yaml import chain).
Tests (run locally on the branch)
bunx vitest --run --project server→ 502 files, 6137 tests, 0 failures (203s).- Targeted
sla-targets+telemetry→ 18 passed.
What's correct
statisticExprPromQL is right: mean viarate(_sum)/rate(_count), percentiles viahistogram_quantile(q, sum by (le) rate(_bucket)); alert fires onrecording_rule > targetSeconds. Recording/alert names match the docs verbatim.- Caddyfile:
/metrics/sla-rulesis added to@metricsAuthand proxied without a rewrite — correct, becauseserver.tsserves that literal path (unlike/metrics/platform→/metrics). Falls through to the@metricsBlock401 when no bearer token. Handle ordering is sound. registerSlaTargetMetricsdouble-registration is prevented by theinitializedguard ininitTelemetry; both test files reset the globalprom-clientregistry inafterEach.yaml@2.8.3/prom-client@15.1.3are real deps andsla-targets.tsis copied into both the builder and runtime Docker stages (the two follow-up commits fixed the runtime-stage copy).- Edge/error branches covered: empty target list →
{groups: []}, duplicate-id rejection, percentile aggregation. - Issue fully addressed: 1s dialog + 40s long-op means defined once; gauge
tale_sla_target_secondsfor the dashboard budget line; recording+alert rules served at/metrics/sla-rules; and the ~1s-mean vs ~3s-warm-TTFT-ceiling reconciliation is documented inoperations.mdandperformance.md(logically sound: per-request ceiling vs steady-state mean).
Non-blocking notes (optional, for a follow-up — do NOT block merge)
- The SLA rules resolve only once an operator relabels/records the Convex histograms to
tale_dialog_ttft_seconds/tale_long_operation_seconds. If that step is skipped the recording rules return no data and the alert silently never fires. This is honestly documented inoperations.mdand is acceptable scope, but afor-based "no data" guard or an absent-series alert would harden it. operations.mdembeds a hand-copied YAML of the generated rules; it's marked as a fallback to/metrics/sla-rules, so drift risk is low but real.- The
/metrics/sla-rulesHono route is only a thin delegation to the unit-testedslaRulesResponse; an HTTP-level integration test and a p99/alert-label assertion would raise coverage but aren't required.
None of these gate the merge.
What
Resolves #1924. Adds the SLA-tracking and verification layer that was missing on top of Tale's existing measurement primitives — defining the two response-time budgets once and wiring them into metrics, alert rules, and the operator docs.
services/platform/sla-targets.tsis the single source of truth: dialog input at a ~1 s mean and long operations (e.g. evaluations) at a ~40 s mean, each with its statistic, rolling window, alert duration, and underlying latency series.registerSlaTargetMetrics()exposes each budget as atale_sla_target_seconds{sla,statistic}gauge on/metrics(via/metrics/platform), so a Grafana panel draws the budget line straight from Prometheus.renderSlaPrometheusRules()generates the Prometheus recording + alerting rules from the same targets, served read-only at/metrics/sla-rules(added to the bearer-gated metrics paths in the proxyCaddyfile).Verification
vitest --project server—sla-targets.test.ts(targets, rule renderer incl. mean/percentile + empty + duplicate-id cases, gauge registration, YAML response) andtelemetry.test.ts(SLA-gauge regression): 18 passed.@tale/docs test: 142 passed (locale outline, frontmatter, links, code-fence, closing-recap guards all green across en/de/fr).tsc --noEmit,oxlint --type-aware,oxfmt --check,prettier --check, andbun run lint:sast(Opengrep): clean, 0 findings.Definition of Done
messages/{en,de,fr}.json— N/A (no UI strings).Summary by CodeRabbit
New Features
/metrics/sla-rulesendpoint exposes Prometheus recording and alerting rules for SLA tracking.Documentation