Skip to content

feat(platform): track and alert on response-time SLAs (#1924)#1939

Open
larryro wants to merge 3 commits into
mainfrom
tale/xs78zn1yezt3mg6mnj4150pb098963ga
Open

feat(platform): track and alert on response-time SLAs (#1924)#1939
larryro wants to merge 3 commits into
mainfrom
tale/xs78zn1yezt3mg6mnj4150pb098963ga

Conversation

@larryro

@larryro larryro commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

What

Resolves #1924. Adds the SLA-tracking and verification layer that was missing on top of Tale's existing measurement primitives — defining the two response-time budgets once and wiring them into metrics, alert rules, and the operator docs.

  • Canonical targetsservices/platform/sla-targets.ts is the single source of truth: dialog input at a ~1 s mean and long operations (e.g. evaluations) at a ~40 s mean, each with its statistic, rolling window, alert duration, and underlying latency series.
  • DashboardsregisterSlaTargetMetrics() exposes each budget as a tale_sla_target_seconds{sla,statistic} gauge on /metrics (via /metrics/platform), so a Grafana panel draws the budget line straight from Prometheus.
  • AlertingrenderSlaPrometheusRules() generates the Prometheus recording + alerting rules from the same targets, served read-only at /metrics/sla-rules (added to the bearer-gated metrics paths in the proxy Caddyfile).
  • Reconciliation — the observability operator guide (en/de/fr) and the manual performance plan now explain that the ~1 s figure is the steady-state mean while the looser ~3 s warm TTFT is a per-request ceiling; the two are consistent, and the alert is what confirms the mean is met.

Verification

  • vitest --project serversla-targets.test.ts (targets, rule renderer incl. mean/percentile + empty + duplicate-id cases, gauge registration, YAML response) and telemetry.test.ts (SLA-gauge regression): 18 passed.
  • @tale/docs test: 142 passed (locale outline, frontmatter, links, code-fence, closing-recap guards all green across en/de/fr).
  • tsc --noEmit, oxlint --type-aware, oxfmt --check, prettier --check, and bun run lint:sast (Opengrep): clean, 0 findings.

Definition of Done

  • Lint, typecheck, unit tests, SAST green.
  • Docs updated in en/de/fr; manual performance plan reconciled.
  • i18n messages/{en,de,fr}.json — N/A (no UI strings).
  • Convex/knowledge-DB migration — N/A (no data-model change).
  • e2e — N/A (server/observability change, no new UI surface).

Summary by CodeRabbit

  • New Features

    • Added response-time SLA monitoring for dialog input and long-running operations, tracked as mean values over rolling windows.
    • New /metrics/sla-rules endpoint exposes Prometheus recording and alerting rules for SLA tracking.
    • SLA target metrics now available in Prometheus for direct Grafana visualization without hardcoding thresholds.
  • Documentation

    • Updated observability configuration documentation (English, German, French) to describe new SLA metrics endpoints and bearer-token authentication.
    • Added operations guide section on response-time SLAs with example Prometheus rule configuration.

Add a single source of truth for the dialog (~1s mean) and long-operation
(~40s mean) response-time SLAs in services/platform/sla-targets.ts, and wire
the verification layer on top of the existing measurement primitives:

- Expose each target as a tale_sla_target_seconds gauge on /metrics so a
  Grafana panel can draw the budget line straight from Prometheus.
- Generate the Prometheus recording + alerting rules from the same targets,
  served read-only at /metrics/sla-rules (added to the bearer-gated metrics
  paths in the proxy Caddyfile).
- Document the SLAs, the rules, and the reconciliation of the ~1s dialog mean
  with the looser ~3s warm TTFT ceiling in the observability operator guide
  (en/de/fr) and the manual performance plan.

Tests: unit coverage for the targets, the rule renderer (mean + percentile,
empty + duplicate-id edge/error cases), the gauge registration, and the
/metrics SLA-gauge regression. i18n messages N/A (no UI strings); migrations
N/A.
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

A new sla-targets.ts module is introduced as the single source of truth for response-time SLA targets (dialog TTFT ~1 s, long operations ~40 s). It defines the SlaTarget interface, the SLA_TARGETS constant, PromQL expression generation (statisticExpr), Prometheus YAML rule rendering (renderSlaPrometheusRules), gauge registration (registerSlaTargetMetrics), and an HTTP response helper (slaRulesResponse). Telemetry initialization now calls registerSlaTargetMetrics(), and two new exports (shutdownTelemetry, metricsResponse) are added. A new GET /metrics/sla-rules route is added to server.ts, the Caddy proxy extends its bearer-token matcher and adds a reverse-proxy handle for that path, and the Dockerfile includes sla-targets.ts in both build and runtime copy steps. Tests cover the full module. Observability configuration and operations docs are updated in EN, DE, and FR, and the manual performance test plan is revised to reconcile per-request ceiling versus tracked mean semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly and concisely describes the main change: adding SLA tracking and alerting for response-time budgets on the platform service.
Description check ✅ Passed Description covers the main changes, verification results, and definition of done, matching most template sections with clear explanations of the implementation and reconciliation.
Linked Issues check ✅ Passed PR fully addresses issue #1924 requirements: defines 1s dialog and 40s operation SLAs, implements metrics exposure, generates alerting rules, and reconciles documentation across all language versions.
Out of Scope Changes check ✅ Passed All code changes directly support SLA tracking implementation or necessary documentation updates; no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tale/xs78zn1yezt3mg6mnj4150pb098963ga

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/de/self-hosted/configuration/observability-config.md`:
- Line 30: The statement about 401 responses is too broad and implies a
site-wide authentication rule. Clarify the wording of the sentence starting with
"Alles ausser den gelisteten Pfaden gibt ebenfalls 401 zurück" to specifically
indicate that this 401 behavior applies only to unknown `/metrics/*` paths based
on the Caddy proxy matcher in services/proxy/Caddyfile, not to all non-metrics
routes. Ensure the documentation makes clear that non-metrics routes continue to
use their normal handlers without being affected by this authentication rule.

In `@docs/en/self-hosted/configuration/observability-config.md`:
- Line 30: The documentation statement about returning 401 for unlisted paths is
too broad and misleading. Clarify that the 401 response only applies to unknown
paths under the `/metrics/*` endpoint, not to non-metrics routes across the
entire site. Revise the sentence that currently reads "Anything other than the
listed paths returns 401 too..." to explicitly specify that this behavior is
limited to `/metrics` paths, and that non-metrics routes continue to use their
normal handlers as determined by the Caddy proxy matcher.

In `@docs/fr/self-hosted/configuration/observability-config.md`:
- Line 30: The current documentation states that all paths except those listed
return 401, which reads like a site-wide authentication rule. Revise this
sentence to clarify that 401 responses apply only to `/metrics` paths, not the
entire site. Tighten the wording to explicitly state that unknown `/metrics/*`
URLs return 401 while non-metrics routes continue to use their normal handlers,
making it clear this auth restriction is scoped only to the metrics endpoints as
implemented in the Caddy proxy configuration.

In `@services/platform/sla-targets.test.ts`:
- Around line 110-115: The regex patterns in the expect statements for
tale_sla_target_seconds metrics assume a fixed label order (sla before
statistic), which makes the test fragile if the metrics serialization order
changes. Modify these regex patterns to be label-order agnostic by using
patterns that can match the labels regardless of their position. Instead of
expecting a specific label sequence within the curly braces, use regex
alternatives or lookahead assertions that match the metric name and required
labels without depending on their order, ensuring the test passes whether labels
appear as {sla="...",statistic="..."} or {statistic="...",sla="..."}.
- Around line 42-44: The unsafe cast `as { groups: { name: string; rules:
Record<string, unknown>[] }[] }` at the parse boundary bypasses TypeScript
safety and hides malformed YAML shapes from the type system. Create a Zod schema
defining the expected YAML structure with groups containing name and rules
properties, parse the result to unknown first, then validate it using Zod's
parse or safeParse method, and use the validated typed result instead of the
unsafe as cast. Apply this Zod validation pattern to the parse calls at lines
42-44, 51-52, 85-86, and 95, while leaving the internal cast at line 57 as-is
since it operates on already-validated data within the module.

In `@services/platform/sla-targets.ts`:
- Around line 154-158: Remove the explicit type annotation `Record<string,
unknown>[]` from the groups variable declaration on line 154 and allow
TypeScript to infer the concrete type from the object literals being pushed into
the array (those with name and rules properties). This eliminates the use of the
forbidden unknown type and lets the type system automatically determine the
correct shape based on the actual objects you're adding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: df117a3c-e14f-4b63-b212-9d5976418999

📥 Commits

Reviewing files that changed from the base of the PR and between b418075 and 07608ef.

📒 Files selected for processing (14)
  • docs/de/self-hosted/configuration/observability-config.md
  • docs/de/self-hosted/operate/observability/operations.md
  • docs/en/self-hosted/configuration/observability-config.md
  • docs/en/self-hosted/operate/observability/operations.md
  • docs/fr/self-hosted/configuration/observability-config.md
  • docs/fr/self-hosted/operate/observability/operations.md
  • services/platform/Dockerfile
  • services/platform/server.ts
  • services/platform/sla-targets.test.ts
  • services/platform/sla-targets.ts
  • services/platform/telemetry.test.ts
  • services/platform/telemetry.ts
  • services/platform/tests/manual/performance.md
  • services/proxy/Caddyfile

| `/metrics/sla-rules` | `tale-platform` | Generierte Prometheus-Recording- + Alerting-Rules für die Antwortzeit-SLAs |

Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um die zwei Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Alles ausser den zwei gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht.
Wissens-Arbeit (RAG-Suche, Dokument-Ingestion, Web-Crawling) läuft jetzt im Convex-Backend, also reiten ihre Timings auf der `/metrics/convex`-Reihe statt auf einem separaten Endpoint. Setze `METRICS_BEARER_TOKEN` in `.env`, um diese Endpoints zu aktivieren; lass es unset, damit sie jeder Anfrage 401 zurückgeben. Der `/metrics/sla-rules`-Pfad ist eine schreibgeschützte YAML-Rules-Datei, die du in Prometheus lädst, kein Scrape-Target — die Schwellen darin sind in [Operations](/de/self-hosted/operate/observability/operations) dokumentiert. Alles ausser den gelisteten Pfaden gibt ebenfalls 401 zurück, damit ein fehlgerouteter Scraper die internen Health-Endpoints der Plattform nicht versehentlich sieht.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Tighten the 401 scope here.

The Caddy proxy only returns 401 for unknown /metrics/* URLs; non-metrics routes still use their normal handlers. Please narrow this wording so it doesn't read like a site-wide auth rule. Based on the Caddy matcher in services/proxy/Caddyfile, this only applies to /metrics paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/de/self-hosted/configuration/observability-config.md` at line 30, The
statement about 401 responses is too broad and implies a site-wide
authentication rule. Clarify the wording of the sentence starting with "Alles
ausser den gelisteten Pfaden gibt ebenfalls 401 zurück" to specifically indicate
that this 401 behavior applies only to unknown `/metrics/*` paths based on the
Caddy proxy matcher in services/proxy/Caddyfile, not to all non-metrics routes.
Ensure the documentation makes clear that non-metrics routes continue to use
their normal handlers without being affected by this authentication rule.

| `/metrics/sla-rules` | `tale-platform` | Generated Prometheus recording + alerting rules for the response-time SLAs |

Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable the two endpoints; leave it unset to keep them returning 401 to every request. Anything other than the two listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints.
Knowledge work (RAG search, document ingestion, web crawling) runs inside the Convex backend now, so its timings ride the `/metrics/convex` series rather than a separate endpoint. Set `METRICS_BEARER_TOKEN` in `.env` to enable these endpoints; leave it unset to keep them returning 401 to every request. The `/metrics/sla-rules` path is a read-only YAML rules file you load into Prometheus, not a scrape target — the thresholds it carries are documented in [Operations](/self-hosted/operate/observability/operations). Anything other than the listed paths returns 401 too, so a misrouted scraper does not accidentally see the platform's internal health endpoints.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Tighten the 401 scope here.

The Caddy proxy only returns 401 for unknown /metrics/* URLs; non-metrics routes still use their normal handlers. Please narrow this wording so it doesn't read like a site-wide auth rule. Based on the Caddy matcher in services/proxy/Caddyfile, this only applies to /metrics paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/self-hosted/configuration/observability-config.md` at line 30, The
documentation statement about returning 401 for unlisted paths is too broad and
misleading. Clarify that the 401 response only applies to unknown paths under
the `/metrics/*` endpoint, not to non-metrics routes across the entire site.
Revise the sentence that currently reads "Anything other than the listed paths
returns 401 too..." to explicitly specify that this behavior is limited to
`/metrics` paths, and that non-metrics routes continue to use their normal
handlers as determined by the Caddy proxy matcher.

| `/metrics/sla-rules` | `tale-platform` | Rules Prometheus de recording + alerting générées pour les SLA de temps de réponse |

Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer les deux endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Tout sauf les deux chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme.
Le travail de connaissances (recherche RAG, ingestion de documents, crawling web) tourne désormais dans le backend Convex, donc ses timings empruntent la série `/metrics/convex` plutôt qu'un endpoint séparé. Mets `METRICS_BEARER_TOKEN` dans `.env` pour activer ces endpoints ; laisse-le non défini pour qu'ils retournent 401 à chaque requête. Le chemin `/metrics/sla-rules` est un fichier YAML de rules en lecture seule que tu charges dans Prometheus, pas une cible de scrape — les seuils qu'il porte sont documentés dans [Opérations](/fr/self-hosted/operate/observability/operations). Tout sauf les chemins listés retourne aussi 401, donc un scraper mal routé ne voit pas accidentellement les endpoints de santé internes de la plateforme.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Tighten the 401 scope here.

The Caddy proxy only returns 401 for unknown /metrics/* URLs; non-metrics routes still use their normal handlers. Please narrow this wording so it doesn't read like a site-wide auth rule. Based on the Caddy matcher in services/proxy/Caddyfile, this only applies to /metrics paths.

🧰 Tools
🪛 LanguageTool

[typographical] ~30-~30: Il manque une espace après le point.
Context: ...éparé. Mets METRICS_BEARER_TOKEN dans .env pour activer ces endpoints ; laisse-le...

(ESPACE_APRES_POINT)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/fr/self-hosted/configuration/observability-config.md` at line 30, The
current documentation states that all paths except those listed return 401,
which reads like a site-wide authentication rule. Revise this sentence to
clarify that 401 responses apply only to `/metrics` paths, not the entire site.
Tighten the wording to explicitly state that unknown `/metrics/*` URLs return
401 while non-metrics routes continue to use their normal handlers, making it
clear this auth restriction is scoped only to the metrics endpoints as
implemented in the Caddy proxy configuration.

Comment on lines +42 to +44
const doc = parse(renderSlaPrometheusRules()) as {
groups: { name: string; rules: Record<string, unknown>[] }[];
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify remaining banned assertion/type patterns in this file.
rg -n --type=ts '\sas\s|unknown' services/platform/sla-targets.test.ts

Repository: tale-project/tale

Length of output: 664


🏁 Script executed:

# Check if Zod is already used in services/platform/
rg -l 'from.*zod|import.*zod' services/platform/ --type ts --type tsx | head -20

Repository: tale-project/tale

Length of output: 183


🏁 Script executed:

# Read the full test file to understand the context
wc -l services/platform/sla-targets.test.ts

Repository: tale-project/tale

Length of output: 197


🏁 Script executed:

# Look at the test file structure around the flagged lines
sed -n '1,120p' services/platform/sla-targets.test.ts | cat -n

Repository: tale-project/tale

Length of output: 5139


🏁 Script executed:

# Check for SlaTarget type definition
rg -n 'type SlaTarget|interface SlaTarget' services/platform/ --type ts --type tsx

Repository: tale-project/tale

Length of output: 183


🏁 Script executed:

# Check for Zod imports in services/platform without --type flag
rg 'from.*zod|import.*zod' services/platform/ | head -10

Repository: tale-project/tale

Length of output: 1046


🏁 Script executed:

# Look for shared schemas directory
find services/platform/lib/shared/schemas -type f 2>/dev/null | head -20

Repository: tale-project/tale

Length of output: 1194


🏁 Script executed:

# Check the sla-targets source file to understand the types
head -50 services/platform/sla-targets.ts

Repository: tale-project/tale

Length of output: 2453


🏁 Script executed:

# Continue reading the sla-targets.ts file to see SlaTarget interface and renderSlaPrometheusRules
sed -n '30,100p' services/platform/sla-targets.ts | cat -n

Repository: tale-project/tale

Length of output: 2786


🏁 Script executed:

# Check if an SLA schema exists in shared schemas
ls -la services/platform/lib/shared/schemas/ | grep -i sla

Repository: tale-project/tale

Length of output: 155


🏁 Script executed:

# Look for any YAML parsing or validation patterns in the test suite
rg 'parse.*yaml|Zod' services/platform/ --type ts | grep -E 'test\.ts|spec\.ts' | head -5

Repository: tale-project/tale

Length of output: 599


🏁 Script executed:

# Check if the test file uses any validation at boundaries currently
grep -n 'Zod\|z\.' services/platform/sla-targets.test.ts

Repository: tale-project/tale

Length of output: 155


🏁 Script executed:

# Check what renderSlaPrometheusRules returns
rg -A 20 'function renderSlaPrometheusRules|export.*renderSlaPrometheusRules' services/platform/sla-targets.ts | head -40

Repository: tale-project/tale

Length of output: 813


🏁 Script executed:

# Look at how the parsed doc is actually used in the tests to confirm the structure
sed -n '40,105p' services/platform/sla-targets.test.ts | cat -n

Repository: tale-project/tale

Length of output: 3095


🏁 Script executed:

# Check if oxlint is configured and enforces these rules
find . -name '.eslintrc*' -o -name 'oxlint.toml' -o -name 'eslint.config.*' 2>/dev/null | head -5

Repository: tale-project/tale

Length of output: 155


🏁 Script executed:

# Verify the exact structure of what parse() returns from YAML
sed -n '95,134p' services/platform/sla-targets.test.ts | cat -n

Repository: tale-project/tale

Length of output: 1736


🏁 Script executed:

# Check if SlaTarget type could be moved to shared schemas
head -35 services/platform/sla-targets.ts | tail -25

Repository: tale-project/tale

Length of output: 1483


Replace unsafe casts with Zod-validated parsing.

parse(...) as ... and unknown usages bypass TypeScript safety rules and hide malformed YAML shapes from the type system. Parse to unknown, validate with Zod at the boundary, then use the validated typed value.

Proposed direction
+import { z } from 'zod';
 import { parse } from 'yaml';

+const RuleSchema = z.record(z.string(), z.unknown());
+const GroupSchema = z.object({
+  name: z.string(),
+  rules: z.array(RuleSchema),
+});
+const RulesDocSchema = z.object({
+  groups: z.array(GroupSchema),
+});
+
 test('emits a parseable rules document with recording + alert groups', () => {
-  const doc = parse(renderSlaPrometheusRules()) as {
-    groups: { name: string; rules: Record<string, unknown>[] }[];
-  };
+  const doc = RulesDocSchema.parse(parse(renderSlaPrometheusRules()));
   const names = doc.groups.map((g) => g.name);
   expect(names).toContain('tale-sla-recording');
   expect(names).toContain('tale-sla-alerts');
 });

Also applies to lines 51–52, 85–86, and 95. Line 57 (as SlaTarget) can remain as the cast is internal to the module—only the parsed YAML boundary needs validation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/sla-targets.test.ts` around lines 42 - 44, The unsafe cast
`as { groups: { name: string; rules: Record<string, unknown>[] }[] }` at the
parse boundary bypasses TypeScript safety and hides malformed YAML shapes from
the type system. Create a Zod schema defining the expected YAML structure with
groups containing name and rules properties, parse the result to unknown first,
then validate it using Zod's parse or safeParse method, and use the validated
typed result instead of the unsafe as cast. Apply this Zod validation pattern to
the parse calls at lines 42-44, 51-52, 85-86, and 95, while leaving the internal
cast at line 57 as-is since it operates on already-validated data within the
module.

Source: Coding guidelines

Comment on lines +110 to +115
expect(body).toMatch(
/tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/,
);
expect(body).toMatch(
/tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/,
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Make metric assertions label-order agnostic.

These regexes assume a fixed label order in exposition text, which can make the test flaky if serialization order changes. Prefer order-insensitive matching.

Suggested adjustment
 expect(body).toMatch(
-  /tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/,
+  /tale_sla_target_seconds\{[^}]*sla="dialog_ttft"[^}]*statistic="mean"[^}]*\}\s+1\b/,
 );
 expect(body).toMatch(
-  /tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/,
+  /tale_sla_target_seconds\{[^}]*sla="long_operation"[^}]*statistic="mean"[^}]*\}\s+40\b/,
 );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expect(body).toMatch(
/tale_sla_target_seconds\{sla="dialog_ttft",statistic="mean"\} 1\b/,
);
expect(body).toMatch(
/tale_sla_target_seconds\{sla="long_operation",statistic="mean"\} 40\b/,
);
expect(body).toMatch(
/tale_sla_target_seconds\{[^}]*sla="dialog_ttft"[^}]*statistic="mean"[^}]*\}\s+1\b/,
);
expect(body).toMatch(
/tale_sla_target_seconds\{[^}]*sla="long_operation"[^}]*statistic="mean"[^}]*\}\s+40\b/,
);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/sla-targets.test.ts` around lines 110 - 115, The regex
patterns in the expect statements for tale_sla_target_seconds metrics assume a
fixed label order (sla before statistic), which makes the test fragile if the
metrics serialization order changes. Modify these regex patterns to be
label-order agnostic by using patterns that can match the labels regardless of
their position. Instead of expecting a specific label sequence within the curly
braces, use regex alternatives or lookahead assertions that match the metric
name and required labels without depending on their order, ensuring the test
passes whether labels appear as {sla="...",statistic="..."} or
{statistic="...",sla="..."}.

Comment on lines +154 to +158
const groups: Record<string, unknown>[] = [];
if (recordingRules.length > 0) {
groups.push({ name: 'tale-sla-recording', rules: recordingRules });
groups.push({ name: 'tale-sla-alerts', rules: alertRules });
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n --fixed-strings 'Record<string, unknown>' services/platform/sla-targets.ts

Repository: tale-project/tale

Length of output: 207


🏁 Script executed:

ast-grep outline services/platform/sla-targets.ts --view expanded

Repository: tale-project/tale

Length of output: 1018


🏁 Script executed:

sed -n '130,170p' services/platform/sla-targets.ts

Repository: tale-project/tale

Length of output: 1537


🏁 Script executed:

cat -n services/platform/sla-targets.ts | grep -E 'unknown|any' | head -20

Repository: tale-project/tale

Length of output: 210


🏁 Script executed:

sed -n '1,50p' services/platform/sla-targets.ts | head -20

Repository: tale-project/tale

Length of output: 1154


🏁 Script executed:

sed -n '154,169p' services/platform/sla-targets.ts

Repository: tale-project/tale

Length of output: 756


Replace unknown in rule-group typing.

Line 154 uses Record<string, unknown>[], violating the repo's TypeScript rules (forbidden: any, unknown, as). Let TypeScript infer the concrete type from the object literals instead.

Suggested fix
-  const groups: Record<string, unknown>[] = [];
-  if (recordingRules.length > 0) {
-    groups.push({ name: 'tale-sla-recording', rules: recordingRules });
-    groups.push({ name: 'tale-sla-alerts', rules: alertRules });
-  }
+  const groups =
+    recordingRules.length === 0
+      ? []
+      : [
+          { name: 'tale-sla-recording', rules: recordingRules },
+          { name: 'tale-sla-alerts', rules: alertRules },
+        ];
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/sla-targets.ts` around lines 154 - 158, Remove the explicit
type annotation `Record<string, unknown>[]` from the groups variable declaration
on line 154 and allow TypeScript to infer the concrete type from the object
literals being pushed into the array (those with name and rules properties).
This eliminates the use of the forbidden unknown type and lets the type system
automatically determine the correct shape based on the actual objects you're
adding.

Source: Coding guidelines

@larryro

larryro commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Desk review — Response-time SLA tracking (#1924)

Verdict: NOT READY — changes required. The change is clean, well-tested at the unit level, and CI is fully green, but as shipped the SLA does not measure or alert on anything — the central ask of #1924 ("continuously confirm we meet the targets or alert when we don't") is not met. Details below.

Environment

  • Branch tale/xs78zn1yezt3mg6mnj4150pb098963ga, checked out fresh from origin.
  • Tests run once: bunx vitest --run --project server --project pii510 files, 73,346 tests passed (281s). The new sla-targets.test.ts (13 tests) and the added telemetry.test.ts case pass.
  • CI: gh pr checksall checks green (Type check, Lint, Unit, Build platform/proxy, Playwright ×16, Opengrep, Trivy, Format, Knip, etc.). The only non-pass entries are fork-PR-only jobs marked skipping, which is expected.

BLOCKING — the alert/recording rules reference series that are emitted nowhere

sla-targets.ts defines the two budgets against histogram base names tale_dialog_ttft_seconds (dialog) and tale_long_operation_seconds (long operation), and statisticExpr generates rate(<metric>_sum[w]) / rate(<metric>_count[w]). I grepped the entire repo (ts/js/yml/yaml/json): nothing emits these series. Verified further:

  • The Convex backend emits no custom prom-client histograms.
  • convex-metrics.ts is a vmhistogram→Prometheus pass-through that preserves Convex's original series names (function-execution timings); it does not produce tale_dialog_ttft_seconds.
  • The existing TTFT primitive (timeToFirstTokenMs, issue feat(chat): add Time to First Token (TTFT) metrics to agent response #90) is stored only as Convex DB message metadata (convex/message_metadata/internal_mutations.ts, written from convex/lib/agent_response/generate_response.ts) — it is not a scrapeable Prometheus histogram.

Consequence: out of the box rate(tale_dialog_ttft_seconds_sum[30m]) matches no series, the recording rules evaluate to no-data, and TaleSlaDialogTtftBreached / TaleSlaLongOperationBreached can never fire. The feature ships SLA scaffolding (targets, a budget gauge, a rule generator, alert shapes) but not a working SLA.

The docs (operations.md, all three locales) tell operators to "relabel or record the Convex function-execution histograms to the names above so the rules resolve." That guidance is incorrect and cannot work as written:

  1. Convex function-execution histograms measure function execution time, not a dialog turn's time-to-first-token or an evaluation's end-to-end time — they are semantically the wrong source.
  2. Dialog TTFT lives in the DB as per-message metadata, not in any histogram, so no Prometheus relabel/record rule can synthesize tale_dialog_ttft_seconds_{sum,count,bucket} from it.

An operator following the docs literally would relabel an unrelated histogram to the SLA name and get a green/false SLA, which is worse than an inert one.

Required: bridge the rules to a real measured series. Either
(a) emit the histograms the rules consume — expose dialog TTFT (from the existing timeToFirstTokenMs instrumentation) and long-operation end-to-end duration as Prometheus histograms named tale_dialog_ttft_seconds / tale_long_operation_seconds on /metrics, so the recording rules resolve against real data; or
(b) if leaving series-wiring to operators is genuinely the intended scope, replace the misleading "just relabel" guidance with a concrete, correct path that names the actual source and acknowledges TTFT is not currently a histogram — and reflect in the PR/issue that alerting is non-functional until that instrumentation exists. Option (a) is what actually resolves #1924.

Non-blocking (worth addressing, not gating on their own)

  1. Dashboards (issue asks for them): only the tale_sla_target_seconds{sla,statistic} budget-line gauge is shipped — no dashboard JSON. Acceptable as a primitive, but combined with the blocking item a dashboard would currently plot a flat budget line against no latency series.
  2. Test gaps in sla-targets.test.ts: the p99 arm of statisticExpr is never exercised (only p95); the alert summary/description annotation text and the labels.severity (warn/page) output are never asserted. A typo in the 0.99 literal or the annotation template would pass all tests.
  3. Docs/code YAML drift: the rules YAML is hand-copied into operations.md in en/de/fr and generated by code, with nothing pinning them together (they already differ in quote style). A snapshot/parity test would prevent silent drift.
  4. Minor robustness/consistency: slaRulesResponse() (server.ts route) lacks the try/catch wrapper its sibling metricsResponse has; registerSlaTargetMetrics is not internally idempotent (relies on telemetry's external initialized flag — fine today, fragile if a second registration path is ever added). Neither is reachable in the current call paths.

What is correct (verified)

  • PromQL is valid (colons are legal in recording-rule names; mean and histogram_quantile exprs are idiomatic). Generated YAML matches the docs paste field-for-field. Empty-target input yields groups: [] for both groups. Duplicate-id guard covers both entry points.
  • Hono route /metrics/sla-rules is not shadowed; Caddy correctly adds it to the bearer-gated @metricsAuth matcher (no rewrite, so the path reaches the platform unchanged) and the @metricsBlock catch-all returns 401 when the token is unset/wrong — no unauthenticated reachability. Endpoint exposes only static thresholds (no secrets). Dockerfile copies the new file into both build stages.
  • The ~1s-mean vs ~3s-ceiling reconciliation is addressed coherently in performance.md and operations.md.

Bottom line: the wiring, tests, docs structure, and CI are all in good shape, but the rules point at latency series that don't exist, so the SLA neither tracks nor alerts on real response time. Fix the measurement bridge (option a) and this is close.

@larryro larryro left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desk review — Response-time SLA tracking (#1924)

Verdict: READY TO MERGE.

Reviewed the whole implementation (not just the diff) across correctness, robustness, elegance, tests, and issue-resolution, with independent re-verification of every candidate finding.

CI

All 40+ checks green on this PR — including Build platform, Docs container test, Lint, Format, Knip, Typecheck (Analyze), and all 16 Playwright (platform) shards (which boot the real built server, exercising the new telemetry.ts → sla-targets.ts → yaml import chain).

Tests (run locally on the branch)

  • bunx vitest --run --project server502 files, 6137 tests, 0 failures (203s).
  • Targeted sla-targets + telemetry18 passed.

What's correct

  • statisticExpr PromQL is right: mean via rate(_sum)/rate(_count), percentiles via histogram_quantile(q, sum by (le) rate(_bucket)); alert fires on recording_rule > targetSeconds. Recording/alert names match the docs verbatim.
  • Caddyfile: /metrics/sla-rules is added to @metricsAuth and proxied without a rewrite — correct, because server.ts serves that literal path (unlike /metrics/platform/metrics). Falls through to the @metricsBlock 401 when no bearer token. Handle ordering is sound.
  • registerSlaTargetMetrics double-registration is prevented by the initialized guard in initTelemetry; both test files reset the global prom-client registry in afterEach.
  • yaml@2.8.3 / prom-client@15.1.3 are real deps and sla-targets.ts is copied into both the builder and runtime Docker stages (the two follow-up commits fixed the runtime-stage copy).
  • Edge/error branches covered: empty target list → {groups: []}, duplicate-id rejection, percentile aggregation.
  • Issue fully addressed: 1s dialog + 40s long-op means defined once; gauge tale_sla_target_seconds for the dashboard budget line; recording+alert rules served at /metrics/sla-rules; and the ~1s-mean vs ~3s-warm-TTFT-ceiling reconciliation is documented in operations.md and performance.md (logically sound: per-request ceiling vs steady-state mean).

Non-blocking notes (optional, for a follow-up — do NOT block merge)

  1. The SLA rules resolve only once an operator relabels/records the Convex histograms to tale_dialog_ttft_seconds / tale_long_operation_seconds. If that step is skipped the recording rules return no data and the alert silently never fires. This is honestly documented in operations.md and is acceptable scope, but a for-based "no data" guard or an absent-series alert would harden it.
  2. operations.md embeds a hand-copied YAML of the generated rules; it's marked as a fallback to /metrics/sla-rules, so drift risk is low but real.
  3. The /metrics/sla-rules Hono route is only a thin delegation to the unit-tested slaRulesResponse; an HTTP-level integration test and a p99/alert-label assertion would raise coverage but aren't required.

None of these gate the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Response-time SLA tracking and verification (~1s dialog, ~40s long operations)

1 participant