Skip to content

Changes made#690

Open
Godfrey-Delight wants to merge 2 commits into
NOVUS-X:mainfrom
Godfrey-Delight:main
Open

Changes made#690
Godfrey-Delight wants to merge 2 commits into
NOVUS-X:mainfrom
Godfrey-Delight:main

Conversation

@Godfrey-Delight

@Godfrey-Delight Godfrey-Delight commented Apr 26, 2026

Copy link
Copy Markdown

I successfully implemented comprehensive Prometheus and Grafana metrics integration for the XLMate backend, adding real-time monitoring for active games, WebSocket connections, database query latency, authentication events, game lifecycle events, and AI requests. The implementation includes a centralized metrics module with 7 custom Prometheus metrics, automatic HTTP request metrics via actix-web-prom middleware, fully instrumented endpoints across games/auth/AI/WebSocket modules, pre-configured Prometheus and Grafana services in Docker Compose with auto-provisioned dashboards, and comprehensive unit tests ensuring thread-safe metric collection and proper Prometheus format encoding.
Closed #540

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive monitoring and observability with Prometheus metrics collection
    • Integrated Grafana dashboard displaying active games, WebSocket connections, database query performance, and request rates
    • Added metrics tracking for AI requests, authentication events, and game lifecycle events
  • Documentation

    • Added monitoring setup guide and metrics endpoint documentation
  • Tests

    • Added metrics validation test suite
  • Chores

    • Added Prometheus and Grafana services to Docker Compose stack

I successfully implemented comprehensive Prometheus and Grafana metrics integration for the XLMate backend, adding real-time monitoring for active games, WebSocket connections, database query latency, authentication events, game lifecycle events, and AI requests. The implementation includes a centralized metrics module with 7 custom Prometheus metrics, automatic HTTP request metrics via actix-web-prom middleware, fully instrumented endpoints across games/auth/AI/WebSocket modules, pre-configured Prometheus and Grafana services in Docker Compose with auto-provisioned dashboards, and comprehensive unit tests ensuring thread-safe metric collection and proper Prometheus format encoding.
Closed NOVUS-X#540
@drips-wave

drips-wave Bot commented Apr 26, 2026

Copy link
Copy Markdown

@Godfrey-Delight Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@coderabbitai

coderabbitai Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@Godfrey-Delight has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 32 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 31 minutes and 32 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5492c853-0214-450d-bf76-798f62365da0

📥 Commits

Reviewing files that changed from the base of the PR and between e994f9e and 2177e49.

📒 Files selected for processing (7)
  • backend/.env.example
  • backend/README.md
  • backend/modules/api/src/metrics.rs
  • backend/modules/api/src/metrics_tests.rs
  • backend/modules/api/src/server.rs
  • backend/monitoring/grafana/provisioning/datasources/prometheus.yml
  • docker-compose.yml
📝 Walkthrough

Walkthrough

The pull request adds comprehensive Prometheus and Grafana monitoring to the backend by introducing a metrics module that instruments application handlers and WebSocket connections, configuring Prometheus scraping and Grafana visualization, and extending Docker Compose with the monitoring stack services.

Changes

Cohort / File(s) Summary
Configuration & Documentation
backend/.env.example, backend/README.md
Added environment variables for metrics configuration and extensive monitoring documentation covering setup, endpoints, metrics types, PromQL examples, and test commands.
Metrics Module & Infrastructure
backend/modules/api/Cargo.toml, backend/modules/api/src/lib.rs, backend/modules/api/src/metrics.rs
Introduced Prometheus dependencies and new metrics module with global Registry, gauges for active games/WebSocket connections/matchmaking queue, histogram for database query duration, counters for AI/auth/game events, and a /metrics endpoint handler.
Metrics Testing
backend/modules/api/src/metrics_tests.rs
Added comprehensive test suite validating gauge/counter/histogram operations, concurrent metric updates, registry exposure, and Prometheus text encoding.
API Handler Instrumentation
backend/modules/api/src/ai.rs, backend/modules/api/src/auth.rs, backend/modules/api/src/games.rs, backend/modules/api/src/ws.rs, backend/modules/api/src/server.rs
Instrumented API handlers and WebSocket lifecycle to increment/decrement metrics counters and gauges; integrated Prometheus middleware into Actix server and exposed /metrics endpoint.
Prometheus Monitoring Stack
backend/monitoring/prometheus.yml
Created Prometheus configuration scraping /metrics and /metrics/http endpoints from the backend at 10s intervals.
Grafana Dashboard Configuration
backend/monitoring/grafana/provisioning/dashboards/dashboard.yml, backend/monitoring/grafana/provisioning/dashboards/xlmate-dashboard.json, backend/monitoring/grafana/provisioning/datasources/prometheus.yml
Added Grafana provisioning files defining a dashboard provider, a pre-built monitoring dashboard with stat panels and timeseries charts for active games, WebSocket connections, database latency, HTTP request rates, game events, and AI request rates, plus Prometheus datasource configuration.
Docker Integration
docker-compose.yml
Extended Docker Compose with Prometheus and Grafana services, persistent volumes, port mappings, and provisioning file mounts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰✨ Metrics gathered with care,
Prometheus watches everywhere,
Grafana charts the tale so bright,
WebSockets counted day and night!
Observability hops into view—
Data dreams now all come true! 🎯📊

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "Changes made" is vague and generic, failing to convey the specific nature or scope of the changeset. Revise the title to be more descriptive, e.g., "Add Prometheus and Grafana metrics integration for backend monitoring."
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR implements Prometheus/Grafana metrics across backend endpoints with unit tests, though frontend analysis was not performed as frontend code is not modified.
Out of Scope Changes check ✅ Passed All changes directly support metrics integration objectives; Docker Compose updates for monitoring infrastructure are appropriately scoped.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/modules/api/src/games.rs (1)

607-609: ⚠️ Potential issue | 🔴 Critical

Stray closing brace — file will not compile.

complete_game already closes at Line 608 (function body opened at Line 493, match closed at Line 607). The extra } on Line 609 is a syntax error.

🐛 Proposed fix
         }
     }
 }
-}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/modules/api/src/games.rs` around lines 607 - 609, The file contains
an extra closing brace that causes a compile error; remove the stray `}` that
follows the end of the complete_game function so the function (complete_game)
and surrounding module braces are balanced — locate the end of the complete_game
function (opened at Line ~493, closed at the match end around Line ~607) and
delete the extra closing brace after that match to restore correct brace
pairing.
docker-compose.yml (1)

37-75: ⚠️ Potential issue | 🔴 Critical

Critical: prometheus and grafana are nested under volumes:, not services: — they will not start.

Top-level volumes: opens at Line 37. Lines 42–45 correctly add prometheus_data / grafana_data as named volumes, but Lines 47–75 keep the same 2-space indentation, so docker-compose interprets prometheus: and grafana: as additional named-volume declarations (with bogus image/ports/command keys). The monitoring stack will silently fail to come up, and Grafana's datasource (which references http://prometheus:9090) will not resolve.

🐛 Proposed fix — split volumes from services
 volumes:
   redis_data:
     driver: local
   postgres_data:
     driver: local
   prometheus_data:
     driver: local
   grafana_data:
     driver: local
-
-  prometheus:
-    image: prom/prometheus:latest
-    container_name: xlmate-prometheus
-    ports:
-      - "9090:9090"
-    volumes:
-      - ./backend/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
-      - prometheus_data:/prometheus
-    command:
-      - '--config.file=/etc/prometheus/prometheus.yml'
-      - '--storage.tsdb.path=/prometheus'
-      - '--web.console.libraries=/etc/prometheus/console_libraries'
-      - '--web.console.templates=/etc/prometheus/consoles'
-    restart: unless-stopped
-
-  grafana:
-    image: grafana/grafana:latest
-    container_name: xlmate-grafana
-    ports:
-      - "3000:3000"
-    environment:
-      - GF_SECURITY_ADMIN_PASSWORD=admin
-      - GF_USERS_ALLOW_SIGN_UP=false
-    volumes:
-      - grafana_data:/var/lib/grafana
-      - ./backend/monitoring/grafana/provisioning:/etc/grafana/provisioning
-    restart: unless-stopped
-    depends_on:
-      - prometheus

Then add the service blocks under the existing services: key (Line 3) with proper indentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker-compose.yml` around lines 37 - 75, The compose file places the
prometheus and grafana service blocks under the top-level volumes: section
instead of services:, so Docker Compose treats them as volume entries; move the
prometheus and grafana blocks out of the volumes: mapping and place them as
entries under the existing services: key (preserving their image,
container_name, ports, environment, volumes, command, restart, depends_on,
etc.), ensure named volumes prometheus_data and grafana_data remain defined
under volumes:, and fix indentation so prometheus and grafana are sibling
entries of other services (not nested under volumes:).
backend/modules/api/src/auth.rs (1)

28-107: ⚠️ Potential issue | 🟡 Minor

Auth failure tracking is incomplete.

increment_auth_events(..., false) is only called when access-token generation fails. Other failure paths silently skip the metric, so the success/failure ratio will be misleading once you split by label (assuming the underlying counter is fixed to use labels):

  • register validation failure (Line 28–33) — never recorded.
  • login validation failure (Line 68–73) — never recorded.
  • login refresh-token generation failure (Line 100–107) — never recorded.
🔧 Suggested additions
     if let Err(errors) = payload.validate() {
+        increment_auth_events("register", false);
         return HttpResponse::BadRequest().json(ErrorResponse { ... });
     }
     if let Err(errors) = payload.validate() {
+        increment_auth_events("login", false);
         return HttpResponse::BadRequest().json(ErrorResponse { ... });
     }
         Err(e) => {
             log::error!("Failed to generate refresh token: {}", e);
+            increment_auth_events("login", false);
             return HttpResponse::InternalServerError().json(ErrorResponse { ... });
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/modules/api/src/auth.rs` around lines 28 - 107, Validation and
token-generation failures are not being recorded by the auth metrics, so update
the handlers to call increment_auth_events with false on all failure paths: in
the register function, call increment_auth_events("register", false) inside the
validation error branch (where payload.validate() fails); in the login function,
call increment_auth_events("login", false) inside the login validation error
branch and inside the refresh-token failure branch (the Err(e) arm of
TokenService::generate_refresh_token), and ensure you still call
increment_auth_events("login", true) after successful token creation (after
jwt_service.generate_token and refresh token succeed) so both success and all
failure paths are tracked.
🧹 Nitpick comments (6)
backend/modules/api/src/ws.rs (1)

13-13: Drop unused Metrics import.

Only increment_ws_connections and decrement_ws_connections are referenced in this file; Metrics is imported but never used and will produce an unused_imports warning.

♻️ Proposed change
-use crate::metrics::{increment_ws_connections, decrement_ws_connections, Metrics};
+use crate::metrics::{increment_ws_connections, decrement_ws_connections};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/modules/api/src/ws.rs` at line 13, The import statement in ws.rs
includes an unused symbol `Metrics`; remove `Metrics` from the use declaration
so only `increment_ws_connections` and `decrement_ws_connections` are imported
(i.e., update the use crate::metrics line to import just those two functions) to
eliminate the unused_imports warning.
backend/monitoring/prometheus.yml (1)

1-3: Optional: align global and per-job intervals.

Global scrape_interval is 15s but both jobs override to 10s, which makes the global setting effectively unused. Consider lowering the global to 10s (or dropping the per-job overrides) to keep configuration intent obvious.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/monitoring/prometheus.yml` around lines 1 - 3, The global
scrape_interval (global.scrape_interval = 15s) is being overridden by job-level
scrape_interval values (per-job scrape_interval = 10s), so update the
configuration to make intent clear: either set global.scrape_interval to 10s
(and keep evaluation_interval consistent) or remove the per-job scrape_interval
overrides in the job blocks so they inherit the global 15s; locate the global
"scrape_interval" and any "scrape_interval" entries inside job definitions and
apply one consistent choice.
backend/modules/api/src/ai.rs (1)

31-34: Counter increments before validation — confirm this is intended.

increment_ai_requests("suggestion"|"analysis") runs before payload.0.validate(), so malformed requests rejected with 400 are still counted as AI requests. If the dashboard intent is "engine invocations" or "successful AI work", this will over-report; if it's "total inbound AI requests", it's fine. Consider either documenting the semantics or moving the increment after validation (or adding an outcome label like status="ok"|"validation_failed"|"engine_error").

Also applies to: 96-99

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/modules/api/src/ai.rs` around lines 31 - 34, The counter is
incremented before request validation, causing malformed requests to be counted;
move the call to increment_ai_requests("suggestion") so it runs after
payload.0.validate() succeeds (and do the same for the analysis handler around
lines 96-99), or alternatively augment increment_ai_requests to accept a status
label (e.g., increment_ai_requests("suggestion", "ok" | "validation_failed" |
"engine_error")) and call it with the appropriate outcome; locate the call sites
in get_ai_suggestion and the corresponding get_ai_analysis handler and update
them accordingly.
backend/README.md (1)

346-352: Recommend warning users to change the default Grafana password.

The docs publish admin / admin as the production-ready credentials. Add a short callout that this is dev-only and must be overridden via GF_SECURITY_ADMIN_PASSWORD (and the corresponding env-driven update in docker-compose.yml) before any non-local deployment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/README.md` around lines 346 - 352, Add a short security callout to
the "Accessing Dashboards" section warning that the Grafana default credentials
shown (admin/admin) are for development only and must be changed before
non-local deployments; instruct the maintainer to set GF_SECURITY_ADMIN_PASSWORD
to a secure value and ensure the corresponding environment override is applied
in docker-compose.yml (update the Grafana service env block) so production
instances do not use the default password.
backend/modules/api/src/games.rs (1)

65-82: Active-games gauge can drift on non-graceful exits.

active_games is incremented in create_game and decremented only via abandon_game/complete_game. Games that end through other paths — stale waiting games, inactivity timeouts, server restarts mid-game, future cleanup jobs — will leave the gauge permanently inflated. Consider:

  • Periodically reconciling the gauge from the DB (SELECT COUNT(*) WHERE status IN ('waiting','in_progress')) and using set() instead of relying solely on inc/dec, or
  • Centralizing increment/decrement in GameService so any future status-change path stays in sync.

Also applies to: 349-372, 554-573

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/modules/api/src/games.rs` around lines 65 - 82, The active-games
gauge is being incremented directly in the handler after
GameService::create_game via increment_active_games() but only decremented in
abandon_game/complete_game, causing drift on non-graceful exits; fix by moving
all gauge updates into GameService (e.g., inside GameService::create_game,
GameService::abandon_game, GameService::complete_game and any other
status-change methods) so status transitions always update the metric in one
place, and implement a periodic reconciliation function that queries the DB
count of games with status IN ('waiting','in_progress') and calls the
gauge.set(...) to correct drift instead of relying solely on inc/dec.
backend/modules/api/src/metrics.rs (1)

4-4: Unused import web.

actix_web::web is imported but never used in this file — only HttpResponse is referenced (in metrics_handler). Will trigger an unused_imports warning.

-use actix_web::{HttpResponse, web};
+use actix_web::HttpResponse;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/modules/api/src/metrics.rs` at line 4, The import line brings in
actix_web::web but it's unused (only HttpResponse is used by metrics_handler);
remove the unused symbol by changing the use statement to import only
HttpResponse (i.e., drop `web`) or otherwise use the `web` symbol where
intended—update the `use actix_web::{HttpResponse, web};` to `use
actix_web::HttpResponse;` so the unused_imports warning is resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/.env.example`:
- Around line 41-47: The .env.example exposes METRICS_ENABLED, METRICS_PATH, and
HTTP_METRICS_PATH but those vars are unused; either remove them from
.env.example or wire them into the metrics setup: read METRICS_ENABLED (parse
bool) and only register metrics when true, use HTTP_METRICS_PATH to set
PrometheusMetricsBuilder::new("xlmate_http").endpoint(...) instead of the
hardcoded "/metrics/http", and use METRICS_PATH to register the route path
instead of the hardcoded .route("/metrics", web::get().to(metrics_handler)). Use
std::env::var (or your config loader) to fetch these vars, provide sensible
defaults if missing, and ensure conditional registration logic and endpoint
strings are updated in the server metrics initialization and route registration
code.

In `@backend/modules/api/src/metrics_tests.rs`:
- Around line 1-186: init_metrics() is not idempotent causing duplicate
Prometheus registrations and OnceCell set errors; change init_metrics (and
Metrics::new if needed) to only register metrics once and return a shared
instance via GLOBAL_METRICS.get_or_init/OnceCell::get_or_init (e.g., store
Arc<Metrics>), handle duplicate registration errors by ignoring
AlreadyRegistered when reusing the global registry, and update
test_concurrent_metric_updates to snapshot initial values (active_games.get(),
ai_requests_total.get()) before spawning threads and assert that final - initial
== 1000.0 instead of asserting absolute equals; reference symbols: init_metrics,
Metrics::new, REGISTRY, GLOBAL_METRICS, test_concurrent_metric_updates.

In `@backend/modules/api/src/metrics.rs`:
- Around line 84-114: The init_metrics flow currently calls Metrics::new() and
GLOBAL_METRICS.set(...), causing panics on a second call because
REGISTRY.register rejects duplicate metrics and OnceCell::set panics if already
set; change init_metrics to use GLOBAL_METRICS.get_or_init(||
Arc::new(Metrics::new())) and return a clone of the stored Arc so repeated calls
are no-ops, and ensure Metrics::new remains the single place that registers
metrics with REGISTRY (so registration only happens once via the get_or_init
path). Reference symbols: init_metrics, Metrics::new, GLOBAL_METRICS, and
REGISTRY.register.
- Around line 62-81: The three plain Counter instances (ai_requests_total,
auth_events_total, game_events_total) must become labeled CounterVecs so the
helper functions that accept request_type: &str, event_type: &str, and success:
bool actually emit labels; replace Counter::with_opts(...) with
CounterVec::new(Opts::new(...), &["request_type","success"]) for
ai_requests_total (or &["request_type"]/&["request_type","success"] as
appropriate), &["event_type","success"] for auth_events_total, and
&["event_type"] for game_events_total, then update the helper functions that
currently call .inc() to call .with_label_values(&[...]) using the corresponding
stringified success ("true"/"false") and type args before .inc(), ensuring the
metric names (ai_requests_total, auth_events_total, game_events_total) and Opts
descriptions remain the same.

In `@backend/modules/api/src/server.rs`:
- Line 179: The /metrics route is currently exposed publicly via
.route("/metrics", web::get().to(metrics_handler)); to fix this either (A)
register the metrics route on a separate internal listener instead of the public
App, (B) wrap the metrics handler with authentication middleware (e.g.,
JwtAuthMiddleware or a basic-auth middleware) before registering it so requests
to metrics_handler require valid credentials, or (C) add request-source
filtering middleware that checks the remote IP allowlist and returns 403 if not
allowed; update server startup docs to document the chosen restriction method
and ensure Prometheus is configured to use the corresponding credentials/network
path.
- Around line 118-124: PrometheusMetricsBuilder::build() is being called inside
the per-worker app_factory, creating a separate Prometheus registry per worker;
move the build() call so the PrometheusMetrics instance is constructed once
before HttpServer::new(), store it in a local variable named prometheus, then
clone that prometheus into the app_factory closure (use prometheus.clone()
inside the closure) and wrap the App with that cloned instance so all workers
share the same registry and metrics are aggregated across workers.

In `@backend/monitoring/grafana/provisioning/dashboards/xlmate-dashboard.json`:
- Around line 388-389: The dashboard legend is empty because
xlmate_game_events_total is a plain Counter without labels; update metrics.rs to
use a CounterVec for xlmate_game_events_total with an "event_type" label (and do
the same for xlmate_ai_requests_total if you want per-type breakdown), register
the CounterVec, and replace all increments of xlmate_game_events_total.inc()
with the label-aware calls (e.g., using with_label_values or with_label to
increment the appropriate "created"/"completed"/"abandoned" value); ensure the
CounterVec is initialized before use and that existing metric names remain
unchanged so Prometheus picks up the labelged metric.
- Around line 12-14: The dashboard panels currently reference the datasource
object with "uid": "default" which won't match the provisioned Prometheus
instance; either add uid: "default" to the Prometheus provisioning YAML (so the
provisioned datasource UID is stable) or change the panels in
xlmate-dashboard.json to reference the datasource by name ("Prometheus") instead
of the object with uid; update all occurrences of the datasource object in the
dashboard (look for "datasource": { "type": "prometheus", "uid": "default" }) or
add the uid: default entry to the prometheus provisioning config so the
references resolve.

In `@backend/monitoring/grafana/provisioning/datasources/prometheus.yml`:
- Around line 4-9: Add an explicit uid to the Grafana Prometheus datasource so
dashboard references using UID "default" resolve correctly: update the
datasource configuration (the block with name: Prometheus, type: prometheus,
access: proxy, url: http://prometheus:9090, isDefault: true, editable: true) to
include uid: default so the provisioned datasource UID matches the dashboards'
references.

In `@backend/monitoring/prometheus.yml`:
- Line 8: Prometheus uses host.docker.internal in its scrape targets which fails
on Linux; update the Prometheus service definition in docker-compose.yml (the
prometheus service) to include an extra_hosts entry mapping host.docker.internal
to host-gateway: add extra_hosts: - "host.docker.internal:host-gateway" so the
container can resolve that hostname on Linux, and keep the targets in
prometheus.yml unchanged for Docker Desktop compatibility.

In `@docker-compose.yml`:
- Around line 67-69: The docker-compose environment currently hardcodes
GF_SECURITY_ADMIN_PASSWORD=admin; change it to require an external value (e.g.
use environment variable substitution like ${GRAFANA_ADMIN_PASSWORD:?must be
set} for GF_SECURITY_ADMIN_PASSWORD) so operators must supply a password, and
add a clear comment next to GF_SECURITY_ADMIN_PASSWORD explaining it must be
overridden in any non-local environment (and update README or docs to instruct
setting GRAFANA_ADMIN_PASSWORD); keep GF_USERS_ALLOW_SIGN_UP as-is unless you
want different default behavior.

---

Outside diff comments:
In `@backend/modules/api/src/auth.rs`:
- Around line 28-107: Validation and token-generation failures are not being
recorded by the auth metrics, so update the handlers to call
increment_auth_events with false on all failure paths: in the register function,
call increment_auth_events("register", false) inside the validation error branch
(where payload.validate() fails); in the login function, call
increment_auth_events("login", false) inside the login validation error branch
and inside the refresh-token failure branch (the Err(e) arm of
TokenService::generate_refresh_token), and ensure you still call
increment_auth_events("login", true) after successful token creation (after
jwt_service.generate_token and refresh token succeed) so both success and all
failure paths are tracked.

In `@backend/modules/api/src/games.rs`:
- Around line 607-609: The file contains an extra closing brace that causes a
compile error; remove the stray `}` that follows the end of the complete_game
function so the function (complete_game) and surrounding module braces are
balanced — locate the end of the complete_game function (opened at Line ~493,
closed at the match end around Line ~607) and delete the extra closing brace
after that match to restore correct brace pairing.

In `@docker-compose.yml`:
- Around line 37-75: The compose file places the prometheus and grafana service
blocks under the top-level volumes: section instead of services:, so Docker
Compose treats them as volume entries; move the prometheus and grafana blocks
out of the volumes: mapping and place them as entries under the existing
services: key (preserving their image, container_name, ports, environment,
volumes, command, restart, depends_on, etc.), ensure named volumes
prometheus_data and grafana_data remain defined under volumes:, and fix
indentation so prometheus and grafana are sibling entries of other services (not
nested under volumes:).

---

Nitpick comments:
In `@backend/modules/api/src/ai.rs`:
- Around line 31-34: The counter is incremented before request validation,
causing malformed requests to be counted; move the call to
increment_ai_requests("suggestion") so it runs after payload.0.validate()
succeeds (and do the same for the analysis handler around lines 96-99), or
alternatively augment increment_ai_requests to accept a status label (e.g.,
increment_ai_requests("suggestion", "ok" | "validation_failed" |
"engine_error")) and call it with the appropriate outcome; locate the call sites
in get_ai_suggestion and the corresponding get_ai_analysis handler and update
them accordingly.

In `@backend/modules/api/src/games.rs`:
- Around line 65-82: The active-games gauge is being incremented directly in the
handler after GameService::create_game via increment_active_games() but only
decremented in abandon_game/complete_game, causing drift on non-graceful exits;
fix by moving all gauge updates into GameService (e.g., inside
GameService::create_game, GameService::abandon_game, GameService::complete_game
and any other status-change methods) so status transitions always update the
metric in one place, and implement a periodic reconciliation function that
queries the DB count of games with status IN ('waiting','in_progress') and calls
the gauge.set(...) to correct drift instead of relying solely on inc/dec.

In `@backend/modules/api/src/metrics.rs`:
- Line 4: The import line brings in actix_web::web but it's unused (only
HttpResponse is used by metrics_handler); remove the unused symbol by changing
the use statement to import only HttpResponse (i.e., drop `web`) or otherwise
use the `web` symbol where intended—update the `use actix_web::{HttpResponse,
web};` to `use actix_web::HttpResponse;` so the unused_imports warning is
resolved.

In `@backend/modules/api/src/ws.rs`:
- Line 13: The import statement in ws.rs includes an unused symbol `Metrics`;
remove `Metrics` from the use declaration so only `increment_ws_connections` and
`decrement_ws_connections` are imported (i.e., update the use crate::metrics
line to import just those two functions) to eliminate the unused_imports
warning.

In `@backend/monitoring/prometheus.yml`:
- Around line 1-3: The global scrape_interval (global.scrape_interval = 15s) is
being overridden by job-level scrape_interval values (per-job scrape_interval =
10s), so update the configuration to make intent clear: either set
global.scrape_interval to 10s (and keep evaluation_interval consistent) or
remove the per-job scrape_interval overrides in the job blocks so they inherit
the global 15s; locate the global "scrape_interval" and any "scrape_interval"
entries inside job definitions and apply one consistent choice.

In `@backend/README.md`:
- Around line 346-352: Add a short security callout to the "Accessing
Dashboards" section warning that the Grafana default credentials shown
(admin/admin) are for development only and must be changed before non-local
deployments; instruct the maintainer to set GF_SECURITY_ADMIN_PASSWORD to a
secure value and ensure the corresponding environment override is applied in
docker-compose.yml (update the Grafana service env block) so production
instances do not use the default password.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e339f223-5e1f-400f-abb3-8c123887ebb8

📥 Commits

Reviewing files that changed from the base of the PR and between 382408d and e994f9e.

📒 Files selected for processing (16)
  • backend/.env.example
  • backend/README.md
  • backend/modules/api/Cargo.toml
  • backend/modules/api/src/ai.rs
  • backend/modules/api/src/auth.rs
  • backend/modules/api/src/games.rs
  • backend/modules/api/src/lib.rs
  • backend/modules/api/src/metrics.rs
  • backend/modules/api/src/metrics_tests.rs
  • backend/modules/api/src/server.rs
  • backend/modules/api/src/ws.rs
  • backend/monitoring/grafana/provisioning/dashboards/dashboard.yml
  • backend/monitoring/grafana/provisioning/dashboards/xlmate-dashboard.json
  • backend/monitoring/grafana/provisioning/datasources/prometheus.yml
  • backend/monitoring/prometheus.yml
  • docker-compose.yml

Comment thread backend/.env.example
Comment thread backend/modules/api/src/metrics_tests.rs
Comment thread backend/modules/api/src/metrics.rs Outdated
Comment thread backend/modules/api/src/metrics.rs Outdated
Comment thread backend/modules/api/src/server.rs Outdated
Comment thread backend/monitoring/grafana/provisioning/datasources/prometheus.yml
Comment thread backend/monitoring/prometheus.yml
Comment thread docker-compose.yml
@gabito1451

Copy link
Copy Markdown
Collaborator

@Godfrey-Delight please look in to ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend: Prometheus and Grafana Metrics Integration

2 participants