Skip to content

feat: replace Prometheus monitoring with Netdata + ntfy.sh alerts#1

Merged
mpasternak merged 33 commits into
mainfrom
feat/netdata-monitoring
May 31, 2026
Merged

feat: replace Prometheus monitoring with Netdata + ntfy.sh alerts#1
mpasternak merged 33 commits into
mainfrom
feat/netdata-monitoring

Conversation

@mpasternak
Copy link
Copy Markdown
Member

Summary

  • Replace prometheus + node-exporter + postgres-exporter (3 containers, ~700MB RAM, zero preconfigured alerts) with a single netdata agent (1 container, ~200MB RAM, hundreds of built-in alerts, 1s resolution)
  • Alerts push directly to phone via ntfy.sh (random per-deployment topic stored in .env as NTFY_TOPIC — no Slack, no email, no PagerDuty)
  • Keep loki + alloy + grafana untouched as the log search stack (180d retention for nginx access log, 90d for app, etc.)

What's added

  • defaults/netdata/{netdata.conf, go.d/postgres.conf, health_alarm_notify.conf, health.d/}
  • Nginx location /netdata/ behind existing authserver (regex with named capture — handles subpath proxying correctly)
  • init-configs.sh migration: auto-generates NTFY_TOPIC (random 32-hex) for existing deployments, prints subscribe URL once
  • mk/monitoring.mk with make ntfy-test, health-netdata, logs-netdata, netdata-shell, grant-pg-monitor
  • scripts/grant-pg-monitor.sh — auto-detects internal vs external dbserver mode

What's removed

  • Three monitoring services + prometheus_data volume
  • 4 Grafana dashboards that depended on Prometheus (disk-usage, http-performance, errors, postgresql-health) — Netdata has equivalent built-ins
  • Prometheus datasource in Grafana provisioning (Loki promoted to default)
  • DJANGO_BPP_ENABLE_PROMETHEUS default flipped to false (django-prometheus middleware was pure overhead)
  • macOS local_overrides.yml (only purpose was disabling node-exporter)

defaults/prometheus/ directory kept as historical artifact (delete in follow-up if no rollback needed).

Backwards compatibility

  • Old .env files without NTFY_TOPIC parse cleanly (${NTFY_TOPIC:-} default in compose)
  • make init-configs migrates existing deployments (idempotent — won't regenerate topic and break phone subscriptions)
  • Stale PROMETHEUS_* / NODE_EXPORTER_* / PG_EXPORTER_* env vars are harmless (Compose ignores unreferenced vars)
  • prometheus_data Docker volume becomes orphan after deploy — cleaned by make prune-orphan-volumes

Test plan

  • Checkout branch on test host: git checkout feat/netdata-monitoring
  • make init-configs — verify NTFY_TOPIC appears in .env, subscribe URL printed
  • Install ntfy app on phone, subscribe to the printed URL
  • make refresh — verify netdata pulls and starts; prometheus/exporter containers gone
  • make ntfy-test — confirm push notification arrives on phone
  • make grant-pg-monitor — confirm idempotent (run twice)
  • Open https://<host>/netdata/ — verify dashboard loads through authserver
  • make health-netdata — confirm agent reports healthy
  • Verify Postgres collector shows pg_stat_* data (no permission errors)
  • Verify Loki + Grafana log search still works (LogQL via /grafana/)
  • 24h monitoring: check for OOM events, alert flapping, missed metrics
  • After 24h: make prune-orphan-volumes to remove prometheus_data

Plan & history

Full implementation plan: docs/superpowers/plans/2026-05-31-netdata-monitoring.md (in branch)

13 commits, organized as: plan → Phase 1 (additive, 8 commits) → Phase 2 (removal, 2 commits) → docs polish (2 commits).

🤖 Generated with Claude Code

mpasternak and others added 30 commits May 31, 2026 11:06
10-task phased plan (Phase 1: additive, Phase 2: removal). Replaces
prometheus + node-exporter + postgres-exporter with one netdata agent;
keeps Loki + Alloy + Grafana for logs. Alerts go to public ntfy.sh
with random topic stored in .env.

Phase 1 leaves the deployment in a working dual-stack state so the
user can validate Netdata for 24h before Phase 2 removes the old
Prometheus services.
Adds netdata agent (v1.99.0) with full host visibility, Docker socket
for container auto-discovery, named volumes for persistent state and
resource limits (256m/1.0 default). Service is added but not yet
started in this commit - configs come in subsequent tasks.
No other service in the project sets container_name explicitly - all
rely on Compose's default ${COMPOSE_PROJECT_NAME}_<service>_1 naming.
Forcing 'container_name: netdata' would break multi-stack hosts (dev +
prod on one machine) with 'container name already in use' errors.

Netdata's node-label-in-dashboard is set by 'hostname:' not
'container_name:' - that line stays.
netdata.conf disables registry + binds 0.0.0.0:19999 (reverse-proxied
via nginx /netdata/ - not exposed on host). postgres.conf builds DSN
from ${PG_*} env vars (works for both internal and external DB modes).
health_alarm_notify.conf is shell-sourced override that enables only
ntfy channel and routes all roles to ${NTFY_SERVER}/${NTFY_TOPIC}.
ensure-config-files.sh now recursively copies defaults/netdata/ to
BPP_CONFIGS_DIR/netdata/ (copy_if_missing - non-destructive).
init-configs.sh generates random NTFY_TOPIC (openssl rand -hex 16)
for existing deployments missing the var, and ensures
DJANGO_BPP_NTFY_SERVER defaults to https://ntfy.sh. Topic is a
secret (anyone with the URL reads alerts), so it's logged once
during the migration with a 'subscribe in app' hint.
Same pattern as /grafana/ - auth_request to /_bpp_superuser_auth gates
access, trailing-slash proxy_pass strips the /netdata/ prefix.
WebSocket headers enabled (Netdata uses WS for live charts), buffering
disabled for stream-style data.
make ntfy-test       - sends test push to NTFY_TOPIC from .env
                       (confirms phone subscription works)
make health-netdata  - curl /api/v1/info via nginx and direct
make logs-netdata    - tail netdata container logs
make netdata-shell   - exec bash in netdata container
scripts/grant-pg-monitor.sh detects internal vs external DB mode.
Internal: execs psql in dbserver and runs GRANT pg_monitor.
External: prints the SQL for the DBA to run manually.
Idempotent - GRANT can be re-run safely.
Three issues caught by cross-task Phase 1 review:

- nginx /netdata/ used trailing-slash proxy_pass, stripping the URI
  prefix - Netdata then generated /static/* asset URLs that browser
  resolved to root (Django 404). Switch to regex location with named
  capture, preserve prefix, add X-Forwarded-Url for autodetection.
  Also add /netdata -> /netdata/ redirect for typed URLs without slash.

- make health-netdata curled via nginx, hit auth_request, got 302
  redirect, displayed as 'HTTP 302' looking like failure. Drop the
  nginx hop - container-direct check is the meaningful signal.

- ensure-config-files.sh recursive copy was catching .gitkeep and
  leaking it to user's configs dir. Exclude with -not -name.
ntfy-test, health-netdata, logs-netdata, netdata-shell,
grant-pg-monitor - all referenced in CLAUDE.md as the make-target
source of truth but missing from 'make help' until now.
…porter

Netdata (added in Phase 1) replaces all three: host metrics
(node-exporter), Postgres stats (postgres-exporter via go.d/postgres),
and time-series storage (prometheus). One container, ~200MB RAM,
preconfigured alerts, push to ntfy.sh - vs three containers and
zero alerts.

Changes:
- docker-compose.monitoring.yml: remove 3 services + prometheus_data volume
- defaults/grafana/provisioning/datasources/datasources.yaml.tpl:
  remove Prometheus datasource, promote Loki to default
- defaults/grafana/provisioning/dashboards/: delete disk-usage.json,
  http-performance.json, errors.json (all 3 referenced Prometheus -
  Netdata has equivalent built-in dashboards)
- scripts/ensure-config-files.sh: drop prometheus seeding
- scripts/configure-resources.sh: drop prometheus tunable, add netdata
- scripts/upgrade-postgres.sh: drop postgres-exporter stop/restart lines
- scripts/init-configs.sh + defaults/docker-compose.local_overrides.yml:
  drop node-exporter macOS override (entire file - no other content),
  also drop the include in docker-compose.yml and the .gitignore entry
- docker-compose.database.external.yml: drop postgres-exporter from
  explanatory comments
- docker-compose.yml: update volume-list comment

defaults/prometheus/ kept as historical artifact - delete later if
no rollback path needed.

Existing deployments: prometheus_data volume becomes orphan, will be
cleaned by 'make prune-orphan-volumes'. PROMETHEUS_*, NODE_EXPORTER_*,
PG_EXPORTER_* env vars in user .env files are harmless (Docker Compose
ignores unreferenced vars). No migration needed.
- Delete defaults/grafana/provisioning/dashboards/postgresql-health.json
  (referenced the removed Prometheus datasource in 64 places; Netdata
  Postgres collector dashboards cover the same metrics at /netdata/).
- Flip DJANGO_BPP_ENABLE_PROMETHEUS default from true to false in
  docker-compose.application.yml. Nothing scrapes /metrics anymore,
  django-prometheus middleware is pure overhead. Existing deployments
  that set the var in .env keep their value (backwards compat).
Reflects the architectural change: netdata replaces prometheus +
node-exporter + postgres-exporter (Phase 2 commit 16ae9c1 / 8cf921f).
Loki + Alloy + Grafana stay for logs.

Updated sections:
- Architecture Overview > Services > Monitoring
- Architecture Overview > Data Flow (metrics path)
- Monitoring Access (URLs + new ntfy info)
- Logging (drop Prometheus retention, add Netdata tiered retention)
- Make Targets (new ntfy/netdata commands)
- Resource Limits (drop prometheus/exporters, add netdata)
tests/test_makefile.sh asserted presence of prometheus dir + config,
which Phase 2 (commit 16ae9c1) removed. Switched assertions to the
new netdata structure: netdata.conf, go.d/postgres.conf,
health_alarm_notify.conf, plus health.d/ directory.

CI test_init_configs_creates_structure, test_init_configs_copies_templates,
and test_init_configs_no_overwrite now reflect post-migration state.
v1.99.0 did not exist on Docker Hub (planning oversight - I picked
a placeholder version without verifying). Netdata jumped from v1.47
straight to v2.0 - no v1.99 release line.

v2 split [global] into [global] + [db], so the dbengine directives
move:
- memory mode             -> [db] mode
- page cache size         -> [db] dbengine page cache size
- dbengine multihost disk -> [db] dbengine tier 0 retention size
  space

v2 maintains backwards-compat with v1 directive names (logs
deprecation warnings) but cleaner to use current idiom upfront.
Sizing kept identical: 512MB tier 0 retention, 32MB page cache,
1s update interval.
NETDATA_DISABLE_CLOUD=1 turns off the agent-side Cloud client
entirely. Without it, Netdata v2's first-time dashboard pops a
'please connect your agent / docker exec netdata...' dialog even
though the user is already authed via authserver+nginx.

DO_NOT_TRACK and DISABLE_TELEMETRY (already set) only suppress
anonymous-stats phone-home, not the Cloud claim prompt - those are
separate code paths.
tests/test_makefile.sh Test 14e ran nginx container that created
letsencrypt cert files as root (container default user). Teardown's
bare rm failed on Ubuntu GHA (runner user != root) with 'Permission
denied' - the test itself passed all assertions, only cleanup
exit code was non-zero.

macOS Docker Desktop uses a user-mapped VM, so files appear as
runner-owned and rm works there. Ubuntu runs native Docker with
no user mapping. This pre-existing failure has been red on main
for every commit since the LE test was added.

Fix: try sudo rm first (GHA has passwordless sudo), fall back to
plain rm (no-op since files already gone, or harmless error
message on dev machine without sudo). Applied to all rm sites in
Test 14 (LE certs) and Test 15 (runtime ssl/) that touch dirs
populated by docker run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs caught on the deployment host:

1) scripts/grant-pg-monitor.sh sourced .env as bash, breaking on
   shell-unfriendly values like 'EMAIL=Name <addr@domain>' (the < is
   parsed as redirect). Switched to grep-based extraction matching
   the get_env_var helper pattern from init-configs.sh.

2) defaults/netdata/go.d/postgres.conf used ${PG_*} env var
   placeholders, assuming go.d.plugin would substitute them. It does
   not (verified on v2.10.3): the literal string went through to
   the URL parser which choked on ':${PG_PORT}'. Reworked as a
   template (.tpl) rendered host-side by ensure-config-files.sh
   on every make up/refresh - so password changes in .env propagate
   on next deploy. Removed now-useless PG_* env vars from compose
   (NTFY_* stay - those ARE used because health_alarm_notify.conf
   is bash-sourced).

Auto-generated file lives at \$BPP_CONFIGS_DIR/netdata/go.d/postgres.conf
with a clear DO-NOT-EDIT header.
Test 4 (init-configs copies templates) was asserting that
$CONFIG_DIR/netdata/go.d/postgres.conf exists after init-configs.
After the .tpl rendering refactor, that file is generated from
the .tpl by ensure-config-files.sh - but only when .env exists,
and init-configs.sh invoked ensure-config-files BEFORE creating
.env (intentionally - to seed the directory layout first).

Fix: invoke ensure-config-files.sh a second time at the end of
init-configs, after .env is fully populated. Idempotent - just
re-renders postgres.conf (and any other .env-dependent configs
we add later).
…_log to Loki

Dodaje monitorowanie nginx w netdacie oraz access_log w Grafanie/Loki.

nginx (stub_status) -> netdata:
- default.conf.template: wewnetrzny server { listen 8090; /stub_status }
  (port niepublikowany w compose, osiagalny tylko netdata->webserver:8090)
- defaults/netdata/go.d/nginx.conf: kolektor live metryk polaczen

access_log -> Loki + web_log:
- 00-log-format.conf: format bpp_access (combined + request_time/
  upstream_response_time/request_length), ladowany w kontekscie http
- vhost.conf.template: dwa sinki access_log w bpp_access:
  /dev/stdout (-> Alloy -> Loki) ORAZ plik na wolumenie nginx_access_log
- defaults/netdata/go.d/web_log.conf: kolektor metryk z access logu
  (kody HTTP, latencja) + alerty 5xx/latencja; log_type auto + escape-hatch
- infrastructure.yml: mount 00-log-format.conf, wolumen nginx_access_log,
  skrypt rotacji + label Ofelia (04:10)
- monitoring.yml: nginx_access_log RO do netdaty
- scripts/nginx-access-log-rotate.sh: mv .1 + nginx -s reopen (Docker log
  driver nie rotuje plikow, tylko stdout/stderr)

cleanup po migracji Prometheus->Netdata:
- datasources.yaml.tpl: deleteDatasources Prometheus (kasuje martwy
  datasource z grafana_data na upgrade'owanych instalacjach)

testy + docs:
- test_makefile.sh: asercje dla nginx.conf/web_log.conf
- CLAUDE.md: sekcje go.d collectors, nginx access log, data flow

Zweryfikowane: nginx -t (realny kontener nginx:1.29.7), docker compose
config (merge 7 plikow, wolumen rozwiazuje sie cross-file), make init-configs
(nowe pliki go.d kopiuja sie).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two complementary channels for tracking slow PostgreSQL queries,
both reusing existing infrastructure (Loki + Grafana + PostgreSQL
datasource):

- log_min_duration_statement=1000: every query >1s logged to dbserver
  log -> Alloy -> Loki (90d retention). Grafana dashboard 'Slow
  queries (log)' renders via LogQL with regex extraction of duration
  and query text. Natural time-windowing via UI time picker.

- pg_stat_statements: aggregated stats per normalized query (calls,
  mean/total/stddev exec time, rows). Grafana dashboard 'Top 100
  queries (pg_stat_statements)' via existing PostgreSQL datasource.
  Manual pg_stat_statements_reset() for rolling time windows.

Bootstrap: make pg-monitoring-setup
- ALTER SYSTEM SET log_min_duration_statement = 1000 + reload
- Append pg_stat_statements to shared_preload_libraries (preserving
  existing libs), restart dbserver, CREATE EXTENSION
- Idempotent, detects external DB mode (prints SQL for DBA)
Commit 566b146 added defaults/webserver/00-log-format.conf (log_format
bpp_access ...) and wired it into production via
docker-compose.infrastructure.yml. The test helper _run_nginx_t builds
its own nginx container with explicit mounts and didn't propagate the
new file - nginx -t failed with 'unknown log format bpp_access' in
6 different test 14 / 15 variants.

Fix: mount the same file in the test container, matching the production
configuration. Also create and mount /var/log/nginx-shared/ so the
access_log file destination in vhost.conf.template can be opened.
Pure test plumbing - no production behavior change.
…erride

- Mount host root (/:/host/root:ro,rslave) so diskspace.plugin reports
  used/avail/% for ALL host partitions (df), not just container fs.
  No NETDATA_HOST_PREFIX needed — image knows the /host/root convention.
- Remove custom healthcheck: it called `wget --spider`, but the netdata
  image ships no wget (only curl/nc), so the container was ALWAYS
  reported unhealthy despite a working agent. Image's built-in
  HEALTHCHECK /usr/sbin/health.sh is correct and maintained upstream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…boards

Error Monitoring:
- data link na "Error Count Over Time": klik w serię serwera ustawia
  var-service i filtruje panel "Error Logs"
- "Error Logs" wyzszy (h 16 -> 24) + enableInfiniteScrolling

Top 100 queries (pg_stat_statements):
- towarzyszacy bar chart "Top 15 by mean execution time"; klik w slupek
  ustawia zmienna qid i zaweza tabele do tego queryid
- tabela honoruje $qid (puste = wszystkie 100); pole qid u gory do resetu
- pg_stat_statements nie ma osi czasu, wiec filtr jest po queryid (migawka)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dashboard "PostgreSQL: Storage & tables" (uid postgresql-storage):
rozmiar bazy, najwieksze tabele/indeksy (top 20), dead tuples & autovacuum,
szacowany bloat tabel i indeksow. Datasource grafana-postgresql-datasource.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensure-config-files.sh: dashboardy Grafany (grafana/provisioning/dashboards/*)
sa teraz force-syncowane z defaults/ przy kazdym make up/refresh/run
(copy_always, overwrite tylko gdy tresc sie rozni). Wczesniej copy_if_missing
pomijal istniejace pliki, wiec zaktualizowany dashboard nie trafial na zywy
deployment bez recznego cp. User-tunable configi (loki/netdata/alloy) zostaja
copy_if_missing.

Docs: CLAUDE.md + README opisuja force-sync oraz komplet dashboardow
(Error Monitoring z cross-filterem serwera, companion bar chart + klik-filtr
na pg_stat_statements, Storage & tables).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Host)

Redirect @bpp_login budowal URL z $http_host. Pod HTTP/3 (QUIC) nie ma
naglowka Host: — jest pseudo-naglowek :authority — wiec $http_host jest
PUSTY i przegladarka dostawala 302 na https:///__external_auth/login/?next=https:///...
(bez domeny). Firefox po Alt-Svc przelaczal sie na h3 i trafial na bug;
Safari (jeszcze h2) dzialal. $host bierze wartosc z :authority/Host/server_name,
wiec poprawny we wszystkich protokolach. Wlaczamy h3 w vhost.conf.template
(listen 443 quic + Alt-Svc), wiec to realny regres dla kazdego /grafana /netdata
/dozzle /flower przy wygaslej sesji.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce table

Dashboard loguje tez INFO/WARN, nie tylko bledy -> "Error Monitoring"
przemianowany na "Log Monitoring" (uid zostaje error-monitoring, zeby nie
osierocic provisioned dashboardu / nie zepsuc zakladek).

Gorny wykres "Log volume by level over time": rozbity po detected_level
(stackowane slupki, kolory per poziom: error=czerwony, warn=pomaranczowy,
info=zielony, debug=niebieski). Klik w serie poziomu ustawia var-level.

Nowy panel-tabela "By service (click to filter)": liczba linii per serwer
w zakresie czasu; klik w wiersz ustawia var-service. Tabela nie filtruje
sie po $service (zostaje pelnym menu do przelaczania), respektuje
container/level. Dolny panel przemianowany na "Logs".

Efekt: filtrowanie po serwerze (klik w tabele) ORAZ po poziomie (klik
w serie wykresu), plus wizualne rozroznienie poziomow na wykresie.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ighten netdata ACL

- ntfy-test: nie drukuj sekretnego NTFY_TOPIC na stdout (historia/CI/tee)
- health-netdata: curl zamiast wget (obraz netdaty nie ma wget -> zawsze failowalo)
- netdata.conf: allow badges/streaming from = sieci Dockera+localhost zamiast *
  (single-agent, brak parent/child; * pozwalal kazdemu kontenerowi wstrzykiwac metryki)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/Netdata

Grafana datasource laczyl sie uzytkownikiem APLIKACJI (RW na produkcji), a
GF_USERS_AUTO_ASSIGN_ORG_ROLE=Admin + panel SQL = kazdy zalogowany mogl
wykonac dowolny DML/DDL. Teraz osobna read-only rola bpp_monitor (pg_monitor
+ pg_read_all_data, bez DDL/DML).

- scripts/create-monitoring-user.sh (NOWY): idempotentny CREATE/ALTER ROLE +
  granty. Internal: psql przez docker exec jako superuser, PGPASSWORD przez -e
  (nie w argv), ON_ERROR_STOP=1. External: wypisuje SQL. --soft: nie blokuje
  make up gdy DB jeszcze nie wstala. Walidacja hasla [A-Za-z0-9] (literal SQL).
- datasources.yaml.tpl + postgres.conf.tpl: lacza sie jako bpp_monitor (BEZ
  fallbacku do usera Django - rola ma istniec).
- ensure-config-files.sh: self-heal sekretow (DJANGO_BPP_PG_MONITOR_PASSWORD,
  NTFY_TOPIC) append-only -> git pull && make up na starym .env dziala bez
  recznych krokow. _esc escapuje teraz backslash. postgres.conf renderowany
  atomowo (tmp+mv) + chmod 600 (haslo w DSN).
- pg-monitoring-setup.sh + grant-pg-monitor.sh: tryb external wykrywany przez
  BPP_DATABASE_COMPOSE (nie obecnosc serwisu - sentinel tez zwie sie dbserver).
  PGPASSWORD + ON_ERROR_STOP. Walidacja shared_preload_libraries przed ALTER.
  grant-pg-monitor -> alias do create-monitoring-user.
- up/refresh: wolaja create-monitoring-user.sh --soft (rola ma istniec).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mpasternak and others added 3 commits May 31, 2026 22:14
…tpl + review fixes

Continues the Prometheus->Netdata migration; bundles in-flight datasource
work with code-review fixes on PR #1.

bpp_monitor (security):
- Drop pg_read_all_data; keep only pg_monitor. Grafana auto-promotes every
  authenticated user to Admin and exposes an ad-hoc SQL panel, so a data-read
  grant would let any Grafana user read employee PII. All shipped dashboards
  query stat-views / catalog / size functions only (verified) - pg_monitor
  suffices; the Netdata postgres collector needs only it too.
- pg-monitoring-setup external mode now also emits the bpp_monitor role SQL
  (was: printed slow-query SQL and exited before creating the monitor user).
- ensure-config-files: warn loudly when the postgres.conf render is skipped,
  so a stale pre-migration DSN (app superuser) cannot silently persist.

datasource / config rendering (in-flight):
- Force-sync datasources.yaml.tpl (copy_always) so upgrades pick up the
  bpp_monitor switch + deleteDatasources: Prometheus cleanup.
- Extract generate-grafana-datasources.sh (reads .env from disk, atomic render).
- _ensure_secret treats empty 'VAR=' as missing; default PG port 5432.
- NTFY_SERVER: $(or $(strip ...)) fallback for old .env; qid filter uses
  ${qid:sqlstring} (no ::bigint crash on non-numeric input).

cleanup:
- Remove `make health-netdata`: Netdata has a built-in image HEALTHCHECK, and
  the wrapper masked curl failure through the head pipe (always exited 0).
- Remove prometheus.yml + stale health-netdata / grant-pg-monitor doc refs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README still described the old Prometheus stack. Sync it to what ships now:
- add /netdata/ to the monitoring access paths
- config-dir tree: drop prometheus/, add loki/ + netdata/ (go.d, health.d, ntfy)
- "Monitoring i logi": add `make logs-netdata` + `make ntfy-test`
- configure-resources high-risk list: prometheus -> netdata
- services table: replace prometheus row with netdata (metrics + ntfy push)
- server-move section: prometheus_data -> netdata_lib + netdata_cache volumes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ewhere

Test 7 wrote a custom marker into netdata.conf and asserted it survived
re-init (copy_if_missing). netdata.conf is now force-synced (rendered from
netdata.conf.tpl for the registry-announce URL), so the marker is overwritten
by design and the assertion failed in CI. Test preservation on
health_alarm_notify.conf instead, which stays copy_if_missing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mpasternak mpasternak merged commit 28b3515 into main May 31, 2026
5 checks passed
@mpasternak mpasternak deleted the feat/netdata-monitoring branch May 31, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant