feat(otel): OpenTelemetry traces, logs, drop counters, and OTEL metrics (Phase 1 + 2) by matthyx · Pull Request #818 · kubescape/node-agent

matthyx · 2026-05-19T19:23:37Z

Summary

Phase 1 — Traces, logs, drop counters

New `pkg/otelsetup` package: `InitProviders` wires up TracerProvider, LoggerProvider, and MeterProvider over OTLP gRPC; injects ARMO `X-API-Key` / `X-Customer-GUID` auth headers when the endpoint matches `otel.armosec.io`; returns no-op providers when no endpoint is configured
Container profile lifecycle tracing: `ProfileLifecycleTracker` maintains one long-running span per container learning period (bounded at 10k entries with LRU eviction), recording `profile.entry.saved`, `learning.completed`, `learning.terminated`, and eviction events
Alert log records: `EmitAlertLogRecord` emits structured OTEL log records for every fired rule and malware detection; includes 60s/1000-entry dedup LRU to avoid flooding on hot rules
eBPF drop counters: `node_agent.ebpf.events_dropped.total` incremented in container watcher and event handler factory drop paths, labelled by `reason`
Slow-eval spans: rule evaluations exceeding `OTEL_SLOW_EVAL_THRESHOLD_MS` emit a `rule.evaluate` span
Ring-buffer log processor: 7500-entry ring buffer retains recent log records; flush endpoint activates automatically when KS_LOGGER_LEVEL=debug
sbommanager: attaches `otelgrpc.NewClientHandler()` for automatic trace propagation

Phase 2 — Replace Prometheus metrics with OTEL SDK

New `pkg/metricsmanager/otel/`: full `MetricsManager` interface backed by OTEL SDK; attribute-set caching on all hot paths (2× faster, 10× less memory vs Prometheus on the histogram path)
Collapsed eBPF counters: 17 individual per-event-type counters → single `node_agent.ebpf.events.total{event_type}`
Prometheus scrape mode: `OTEL_METRICS_EXPORTER=prometheus` installs an OTEL→Prometheus bridge and starts `:8080/metrics` listener
`rule.ID` standardisation: all metric call sites now use the stable rule ID (e.g. `R1001`) instead of the display name; malware alerts use constant `"malware"` to bound cardinality
`docs/metrics-migration.md`: full mapping of old Prometheus names → new OTEL names with dashboard update checklist
A/B benchmarks: hard gate passes — OTEL allocs/op ≤ Prometheus allocs/op, ns/op ≤ 1.1× Prometheus on `BenchmarkReportRuleEvaluationTime`

New env vars

Variable	Default	Purpose
`OTEL_EXPORTER_OTLP_ENDPOINT`	—	Base OTLP gRPC endpoint
`OTEL_METRICS_EXPORTER`	—	Set to `prometheus` to enable scrape endpoint on `:8080/metrics`
`OTEL_SLOW_EVAL_THRESHOLD_MS`	0 (disabled)	Threshold for slow-eval spans
`OTEL_DEBUG_PORT`	6060	Debug listener port

`OTEL_COLLECTOR_SVC` is now deprecated (superseded by `OTEL_EXPORTER_OTLP_ENDPOINT`).

Breaking change

Metric names changed. See `docs/metrics-migration.md` for the full mapping and dashboard update checklist.

Test plan

`go build ./...` — passes
`go test ./pkg/otelsetup/... ./pkg/metricsmanager/...` — all pass
A/B benchmark: OTEL `ReportRuleEvaluationTime` ~95 ns/op / 32 B / 2 allocs vs Prometheus ~200 ns/op / 336 B / 2 allocs — gate passes
`ProfileLifecycleTracker` and `RingBufferLogProcessor` unit tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Provider-based OpenTelemetry init, OTEL-backed metrics manager replacing prior Prometheus path; expanded metrics (events, rules, SBOM, alerts), gRPC instrumentation, profile lifecycle spans, alert deduplication and suppression reporting
Documentation
- Expanded OTEL configuration reference, runtime notes, and Prometheus→OTEL migration guide
Tests
- New unit tests and benchmarks for OTEL, lifecycle tracking, thresholds, and metrics

coderabbitai · 2026-05-19T19:23:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces legacy OTEL wiring with provider-based InitProviders, adds an OTEL-backed metrics manager, implements ProfileLifecycleTracker, instruments rules/profiles/ebpf/malware/SBOM/gRPC, renames a metrics config flag, updates docs/go.mod, and adds tests and benchmarks.

Changes

OpenTelemetry Setup and Integration

Layer / File(s)	Summary
Provider init, wrapper, and alert logging `pkg/otelsetup/setup.go`, `pkg/otelsetup/otelsetup_test.go`	Adds ProviderConfig alias, SlowEvalThreshold backing, global Tracer/Logger/Meter accessors, InitProviders (including Prometheus meter mode), and EmitAlertLogRecord plus test scaffolding.
ProfileLifecycleTracker `pkg/otelsetup/lifecycle.go`, `pkg/otelsetup/otelsetup_test.go`	Implements ProfileLifecycleTracker with per-container learning spans, LearningSpanID/Traceparent accessors, OnEntrySaved snapshot throttling, OnLearningEnded, and cap eviction with unit tests.
Main startup and metrics wiring `cmd/main.go`, `cmd/sbom-scanner/main.go`	Replaces old OTEL init with `otelsetup.InitProviders`, builds ProviderConfig from config/cluster/credentials, defers shutdown with 5s timeout, creates `otelmetrics.NewOTELMetricsManager`, and threads metrics manager into components.
OTEL metrics manager and benchmarks `pkg/metricsmanager/otel/otel_metrics_manager.go`, `pkg/metricsmanager/otel/bench_test.go`, `pkg/metricsmanager/prometheus/bench_test.go`	Adds OTEL-backed `OTELMetricsManager` (instruments, caches, reporting API, Start/Destroy), SBOM and alert-suppression metrics, and benchmark suites for OTEL and Prometheus metric paths.
Container profile lifecycle wiring `pkg/containerprofilemanager/v1/containerprofile_manager.go`, `pkg/containerprofilemanager/v1/monitoring.go`	Adds lifecycleTracker field, initializes it, calls OnLearningStarted/OnEntrySaved/OnLearningEnded at lifecycle checkpoints and error paths, and annotates saved profiles with OTEL trace/span metadata.
eBPF dropped-event metrics `pkg/containerwatcher/v2/container_watcher.go`, `pkg/containerwatcher/v2/event_handler_factory.go`	Adds `node_agent.ebpf.events_dropped.total` Int64Counter via `otelsetup.Meter()` to ContainerWatcher and EventHandlerFactory, increments on dropped events with `event_type` and `reason` attributes.
RuleManager spans and alert dedupe `pkg/rulemanager/rule_manager.go`	Adds an expirable LRU + mutex to deduplicate SecurityAlert log records per (rule,container), emits `rule.evaluate` spans when evaluation exceeds SlowEvalThreshold, and switches some metrics to use `rule.ID`.
Malware alert OTEL log emission `pkg/malwaremanager/v1/malware_manager.go`	Emits structured SecurityAlert OTEL log records for malware alerts with rule/runtime/malware attributes and uses `malware` constant for metrics.
SBOM gRPC client/server instrumentation `pkg/sbommanager/v1/sbom_manager.go`, `cmd/sbom-scanner/main.go`, `pkg/sbomscanner/v1/client.go`, `pkg/sbomscanner/v1/server.go`	Adds `otelgrpc` client/server handlers with health-method filtering, injects MetricsManager into SbomManager, wraps SBOM creation in `sbom.scan` spans and records heap alloc attrs, and conditionally starts Go runtime metrics.
Exporters and HTTP exporter metrics injection `pkg/exporters/exporters_bus.go`, `pkg/exporters/http_exporter.go`, `pkg/exporters/alert_bulk_manager.go`, `pkg/exporters/alert_manager.go`	InitExporters now accepts a MetricsManager, HTTPExporter stores & uses metrics client (noop fallback), reports alert suppression and uses context-aware logging across exporters and bulk manager.
Metrics API, mocks, noop, and Prometheus impl updates `pkg/metricsmanager/*`	Adds context-aware `ReportRuleEvaluationTime(ctx, ...)`, SBOM scan helpers and `ReportAlertSuppressed` to interface; updates mocks, noop, Prometheus impls, and registers SBOM/alert-suppression metrics.
Config, docs, and dependency updates `pkg/config/*`, `docs/CONFIGURATION.md`, `docs/metrics-migration.md`, `go.mod`	Renames `EnablePrometheusExporter` → `EnableMetricsExporter` (same mapstructure key), updates TestLoadConfig, expands environment docs and OTEL Notes, adds metrics-migration guide, and reshuffles/bumps OpenTelemetry and Prometheus-related dependencies.
Context-aware logging fixes `pkg/malwaremanager/v1/clamav/*`, various exporters	Replaces several non-context logger calls with `logger.L().Ctx(...)` variants across exporters and ClamAV flows.

Sequence Diagram(s): omitted (changes are multiple feature areas; no single three-component sequential flow benefits more than the above diagrams).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

kubescape/node-agent#782: Dependency and OpenTelemetry-related dependency updates.
kubescape/node-agent#794: Overlaps rule-evaluation/CEL changes and ReportEnrichedEvent flow.
kubescape/node-agent#753: Related SBOM manager/metrics and sidecar integration changes.

Suggested labels

release

Suggested reviewers

YakirOren
slashben

"🐰 A hop for traces, a twitch for logs,
Counters tallying dropped eBPF clogs,
Profiles learn with spans held near,
Alerts now whisper, clean and clear,
Hooray — telemetry in my paws!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: OpenTelemetry integration with traces, logs, drop counters, and metrics (Phases 1 & 2). It accurately reflects the substantial refactoring from Prometheus-only to OTEL-based instrumentation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/otel-instrumentation-phase1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-19T19:59:53Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.196	0.185	-5.7%
Peak CPU (cores)	0.211	0.208	-1.2%
Avg Memory (MiB)	311.043	270.446	-13.1%
Peak Memory (MiB)	312.699	276.422	-11.6%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	1	0	0.0%
hardlink	6000	0	0.0%
http	1766	119394	98.5%
network	904	77996	98.9%
open	33777	622476	94.9%
symlink	6000	0	0.0%
syscall	979	1886	65.8%

Event Counters

Metric	BEFORE	AFTER
capability_counter	8	9
dns_counter	1446	1434
exec_counter	7231	7174
network_counter	95086	94343
open_counter	791819	785344
syscall_counter	3622	3602

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/containerwatcher/v2/container_watcher.go (1)

490-501: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release dropped events on the non-blocking drop path.

On Line 490, dropped events skip the worker callback (where enrichedEvent.Event.Release() normally happens), so they are never released.

💡 Proposed fix

 		} else {
 			logger.L().Warning("ContainerWatcher - Worker channel full, dropping event",
 				helpers.String("eventType", string(entry.EventType)),
 				helpers.String("containerID", entry.ContainerID))
 			cw.ebpfDropCounter.Add(context.Background(),
 				1,
 				metric.WithAttributes(
 					attribute.String("event_type", string(entry.EventType)),
 					attribute.String("reason", "worker_channel_full"),
 				),
 			)
+			enrichedEvent.Event.Release()
 		}
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/containerwatcher/v2/container_watcher.go` around lines 490 - 501, The
non-blocking "worker channel full" branch drops events without releasing their
resources; update the else branch so that before logging/incrementing
cw.ebpfDropCounter you call the event release used in the worker path (e.g. call
enrichedEvent.Event.Release() or the appropriate Release() on the event stored
in entry), guarding with a nil-check if necessary, so dropped events are
properly released just like in the worker callback.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/main.go`:
- Around line 116-122: The deferred OTEL shutdown in main is skipped because
main calls os.Exit; move to the safer pattern: create a run() function that
contains the existing setup and the defer block that calls otelShutdown using
context.WithTimeout (and defers cancel), keep the defer exactly as in the diff,
replace all internal os.Exit(...) calls inside run() with returns of the
corresponding exit code, and change main to call os.Exit(run()); ensure you
reference the existing otelShutdown variable and the defer with
context.WithTimeout so the shutdown runs on SIGINT/SIGTERM paths.
- Around line 103-112: The OTEL provider call currently passes raw os.Getenv
values for NodeName, PodName, Namespace and ClusterName which bypass the
resolved config and clusterData; update the otelsetup.InitProviders
ProviderConfig construction (the call site where ProviderConfig is built) to use
the resolved config and cluster values (e.g., cfg.NodeName / cfg.PodName /
cfg.Namespace / cfg.ClusterName and existing clusterData fields) with optional
fallbacks to the env vars if needed so telemetry uses the final config values
instead of the raw environment.

In `@docs/CONFIGURATION.md`:
- Around line 112-115: Add a new environment variable row documenting
KS_LOGGER_NAME to the environment variables table, noting its default value and
valid options (e.g., "slog" vs "prettylogger"/"zaplogger") and that setting
KS_LOGGER_NAME=slog activates the ring buffer used by the retroactive log
export; also reference that ENABLE_DEBUG_LISTENER must be true to keep the last
7,500 log records in memory and that they can be re-emitted via a POST to
/debug/flush-ring-buffer.

In `@pkg/otelsetup/otelsetup_test.go`:
- Around line 156-161: Test TestSlowEvalThreshold_Default modifies the
package-global atomic slowEvalThresholdNs and does not restore it; update the
test to capture the current value of slowEvalThresholdNs before calling
slowEvalThresholdNs.Store(...), and defer restoring the saved value (using
slowEvalThresholdNs.Store(old)) so the global state is returned to its prior
value after the test; reference the slowEvalThresholdNs symbol and the
TestSlowEvalThreshold_Default test and keep the assertion on SlowEvalThreshold()
unchanged.

In `@pkg/otelsetup/setup.go`:
- Around line 185-187: The exporter option construction uses
otlptracegrpc.WithEndpoint (and likewise otlpmetricgrpc.WithEndpoint and
otlplogsgrpc.WithEndpoint) which requires host:port; update the logic that
builds traceOpts/metricOpts/logOpts to detect URL-style endpoints (use the
existing isARMOEndpoint() utility or check for "://") and call
WithEndpointURL(endpoint) when a scheme is present, otherwise keep
WithEndpoint(endpoint); apply the same change for the metric and log exporter
option blocks referenced around the other diffs so all three exporters handle
both host:port and full URL endpoints correctly.
- Around line 287-289: The shutdown closure currently derives shutdownCtx from
the incoming c which can be already cancelled; change it to create a fresh
timeout context using context.WithTimeout(context.Background(), 5*time.Second)
(instead of context.WithTimeout(c, ...)) so provider shutdown/flush calls are
not short-circuited by the caller's cancelled context; update the shutdown
function (the shutdown variable in setup.go) to use that fresh context and still
call defer cancel() as before.

---

Outside diff comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 490-501: The non-blocking "worker channel full" branch drops
events without releasing their resources; update the else branch so that before
logging/incrementing cw.ebpfDropCounter you call the event release used in the
worker path (e.g. call enrichedEvent.Event.Release() or the appropriate
Release() on the event stored in entry), guarding with a nil-check if necessary,
so dropped events are properly released just like in the worker callback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c5cf97b-b20d-4d20-93f8-9fcf98af12b5

📥 Commits

Reviewing files that changed from the base of the PR and between bf71679 and 515f7bb.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (16)

cmd/main.go
docs/CONFIGURATION.md
go.mod
pkg/config/config.go
pkg/config/config_test.go
pkg/containerprofilemanager/v1/containerprofile_manager.go
pkg/containerprofilemanager/v1/monitoring.go
pkg/containerwatcher/v2/container_watcher.go
pkg/containerwatcher/v2/event_handler_factory.go
pkg/containerwatcher/v2/tracers/top.go
pkg/malwaremanager/v1/malware_manager.go
pkg/otelsetup/lifecycle.go
pkg/otelsetup/otelsetup_test.go
pkg/otelsetup/setup.go
pkg/rulemanager/rule_manager.go
pkg/sbommanager/v1/sbom_manager.go

github-actions · 2026-05-20T05:44:57Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.215	0.221	+2.8%
Peak CPU (cores)	0.223	0.235	+5.4%
Avg Memory (MiB)	343.940	267.842	-22.1%
Peak Memory (MiB)	348.543	274.531	-21.2%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	0	0	N/A
hardlink	6000	0	0.0%
http	1708	119455	98.6%
network	903	78000	98.9%
open	35202	621052	94.6%
symlink	6000	0	0.0%
syscall	988	1904	65.8%

Event Counters

Metric	BEFORE	AFTER
capability_counter	10	8
dns_counter	1430	1449
exec_counter	7153	7250
network_counter	94110	95376
open_counter	783483	794469
syscall_counter	3635	3517

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/otelsetup/setup.go (1)
171-177: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ProviderConfig still drops cluster/version metadata.

cfg.ClusterName and cfg.ServiceVersion are passed into InitProviders, but they never make it into the shared OTEL resource. That means all three signals still lose cluster identity and agent version even after cmd/main.go was updated to resolve those values first.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/otelsetup/setup.go` around lines 171 - 177, The OTEL resource built in
resource.Merge (inside InitProviders / setup.go) omits cluster and service
version; update the resource.NewWithAttributes call to include cfg.ClusterName
and cfg.ServiceVersion by adding the appropriate semantic attributes (e.g.
semconv.K8SClusterName(cfg.ClusterName) and
semconv.ServiceVersion(cfg.ServiceVersion)) alongside semconv.ServiceName,
semconv.K8SNodeName, semconv.K8SPodName, and semconv.K8SNamespaceName so the
shared resource carries cluster identity and agent version.

♻️ Duplicate comments (1)

cmd/main.go (1)
116-122: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Deferred OTEL shutdown is still skipped on hard-exit paths.

The defer at Lines 116-122 never runs after os.Exit(...), and the later Fatal(...) paths terminate immediately as well. Normal SIGTERM/SIGINT shutdown can still bypass provider flush and drop batched telemetry.

Also applies to: 131-139, 186-187, 495-528
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/main.go` around lines 116 - 122, The deferred otelShutdown in the
anonymous defer block will be skipped on hard exits (os.Exit and log.Fatal
paths) so ensure otelShutdown(shutdownCtx) is invoked before any immediate exit:
replace direct os.Exit / log.Fatal usage or any hard-exit paths that occur near
the locations mentioned with a small exit helper (e.g., callExitWithShutdown)
that calls otelShutdown with a context timeout (same 5s pattern used in the
current defer), waits for completion (or logs error), then performs the actual
os.Exit; update all occurrences that terminate immediately (references:
otelShutdown, the anonymous defer block, and places that call os.Exit / Fatal)
to use that helper so provider flush runs on hard-exit paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/otelsetup/setup.go`:
- Around line 171-177: The OTEL resource built in resource.Merge (inside
InitProviders / setup.go) omits cluster and service version; update the
resource.NewWithAttributes call to include cfg.ClusterName and
cfg.ServiceVersion by adding the appropriate semantic attributes (e.g.
semconv.K8SClusterName(cfg.ClusterName) and
semconv.ServiceVersion(cfg.ServiceVersion)) alongside semconv.ServiceName,
semconv.K8SNodeName, semconv.K8SPodName, and semconv.K8SNamespaceName so the
shared resource carries cluster identity and agent version.

---

Duplicate comments:
In `@cmd/main.go`:
- Around line 116-122: The deferred otelShutdown in the anonymous defer block
will be skipped on hard exits (os.Exit and log.Fatal paths) so ensure
otelShutdown(shutdownCtx) is invoked before any immediate exit: replace direct
os.Exit / log.Fatal usage or any hard-exit paths that occur near the locations
mentioned with a small exit helper (e.g., callExitWithShutdown) that calls
otelShutdown with a context timeout (same 5s pattern used in the current defer),
waits for completion (or logs error), then performs the actual os.Exit; update
all occurrences that terminate immediately (references: otelShutdown, the
anonymous defer block, and places that call os.Exit / Fatal) to use that helper
so provider flush runs on hard-exit paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aa4254e2-e99b-47b8-af17-b2fe32adc73c

📥 Commits

Reviewing files that changed from the base of the PR and between 515f7bb and da9e60f.

📒 Files selected for processing (4)

cmd/main.go
pkg/containerwatcher/v2/container_watcher.go
pkg/otelsetup/otelsetup_test.go
pkg/otelsetup/setup.go

matthyx · 2026-05-20T06:28:52Z

@coderabbitai review again

coderabbitai · 2026-05-20T06:29:05Z

@matthyx: Sure, I'll re-review the changes in this PR right away!

✅ Actions performed

Full review triggered.

github-actions · 2026-05-20T06:45:25Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.199	0.197	-1.0%
Peak CPU (cores)	0.213	0.205	-3.6%
Avg Memory (MiB)	330.575	275.331	-16.7%
Peak Memory (MiB)	333.004	290.117	-12.9%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	0	0	N/A
hardlink	6000	0	0.0%
http	1703	119457	98.6%
network	905	77828	98.9%
open	36346	619901	94.5%
symlink	6000	0	0.0%
syscall	981	1897	65.9%

Event Counters

Metric	BEFORE	AFTER
capability_counter	11	8
dns_counter	1441	1383
exec_counter	7250	6995
network_counter	95295	91840
open_counter	794354	767514
syscall_counter	3515	3365

github-actions · 2026-05-20T16:05:49Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.214	0.216	+1.1%
Peak CPU (cores)	0.224	0.228	+1.8%
Avg Memory (MiB)	331.498	269.725	-18.6%
Peak Memory (MiB)	339.590	275.367	-18.9%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	0	0	N/A
hardlink	5999	0	0.0%
http	1705	119458	98.6%
network	900	77791	98.9%
open	36263	619957	94.5%
symlink	5999	0	0.0%
syscall	981	1885	65.8%

Event Counters

Metric	BEFORE	AFTER
capability_counter	11	8
dns_counter	1454	1414
exec_counter	7316	7246
network_counter	96129	94987
open_counter	801893	795185
syscall_counter	3672	3631

github-actions · 2026-05-21T14:15:20Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.198	0.192	-3.3%
Peak CPU (cores)	0.209	0.203	-2.9%
Avg Memory (MiB)	330.410	271.648	-17.8%
Peak Memory (MiB)	333.457	281.020	-15.7%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	0	0	N/A
hardlink	5999	0	0.0%
http	1767	119392	98.5%
network	905	77983	98.9%
open	36400	619834	94.5%
symlink	5999	0	0.0%
syscall	980	1877	65.7%

Event Counters

Metric	BEFORE	AFTER
capability_counter	11	9
dns_counter	1429	1444
exec_counter	7199	7225
network_counter	94605	95022
open_counter	789856	792502
syscall_counter	3520	3599

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

pkg/otelsetup/lifecycle.go (1)
39-45: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid evicting another container when replacing an existing tracked one.

At capacity, eviction runs before the replace check. If containerID already exists, this can evict an unrelated active lifecycle span and reduce tracked coverage unnecessarily.
Suggested patch
 func (t *ProfileLifecycleTracker) OnLearningStarted(containerID, namespace, pod, image string) {
 	t.mu.Lock()
 	defer t.mu.Unlock()
-	if len(t.spans) >= maxTrackedProfiles {
-		t.evictOldest()
-	}
 	if existing, ok := t.spans[containerID]; ok {
 		existing.AddEvent("learning.replaced")
 		existing.End()
+	} else if len(t.spans) >= maxTrackedProfiles {
+		t.evictOldest()
 	}
 	spanCtx, span := Tracer().Start(context.Background(), "container.profile.learning",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/otelsetup/lifecycle.go` around lines 39 - 45, The current logic evicts
oldest span before checking for an existing span, which can remove an unrelated
active lifecycle span when containerID is being replaced; change the flow in the
lifecycle tracking function so you first check if containerID already exists in
t.spans (the existing := t.spans[containerID] path and calls to
existing.AddEvent("learning.replaced") / existing.End()), perform the
replacement without calling t.evictOldest, and only if the containerID is not
already present enforce capacity by calling t.evictOldest when len(t.spans) >=
maxTrackedProfiles; this preserves existing spans and only evicts when adding a
new tracked container.
pkg/containerwatcher/v2/container_watcher.go (1)
179-183: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle the Int64Counter creation error instead of discarding it.

In pkg/containerwatcher/v2/container_watcher.go (lines 179-183), the code ignores the err from otelsetup.Meter().Int64Counter(...). OpenTelemetry-Go’s contract intends the returned instrument to remain usable/non-nil even when err != nil, but the error still signals an instrument registration/name issue (e.g., ErrInstrumentName) that can lead to incorrect/conflicting metric streams—so handle/log/propagate err instead of dropping it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/containerwatcher/v2/container_watcher.go` around lines 179 - 183, The
Int64Counter creation call using otelsetup.Meter().Int64Counter assigned to
ebpfDropCounter currently discards its returned error; capture the error (e.g.,
ebpfDropCounter, err := otelsetup.Meter().Int64Counter(...)) and handle it—log
it via the existing logger (processLogger or similar) or return it from the
containing function so instrument registration problems (ErrInstrumentName,
etc.) are visible; ensure the handling occurs where ebpfDropCounter is declared
so downstream code either gets a valid instrument or the startup fails/alerts
appropriately.

🧹 Nitpick comments (1)

pkg/otelsetup/otelsetup_test.go (1)

116-121: ⚡ Quick win

Make eviction ordering deterministic in the test (avoid time.Sleep).

Using time.Sleep for ordering can cause intermittent flakes. Setting startTimes explicitly under lock keeps this test deterministic.

Suggested patch

 	tracker := NewProfileLifecycleTracker()
 	tracker.OnLearningStarted("old", "ns", "pod", "")
-	time.Sleep(time.Millisecond)
 	tracker.OnLearningStarted("new", "ns", "pod", "")

 	tracker.mu.Lock()
+	tracker.startTimes["old"] = time.Unix(0, 1)
+	tracker.startTimes["new"] = time.Unix(0, 2)
 	tracker.evictOldest()
 	tracker.mu.Unlock()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/otelsetup/otelsetup_test.go` around lines 116 - 121, The test relies on
time.Sleep to create ordering for eviction; instead make ordering deterministic
by acquiring tracker.mu, setting tracker.startTimes for the relevant keys to
explicit timestamps (older for the entry you expect evicted, newer for the one
to keep), release the lock, then call
tracker.OnLearningStarted("new","ns","pod","") and invoke tracker.mu.Lock();
tracker.evictOldest(); tracker.mu.Unlock(); this replaces the non-deterministic
time.Sleep and ensures evictOldest() sees the intended startTimes ordering.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/otelsetup/lifecycle.go`:
- Around line 97-103: OnEntrySaved currently increments t.counts[containerID]
before verifying the container is tracked, which can grow counts for unknown
IDs; fix by checking the map lookup result (ctx, ok := t.ctxs[containerID])
while holding t.mu and only incrementing t.counts[containerID] (and reading
count) if ok is true, moving the increment into the guarded branch and returning
early without touching t.counts when ok is false; ensure the lock/unlock
semantics around t.mu remain correct.

---

Outside diff comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 179-183: The Int64Counter creation call using
otelsetup.Meter().Int64Counter assigned to ebpfDropCounter currently discards
its returned error; capture the error (e.g., ebpfDropCounter, err :=
otelsetup.Meter().Int64Counter(...)) and handle it—log it via the existing
logger (processLogger or similar) or return it from the containing function so
instrument registration problems (ErrInstrumentName, etc.) are visible; ensure
the handling occurs where ebpfDropCounter is declared so downstream code either
gets a valid instrument or the startup fails/alerts appropriately.

In `@pkg/otelsetup/lifecycle.go`:
- Around line 39-45: The current logic evicts oldest span before checking for an
existing span, which can remove an unrelated active lifecycle span when
containerID is being replaced; change the flow in the lifecycle tracking
function so you first check if containerID already exists in t.spans (the
existing := t.spans[containerID] path and calls to
existing.AddEvent("learning.replaced") / existing.End()), perform the
replacement without calling t.evictOldest, and only if the containerID is not
already present enforce capacity by calling t.evictOldest when len(t.spans) >=
maxTrackedProfiles; this preserves existing spans and only evicts when adding a
new tracked container.

---

Nitpick comments:
In `@pkg/otelsetup/otelsetup_test.go`:
- Around line 116-121: The test relies on time.Sleep to create ordering for
eviction; instead make ordering deterministic by acquiring tracker.mu, setting
tracker.startTimes for the relevant keys to explicit timestamps (older for the
entry you expect evicted, newer for the one to keep), release the lock, then
call tracker.OnLearningStarted("new","ns","pod","") and invoke
tracker.mu.Lock(); tracker.evictOldest(); tracker.mu.Unlock(); this replaces the
non-deterministic time.Sleep and ensures evictOldest() sees the intended
startTimes ordering.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 226ea8a5-cbde-422e-82d6-c29aa8dcd6c0

📥 Commits

Reviewing files that changed from the base of the PR and between da9e60f and ea28040.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (7)

cmd/main.go
go.mod
pkg/containerprofilemanager/v1/monitoring.go
pkg/containerwatcher/v2/container_watcher.go
pkg/otelsetup/lifecycle.go
pkg/otelsetup/otelsetup_test.go
pkg/otelsetup/setup.go

github-actions · 2026-05-21T17:50:18Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.000	0.000	N/A
Peak CPU (cores)	0.000	0.000	N/A
Avg Memory (MiB)	0.000	0.000	N/A
Peak Memory (MiB)	0.000	0.000	N/A

Dedup Effectiveness

No data available.

github-actions · 2026-05-21T18:24:06Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.206	0.214	+3.8%
Peak CPU (cores)	0.214	0.225	+5.3%
Avg Memory (MiB)	350.030	266.701	-23.8%
Peak Memory (MiB)	355.156	273.195	-23.1%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	1	0	0.0%
hardlink	6000	0	0.0%
http	1764	119394	98.5%
network	901	77998	98.9%
open	34143	622102	94.8%
symlink	6000	0	0.0%
syscall	978	1893	65.9%

Event Counters

Metric	BEFORE	AFTER
capability_counter	10	9
dns_counter	1442	1417
exec_counter	7236	7087
network_counter	95143	93217
open_counter	792738	777535
syscall_counter	3659	3516

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/rulemanager/rule_manager.go (1)

372-385: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Slow-eval span gate is always true with the documented default.

Line 374 uses evaluationTime >= otelsetup.SlowEvalThreshold(). If OTEL_SLOW_EVAL_THRESHOLD_MS defaults to 0, this records spans for every rule evaluation, not only slow ones.

Suggested fix

-		if evaluationTime >= otelsetup.SlowEvalThreshold() {
+		if threshold := otelsetup.SlowEvalThreshold(); threshold > 0 && evaluationTime >= threshold {
 			_, span := otelsetup.Tracer().Start(rm.ctx, "rule.evaluate",
 				trace.WithAttributes(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rulemanager/rule_manager.go` around lines 372 - 385, The slow-eval span
gate currently uses >= which with the default OTEL_SLOW_EVAL_THRESHOLD_MS == 0
will record every evaluation; change the condition in rule_manager.go from
evaluationTime >= otelsetup.SlowEvalThreshold() to evaluationTime >
otelsetup.SlowEvalThreshold() (ensuring both sides are the same unit, e.g.,
time.Duration) so only truly slower-than-threshold evaluations create the
"rule.evaluate" span; if otelsetup.SlowEvalThreshold() can be zero by design,
also consider ensuring its default is a positive duration instead of 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/rulemanager/rule_manager.go`:
- Around line 372-385: The slow-eval span gate currently uses >= which with the
default OTEL_SLOW_EVAL_THRESHOLD_MS == 0 will record every evaluation; change
the condition in rule_manager.go from evaluationTime >=
otelsetup.SlowEvalThreshold() to evaluationTime > otelsetup.SlowEvalThreshold()
(ensuring both sides are the same unit, e.g., time.Duration) so only truly
slower-than-threshold evaluations create the "rule.evaluate" span; if
otelsetup.SlowEvalThreshold() can be zero by design, also consider ensuring its
default is a positive duration instead of 0.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cfa1ecd-a736-45e0-baea-57fe5d4815f4

📥 Commits

Reviewing files that changed from the base of the PR and between ea28040 and 4525e89.

📒 Files selected for processing (10)

pkg/containerprofilemanager/v1/monitoring.go
pkg/containerwatcher/v2/container_watcher.go
pkg/exporters/alert_bulk_manager.go
pkg/exporters/alert_manager.go
pkg/exporters/http_exporter.go
pkg/malwaremanager/v1/clamav/clamav.go
pkg/malwaremanager/v1/clamav/exec.go
pkg/malwaremanager/v1/clamav/open.go
pkg/malwaremanager/v1/malware_manager.go
pkg/rulemanager/rule_manager.go

✅ Files skipped from review due to trivial changes (2)

pkg/exporters/http_exporter.go
pkg/exporters/alert_bulk_manager.go

github-actions · 2026-05-21T19:14:15Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.218	0.210	-3.8%
Peak CPU (cores)	0.230	0.220	-4.3%
Avg Memory (MiB)	332.306	269.525	-18.9%
Peak Memory (MiB)	334.609	275.129	-17.8%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	1	0	0.0%
hardlink	6000	0	0.0%
http	1767	119394	98.5%
network	902	77918	98.9%
open	36198	620067	94.5%
symlink	6000	0	0.0%
syscall	988	1906	65.9%

Event Counters

Metric	BEFORE	AFTER
capability_counter	9	8
dns_counter	1449	1449
exec_counter	7249	7288
network_counter	95369	95812
open_counter	793322	798400
syscall_counter	3539	3626

Introduces OTEL instrumentation across node-agent without touching existing Prometheus metrics: - pkg/otelsetup: new package with InitProviders (TracerProvider + LoggerProvider + MeterProvider via OTLP gRPC), ARMO auth header injection, ring-buffer log processor (7500-entry), ProfileLifecycleTracker (one span per container learning period, cap 10k), SlowEvalThreshold, EmitAlertLogRecord with 60s/1000-entry dedup LRU, and debug HTTP listener - pkg/rulemanager: emit alert OTEL log records per fired rule; add slow-path span for evaluations exceeding SlowEvalThreshold - pkg/malwaremanager: emit alert OTEL log records for malware detections - pkg/containerwatcher: count dropped eBPF events via Int64Counter (node_agent.ebpf.events_dropped.total) with reason label - pkg/containerprofilemanager: wire ProfileLifecycleTracker lifecycle hooks (OnLearningStarted/OnEntrySaved/OnLearningEnded) - pkg/sbommanager: attach otelgrpc stats handler to gRPC dial - docs/CONFIGURATION.md: document new OTEL env vars; mark OTEL_COLLECTOR_SVC as deprecated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

InitProviders now guards each exporter behind its own non-empty endpoint check, so a traces-only config no longer instantiates log/metric exporters against empty targets (avoiding retry loops). The ARMO no-credentials guard is extended to cover metrics-only ARMO configs. Each provider pointer is nil-checked in the combined shutdown. Debug listener is gated on logProvider being non-nil. Alert-log dedup in rule_manager replaces the racy Contains+Add pair with a mutex-protected check-and-set, closing the TOCTOU window under the ants worker pool. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- cmd/main.go: use resolved cfg/clusterData values for OTEL resource attributes instead of raw os.Getenv (NodeName, PodName, NamespaceName, ClusterName) - setup.go: detect URL-scheme endpoints and route to WithEndpointURL; host:port paths continue using WithEndpoint for all three exporters - setup.go: shutdown closure derives fresh context.Background() timeout so provider flush is not short-circuited by an already-cancelled caller context - otelsetup_test.go: restore slowEvalThresholdNs global in t.Cleanup to prevent order-dependent test failures - container_watcher.go: call Release() on dropped events in the worker_channel_full path to prevent resource leak Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>

The shared resource was missing K8SClusterName and ServiceVersion attributes, so all three signals were losing cluster identity and agent version even after cmd/main.go was updated to resolve these values from clusterData and ProviderConfig. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

resource.Default() in otel v1.43.0 sets schema URL 1.40.0 internally, while resource.NewWithAttributes(semconv.SchemaURL, ...) sets 1.26.0 from our semconv/v1.26.0 import. resource.Merge rejects two non-empty conflicting schema URLs, causing InitProviders to fail at startup. Switch to resource.NewSchemaless so our custom attributes carry no schema URL; Merge then adopts the default resource's schema without conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The OTLP gRPC exporters default to TLS, which fails against plaintext collectors (e.g. local SignOz on port 4317). Add WithInsecure() for all three exporters when the endpoint does not start with https://. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ing buffer severity gate - ProfileLifecycleTracker: one container.profile.learning span per container with LearningSpanID() and LearningTraceparent() for W3C propagation - OnEntrySaved: emit container.profile.cp.saved child spans with M2 throttle (count==1, %10, or hasDropped) - ContainerProfile annotations: OtelSpanIDMetadataKey + OtelTraceparentMetadataKey (k8s-interface v0.0.213) - otelsetup: thin wrapper delegating to go-logger/otelsetup (v0.0.29) which includes ring buffer with severity≥Info gate - Bump k8s-interface v0.0.212→v0.0.213 for OtelTraceparentMetadataKey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…amAV, drop events Wires OTEL log context on the call sites the design doc marks as Tier 1 (direct customer impact) and Tier 2 (operational health): - Alert delivery failures: alert_manager.go SendRuleAlert/SendMalwareAlert, alert_bulk_manager.go bulk-send max retries / queue-full / drain timeout, http_exporter.go SendRuleAlert / SendFimAlerts / alert-limit - Rule eval failures: rule_manager.go ReportEnrichedEvent / EvaluatePolicyRules / getUniqueIdAndMessage — use rm.ctx for trace correlation - ClamAV health: clamav.go ping failure, exec.go/open.go scan failures - Drop events: container_watcher.go worker-channel-full lines (Tier 2) - Profile save failures: containerprofilemanager monitoring.go — use cpm.ctx Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… to exec malware path - rule_manager.go: set span.SetStatus(codes.Error) when CEL evaluation returns an error so failed evaluations are distinguishable from successful non-alerts in traces - malware_manager.go: mirror reportFileOpen OTEL emission in reportFileExec — adds metrics.ReportRuleAlert + EmitAlertLogRecord to the exec malware detection path so exec-path detections appear in telemetry alongside open-path ones Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-21T19:54:19Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.225	0.217	-3.7%
Peak CPU (cores)	0.234	0.223	-4.5%
Avg Memory (MiB)	336.682	269.727	-19.9%
Peak Memory (MiB)	340.227	276.770	-18.7%

Dedup Effectiveness (AFTER only)

Event Type	Passed	Deduped	Ratio
capabilities	1	0	0.0%
hardlink	6000	0	0.0%
http	1764	119396	98.5%
network	901	77999	98.9%
open	35620	620629	94.6%
symlink	6000	0	0.0%
syscall	979	1885	65.8%

Event Counters

Metric	BEFORE	AFTER
capability_counter	9	9
dns_counter	1427	1422
exec_counter	7180	7116
network_counter	94353	93617
open_counter	785912	779420
syscall_counter	3527	3454

- Add pkg/metricsmanager/otel/ with full MetricsManager implementation backed by OTEL SDK; attribute-set caching on all hot paths eliminates per-call allocations (2× faster, 10× less memory vs Prometheus on the histogram path per A/B benchmark) - Wire OTEL metrics in cmd/main.go; drop prometheus package import - Add Prometheus scrape mode: OTEL_METRICS_EXPORTER=prometheus installs an OTEL→Prometheus bridge and starts :8080/metrics listener - Standardise rule.ID at all metric call sites (was rule.Name); malware alerts use constant "malware" to bound cardinality - Add docs/metrics-migration.md mapping every old Prometheus name to its new OTEL name (breaking rename — dashboards must be updated) - Add A/B benchmarks in otel/ and prometheus/ packages; hard gate passes: OTEL allocs/op ≤ Prometheus allocs/op, ns/op ≤ 1.1× Prometheus Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Picks up buildAuthHeaders refactor and AC8/AC9 unit tests for ARMO auth header injection (X-API-Key, X-Customer-GUID) across trace/log/metric exporter option builders. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-22T08:08:14Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.000	0.000	N/A
Peak CPU (cores)	0.000	0.000	N/A
Avg Memory (MiB)	0.000	0.000	N/A
Peak Memory (MiB)	0.000	0.000	N/A

Dedup Effectiveness

No data available.

github-actions · 2026-05-22T08:41:39Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.204	0.203	-0.9%
Peak CPU (cores)	0.215	0.209	-2.7%
Avg Memory (MiB)	321.076	265.948	-17.2%
Peak Memory (MiB)	323.840	270.332	-16.5%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	8	0
dns_counter	1463	0
exec_counter	7322	0
network_counter	96293	0
open_counter	802807	0
syscall_counter	3704	0

otel.SetMeterProvider(mp) was called before net.Listen(":8080"), so a port-conflict error (sidecar when main agent owns :8080) left a leaked Prometheus-backed provider as the global, silently blackholing sidecar metrics. Reorder to attempt net.Listen first; only install the provider after the listener succeeds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…reateSbomManager Prevents a runtime panic if CreateSbomManager is ever called with a nil MetricsManager (e.g. in tests). Substitutes the no-op implementation rather than returning an error, consistent with how other managers in this codebase treat optional metric dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e values HeapAlloc is a live snapshot that can decrease when GC runs mid-scan, making heap.alloc.delta_mb negative or understated. Switch to TotalAlloc (monotonically increasing cumulative bytes) so the delta always reflects actual allocations made during the scan. Rename span attributes from heap.alloc.* to alloc.total.* to avoid implying a live-heap measurement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-22T09:18:29Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.000	0.000	N/A
Peak CPU (cores)	0.000	0.000	N/A
Avg Memory (MiB)	0.000	0.000	N/A
Peak Memory (MiB)	0.000	0.000	N/A

Dedup Effectiveness

No data available.

matthyx · 2026-05-22T09:20:19Z

@coderabbitai review

coderabbitai · 2026-05-22T09:20:26Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

github-actions · 2026-05-22T09:50:58Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.204	0.196	-3.8%
Peak CPU (cores)	0.213	0.204	-4.1%
Avg Memory (MiB)	312.036	270.980	-13.2%
Peak Memory (MiB)	313.836	279.043	-11.1%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	11	0
dns_counter	1429	0
exec_counter	7147	0
network_counter	94013	0
open_counter	782194	0
syscall_counter	3542	0

Picks up credential-presence auth header gate — drops isARMOEndpoint() hostname matching in favour of accessKey != "" check, consistent with the SBOM scan-failure reporter and HTTP exporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tial-presence model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…LE_DEBUG_LISTENER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-22T10:20:42Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.000	0.000	N/A
Peak CPU (cores)	0.000	0.000	N/A
Avg Memory (MiB)	0.000	0.000	N/A
Peak Memory (MiB)	0.000	0.000	N/A

Dedup Effectiveness

No data available.

Picks up debug listener activation from KS_LOGGER_LEVEL=debug directly in the library — no ENABLE_DEBUG_LISTENER env var or DebugListener config field needed in callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-22T10:55:40Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.216	0.220	+1.6%
Peak CPU (cores)	0.228	0.230	+1.0%
Avg Memory (MiB)	330.949	265.483	-19.8%
Peak Memory (MiB)	333.180	270.480	-18.8%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	9	0
dns_counter	1453	0
exec_counter	7293	0
network_counter	95876	0
open_counter	800201	0
syscall_counter	3668	0

- Add KS_LOGGER_LEVEL and KS_LOGGER_NAME to CONFIGURATION.md env vars table - Gate runtime metrics on all metric-exporter env vars (OTEL_METRICS_EXPORTER, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) not just the base endpoint - Unregister sbom/alert counters in PrometheusMetric.Destroy() - Propagate prometheus HTTP server shutdown error via errors.Join Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-22T17:12:46Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.195	0.191	-2.4%
Peak CPU (cores)	0.203	0.198	-2.1%
Avg Memory (MiB)	330.452	265.105	-19.8%
Peak Memory (MiB)	332.152	267.922	-19.3%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	10	0
dns_counter	1427	0
exec_counter	7194	0
network_counter	94499	0
open_counter	788400	0
syscall_counter	3544	0

…gent - New node_agent.process.memory.rss_bytes (VmRSS from /proc/self/status) - New node_agent.process.memory.cgroup_bytes (cgroupv2 first, cgroupv1 fallback) - New node_agent.host.memory.total_bytes (MemTotal from /proc/meminfo, read once) - New node_agent.host.cpu.count (runtime.NumCPU()) - New node_agent.container.count (atomic start-stop delta, observable gauge) - Wire goruntime.Start() in cmd/main.go (was only in sbom-scanner) Closes Appendix A "Per-agent CPU and memory" and "Host total resources" gaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-27T10:19:51Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.200	0.191	-4.5%
Peak CPU (cores)	0.206	0.199	-3.3%
Avg Memory (MiB)	319.246	266.066	-16.7%
Peak Memory (MiB)	323.586	271.633	-16.1%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	10	0
dns_counter	1409	0
exec_counter	7049	0
network_counter	92705	0
open_counter	772999	0
syscall_counter	3529	0

Log records emitted via logger.L().Ctx(ctx) inherit trace_id/span_id when ctx carries an active span — the gap is purely call-site, not SDK. Pass spanned contexts at four sites where one is already in scope but the log fell back to a parent context, and add a single nodeprofile.send span to cover the only remaining warning that fires without a parent. - pkg/otelsetup/lifecycle.go: expose LearningCtx(containerID) so callers can reach the open learning span from outside the tracker package. - pkg/containerprofilemanager/v1/monitoring.go: four save-profile error logs now use the learning ctx — failures correlate to the container's learning trace. - pkg/rulemanager/rule_manager.go: hoist evalCtx so the post-eval error log inherits the rule.evaluate span when slow-path tracing fires. - pkg/nodeprofilemanager/v1/nodeprofile_manager.go: wrap sendProfile in a nodeprofile.send span (1 span / NodeProfileInterval / pod) so the recurring HTTP-EOF warning carries trace context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-27T14:59:35Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.218	0.222	+1.6%
Peak CPU (cores)	0.225	0.228	+1.2%
Avg Memory (MiB)	339.327	267.045	-21.3%
Peak Memory (MiB)	341.730	273.941	-19.8%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	9	0
dns_counter	1415	0
exec_counter	7079	0
network_counter	93116	0
open_counter	776654	0
syscall_counter	3563	0

TestZapWarning_AttachesSpanContext exercises the full path used at every fix site (logger.L().Ctx(spanCtx).Warning(...)) with an in-memory log recorder and verifies the emitted record's context carries the span's TraceID/SpanID — which is what the OTEL Logs SDK stamps into the trace_id/span_id columns downstream. Adds TestProfileLifecycleTracker_LearningCtx and a nil-safety variant for the new LearningCtx getter. Pulls in go.opentelemetry.io/otel/log/logtest as a test-only dep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-27T17:11:04Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.257	0.250	-2.8%
Peak CPU (cores)	0.271	0.268	-1.2%
Avg Memory (MiB)	321.914	276.831	-14.0%
Peak Memory (MiB)	323.180	282.773	-12.5%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	9	0
dns_counter	1448	0
exec_counter	7244	0
network_counter	95261	0
open_counter	794096	0
syscall_counter	3688	0

When httpExporterConfig.timeoutSeconds is unset it defaults to 0, which clobbered the intended 5s default and left the http.Client with no timeout. A hung POST to the synchronizer then blocked the sequential send loop for hours — observed in production via the nodeprofile.send span (5–8h durations all ending when the synchronizer reset connections). Only override the default when a positive timeout is configured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-28T07:08:49Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.197	0.195	-0.7%
Peak CPU (cores)	0.204	0.200	-2.0%
Avg Memory (MiB)	332.871	270.537	-18.7%
Peak Memory (MiB)	334.980	273.953	-18.2%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	12	0
dns_counter	1430	0
exec_counter	7151	0
network_counter	94057	0
open_counter	784324	0
syscall_counter	3580	0

node_agent.process.memory.cgroup_bytes read 0 on every pod: node-agent runs with a private cgroup namespace (/proc/self/cgroup = "0::/") while bind-mounting the host /sys/fs/cgroup over its own, so the fixed root path has no memory.current (cgroupv2 root exposes none) and the read fell through to 0. Resolve the process's own cgroup once (cached — a process never changes cgroup): fast path joins /proc/self/cgroup when the namespace is the host's; the node-agent fallback derives the container ID from /proc/self/mountinfo and locates the matching *.scope dir in the host tree (readable thanks to the agent's root + SYS_ADMIN + spc_t context). Also add node_agent.process.memory.cgroup_limit_bytes (memory.max, "max" → 0 = unlimited) so usage pairs with the limit for OOM headroom — the cgroup figure the kernel OOM killer enforces, which RSS undercounts. Verified the resolution path empirically against the live cluster (container memory.current ~295MB vs Go RSS ~220MB). Parsers and scope resolution covered by unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-28T08:36:00Z

Performance Benchmark Results

Node-Agent Resource Usage

Metric	BEFORE	AFTER	Delta
Avg CPU (cores)	0.197	0.204	+3.5%
Peak CPU (cores)	0.209	0.211	+0.8%
Avg Memory (MiB)	338.450	271.873	-19.7%
Peak Memory (MiB)	342.008	276.359	-19.2%

Dedup Effectiveness

No data available.

Event Counters

Metric	BEFORE	AFTER
capability_counter	11	0
dns_counter	1423	0
exec_counter	7153	0
network_counter	93994	0
open_counter	783656	0
syscall_counter	3552	0

The first attempt self-discovered the container ID from /proc/self/mountinfo, but node-agent mounts /host with shared mount propagation, so its mountinfo is polluted with every other container's ID. The "first scope that resolves" heuristic then matched a random foreign container — cgroup_bytes reported ~1MB with no limit instead of the agent's own ~295MB/731MB. Resolve node-agent's exact container ID from the k8s API at startup (POD_NAME/NAMESPACE_NAME Downward-API env + existing client → containerStatuses[node-agent].containerID) and thread it into the resource-metric resolver, which then locates the precise *.scope dir in the host cgroup tree. Drop the unreliable mountinfo guessing. Resolution is best-effort: an API failure or non-k8s mode yields "" and the gauges fall back to /proc-based resolution (0 if unavailable), never blocking startup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

matthyx force-pushed the feat/otel-instrumentation-phase1 branch from bd33437 to dcb83de Compare May 19, 2026 19:24

matthyx added this to KS PRs tracking May 20, 2026

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread cmd/main.go

Comment thread cmd/main.go

Comment thread docs/CONFIGURATION.md Outdated

Comment thread pkg/otelsetup/otelsetup_test.go Outdated

Comment thread pkg/otelsetup/setup.go Outdated

Comment thread pkg/otelsetup/setup.go Outdated

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

matthyx force-pushed the feat/otel-instrumentation-phase1 branch from da9e60f to 2e19259 Compare May 20, 2026 15:29

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread pkg/otelsetup/lifecycle.go

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

matthyx and others added 9 commits May 21, 2026 21:20

matthyx force-pushed the feat/otel-instrumentation-phase1 branch from 4525e89 to b41f213 Compare May 21, 2026 19:21

chore: bump go-logger to v0.0.30

cd00ceb

Picks up buildAuthHeaders refactor and AC8/AC9 unit tests for ARMO auth header injection (X-API-Key, X-Customer-GUID) across trace/log/metric exporter option builders. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matthyx and others added 3 commits May 22, 2026 11:06

matthyx and others added 3 commits May 22, 2026 12:03

chore: bump go-logger to v0.0.31

52e2500

Picks up credential-presence auth header gate — drops isARMOEndpoint() hostname matching in favour of accessKey != "" check, consistent with the SBOM scan-failure reporter and HTTP exporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: remove ARMO_OTEL_AUTH, update auth header description to creden…

47dcb06

…tial-presence model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: debug listener activates via KS_LOGGER_LEVEL=debug, remove ENAB…

ec3002a

…LE_DEBUG_LISTENER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: bump go-logger to v0.0.32

dda23a9

Picks up debug listener activation from KS_LOGGER_LEVEL=debug directly in the library — no ENABLE_DEBUG_LISTENER env var or DebugListener config field needed in callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matthyx moved this to WIP in KS PRs tracking May 26, 2026

Conversation

matthyx commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 1 — Traces, logs, drop counters

Phase 2 — Replace Prometheus metrics with OTEL SDK

New env vars

Breaking change

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 19, 2026

Performance Benchmark Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Performance Benchmark Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

matthyx commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 20, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 21, 2026

Performance Benchmark Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 21, 2026

Performance Benchmark Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 21, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 22, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 22, 2026

Performance Benchmark Results

Uh oh!

github-actions Bot commented May 22, 2026

Performance Benchmark Results

Uh oh!

matthyx commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

matthyx commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading