feat(otel): OpenTelemetry traces, logs, drop counters, and OTEL metrics (Phase 1 + 2)#818
feat(otel): OpenTelemetry traces, logs, drop counters, and OTEL metrics (Phase 1 + 2)#818matthyx wants to merge 38 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces legacy OTEL wiring with provider-based InitProviders, adds an OTEL-backed metrics manager, implements ProfileLifecycleTracker, instruments rules/profiles/ebpf/malware/SBOM/gRPC, renames a metrics config flag, updates docs/go.mod, and adds tests and benchmarks. ChangesOpenTelemetry Setup and Integration
Sequence Diagram(s): omitted (changes are multiple feature areas; no single three-component sequential flow benefits more than the above diagrams). Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
bd33437 to
dcb83de
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/containerwatcher/v2/container_watcher.go (1)
490-501:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRelease dropped events on the non-blocking drop path.
On Line 490, dropped events skip the worker callback (where
enrichedEvent.Event.Release()normally happens), so they are never released.💡 Proposed fix
} else { logger.L().Warning("ContainerWatcher - Worker channel full, dropping event", helpers.String("eventType", string(entry.EventType)), helpers.String("containerID", entry.ContainerID)) cw.ebpfDropCounter.Add(context.Background(), 1, metric.WithAttributes( attribute.String("event_type", string(entry.EventType)), attribute.String("reason", "worker_channel_full"), ), ) + enrichedEvent.Event.Release() } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/containerwatcher/v2/container_watcher.go` around lines 490 - 501, The non-blocking "worker channel full" branch drops events without releasing their resources; update the else branch so that before logging/incrementing cw.ebpfDropCounter you call the event release used in the worker path (e.g. call enrichedEvent.Event.Release() or the appropriate Release() on the event stored in entry), guarding with a nil-check if necessary, so dropped events are properly released just like in the worker callback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/main.go`:
- Around line 116-122: The deferred OTEL shutdown in main is skipped because
main calls os.Exit; move to the safer pattern: create a run() function that
contains the existing setup and the defer block that calls otelShutdown using
context.WithTimeout (and defers cancel), keep the defer exactly as in the diff,
replace all internal os.Exit(...) calls inside run() with returns of the
corresponding exit code, and change main to call os.Exit(run()); ensure you
reference the existing otelShutdown variable and the defer with
context.WithTimeout so the shutdown runs on SIGINT/SIGTERM paths.
- Around line 103-112: The OTEL provider call currently passes raw os.Getenv
values for NodeName, PodName, Namespace and ClusterName which bypass the
resolved config and clusterData; update the otelsetup.InitProviders
ProviderConfig construction (the call site where ProviderConfig is built) to use
the resolved config and cluster values (e.g., cfg.NodeName / cfg.PodName /
cfg.Namespace / cfg.ClusterName and existing clusterData fields) with optional
fallbacks to the env vars if needed so telemetry uses the final config values
instead of the raw environment.
In `@docs/CONFIGURATION.md`:
- Around line 112-115: Add a new environment variable row documenting
KS_LOGGER_NAME to the environment variables table, noting its default value and
valid options (e.g., "slog" vs "prettylogger"/"zaplogger") and that setting
KS_LOGGER_NAME=slog activates the ring buffer used by the retroactive log
export; also reference that ENABLE_DEBUG_LISTENER must be true to keep the last
7,500 log records in memory and that they can be re-emitted via a POST to
/debug/flush-ring-buffer.
In `@pkg/otelsetup/otelsetup_test.go`:
- Around line 156-161: Test TestSlowEvalThreshold_Default modifies the
package-global atomic slowEvalThresholdNs and does not restore it; update the
test to capture the current value of slowEvalThresholdNs before calling
slowEvalThresholdNs.Store(...), and defer restoring the saved value (using
slowEvalThresholdNs.Store(old)) so the global state is returned to its prior
value after the test; reference the slowEvalThresholdNs symbol and the
TestSlowEvalThreshold_Default test and keep the assertion on SlowEvalThreshold()
unchanged.
In `@pkg/otelsetup/setup.go`:
- Around line 185-187: The exporter option construction uses
otlptracegrpc.WithEndpoint (and likewise otlpmetricgrpc.WithEndpoint and
otlplogsgrpc.WithEndpoint) which requires host:port; update the logic that
builds traceOpts/metricOpts/logOpts to detect URL-style endpoints (use the
existing isARMOEndpoint() utility or check for "://") and call
WithEndpointURL(endpoint) when a scheme is present, otherwise keep
WithEndpoint(endpoint); apply the same change for the metric and log exporter
option blocks referenced around the other diffs so all three exporters handle
both host:port and full URL endpoints correctly.
- Around line 287-289: The shutdown closure currently derives shutdownCtx from
the incoming c which can be already cancelled; change it to create a fresh
timeout context using context.WithTimeout(context.Background(), 5*time.Second)
(instead of context.WithTimeout(c, ...)) so provider shutdown/flush calls are
not short-circuited by the caller's cancelled context; update the shutdown
function (the shutdown variable in setup.go) to use that fresh context and still
call defer cancel() as before.
---
Outside diff comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 490-501: The non-blocking "worker channel full" branch drops
events without releasing their resources; update the else branch so that before
logging/incrementing cw.ebpfDropCounter you call the event release used in the
worker path (e.g. call enrichedEvent.Event.Release() or the appropriate
Release() on the event stored in entry), guarding with a nil-check if necessary,
so dropped events are properly released just like in the worker callback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1c5cf97b-b20d-4d20-93f8-9fcf98af12b5
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (16)
cmd/main.godocs/CONFIGURATION.mdgo.modpkg/config/config.gopkg/config/config_test.gopkg/containerprofilemanager/v1/containerprofile_manager.gopkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/containerwatcher/v2/event_handler_factory.gopkg/containerwatcher/v2/tracers/top.gopkg/malwaremanager/v1/malware_manager.gopkg/otelsetup/lifecycle.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.gopkg/rulemanager/rule_manager.gopkg/sbommanager/v1/sbom_manager.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/otelsetup/setup.go (1)
171-177:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
ProviderConfigstill drops cluster/version metadata.
cfg.ClusterNameandcfg.ServiceVersionare passed intoInitProviders, but they never make it into the shared OTEL resource. That means all three signals still lose cluster identity and agent version even aftercmd/main.gowas updated to resolve those values first.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/otelsetup/setup.go` around lines 171 - 177, The OTEL resource built in resource.Merge (inside InitProviders / setup.go) omits cluster and service version; update the resource.NewWithAttributes call to include cfg.ClusterName and cfg.ServiceVersion by adding the appropriate semantic attributes (e.g. semconv.K8SClusterName(cfg.ClusterName) and semconv.ServiceVersion(cfg.ServiceVersion)) alongside semconv.ServiceName, semconv.K8SNodeName, semconv.K8SPodName, and semconv.K8SNamespaceName so the shared resource carries cluster identity and agent version.
♻️ Duplicate comments (1)
cmd/main.go (1)
116-122:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDeferred OTEL shutdown is still skipped on hard-exit paths.
The defer at Lines 116-122 never runs after
os.Exit(...), and the laterFatal(...)paths terminate immediately as well. Normal SIGTERM/SIGINT shutdown can still bypass provider flush and drop batched telemetry.Also applies to: 131-139, 186-187, 495-528
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/main.go` around lines 116 - 122, The deferred otelShutdown in the anonymous defer block will be skipped on hard exits (os.Exit and log.Fatal paths) so ensure otelShutdown(shutdownCtx) is invoked before any immediate exit: replace direct os.Exit / log.Fatal usage or any hard-exit paths that occur near the locations mentioned with a small exit helper (e.g., callExitWithShutdown) that calls otelShutdown with a context timeout (same 5s pattern used in the current defer), waits for completion (or logs error), then performs the actual os.Exit; update all occurrences that terminate immediately (references: otelShutdown, the anonymous defer block, and places that call os.Exit / Fatal) to use that helper so provider flush runs on hard-exit paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/otelsetup/setup.go`:
- Around line 171-177: The OTEL resource built in resource.Merge (inside
InitProviders / setup.go) omits cluster and service version; update the
resource.NewWithAttributes call to include cfg.ClusterName and
cfg.ServiceVersion by adding the appropriate semantic attributes (e.g.
semconv.K8SClusterName(cfg.ClusterName) and
semconv.ServiceVersion(cfg.ServiceVersion)) alongside semconv.ServiceName,
semconv.K8SNodeName, semconv.K8SPodName, and semconv.K8SNamespaceName so the
shared resource carries cluster identity and agent version.
---
Duplicate comments:
In `@cmd/main.go`:
- Around line 116-122: The deferred otelShutdown in the anonymous defer block
will be skipped on hard exits (os.Exit and log.Fatal paths) so ensure
otelShutdown(shutdownCtx) is invoked before any immediate exit: replace direct
os.Exit / log.Fatal usage or any hard-exit paths that occur near the locations
mentioned with a small exit helper (e.g., callExitWithShutdown) that calls
otelShutdown with a context timeout (same 5s pattern used in the current defer),
waits for completion (or logs error), then performs the actual os.Exit; update
all occurrences that terminate immediately (references: otelShutdown, the
anonymous defer block, and places that call os.Exit / Fatal) to use that helper
so provider flush runs on hard-exit paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aa4254e2-e99b-47b8-af17-b2fe32adc73c
📒 Files selected for processing (4)
cmd/main.gopkg/containerwatcher/v2/container_watcher.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.go
|
@coderabbitai review again |
|
✅ Actions performedFull review triggered. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
da9e60f to
2e19259
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/otelsetup/lifecycle.go (1)
39-45:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid evicting another container when replacing an existing tracked one.
At capacity, eviction runs before the replace check. If
containerIDalready exists, this can evict an unrelated active lifecycle span and reduce tracked coverage unnecessarily.Suggested patch
func (t *ProfileLifecycleTracker) OnLearningStarted(containerID, namespace, pod, image string) { t.mu.Lock() defer t.mu.Unlock() - if len(t.spans) >= maxTrackedProfiles { - t.evictOldest() - } if existing, ok := t.spans[containerID]; ok { existing.AddEvent("learning.replaced") existing.End() + } else if len(t.spans) >= maxTrackedProfiles { + t.evictOldest() } spanCtx, span := Tracer().Start(context.Background(), "container.profile.learning",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/otelsetup/lifecycle.go` around lines 39 - 45, The current logic evicts oldest span before checking for an existing span, which can remove an unrelated active lifecycle span when containerID is being replaced; change the flow in the lifecycle tracking function so you first check if containerID already exists in t.spans (the existing := t.spans[containerID] path and calls to existing.AddEvent("learning.replaced") / existing.End()), perform the replacement without calling t.evictOldest, and only if the containerID is not already present enforce capacity by calling t.evictOldest when len(t.spans) >= maxTrackedProfiles; this preserves existing spans and only evicts when adding a new tracked container.pkg/containerwatcher/v2/container_watcher.go (1)
179-183:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle the
Int64Countercreation error instead of discarding it.In
pkg/containerwatcher/v2/container_watcher.go(lines 179-183), the code ignores theerrfromotelsetup.Meter().Int64Counter(...). OpenTelemetry-Go’s contract intends the returned instrument to remain usable/non-nil even whenerr != nil, but the error still signals an instrument registration/name issue (e.g.,ErrInstrumentName) that can lead to incorrect/conflicting metric streams—so handle/log/propagateerrinstead of dropping it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/containerwatcher/v2/container_watcher.go` around lines 179 - 183, The Int64Counter creation call using otelsetup.Meter().Int64Counter assigned to ebpfDropCounter currently discards its returned error; capture the error (e.g., ebpfDropCounter, err := otelsetup.Meter().Int64Counter(...)) and handle it—log it via the existing logger (processLogger or similar) or return it from the containing function so instrument registration problems (ErrInstrumentName, etc.) are visible; ensure the handling occurs where ebpfDropCounter is declared so downstream code either gets a valid instrument or the startup fails/alerts appropriately.
🧹 Nitpick comments (1)
pkg/otelsetup/otelsetup_test.go (1)
116-121: ⚡ Quick winMake eviction ordering deterministic in the test (avoid
time.Sleep).Using
time.Sleepfor ordering can cause intermittent flakes. SettingstartTimesexplicitly under lock keeps this test deterministic.Suggested patch
tracker := NewProfileLifecycleTracker() tracker.OnLearningStarted("old", "ns", "pod", "") - time.Sleep(time.Millisecond) tracker.OnLearningStarted("new", "ns", "pod", "") tracker.mu.Lock() + tracker.startTimes["old"] = time.Unix(0, 1) + tracker.startTimes["new"] = time.Unix(0, 2) tracker.evictOldest() tracker.mu.Unlock()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/otelsetup/otelsetup_test.go` around lines 116 - 121, The test relies on time.Sleep to create ordering for eviction; instead make ordering deterministic by acquiring tracker.mu, setting tracker.startTimes for the relevant keys to explicit timestamps (older for the entry you expect evicted, newer for the one to keep), release the lock, then call tracker.OnLearningStarted("new","ns","pod","") and invoke tracker.mu.Lock(); tracker.evictOldest(); tracker.mu.Unlock(); this replaces the non-deterministic time.Sleep and ensures evictOldest() sees the intended startTimes ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/otelsetup/lifecycle.go`:
- Around line 97-103: OnEntrySaved currently increments t.counts[containerID]
before verifying the container is tracked, which can grow counts for unknown
IDs; fix by checking the map lookup result (ctx, ok := t.ctxs[containerID])
while holding t.mu and only incrementing t.counts[containerID] (and reading
count) if ok is true, moving the increment into the guarded branch and returning
early without touching t.counts when ok is false; ensure the lock/unlock
semantics around t.mu remain correct.
---
Outside diff comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 179-183: The Int64Counter creation call using
otelsetup.Meter().Int64Counter assigned to ebpfDropCounter currently discards
its returned error; capture the error (e.g., ebpfDropCounter, err :=
otelsetup.Meter().Int64Counter(...)) and handle it—log it via the existing
logger (processLogger or similar) or return it from the containing function so
instrument registration problems (ErrInstrumentName, etc.) are visible; ensure
the handling occurs where ebpfDropCounter is declared so downstream code either
gets a valid instrument or the startup fails/alerts appropriately.
In `@pkg/otelsetup/lifecycle.go`:
- Around line 39-45: The current logic evicts oldest span before checking for an
existing span, which can remove an unrelated active lifecycle span when
containerID is being replaced; change the flow in the lifecycle tracking
function so you first check if containerID already exists in t.spans (the
existing := t.spans[containerID] path and calls to
existing.AddEvent("learning.replaced") / existing.End()), perform the
replacement without calling t.evictOldest, and only if the containerID is not
already present enforce capacity by calling t.evictOldest when len(t.spans) >=
maxTrackedProfiles; this preserves existing spans and only evicts when adding a
new tracked container.
---
Nitpick comments:
In `@pkg/otelsetup/otelsetup_test.go`:
- Around line 116-121: The test relies on time.Sleep to create ordering for
eviction; instead make ordering deterministic by acquiring tracker.mu, setting
tracker.startTimes for the relevant keys to explicit timestamps (older for the
entry you expect evicted, newer for the one to keep), release the lock, then
call tracker.OnLearningStarted("new","ns","pod","") and invoke
tracker.mu.Lock(); tracker.evictOldest(); tracker.mu.Unlock(); this replaces the
non-deterministic time.Sleep and ensures evictOldest() sees the intended
startTimes ordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 226ea8a5-cbde-422e-82d6-c29aa8dcd6c0
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (7)
cmd/main.gogo.modpkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/otelsetup/lifecycle.gopkg/otelsetup/otelsetup_test.gopkg/otelsetup/setup.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/rulemanager/rule_manager.go (1)
372-385:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSlow-eval span gate is always true with the documented default.
Line 374 uses
evaluationTime >= otelsetup.SlowEvalThreshold(). IfOTEL_SLOW_EVAL_THRESHOLD_MSdefaults to0, this records spans for every rule evaluation, not only slow ones.Suggested fix
- if evaluationTime >= otelsetup.SlowEvalThreshold() { + if threshold := otelsetup.SlowEvalThreshold(); threshold > 0 && evaluationTime >= threshold { _, span := otelsetup.Tracer().Start(rm.ctx, "rule.evaluate", trace.WithAttributes(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/rulemanager/rule_manager.go` around lines 372 - 385, The slow-eval span gate currently uses >= which with the default OTEL_SLOW_EVAL_THRESHOLD_MS == 0 will record every evaluation; change the condition in rule_manager.go from evaluationTime >= otelsetup.SlowEvalThreshold() to evaluationTime > otelsetup.SlowEvalThreshold() (ensuring both sides are the same unit, e.g., time.Duration) so only truly slower-than-threshold evaluations create the "rule.evaluate" span; if otelsetup.SlowEvalThreshold() can be zero by design, also consider ensuring its default is a positive duration instead of 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/rulemanager/rule_manager.go`:
- Around line 372-385: The slow-eval span gate currently uses >= which with the
default OTEL_SLOW_EVAL_THRESHOLD_MS == 0 will record every evaluation; change
the condition in rule_manager.go from evaluationTime >=
otelsetup.SlowEvalThreshold() to evaluationTime > otelsetup.SlowEvalThreshold()
(ensuring both sides are the same unit, e.g., time.Duration) so only truly
slower-than-threshold evaluations create the "rule.evaluate" span; if
otelsetup.SlowEvalThreshold() can be zero by design, also consider ensuring its
default is a positive duration instead of 0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0cfa1ecd-a736-45e0-baea-57fe5d4815f4
📒 Files selected for processing (10)
pkg/containerprofilemanager/v1/monitoring.gopkg/containerwatcher/v2/container_watcher.gopkg/exporters/alert_bulk_manager.gopkg/exporters/alert_manager.gopkg/exporters/http_exporter.gopkg/malwaremanager/v1/clamav/clamav.gopkg/malwaremanager/v1/clamav/exec.gopkg/malwaremanager/v1/clamav/open.gopkg/malwaremanager/v1/malware_manager.gopkg/rulemanager/rule_manager.go
✅ Files skipped from review due to trivial changes (2)
- pkg/exporters/http_exporter.go
- pkg/exporters/alert_bulk_manager.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Introduces OTEL instrumentation across node-agent without touching existing Prometheus metrics: - pkg/otelsetup: new package with InitProviders (TracerProvider + LoggerProvider + MeterProvider via OTLP gRPC), ARMO auth header injection, ring-buffer log processor (7500-entry), ProfileLifecycleTracker (one span per container learning period, cap 10k), SlowEvalThreshold, EmitAlertLogRecord with 60s/1000-entry dedup LRU, and debug HTTP listener - pkg/rulemanager: emit alert OTEL log records per fired rule; add slow-path span for evaluations exceeding SlowEvalThreshold - pkg/malwaremanager: emit alert OTEL log records for malware detections - pkg/containerwatcher: count dropped eBPF events via Int64Counter (node_agent.ebpf.events_dropped.total) with reason label - pkg/containerprofilemanager: wire ProfileLifecycleTracker lifecycle hooks (OnLearningStarted/OnEntrySaved/OnLearningEnded) - pkg/sbommanager: attach otelgrpc stats handler to gRPC dial - docs/CONFIGURATION.md: document new OTEL env vars; mark OTEL_COLLECTOR_SVC as deprecated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
InitProviders now guards each exporter behind its own non-empty endpoint check, so a traces-only config no longer instantiates log/metric exporters against empty targets (avoiding retry loops). The ARMO no-credentials guard is extended to cover metrics-only ARMO configs. Each provider pointer is nil-checked in the combined shutdown. Debug listener is gated on logProvider being non-nil. Alert-log dedup in rule_manager replaces the racy Contains+Add pair with a mutex-protected check-and-set, closing the TOCTOU window under the ants worker pool. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- cmd/main.go: use resolved cfg/clusterData values for OTEL resource attributes instead of raw os.Getenv (NodeName, PodName, NamespaceName, ClusterName) - setup.go: detect URL-scheme endpoints and route to WithEndpointURL; host:port paths continue using WithEndpoint for all three exporters - setup.go: shutdown closure derives fresh context.Background() timeout so provider flush is not short-circuited by an already-cancelled caller context - otelsetup_test.go: restore slowEvalThresholdNs global in t.Cleanup to prevent order-dependent test failures - container_watcher.go: call Release() on dropped events in the worker_channel_full path to prevent resource leak Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
The shared resource was missing K8SClusterName and ServiceVersion attributes, so all three signals were losing cluster identity and agent version even after cmd/main.go was updated to resolve these values from clusterData and ProviderConfig. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
resource.Default() in otel v1.43.0 sets schema URL 1.40.0 internally, while resource.NewWithAttributes(semconv.SchemaURL, ...) sets 1.26.0 from our semconv/v1.26.0 import. resource.Merge rejects two non-empty conflicting schema URLs, causing InitProviders to fail at startup. Switch to resource.NewSchemaless so our custom attributes carry no schema URL; Merge then adopts the default resource's schema without conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OTLP gRPC exporters default to TLS, which fails against plaintext collectors (e.g. local SignOz on port 4317). Add WithInsecure() for all three exporters when the endpoint does not start with https://. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing buffer severity gate - ProfileLifecycleTracker: one container.profile.learning span per container with LearningSpanID() and LearningTraceparent() for W3C propagation - OnEntrySaved: emit container.profile.cp.saved child spans with M2 throttle (count==1, %10, or hasDropped) - ContainerProfile annotations: OtelSpanIDMetadataKey + OtelTraceparentMetadataKey (k8s-interface v0.0.213) - otelsetup: thin wrapper delegating to go-logger/otelsetup (v0.0.29) which includes ring buffer with severity≥Info gate - Bump k8s-interface v0.0.212→v0.0.213 for OtelTraceparentMetadataKey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…amAV, drop events Wires OTEL log context on the call sites the design doc marks as Tier 1 (direct customer impact) and Tier 2 (operational health): - Alert delivery failures: alert_manager.go SendRuleAlert/SendMalwareAlert, alert_bulk_manager.go bulk-send max retries / queue-full / drain timeout, http_exporter.go SendRuleAlert / SendFimAlerts / alert-limit - Rule eval failures: rule_manager.go ReportEnrichedEvent / EvaluatePolicyRules / getUniqueIdAndMessage — use rm.ctx for trace correlation - ClamAV health: clamav.go ping failure, exec.go/open.go scan failures - Drop events: container_watcher.go worker-channel-full lines (Tier 2) - Profile save failures: containerprofilemanager monitoring.go — use cpm.ctx Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to exec malware path - rule_manager.go: set span.SetStatus(codes.Error) when CEL evaluation returns an error so failed evaluations are distinguishable from successful non-alerts in traces - malware_manager.go: mirror reportFileOpen OTEL emission in reportFileExec — adds metrics.ReportRuleAlert + EmitAlertLogRecord to the exec malware detection path so exec-path detections appear in telemetry alongside open-path ones Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4525e89 to
b41f213
Compare
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
- Add pkg/metricsmanager/otel/ with full MetricsManager implementation backed by OTEL SDK; attribute-set caching on all hot paths eliminates per-call allocations (2× faster, 10× less memory vs Prometheus on the histogram path per A/B benchmark) - Wire OTEL metrics in cmd/main.go; drop prometheus package import - Add Prometheus scrape mode: OTEL_METRICS_EXPORTER=prometheus installs an OTEL→Prometheus bridge and starts :8080/metrics listener - Standardise rule.ID at all metric call sites (was rule.Name); malware alerts use constant "malware" to bound cardinality - Add docs/metrics-migration.md mapping every old Prometheus name to its new OTEL name (breaking rename — dashboards must be updated) - Add A/B benchmarks in otel/ and prometheus/ packages; hard gate passes: OTEL allocs/op ≤ Prometheus allocs/op, ns/op ≤ 1.1× Prometheus Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up buildAuthHeaders refactor and AC8/AC9 unit tests for ARMO auth header injection (X-API-Key, X-Customer-GUID) across trace/log/metric exporter option builders. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
otel.SetMeterProvider(mp) was called before net.Listen(":8080"), so a
port-conflict error (sidecar when main agent owns :8080) left a leaked
Prometheus-backed provider as the global, silently blackholing sidecar
metrics. Reorder to attempt net.Listen first; only install the provider
after the listener succeeds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…reateSbomManager Prevents a runtime panic if CreateSbomManager is ever called with a nil MetricsManager (e.g. in tests). Substitutes the no-op implementation rather than returning an error, consistent with how other managers in this codebase treat optional metric dependencies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e values HeapAlloc is a live snapshot that can decrease when GC runs mid-scan, making heap.alloc.delta_mb negative or understated. Switch to TotalAlloc (monotonically increasing cumulative bytes) so the delta always reflects actual allocations made during the scan. Rename span attributes from heap.alloc.* to alloc.total.* to avoid implying a live-heap measurement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Picks up credential-presence auth header gate — drops isARMOEndpoint() hostname matching in favour of accessKey != "" check, consistent with the SBOM scan-failure reporter and HTTP exporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tial-presence model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…LE_DEBUG_LISTENER Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Picks up debug listener activation from KS_LOGGER_LEVEL=debug directly in the library — no ENABLE_DEBUG_LISTENER env var or DebugListener config field needed in callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
- Add KS_LOGGER_LEVEL and KS_LOGGER_NAME to CONFIGURATION.md env vars table - Gate runtime metrics on all metric-exporter env vars (OTEL_METRICS_EXPORTER, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) not just the base endpoint - Unregister sbom/alert counters in PrometheusMetric.Destroy() - Propagate prometheus HTTP server shutdown error via errors.Join Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
…gent - New node_agent.process.memory.rss_bytes (VmRSS from /proc/self/status) - New node_agent.process.memory.cgroup_bytes (cgroupv2 first, cgroupv1 fallback) - New node_agent.host.memory.total_bytes (MemTotal from /proc/meminfo, read once) - New node_agent.host.cpu.count (runtime.NumCPU()) - New node_agent.container.count (atomic start-stop delta, observable gauge) - Wire goruntime.Start() in cmd/main.go (was only in sbom-scanner) Closes Appendix A "Per-agent CPU and memory" and "Host total resources" gaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Log records emitted via logger.L().Ctx(ctx) inherit trace_id/span_id when ctx carries an active span — the gap is purely call-site, not SDK. Pass spanned contexts at four sites where one is already in scope but the log fell back to a parent context, and add a single nodeprofile.send span to cover the only remaining warning that fires without a parent. - pkg/otelsetup/lifecycle.go: expose LearningCtx(containerID) so callers can reach the open learning span from outside the tracker package. - pkg/containerprofilemanager/v1/monitoring.go: four save-profile error logs now use the learning ctx — failures correlate to the container's learning trace. - pkg/rulemanager/rule_manager.go: hoist evalCtx so the post-eval error log inherits the rule.evaluate span when slow-path tracing fires. - pkg/nodeprofilemanager/v1/nodeprofile_manager.go: wrap sendProfile in a nodeprofile.send span (1 span / NodeProfileInterval / pod) so the recurring HTTP-EOF warning carries trace context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
TestZapWarning_AttachesSpanContext exercises the full path used at every fix site (logger.L().Ctx(spanCtx).Warning(...)) with an in-memory log recorder and verifies the emitted record's context carries the span's TraceID/SpanID — which is what the OTEL Logs SDK stamps into the trace_id/span_id columns downstream. Adds TestProfileLifecycleTracker_LearningCtx and a nil-safety variant for the new LearningCtx getter. Pulls in go.opentelemetry.io/otel/log/logtest as a test-only dep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
When httpExporterConfig.timeoutSeconds is unset it defaults to 0, which clobbered the intended 5s default and left the http.Client with no timeout. A hung POST to the synchronizer then blocked the sequential send loop for hours — observed in production via the nodeprofile.send span (5–8h durations all ending when the synchronizer reset connections). Only override the default when a positive timeout is configured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
node_agent.process.memory.cgroup_bytes read 0 on every pod: node-agent runs with a private cgroup namespace (/proc/self/cgroup = "0::/") while bind-mounting the host /sys/fs/cgroup over its own, so the fixed root path has no memory.current (cgroupv2 root exposes none) and the read fell through to 0. Resolve the process's own cgroup once (cached — a process never changes cgroup): fast path joins /proc/self/cgroup when the namespace is the host's; the node-agent fallback derives the container ID from /proc/self/mountinfo and locates the matching *.scope dir in the host tree (readable thanks to the agent's root + SYS_ADMIN + spc_t context). Also add node_agent.process.memory.cgroup_limit_bytes (memory.max, "max" → 0 = unlimited) so usage pairs with the limit for OOM headroom — the cgroup figure the kernel OOM killer enforces, which RSS undercounts. Verified the resolution path empirically against the live cluster (container memory.current ~295MB vs Go RSS ~220MB). Parsers and scope resolution covered by unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
The first attempt self-discovered the container ID from /proc/self/mountinfo, but node-agent mounts /host with shared mount propagation, so its mountinfo is polluted with every other container's ID. The "first scope that resolves" heuristic then matched a random foreign container — cgroup_bytes reported ~1MB with no limit instead of the agent's own ~295MB/731MB. Resolve node-agent's exact container ID from the k8s API at startup (POD_NAME/NAMESPACE_NAME Downward-API env + existing client → containerStatuses[node-agent].containerID) and thread it into the resource-metric resolver, which then locates the precise *.scope dir in the host cgroup tree. Drop the unreliable mountinfo guessing. Resolution is best-effort: an API failure or non-k8s mode yields "" and the gauges fall back to /proc-based resolution (0 if unavailable), never blocking startup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Phase 1 — Traces, logs, drop counters
Phase 2 — Replace Prometheus metrics with OTEL SDK
New env vars
`OTEL_COLLECTOR_SVC` is now deprecated (superseded by `OTEL_EXPORTER_OTLP_ENDPOINT`).
Breaking change
Metric names changed. See `docs/metrics-migration.md` for the full mapping and dashboard update checklist.
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests