feat(logging): add a component field to log metadata#2354
Conversation
Adds a `component` field to NICo's structured (logfmt) log lines, so logs can be filtered by the emitting component in centralized logging instead of grepping free-text messages. Resolves NVIDIA#2049. `component` is one flat value per line: the binary by default, or — for a line emitted from an in-process subsystem of nico-api (e.g. Site Explorer) — that subsystem. Values: nico-api — API handlers, DB, startup: anything not in a subsystem below ├── site-explorer ├── machine_state_controller ├── switch_controller ├── rack_controller ├── power_shelf_controller ├── network_segments_controller ├── vpc_prefix_controller ├── ib_partition_controller └── attestation_controller nico-bmc-proxy nico-dhcp nico-dsx-exchange-consumer nico-fmds nico-hardware-health nico-rvs nico-test-artifact-cache nico-dpu-agent nico-scout Set per binary via `logfmt::layer().with_component(...)`; subsystems set it on their root span and nested lines inherit it. The existing `controller=` field is unchanged and coexists. Details in docs/observability/logging.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Adnan Dosani <adnand@nvidia.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
I think the firmware event info component should be renamed - avoids any issues when filtering logs by component. |
Renames the `component` field on nico-hardware-health's "Firmware info event" to `firmware_component`, so the logging `component` stays unambiguous when filtering by it — the event's own firmware value no longer collides with the layer-added emitting-component value. Per PR review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Adnan Dosani <adnand@nvidia.com>
| /// `nico-api`). A span may override it for lines within its scope by setting | ||
| /// its own `component` attribute; see `resolve_component`. `None` (the default | ||
| /// until `with_component` is called) emits no `component` field. | ||
| default_component: Option<String>, |
There was a problem hiding this comment.
extra_event_field should already do the job. I don't think any changes to logfmt itself are required.
@poroh can help with details
There was a problem hiding this comment.
Isn't this approach more intentional ? extra_event_fields sounds like a "random JSON blob" you might want to attach to an event. Can you initialize a logger with a set of those ?
There was a problem hiding this comment.
no, it has the exact same intention. extra_event_field specifies which records out of a span will be copied into each event log line.
This was added to make the machine_id that is sometimes a span attribute also an event attribute.
So adding component to extra_event_fields will make the component field of a span show up additionally in every event. Just the same as the new changes hardcode.
There was a problem hiding this comment.
One thing that extra_event_field doesn't cover that might be done by the component field is allowing to set a value if component is not specified in the span. That seems mostly useful for the applications which are not broken down into components yet.
There are various options here:
- Don't bother adding a default component there. The name of the application (e.g.
hw-health) can already be searched by logstream (e.g.k8s_container_name) - Instrument all the spans inside these applications to add the
componentfield - Modify
logfmtto allow specifying a default value forextra_event_fieldin case the field is not specified in the span.
- is a bit more a more generic solution than what is done in the current version of this PR.
From a user point of view I'm not too sure about 1) and whether the component field adds value besides the existing log stream selectors.
There was a problem hiding this comment.
+1 to your option 3 (default in logfmt), with one addition: also inherit the field from parent spans. Without that, a line emitted from site-explorer's per-endpoint work (the explore_endpoint span, which sets object_id but not component) falls back to the default and prints component=nico-api instead of site-explorer. With inheritance the subsystem sets component once on its root span (explore_site) and every nested line picks it up, without instrumenting each span (your option 2).
On value vs selectors: the clear win is separating nico-api's in-process subsystems (site-explorer, the controllers) — they run in one process, so no log-stream selector can tell them apart. For the services that deploy as their own container, you're right the default largely duplicates existing selectors; we'd keep it mainly for uniformity and the issue's "every line" wording.
One side effect: this makes object_id (the other extra_event_fields user) inherit too — safe, since a span that sets its own value keeps it, and one that doesn't now picks up an ancestor's instead of nothing.
There was a problem hiding this comment.
this makes object_id (the other extra_event_fields user) inherit too
that sounds actually desirable
What I'm a bit worried about with inheritance and parent span lookup is that it costs extra performance. If we do it, we might want to at least limit the recursion depth.
Alternatively we look at the codepath which spawn child tasks and enhance them to have the required fields (component, machine_id) set.
There was a problem hiding this comment.
Incorporated your feedback on performance — there's no ancestor walk. Each field is resolved once at span creation by copying the direct parent's already-resolved value (one O(1) lookup + a tiny clone), so the parent's value already encodes the whole chain. No recursion, nothing to depth-limit. Log lines do zero parent lookups — they just read the span's own resolved map.
And object_id inheriting falls out of the same mechanism — agreed that's desirable, so it's kept.
Resolving it in the layer also covers spans created by code we don't own — e.g. library spans inherit component from their parent automatically, and you can't annotate those at their construction site. Same single-level copy either way, and it won't regress when a new spawn site forgets to set the fields.
| let cloned_writer = writer.clone(); | ||
| let layer = layer() | ||
| .with_writer(Arc::new(move || Box::new(cloned_writer.clone()))) | ||
| .with_component("nico-api"); |
There was a problem hiding this comment.
besides the question whether changes to logfmt are actually required due to extra_event_fields doing a very similar job: It would be nice to keep everything in here free from anything "nico" or hardware related. The logfmt layer is a generic logging layer that is easily usable in other applications.
There was a problem hiding this comment.
isn't this simply an added test ensuring that if you log with nico-api it shows on the log line ?
There was a problem hiding this comment.
yes it is. Just saying that if we need to modify the library, then we can just use an arbitrary string here (like component1).
There was a problem hiding this comment.
Understood, so the feedback is basically not to use the string nico on a test of a generic library
There was a problem hiding this comment.
Agreed — nothing NICo/hardware-specific should live in the generic crate. With the generic approach from this thread — the application supplies the field name and an optional default, all configured at the call site — logfmt won't bake in component or any nico string at all, and I'll use generic names in the tests and doc examples.
# Conflicts: # docs/index.yml
Reworks the logfmt layer's component support (NVIDIA#2049) into a single EventField type — a field name with an optional default — configured through one with_event_fields builder. This replaces the earlier with_component setter and folds the separate default/extra field paths into one API. Each configured field is resolved once, when a span is created: the span's own value if it set one, else the value inherited from its parent span. Child spans therefore inherit fields like `component` (and nico-api's `object_id`) with no ancestor walk at log time — event and span lines just read the span's resolved value, falling back to the field's configured default or omitting it. Update all logfmt call sites to with_event_fields and refresh the logging docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
Description
Adds a
componentfield to NICo's structured (logfmt) log lines, so logs can be filtered by the emittingcomponent in centralized logging instead of grepping free-text messages. Resolves #2049.
componentis one flat value per line: the binary by default, or — for a line emitted from an in-processsubsystem of
nico-api(e.g. Site Explorer) — that subsystem. Values:Set per binary via
logfmt::layer().with_event_fields([logfmt::EventField::with_default("component", "nico-…")]).Subsystems of
nico-apisetcomponenton their root span, and nested instrumented functions,.instrument(...)futures, and spawned tasks inherit it. Inheritance is resolved once when a span iscreated — each span uses its own value if it set one, else copies its parent's already-resolved value — so
there is no ancestor walk at log time.
nico-api's existingobject_idevent field inherits through thesame mechanism. The existing
controller=field is unchanged and coexists. Details indocs/observability/logging.md.Type of Change
Related Issues (Optional)
#2049
Breaking Changes
Testing
logfmt: 16/16 — covers default-bearing and no-default fields on both event andlevel=SPANlines, aspan's own value overriding the default, single- and multi-level inheritance, explicit-parent spans, and
values filled in via
recordafter span creation; the pre-existing snapshot tests are unchanged and stillpass.
state-controller: 12/12 (the controller-span change is additive).api-coresuite passes.Manual verification: ran the full integration suite and confirmed
componentattribution across the ninenico-apisubsystems in the fmt-attribution logs (machine_state / site-explorer / network_segments /attestation / rack / switch / ib_partition / vpc_prefix / power_shelf controllers), and captured real
component=nico-dpu-agentlogfmt lines from the agent tests.Additional Notes
logfmtlayer stays domain-agnostic: it exposes a genericEventField(a field name + optionaldefault) plus a single
with_event_fieldsbuilder; thecomponent/object_idnames and per-binarydefaults are configured at the NICo call sites, not baked into the crate.
nico-hardware-health's firmware-inventorycomponentfield tofirmware_component, so thelogging
componentkey stays unambiguous when filtering (its firmware value no longer collides with theemitting-component value).
controller=is intentionally kept (not converged intocomponent=) so existing dashboards/alertsreading it don't break.
componentyet on the non-logfmt binaries (nico-dnsJSON,nico-pxe,nico-ssh-console,nico-dpu-otel-agent) — possible follow-up.