Skip to content

feat(logging): add a component field to log metadata#2354

Open
adnandnv wants to merge 6 commits into
NVIDIA:mainfrom
adnandnv:feature/2049-log-component
Open

feat(logging): add a component field to log metadata#2354
adnandnv wants to merge 6 commits into
NVIDIA:mainfrom
adnandnv:feature/2049-log-component

Conversation

@adnandnv

@adnandnv adnandnv commented Jun 10, 2026

Copy link
Copy Markdown

Description

Adds a component field to NICo's structured (logfmt) log lines, so logs can be filtered by the emitting
component in centralized logging instead of grepping free-text messages. Resolves #2049.

component is one flat value per line: the binary by default, or — for a line emitted from an in-process
subsystem of nico-api (e.g. Site Explorer) — that subsystem. Values:

nico-api                       — API handlers, DB, startup: anything not in a subsystem below
├── site-explorer
├── machine_state_controller
├── switch_controller
├── rack_controller
├── power_shelf_controller
├── network_segments_controller
├── vpc_prefix_controller
├── ib_partition_controller
└── attestation_controller
nico-bmc-proxy
nico-dhcp
nico-dsx-exchange-consumer
nico-fmds
nico-hardware-health
nico-rvs
nico-test-artifact-cache
nico-dpu-agent
nico-scout

Set per binary via logfmt::layer().with_event_fields([logfmt::EventField::with_default("component", "nico-…")]).
Subsystems of nico-api set component on their root span, and nested instrumented functions,
.instrument(...) futures, and spawned tasks inherit it. Inheritance is resolved once when a span is
created — each span uses its own value if it set one, else copies its parent's already-resolved value — so
there is no ancestor walk at log time. nico-api's existing object_id event field inherits through the
same mechanism. The existing controller= field is unchanged and coexists. Details in
docs/observability/logging.md.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

#2049

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

logfmt: 16/16 — covers default-bearing and no-default fields on both event and level=SPAN lines, a
span's own value overriding the default, single- and multi-level inheritance, explicit-parent spans, and
values filled in via record after span creation; the pre-existing snapshot tests are unchanged and still
pass. state-controller: 12/12 (the controller-span change is additive). api-core suite passes.

Manual verification: ran the full integration suite and confirmed component attribution across the nine
nico-api subsystems in the fmt-attribution logs (machine_state / site-explorer / network_segments /
attestation / rack / switch / ib_partition / vpc_prefix / power_shelf controllers), and captured real
component=nico-dpu-agent logfmt lines from the agent tests.

Additional Notes

  • The logfmt layer stays domain-agnostic: it exposes a generic EventField (a field name + optional
    default) plus a single with_event_fields builder; the component/object_id names and per-binary
    defaults are configured at the NICo call sites, not baked into the crate.
  • Renamed nico-hardware-health's firmware-inventory component field to firmware_component, so the
    logging component key stays unambiguous when filtering (its firmware value no longer collides with the
    emitting-component value).
  • controller= is intentionally kept (not converged into component=) so existing dashboards/alerts
    reading it don't break.
  • No component yet on the non-logfmt binaries (nico-dns JSON, nico-pxe, nico-ssh-console,
    nico-dpu-otel-agent) — possible follow-up.

Adds a `component` field to NICo's structured (logfmt) log lines, so logs can
be filtered by the emitting component in centralized logging instead of
grepping free-text messages. Resolves NVIDIA#2049.

`component` is one flat value per line: the binary by default, or — for a line
emitted from an in-process subsystem of nico-api (e.g. Site Explorer) — that
subsystem. Values:

nico-api                       — API handlers, DB, startup: anything not in a subsystem below
├── site-explorer
├── machine_state_controller
├── switch_controller
├── rack_controller
├── power_shelf_controller
├── network_segments_controller
├── vpc_prefix_controller
├── ib_partition_controller
└── attestation_controller
nico-bmc-proxy
nico-dhcp
nico-dsx-exchange-consumer
nico-fmds
nico-hardware-health
nico-rvs
nico-test-artifact-cache
nico-dpu-agent
nico-scout

Set per binary via `logfmt::layer().with_component(...)`; subsystems set it on
their root span and nested lines inherit it. The existing `controller=` field
is unchanged and coexists. Details in docs/observability/logging.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Adnan Dosani <adnand@nvidia.com>
@adnandnv adnandnv requested review from a team and Coco-Ben as code owners June 10, 2026 05:44
@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4175b4fa-7f1a-4325-a48b-4ef6a932b6f6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@rohanharkin

Copy link
Copy Markdown
Contributor

I think the firmware event info component should be renamed - avoids any issues when filtering logs by component.
A follow up job could be created for the remaining binaries that don't use logfmt

adnandnv and others added 3 commits June 10, 2026 07:47
Renames the `component` field on nico-hardware-health's "Firmware info event"
to `firmware_component`, so the logging `component` stays unambiguous when
filtering by it — the event's own firmware value no longer collides with the
layer-added emitting-component value. Per PR review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Adnan Dosani <adnand@nvidia.com>
Comment thread crates/logfmt/src/lib.rs Outdated
/// `nico-api`). A span may override it for lines within its scope by setting
/// its own `component` attribute; see `resolve_component`. `None` (the default
/// until `with_component` is called) emits no `component` field.
default_component: Option<String>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_event_field should already do the job. I don't think any changes to logfmt itself are required.
@poroh can help with details

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this approach more intentional ? extra_event_fields sounds like a "random JSON blob" you might want to attach to an event. Can you initialize a logger with a set of those ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it has the exact same intention. extra_event_field specifies which records out of a span will be copied into each event log line.

This was added to make the machine_id that is sometimes a span attribute also an event attribute.

So adding component to extra_event_fields will make the component field of a span show up additionally in every event. Just the same as the new changes hardcode.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, makes sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that extra_event_field doesn't cover that might be done by the component field is allowing to set a value if component is not specified in the span. That seems mostly useful for the applications which are not broken down into components yet.

There are various options here:

  1. Don't bother adding a default component there. The name of the application (e.g. hw-health) can already be searched by logstream (e.g. k8s_container_name)
  2. Instrument all the spans inside these applications to add the component field
  3. Modify logfmt to allow specifying a default value for extra_event_field in case the field is not specified in the span.
  1. is a bit more a more generic solution than what is done in the current version of this PR.
    From a user point of view I'm not too sure about 1) and whether the component field adds value besides the existing log stream selectors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to your option 3 (default in logfmt), with one addition: also inherit the field from parent spans. Without that, a line emitted from site-explorer's per-endpoint work (the explore_endpoint span, which sets object_id but not component) falls back to the default and prints component=nico-api instead of site-explorer. With inheritance the subsystem sets component once on its root span (explore_site) and every nested line picks it up, without instrumenting each span (your option 2).

On value vs selectors: the clear win is separating nico-api's in-process subsystems (site-explorer, the controllers) — they run in one process, so no log-stream selector can tell them apart. For the services that deploy as their own container, you're right the default largely duplicates existing selectors; we'd keep it mainly for uniformity and the issue's "every line" wording.

One side effect: this makes object_id (the other extra_event_fields user) inherit too — safe, since a span that sets its own value keeps it, and one that doesn't now picks up an ancestor's instead of nothing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes object_id (the other extra_event_fields user) inherit too

that sounds actually desirable

What I'm a bit worried about with inheritance and parent span lookup is that it costs extra performance. If we do it, we might want to at least limit the recursion depth.

Alternatively we look at the codepath which spawn child tasks and enhance them to have the required fields (component, machine_id) set.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated your feedback on performance — there's no ancestor walk. Each field is resolved once at span creation by copying the direct parent's already-resolved value (one O(1) lookup + a tiny clone), so the parent's value already encodes the whole chain. No recursion, nothing to depth-limit. Log lines do zero parent lookups — they just read the span's own resolved map.

And object_id inheriting falls out of the same mechanism — agreed that's desirable, so it's kept.

Resolving it in the layer also covers spans created by code we don't own — e.g. library spans inherit component from their parent automatically, and you can't annotate those at their construction site. Same single-level copy either way, and it won't regress when a new spawn site forgets to set the fields.

Comment thread crates/logfmt/src/lib.rs Outdated
let cloned_writer = writer.clone();
let layer = layer()
.with_writer(Arc::new(move || Box::new(cloned_writer.clone())))
.with_component("nico-api");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides the question whether changes to logfmt are actually required due to extra_event_fields doing a very similar job: It would be nice to keep everything in here free from anything "nico" or hardware related. The logfmt layer is a generic logging layer that is easily usable in other applications.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this simply an added test ensuring that if you log with nico-api it shows on the log line ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is. Just saying that if we need to modify the library, then we can just use an arbitrary string here (like component1).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, so the feedback is basically not to use the string nico on a test of a generic library

@adnandnv adnandnv Jun 10, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — nothing NICo/hardware-specific should live in the generic crate. With the generic approach from this thread — the application supplies the field name and an optional default, all configured at the call site — logfmt won't bake in component or any nico string at all, and I'll use generic names in the tests and doc examples.

adnandnv and others added 2 commits June 10, 2026 20:21
Reworks the logfmt layer's component support (NVIDIA#2049) into a single EventField
type — a field name with an optional default — configured through one
with_event_fields builder. This replaces the earlier with_component setter and
folds the separate default/extra field paths into one API.

Each configured field is resolved once, when a span is created: the span's own
value if it set one, else the value inherited from its parent span. Child spans
therefore inherit fields like `component` (and nico-api's `object_id`) with no
ancestor walk at log time — event and span lines just read the span's resolved
value, falling back to the field's configured default or omitting it.

Update all logfmt call sites to with_event_fields and refresh the logging docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: carbide-api: Add a component field to log message metadata

4 participants