impl(o11y): record telemetry attributes on LRO span#5695
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances tracing for Long Running Operations (LROs) by recording operation names and resource IDs in the aip151 and discovery modules, and updating the 'LRO Wait' span with additional metadata. Feedback was provided regarding the removal of instrumentation from the into_stream method in src/lro/src/internal/tracing.rs, which would lead to fragmented traces and a loss of parent span context for operations executed within the stream.
1471b6c to
5978295
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5695 +/- ##
==========================================
- Coverage 97.89% 97.89% -0.01%
==========================================
Files 226 226
Lines 55471 55485 +14
==========================================
+ Hits 54304 54316 +12
- Misses 1167 1169 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5978295 to
7355448
Compare
| query, | ||
| ); | ||
| let p0 = poller.poll().await; | ||
| let p0 = poller.poll().instrument(test_span()).await; |
There was a problem hiding this comment.
Shouldn't we verify the span contents?
Aside: consider writing new tests just for tracing. Often it is quickest to use existing tests to test a new feature... but when you do that enough times the tests get complicated and lose their focus.
There was a problem hiding this comment.
ack. I added unit tests for both aip151 and discovery to verify the 2 specific attributes we populate for LROs.
7355448 to
a983c67
Compare
| span.record("gcp.longrunning.operation_name", name); | ||
| span.record("gcp.resource.destination.id", name); |
There was a problem hiding this comment.
are these ever different?
There was a problem hiding this comment.
in the final design we decided that they're always the same. we expected that downstream tools (e.g., AppHub) will use Operation ID as the default.
| async fn poll(&mut self) -> Option<PollingResult<ResponseType, MetadataType>> { | ||
| if let Some(start) = self.start.take() { | ||
| let result = start().await; | ||
| let (op, poll) = crate::details::handle_start(result); |
There was a problem hiding this comment.
Is it possible that the operation is marked done inside the start()? If so we might want to extract the name from the result above handle_start?
There was a problem hiding this comment.
That's a good point. I fixed the loops to extract the name directly from result (if Ok) before calling handle_start. Now we can make sure we don't miss these attributes.
I added a test for this case.
| let (op, poll) = crate::details::handle_start(result); | ||
| #[cfg(google_cloud_unstable_tracing)] | ||
| if let Some(ref name) = op { | ||
| let span = tracing::Span::current(); |
There was a problem hiding this comment.
if tracing is disabled, this might interfere with the customers' spans. Can we attach the span directly to the PollerImpl?
There was a problem hiding this comment.
Yeah you're right. I considered a few approaches for context propagation:
- adding
operation_name()as a trait method (impl(o11y): add operation name support toPollerinterface #5694): this pollutes our public API and intermediate decorators. - direct span mutation (what you were looking at): this might interfere with customers' spans.
tokio::task_local!(the newer commits).
We also used task-local back then for gaxi RequestRecorder, so I guess it's okay to adopt this.
There was a problem hiding this comment.
Regarding attaching the span directly to PollerImpl: I wanted to avoid adding a tracing field to PollerImpl because that forces our polling struct to actively manage and carry o11y variables even when tracing is disabled.
3b62b4a to
5b5d0da
Compare
5b5d0da to
27f2c17
Compare
…tion IDs internally
58f49ef to
e8ac197
Compare
| async fn poll(&mut self) -> Option<PollingResult<ResponseType, MetadataType>> { | ||
| if let Some(start) = self.start.take() { | ||
| let result = start().await; | ||
| let (op, poll) = crate::details::handle_start(result); |
| 0 // Initial triggers record nothing | ||
| }; | ||
|
|
||
| LRO_SPAN |
There was a problem hiding this comment.
would it make sense to combine into a single "LRO Recorder" that captures the LRO-level span and poll-attempt count? You might be able to make it a struct so that it encapsulates the use of task-locals.
Only consider if it simplifies the code.
There was a problem hiding this comment.
I think it does make the code looks more neat.
b48ba0f to
6da39a2
Compare
6da39a2 to
87ae245
Compare
| #[cfg(google_cloud_unstable_tracing)] | ||
| use crate::POLL_ATTEMPT_COUNT; | ||
|
|
||
| #[cfg(google_cloud_unstable_tracing)] | ||
| tokio::task_local! { | ||
| pub(crate) static LRO_SPAN: Span; | ||
| } |
There was a problem hiding this comment.
can these be unexported and use the LroRecorder API instead of LRO_SPAN directly?
There was a problem hiding this comment.
you're right, on it right now
In our previous iteration (#5694), we introduced an
operation_name(&self)query method to the publicPollertrait.To avoid polluting the public API, we utilize thread-local span propagation: because the LRO span is active in the thread context, the internal pollers (
PollerImplandDiscoveryPoller) record the LRO operation name and destination resource ID directly onto the current active span using tracing::Span::current().record().