Skip to content

feat(service): async context propagation for task executor#4061

Open
flyingImer wants to merge 13 commits into
apache:mainfrom
flyingImer:async
Open

feat(service): async context propagation for task executor#4061
flyingImer wants to merge 13 commits into
apache:mainfrom
flyingImer:async

Conversation

@flyingImer
Copy link
Copy Markdown
Contributor

@flyingImer flyingImer commented Mar 25, 2026

This PR adds RequestIdHolder and a concrete context propagation helper for TaskExecutorImpl. Fixes #3444

Problem

When TaskExecutorImpl schedules async work, the task runs on a different thread with a fresh CDI request scope. Request-scoped context (realm, principal, request ID) was previously propagated via ad-hoc hardcoded logic, and request IDs were not propagated at all, since the only way to read them was through RESTEasy's internal CurrentRequestManager API, which is unavailable on task threads.

Solution

RequestIdHolder is a new @RequestScoped CDI bean replacing the removed ServiceProducers.requestIdSupplier() that depended on RESTEasy internals. It produces RequestIdSupplier via CDI so any component can inject it without depending on JAX-RS types. RequestIdFilter now writes to this holder on each request.

TaskContextPropagator is a package-private helper that captures realm, principal, and request ID on the request thread and restores them into the task thread's fresh CDI request scope. It directly injects RealmContextHolder, PolarisPrincipalHolder, and RequestIdHolder. No new SPI or extension point is introduced. The implementation follows the same pattern as Bootstrapper.

CurrentRequestManager is no longer referenced anywhere in the codebase.

Out of scope (follow-up candidates)

  • MDC propagation: request ID is not currently written to SLF4J MDC on task threads. Can be added in a follow-up.
  • X-Request-ID header validation: client-supplied header is used verbatim. Pre-existing behavior in RequestIdFilter, not introduced by this PR.

Checklist

  • Don't disclose security issues! (contact security@apache.org)
  • Clearly explained why the changes are needed, or linked related issues: Fixes Provide a robust CDI way to inject request_ids #3444
  • Added/updated tests with good coverage, or manually tested (and explained how)
      - Unit tests for TaskContextPropagator (capture, restore, round-trip)
      - TaskExecutorImplTest updated for new constructor signature
  • Added comments for complex logic
  • Updated CHANGELOG.md (if needed)
  • Updated documentation in site/content/in-dev/unreleased (if needed)

Disclaimer

Javadoc is mainly assisted by coding agent.

Copy link
Copy Markdown
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @flyingImer ! Good idea to normalize the code that deals with async context values propagation! Some comments and suggestions below.

Comment thread CHANGELOG.md Outdated
@flyingImer flyingImer requested a review from dimas-b March 25, 2026 22:16
Copy link
Copy Markdown
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good to me.

Can you please fix the spotless issues ? I will do a new pass after.

Thanks !

@flyingImer flyingImer requested a review from jbonofre March 27, 2026 18:50
@flyingImer
Copy link
Copy Markdown
Contributor Author

Thanks @dimas-b and @jbonofre. I’ve addressed the earlier feedback and fixed the CI issues. The latest run is green now.

When you have a chance, could you please take another look at the current diff? Thanks!

@flyingImer
Copy link
Copy Markdown
Contributor Author

@dimas-b I think your latest comments convinced me the shape issue is central enough to address in this PR.

My current cut is to keep the scope narrow, but switch this SPI to a state/action model so callers no longer deal with raw Object state, and TaskExecutorImpl only deals with one captured object per propagator.

I may still keep a generic cleanup hook with a default no-op, but the intent there would be generic lifecycle cleanup, not letting MDC shape the main propagation contract.

I also plan to clean up the smaller @Unremovable and RequestIdHolder points while touching this area.

Does that sound like the right cut?

@dimas-b
Copy link
Copy Markdown
Contributor

dimas-b commented Apr 2, 2026

@flyingImer : I'd prefer to keep MDC out of this PR (feel free to improve MDC data in a follow-up PR). I believe it is a totally different concern, not related to Request Context. Let's keep this PR focused on CDI concerns.

Looking forward to a new diff.

@flyingImer flyingImer requested a review from dimas-b April 2, 2026 21:22
@flyingImer flyingImer changed the title feat(service): async context propagation SPI for task executor feat(service): async context propagation for task executor Apr 3, 2026
@flyingImer flyingImer requested a review from dimas-b April 3, 2026 21:09
dimas-b
dimas-b previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me, @flyingImer 🙂

*/
CapturedTaskContext capture() {
return new CapturedTaskContext(
realmContextHolder.get(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for good measure, it might be best to create a copy for the realm context too. It's similar to the principal in many aspects.

Request ID is just a String, so it's fine to reuse it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out! Let me tackle it in a follow up pr later

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callContext is cloned at submission and already carries the realm context we care about. So CapturedTaskContext doesn't need to store realm. In fact, downstream code use the one from the callContext in multiple places:

  1. https://github.com/flyrain/polaris/blob/cfb175441de39bae1bdc5b362958b2029e56b614/runtime/service/src/main/java/org/apache/polaris/service/task/TaskExecutorImpl.java#L215.
  2. https://github.com/flyrain/polaris/blob/cfb175441de39bae1bdc5b362958b2029e56b614/runtime/service/src/main/java/org/apache/polaris/service/task/TaskExecutorImpl.java#L290

Looks like we didn't change these places. I'd suggest to avoid having the realmContext here to avoid any future fragmentation, as we have two ways to use realmContext now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that realm currently flows through two paths on the worker (callContext.getRealmContext() at L206/L276 and the holder via captured state). The fragmentation is real but between the clone path (pre-existing) and the holder path (this PR), not within CapturedTaskContext itself.

Holder propagation is structurally required regardless: RangerPolarisAuthorizerFactory and PolarisEventMetadataFactory @Inject RealmContext directly, so the holder must be populated even if callContext is also passed.

Current CapturedTaskContext shape ({realm, principal, requestId}) matches the follow-up's target state. The CallContext.copy() removal follow-up (thread on TaskExecutorImpl.java:148) drops the callContext parameter and has the worker produce CallContext via CDI, collapsing the two paths into one without changing CapturedTaskContext.

Moving realm out now would require adding it back in the follow-up. Principal precedent: principal is captured as a value, not bundled in a bean, so keeping realm symmetric aligns with existing convention.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is NOT related to holder and principal. It is about the duplicated realm context pass-by. Why do we need to add the real context back? We can always visit it via call context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two realm paths today are a transitional overlap: (a) callContext clone parameter, (b) CapturedTaskContextRealmContextHolder. Path (b) is the mechanism taking over, path (a) is the legacy being retired.

After the follow-up (thread), callContext stops being a cross-thread parcel on the worker side. It becomes an @Inject-produced bean in the worker's request scope, the same role it has on HTTP threads. CDI produces it using RealmContext from the holder that TaskContextPropagator populated. At that point, the only mechanism for realm to reach the worker is captured → holder → CDI. Realm must be in CapturedTaskContext for that to work.

So the current shape ({realm, principal, requestId} captured) is the end-state shape, not a duplication choice. Dropping realm from CapturedTaskContext in this PR would need to re-add it in the follow-up.

Copy link
Copy Markdown
Contributor Author

@flyingImer flyingImer May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain Stepping back to make sure we're aligned on both the near-term shape and the longer-term direction.

Why CapturedTaskContext.realmContext stays

Reading your concern as "don't let worker threads grow a second realm access path on top of what CallContext already provides" — that's a legitimate design principle, and I think the PR's shape actually serves it rather than fighting it.

RealmContextHolder isn't introduced here. It's been in the repo since the initial commit, and existing HTTP-path code already injects realm through it (e.g. RangerPolarisAuthorizerFactory, PolarisEventMetadataFactory). What PR #4061 adds is a capture/restore pair that populates the holder on the worker's fresh request scope, so the existing CDI convention keeps working across the async boundary instead of breaking at it.

CapturedTaskContext.realmContext is specifically the value that feeds that holder on the worker side. Dropping it would leave the worker's holder empty, which breaks not just @Inject RealmContext but also the CallContext producer itself (the producer reads RealmContext through the holder). So it's load-bearing, not additive.

What the follow-up collapses

The current two-path situation on the worker (clone .getRealmContext() vs holder-backed access) is transitional. The follow-up @adutra acked does:

  1. Drop the CallContext parameter from the three worker methods (tryHandleTask, handleTaskWithTracing, handleTask)
  2. Replace it with @Inject CallContext as a field
  3. Drop the callContext.copy() call in addTaskHandlerContext

After that, the worker's CallContext is produced by CDI from the populated holder, the same way HTTP threads produce it today. One data source (RealmContextHolder), two access writings (@Inject RealmContext and ctx.getRealmContext()) that converge on the same instance. Same composition pattern the HTTP path uses already.

If "one access pattern" is the goal

Worth naming that there's a direction here. The two writings aren't symmetric (below sheet created with help of Claude code):

@Inject RealmContext ctx.getRealmContext()
Source RealmContextHolder same, via CallContext producer
Instance identity original original (post-follow-up); lambda repackaging (today's clone path)
Dependency declared narrow full bag
CDI idiomatic yes wrapper
Covers every realm-access scenario yes yes

@Inject RealmContext covers every scenario ctx.getRealmContext() does, and it's the narrower, more idiomatic form. The reverse direction (collapsing toward ctx.getRealmContext()) would require retrofitting every existing @Inject RealmContext site to go through CallContext, which widens dependencies and is the path the codebase has historically not taken.

So if we want to reduce to a single access API in the longer run, I'd suggest the right move is to deprecate CallContext.getRealmContext() in favor of @Inject RealmContext. Separate RFC, not in this PR's scope, but I'm happy to open it after this lands.

Want to confirm we're aligned on:

  1. CapturedTaskContext.realmContext stays in this PR (it's the infrastructure, not a duplicate)
  2. The follow-up collapses the transitional two-path state to one data source
  3. Any further collapse to a single access API is a separate RFC, and the direction is deprecating ctx.getRealmContext() rather than the other way around

If any of those three don't match your read, happy to dig in further.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CapturedTaskContext.realmContext should still be a fresh object reusing only the realm ID (String) from the parent request context.

RealmContext can be a CDI beam, and in that case it will not be reusable after its owner context is terminated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with dealing with CallContext in a follow-up PR.

@github-project-automation github-project-automation Bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Apr 3, 2026
@flyingImer
Copy link
Copy Markdown
Contributor Author

Hi @jbonofre, I addressed the earlier feedback and the latest CI is green now.
Dimas already approved the current diff.
When you have a chance, could you take a quick final look? Thanks!

Comment thread CHANGELOG.md Outdated
@adutra
Copy link
Copy Markdown
Contributor

adutra commented Apr 8, 2026

Worth noting: this PR introduces a new TaskContextPropagator bean. The bean is functionally equivalent to a custom org.eclipse.microprofile.context.spi.ThreadContextProvider.

I wonder if it wouldn't be cleaner to just wire up a custom ThreadContextProvider instead of providing an equivalent class that must be invoked explicitly at the beginning of each task.

The custom ThreadContextProvider would still require to clear ThreadContext.CDI + @ActivateRequestContext + manual population of beans. So it wouldn't look radically different for sure, but we would at least benefit from implicit execution.

@dimas-b
Copy link
Copy Markdown
Contributor

dimas-b commented Apr 8, 2026

I wonder if it wouldn't be cleaner to just wire up a custom ThreadContextProvider instead of providing an equivalent class that must be invoked explicitly at the beginning of each task.

+1

@dimas-b
Copy link
Copy Markdown
Contributor

dimas-b commented Apr 8, 2026

A note on the "Holder" class pattern (RealmContextHolder, PolarisPrincipalHolder, etc.):

These classes are meant to allow Polaris code to manage corresponding request-scoped data without relying on the REST framework (ContainerRequestContext) during async task execution.

@flyingImer
Copy link
Copy Markdown
Contributor Author

Worth noting: this PR introduces a new TaskContextPropagator bean. The bean is functionally equivalent to a custom org.eclipse.microprofile.context.spi.ThreadContextProvider.

I wonder if it wouldn't be cleaner to just wire up a custom ThreadContextProvider instead of providing an equivalent class that must be invoked explicitly at the beginning of each task.

The custom ThreadContextProvider would still require to clear ThreadContext.CDI + @ActivateRequestContext + manual population of beans. So it wouldn't look radically different for sure, but we would at least benefit from implicit execution.

@adutra @dimas-b
I spiked this. It can work, but the cost outweighs the benefit. You're right that .cleared(CDI) + @ActivateRequestContext + manual holder population stays the same either way. The difference is in what the SPI route adds on top IIUC.

  1. The provider is ServiceLoader-instantiated, so no @Inject. Holder lookups go through CDI.current(). That's livable on its own, but resolving the principal during capture triggers Mutiny context propagation, which re-enters the provider. You need a recursion guard to break the cycle.

  2. The provider also fires globally, not just for task submissions. Bootstrapper has no request scope when it submits, so you need a scope-active check. Same for any future executor user.

  3. There's also the question of who owns the worker's request scope. .cleared(CDI) and the custom provider both want to manage it. Either you rely on SmallRye's ordering between built-in and ServiceLoader providers, or the provider takes over entirely. Neither is clean.

Bottomline: ~50 lines of defensive code to remove ~3 explicit lines from TaskExecutorImpl. The explicit helper avoids all of that because it's a regular CDI bean.

The implicit execution benefit is real but thin here. We're propagating three app-specific values at one call site, not a cross-cutting concern. I'd keep the explicit helper. WDYT?

@dimas-b
Copy link
Copy Markdown
Contributor

dimas-b commented Apr 23, 2026

@flyingImer : so, you prefer to go with the current approach in this PR?

@flyingImer flyingImer requested a review from adutra April 23, 2026 18:52
@flyingImer flyingImer requested a review from dimas-b April 23, 2026 18:52
@flyingImer
Copy link
Copy Markdown
Contributor Author

@flyingImer : so, you prefer to go with the current approach in this PR?

@dimas-b yes, I believe this is a balanced approach at the moment

@adutra
Copy link
Copy Markdown
Contributor

adutra commented Apr 27, 2026

That's livable on its own, but resolving the principal during capture triggers Mutiny context propagation, which re-enters the provider. You need a recursion guard to break the cycle.

That's true, I had to implement a re-entrance guard in my own attempt. Not the end of the world, but I agree that it adds some complexity.

OK, I think I'm fine with the current approach.

@flyingImer flyingImer force-pushed the async branch 2 times, most recently from 757d98e to fbbd34e Compare April 27, 2026 22:47
@flyingImer
Copy link
Copy Markdown
Contributor Author

@dimas-b @adutra @jbonofre I think the group is aligned on the PR impl directions and they are now reflected in the PR. Could you please take another look once get a chance? Would like to get it merged, so that I can follow up with other corresponding PRs

@flyingImer
Copy link
Copy Markdown
Contributor Author

@dimas-b @adutra @jbonofre Addressed the latest inline comments: TaskExecutorImpl is now package-private (a4750ba), and the CallContext.copy() removal is covered in the thread reply as a follow-up PR. CI is green, open threads resolved. Could you take a final pass?

adutra
adutra previously approved these changes Apr 29, 2026
Copy link
Copy Markdown
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @flyingImer for the change. The PR looks great overall. The only major concern is the extra realmContext we passed into the task execution.

return new CapturedTaskContext(
realmContextHolder.get(),
ImmutablePolarisPrincipal.builder().from(polarisPrincipal).build(),
requestIdHolder.get());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also include the callContext bean here. I'm fine with a followup though.

Copy link
Copy Markdown
Contributor Author

@flyingImer flyingImer Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction agreed on the TaskExecutorImpl.java:148 thread is removing callContext from the worker path (follow-up PR), not bundling it into CapturedTaskContext. See the thread on TaskContextPropagator.java:75 for the fragmentation discussion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the link? I don't think we can easily remove the call context. There are downstream reference to its field PolarisCallContext

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adutra's ack on the follow-up direction: #4061 (comment)

To be precise: CallContext the class stays, and downstream code keeps calling ctx.getPolarisCallContext() / ctx.getRealmContext() unchanged. What changes is how callContext appears on the worker.

Today, the submitter clones callContext and passes the clone through tryHandleTask / handleTaskWithTracing / handleTask as a parameter. It's a plain object ferried across threads.

After the follow-up, the worker gets callContext the same way HTTP threads already get it: via @Inject CallContext, with CDI producing it in the worker's request scope from the populated holders. The role shifts from "parcel passed across the boundary" to "bean produced in the local scope", which is its normal role elsewhere in the codebase. Consumer code at the call site is unchanged.

*/
CapturedTaskContext capture() {
return new CapturedTaskContext(
realmContextHolder.get(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callContext is cloned at submission and already carries the realm context we care about. So CapturedTaskContext doesn't need to store realm. In fact, downstream code use the one from the callContext in multiple places:

  1. https://github.com/flyrain/polaris/blob/cfb175441de39bae1bdc5b362958b2029e56b614/runtime/service/src/main/java/org/apache/polaris/service/task/TaskExecutorImpl.java#L215.
  2. https://github.com/flyrain/polaris/blob/cfb175441de39bae1bdc5b362958b2029e56b614/runtime/service/src/main/java/org/apache/polaris/service/task/TaskExecutorImpl.java#L290

Looks like we didn't change these places. I'd suggest to avoid having the realmContext here to avoid any future fragmentation, as we have two ways to use realmContext now.

Copy link
Copy Markdown
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a couple of minor concerns remaining from my side. One is below. I'll comment separately on the other one.

try {
return Optional.ofNullable(requestIdHolder.get());
} catch (ContextNotActiveException e) {
// No active request scope (e.g. background thread without @ActivateRequestContext).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still hesitant about this... This method is called on creating a PolarisEventMetadata. If request context is not active at that time, it would be a logical (coding) mistake.

I tend to think we should not try to catch this exception. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide a robust CDI way to inject request_ids

5 participants