xds/extproc: adds ClientInterceptor and ClientStream implementation for normal mode for gRFC A93#9174
xds/extproc: adds ClientInterceptor and ClientStream implementation for normal mode for gRFC A93#9174eshitachandwani wants to merge 27 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9174 +/- ##
==========================================
- Coverage 83.19% 83.01% -0.18%
==========================================
Files 420 421 +1
Lines 34010 34747 +737
==========================================
+ Hits 28295 28846 +551
- Misses 4281 4412 +131
- Partials 1434 1489 +55
🚀 New features to boost your workflow:
|
….Done case becuase also cancel context incase of error to avoid calls to dataplane stream hanging.
| }, | ||
| } | ||
| if err = procStream.Send(headerReq); err != nil { | ||
| return cs.handleInitError(fmt.Errorf("failed to send client headers to external processor server: %v", err), newStream, opts) |
There was a problem hiding this comment.
I'm trying to follow the code when NewStream hits an initialization error and therefore handleInitError is called. Let's say failure_mode_allow is set to true. In this case, handleInitError will cancel the proc stream's context, will create the dataplane stream and will set procStreamBypass to true.
Now, let's say the application tries to send a message. In SendMsg, it will see that the proc stream is bypassed and therefore will call waitForDataplaneStream. But for cases where the proc stream is bypassed, can we be sure that a dataplane stream definitely exists? I see that this is true for the init errors. But just wanted to make sure if it is true for all cases. If that is the case, in SendMsg instead of this block:
if extClosed || cs.config.processingModes.requestBodyMode == modeSkip {
s, err := cs.waitForDataplaneStream(cs.ctx)
if err != nil {
return err
}
return s.SendMsg(m)
}Does, it make sense to separate it out into two blocks:
if extClosed {
return s.SendMsg(m)
}
if cs.config.processingModes.requestBodyMode == modeSkip {
s, err := cs.waitForDataplaneStream(cs.ctx)
if err != nil {
return err
}
return s.SendMsg(m)
}There was a problem hiding this comment.
For now , there is no such case where extClosed is true but dataplane is nil. But I think we should still keep the check waitForDataplaneStream because the proc filter is a highly concurrent system and it might be good to have a check. It will not introduce latency because waitForDataplaneStream will exit immediately because the dataplaneReady channel will already have been closed.
WDYT ?
There was a problem hiding this comment.
We cannot have checks to cover things that we don't know about. We should have some reasoning about why a check if required. Saying that this is a highly concurrent system and therefore I will check everything possible everywhere only complicates the code more and makes it harded to reason about.
There was a problem hiding this comment.
Changed . In SendMsg and CloseSend both.
| } | ||
| if err != io.EOF && (cs.ignoreFailureMode.Load() || !cs.config.failureModeAllow) { | ||
| cs.procStreamErr.Store(status.Errorf(codes.Internal, "extproc: external processor RPC failed: %v", err)) | ||
| cs.procStreamFailed.Fire() |
There was a problem hiding this comment.
Shouldn't this event fire irrespective of whether cs.ignoreFailureMode.Load() || !cs.config.failureModeAllow is true?
There was a problem hiding this comment.
This event is to indicate that we should fail the RPC , but if failure mode allow is true and we do not have to ignore the failure mode allow , that means the events should continue on the dataplane stream rather than failing with error.
And to signal continue on dataplane stream , we have triggerDrain()
| return | ||
| } | ||
| cs.procStreamBypass.Store(true) | ||
| cs.triggerDrain() |
There was a problem hiding this comment.
I'm wondering if we can guarantee that the proc stream is either marked closed (err == io.EOF) or failed (err != io.EOF), do we need to trigger this drain process here? I'm trying to see if we can restrict the drainTriggeredCh only to cases where the proc server sent us the request_drain bit.
There was a problem hiding this comment.
I don't think we can do that because restricting drainTriggeredCh to only server initiated drains would cause deadlocks in fail-open (bypass) scenarios. For example:
Setup: requestHeaderMode: SEND, failureModeAllow: true (Fail-Open enabled).
Situation: During NewStream, the client sends request headers to the processor stream and calls processInitialHeaders() to block waiting for the processor's mutated headers response. While waiting, the processor stream fails or crashes.
Deadlock:
recvFromProcServerLoop receives the connection error and calls cs.failStream(err).
Since fail-open is enabled and no body messages are sent yet, failStream does not fire the hard failure event (procStreamFailed). Instead, it attempts to trigger a bypass by setting procStreamBypass to true.
If we do not trigger a drain here, drainTriggeredCh remains open.
Consequently, processInitialHeaders remains blocked forever in its select statement, waiting for either a processor response (which will never come) or a drain/bypass event. The client stream initialization hangs.
And a similar situation with responseHeaders.
We could introduce a separate channel (like bypassTriggeredCh), but since both events require the exact same action—stop sending to the processor and redirect all traffic directly to the dataplane—splitting them would force us to duplicate select cases throughout
Or we can remane this to bypassProcCh to unify the 2 use cases it serves.
WDYT ?
There was a problem hiding this comment.
Yeah, unifying the two cases, if possible would be nicer. Lesser things to think about. This code is already huge :) and is only going to get huger.
There was a problem hiding this comment.
changed the variable names to show it bypasses the proc stream instead of triggering drain.
|
|
||
| // failStream handles stream failures, recording errors or bypassing external | ||
| // processor based on failureModeAllow configuration. | ||
| func (cs *clientStream) failStream(err error) { |
There was a problem hiding this comment.
Nit: Should we call this failProcStream to be more explicit about which stream is failing here?
There was a problem hiding this comment.
Ohh I made the change and then reverted it I guess while making other changes. Changing it now.
…planeCreated incase of err and sucesss , store error , other review comments
|
|
||
| // failStream handles stream failures, recording errors or bypassing external | ||
| // processor based on failureModeAllow configuration. | ||
| func (cs *clientStream) failStream(err error) { |
| for k, v := range reqFields { | ||
| val, err := structpb.NewValue(v) | ||
| if err != nil { | ||
| continue |
There was a problem hiding this comment.
Nit: Add a comment here or add to the docstring that we encode as many attributes as we can and ignore the ones that can't, similar to Envoy.
There was a problem hiding this comment.
I had the comment typed out and then don't know why decided against it. Adding it now
| return | ||
| } | ||
| cs.procStreamBypass.Store(true) | ||
| cs.triggerDrain() |
There was a problem hiding this comment.
Yeah, unifying the two cases, if possible would be nicer. Lesser things to think about. This code is already huge :) and is only going to get huger.
| }, | ||
| } | ||
| if err = procStream.Send(headerReq); err != nil { | ||
| return cs.handleInitError(fmt.Errorf("failed to send client headers to external processor server: %v", err), newStream, opts) |
There was a problem hiding this comment.
We cannot have checks to cover things that we don't know about. We should have some reasoning about why a check if required. Saying that this is a highly concurrent system and therefore I will check everything possible everywhere only complicates the code more and makes it harded to reason about.
| if cs.procStreamFailed.HasFired() { | ||
| return | ||
| } | ||
| if err != io.EOF && (cs.ignoreFailureMode.Load() || !cs.config.failureModeAllow) { |
There was a problem hiding this comment.
In cases where we have EOF, but drain is not initiated, then we still need to treat it as non-OK status as per the gRFC. are we handling that here? It should be handled in the success scenario. Also, curious if EOF is in headers only RPC - that should not be failed for wanting of drain right? And by the looks of it, failProcStream is conditional failure, the name should reflect that - attemptProcStreamFailure or something similar, failProcStream souds like guaranteed failure - and is confusing.
There was a problem hiding this comment.
The drain request for EOF is still being discussed and will probably be finalised after the new bidirectional drain is finalised. See the discussion here : https://chat.google.com/room/AAAAbkw9L3c/uuoZ0GXjSdE and so I have not changed it yet.
Also intention behind naming it failProcStream is the proc stream had definately failed/closed. We need to decide wether to fail the RPC of let it bypass. attemptProcStreamFailure might indicate that proc stream is being failed conditionally. WDYT ?
| if cs.protocolConfigSent.CompareAndSwap(false, true) { | ||
| req.ProtocolConfig = &v3procservicepb.ProtocolConfiguration{ | ||
| RequestBodyMode: convertBodyMode(cs.config.processingModes.requestBodyMode), | ||
| ResponseBodyMode: convertBodyMode(cs.config.processingModes.responseBodyMode), |
There was a problem hiding this comment.
I was just looking at validations for response_body_mode during the parsing of configuration - we don't have validation for response_trailer_mode : it shold be SEND when the response_body_mode is GRPC
There was a problem hiding this comment.
Ohh right! I planned to send a seperate PR for that. Here is the PR #9209
| // Signal that the response trailer is modified and ready to be sent to | ||
| // the client. | ||
| cs.responseTrailerModified.Fire() | ||
| cs.procStream.CloseSend() |
There was a problem hiding this comment.
The API for stream advises: "It is also not safe to call CloseSend concurrently with SendMsg." should this be guarded to avoid race since we are calling it from multiple places? Same goes for dataplane stream as well.
There was a problem hiding this comment.
I have made sure that all the Send to the dataplaneStream are complete before calling CloseSend and it will theoritically never be called concurrently with Send for dataplaneStream. But we can add a mutex if we want to be sure.
For the Procstream closeSend , I have changed the implementation such that when the trailers are received from the dataplane server , we send nil to the procSendCh to indicate that we need to call closeSend on proc stream. This way it will only be sent after all the sends to proc stream is done.
| // It also contains the initial metadata specified in the config. | ||
| procCtx, cancel := context.WithCancel(ctx) | ||
| if i.config.server.Timeout != 0 { | ||
| procCtx, cancel = context.WithTimeout(ctx, i.config.server.Timeout) |
There was a problem hiding this comment.
if we do this, are we not over-writing context.WithCancel(ctx) ?
There was a problem hiding this comment.
Yes , we are , because if we have a timeout, we need to use that to create the proc RPC. And context.Timeout also returns a cancellable context with a timeout.
Added some comments to make it a little more clear.
| var err error | ||
| if cs.dataplaneStream, err = newStream(ctx, opts...); err != nil { | ||
| cs.dataplaneCreationErr = err | ||
| cs.cancel() |
There was a problem hiding this comment.
Now that we call cancel and close dataplane setup, the waitForDataplaneStream will have a non deterministic error surfaced based on what happens first
There was a problem hiding this comment.
Right! Now that we have different error holder for dataplane stream , we should check for that too in ctx.Done case.
Changed.
| } | ||
|
|
||
| // TestDrainingFlowControlNoMessageLoss tests the scenario where a processor | ||
| // server sends RequestDrain: true during active flow control backpressure. |
There was a problem hiding this comment.
we are guaranteed to test the draining and message loss logic, but not really backpressure.
There was a problem hiding this comment.
Right! Changed the name of the test to reflect the same.
| // cannot receive. Verifies that backpressure correctly propagates across the | ||
| // filter: the client's Send call blocks, and receiving from the dataplane | ||
| // server also blocks. | ||
| func (s) TestFlowControl(t *testing.T) { |
There was a problem hiding this comment.
none of the tests will fail for race because they don't drive request-body forwarding concurrently with response-trailer/close processing - we should add tessts for checking all the close/cancel/send races where we have these happening in parallel.
There was a problem hiding this comment.
I am not sure how to write a deterministic concurrent test for this. I have added a TestConcurrency which will call Recv from the main goroutine, while Send, CloseSend, and context cancellation run in separate concurrent goroutines. And we assert that RPC should fail. Let me know what you think or if you have something else in mind.
| cs.failProcStream(fmt.Errorf("extproc: external processor returned invalid status instead of CONTINUE for response headers")) | ||
| return | ||
| } | ||
| if err = cs.config.mutationRules.ApplyAdditions(header.GetResponse().GetHeaderMutation().GetSetHeaders(), cs.responseHeader); err != nil { |
There was a problem hiding this comment.
for a trailers only response, this will incorrectly fail the RPC? Also the Test for trailers only message doesn't capture this scenario effectively.
There was a problem hiding this comment.
I have asked for what should be the proc behaviour for a trailer only request message.
| // external processor server returns GrpcMessageCompressed: true while | ||
| // failure_mode_allow is false. Verifies that the stream is cancelled and | ||
| // subsequent data plane RPC calls fail with Internal. | ||
| func (s) TestStreamFailureGrpcMessageCompressedDeny(t *testing.T) { |
There was a problem hiding this comment.
All the negative tests are written separate - can they be combined into table drive tests, it will help with making sense of coverage for review. Also, similar for the three TestImmediateResponse* tests. IT will reduce 9 different tests into 2 tests with cleaner code.
There was a problem hiding this comment.
Tried to combine as many as I could.
| // returned. Otherwise, the disallowed mutation is silently ignored. | ||
| // | ||
| // The input metadata must not be nil. | ||
| func (hmr *HeaderMutationRules) ApplyAdditions(hvos []*v3corepb.HeaderValueOption, input metadata.MD) error { |
There was a problem hiding this comment.
changes added here are not tested in extconfig_test.go - should they be or are you covering them in end to end tests?
There was a problem hiding this comment.
I had a test for it in TestStreamModification but added extensive test in ext_config_test.go.
This PR add implementation of NewStream and ClientStream for normal mode for A93: xds-ext-proc.
This PR does not include channel retention , metrics and observability mode.
#ext-proc-a93
RELEASE NOTES: None