feat: add dynamic primary discovery for routing layer#11
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 7 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
📝 WalkthroughWalkthroughThis pull request implements dynamic primary replica discovery in the routing layer. It introduces a Changes
Sequence Diagram(s)sequenceDiagram
actor Timer as Periodic Timer
participant PM as PartitionManager
participant CC as ClusterClient
participant ES as Election Service<br/>(ClusterService)
participant Ring as ConsistentHashRing<br/>(Partition)
Timer->>PM: Discovery Interval Elapsed
PM->>PM: Initialize Replica Cache
loop For Each Partition
PM->>PM: Create ClusterClient for<br/>each Replica Address
loop For Each Replica (Primary, then Secondaries)
PM->>CC: new ClusterClient(address)
activate CC
CC->>ES: GetRole(GetRoleRequest)
activate ES
ES-->>CC: GetRoleResponse{role, term, ...}
deactivate ES
CC-->>PM: NodeRole{IsPrimary, ...}
deactivate CC
alt IsPrimary == true
PM->>PM: Found Primary!<br/>Break Loop
end
end
alt Primary Found
PM->>PM: Create/Cache ReplicaClient<br/>for New Primary
PM->>Ring: UpdatePartitionPrimary(partitionId)
activate Ring
Ring->>Ring: Find Partition by ID
Ring->>Ring: partition.UpdatePrimary(newPrimary)
deactivate Ring
else No Primary Found
PM->>PM: Log Warning,<br/>Fallback to Config
end
end
PM->>PM: Sleep for<br/>Discovery Interval
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
ToyDbContracts/Protos/election.proto (1)
29-36: Consider using an enum for therolefield for type safety.Using a string for
rolerequires exact case-sensitive matching between server (ClusterService.GetRolereturns"Primary"or"Secondary") and client (ClusterClient.NodeRole.IsPrimarychecksRole == "Primary"). An enum would provide compile-time safety and make the contract explicit:enum NodeRoleType { NODE_ROLE_UNKNOWN = 0; NODE_ROLE_PRIMARY = 1; NODE_ROLE_SECONDARY = 2; } message GetRoleResponse { NodeRoleType role = 1; // ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ToyDbContracts/Protos/election.proto` around lines 29 - 36, Replace the string-typed role in GetRoleResponse with a protobuf enum to enforce type safety: define an enum (e.g., NodeRoleType with values NODE_ROLE_UNKNOWN, NODE_ROLE_PRIMARY, NODE_ROLE_SECONDARY) and change the GetRoleResponse.role field to that enum; update any server-side return (ClusterService.GetRole) to return the enum value and adjust client checks (ClusterClient.NodeRole.IsPrimary) to compare against the enum constant instead of a string so all sides use the enum type.ToyDbUnitTests/Services/RoutingServiceRetryTests.cs (1)
33-37: Consider adding a test forFAILED_PRECONDITIONtriggering rediscovery.The mock setup is functional, but there's no test verifying that
TriggerRediscoveryis called when a primary replica returnsFAILED_PRECONDITION. This is a key behavior introduced in this PR.Would you like me to draft a test case that verifies
TriggerRediscoveryis invoked when the primary throws anRpcExceptionwithStatusCode.FailedPrecondition?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ToyDbUnitTests/Services/RoutingServiceRetryTests.cs` around lines 33 - 37, Add a unit test to verify PartitionManager.TriggerRediscovery is called when the primary replica returns an RpcException with StatusCode.FailedPrecondition: mock the primary call used by the RoutingService retry logic to throw new RpcException(new Status(StatusCode.FailedPrecondition, "...")), arrange the existing _partitionManagerMock (the Mock<PartitionManager> set up in the test class) and verify _partitionManagerMock.Verify(p => p.TriggerRediscovery(It.IsAny<string>()), Times.Once()); name the test something like FailedPrecondition_ShouldTriggerRediscovery and place it alongside the other RoutingServiceRetryTests to assert the triggered rediscovery behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plans/completed/plan-routing-redundancy.md`:
- Around line 45-70: Wrap both fenced code blocks under "Primary Discovery Flow"
and "Routing Redundancy Flow" with a language tag (use ```text) to satisfy
markdownlint, and rename the interval identifier in the Primary Discovery Flow
from DiscoveryIntervalMs to DiscoveryIntervalSeconds (or DiscoveryIntervalSec if
that matches config) and keep the default value as 10s; update any references in
the doc mentioning PartitionManager, RoutingService, Partition.PrimaryReplica,
and the discovery interval so they consistently use DiscoveryIntervalSeconds.
In `@ToyDbRouting/Clients/ClusterClient.cs`:
- Around line 11-23: ClusterClient currently allocates an HttpClientHandler and
a GrpcChannel in its constructor but never disposes them; implement
IAsyncDisposable (and IDisposable if you prefer) on ClusterClient and in the
DisposeAsync/Dispose method ensure you dispose the HttpClientHandler and call
DisposeAsync() on the GrpcChannel (or equivalent channel shutdown) to release
resources; then modify PartitionManager to cache ClusterClient instances keyed
by Address and have DiscoverPrimaryReplicasAsync reuse the cached ClusterClient
instead of creating a new one each loop (and ensure cached clients are disposed
on manager shutdown or replaced safely), referencing ClusterClient, GrpcChannel,
HttpClientHandler, and PartitionManager.DiscoverPrimaryReplicasAsync in your
changes.
- Around line 14-17: The code in ClusterClient.cs sets
HttpClientHandler.ServerCertificateCustomValidationCallback to
DangerousAcceptAnyServerCertificateValidator which disables TLS validation for
discovery; replace this by enforcing proper cert validation or gating the bypass
behind a dev-only flag. Update the handler setup in the ClusterClient
constructor (the handler variable) to either (a) validate the server certificate
chain/issuer or compare the server cert thumbprint/subject against an expected
value (certificate pinning) using ServerCertificateCustomValidationCallback, or
(b) only use DangerousAcceptAnyServerCertificateValidator when an explicit
dev/test env flag/config (e.g., "Environment" or "AllowInsecureDiscovery") is
set; ensure the default behavior performs normal TLS validation and load trusted
CA or client certs via HttpClientHandler.ClientCertificates or the system store.
- Around line 26-28: GetRole currently passes only the caller cancellation token
to _electionClient.GetRoleAsync which can hang; create a short, configurable
per-call deadline by constructing a linked CancellationTokenSource (use
CancellationTokenSource.CreateLinkedTokenSource) from the incoming
cancellationToken, call CancelAfter with the configured timeout, and pass
linked.Token to _electionClient.GetRoleAsync; ensure you dispose the linked CTS
and keep the timeout configurable (like HealthProbeService) so GetRole and the
NodeRole retrieval are bounded.
In `@ToyDbRouting/Models/Partition.cs`:
- Around line 11-17: GetHealthyPrimaryReplica is vulnerable to a race where
UpdatePrimary can swap PrimaryReplica between the health check and the return;
fix by taking a local snapshot of the PrimaryReplica at the start of
GetHealthyPrimaryReplica (e.g., var primary = PrimaryReplica) and use that local
variable for both the health lookup and the return value instead of accessing
the PrimaryReplica field twice; this ensures the health status corresponds to
the same ReplicaClient instance even if
PartitionManager.DiscoverPrimaryReplicasAsync calls UpdatePrimary concurrently.
In `@ToyDbRouting/Program.cs`:
- Around line 26-34: The /health endpoint always returns 200; change the MapGet
handler to check actual readiness (e.g., an injected readiness service or state
exposed by your router) and return 503 when the instance is not ready or has no
usable replicas. Update the handler signature used in app.MapGet("/health", ...)
to accept a readiness indicator (for example a RouterState or
IRouterHealthService with IsReady and AvailableReplicas properties) in addition
to IOptions<RoutingOptions>, then return Results.StatusCode(503) (or
Results.ServiceUnavailable) with a payload explaining the failure when IsReady
is false or AvailableReplicas == 0; otherwise return 200 with Status, InstanceId
(routingOptions.Value.RouterInstanceId) and Timestamp as before.
In `@ToyDbRouting/Services/PartitionManager.cs`:
- Around line 100-106: The code currently overwrites a previously discovered
primary with the static config when no primary is found; instead, check
_primaryReplicas for an existing entry before falling back: in the block
containing logger.LogWarning, attempt to get currentPrimary from
_primaryReplicas using partitionConfig.PartitionId and if found keep it (do not
call UpdatePartitionPrimary), otherwise fall back to the configured address by
retrieving/creating fallbackPrimary from _replicaClients
(partitionConfig.PrimaryReplicaAddress), assign it to
_primaryReplicas[partitionConfig.PartitionId] and call UpdatePartitionPrimary as
now; reference logger.LogWarning, _replicaClients, fallbackPrimary,
_primaryReplicas, UpdatePartitionPrimary, partitionConfig.PartitionId and
partitionConfig.PrimaryReplicaAddress.
- Around line 117-129: TriggerRediscovery currently spins an unbounded
fire-and-forget Task that rescans all partitions with CancellationToken.None and
only uses partitionId in the log; change it to perform a scoped discovery for
the single partitionId (call a new or existing partition-scoped method instead
of DiscoverPrimaryReplicasAsync), avoid Task.Run by scheduling via a coalescing
mechanism (e.g., track in-flight triggers per partitionId using a
ConcurrentDictionary or similar to coalesce concurrent calls), accept or use a
cancellable token (propagate a CancellationToken to the discovery call) and
ensure the task is awaited or observably scheduled so shutdown can cancel it;
also move the public TriggerRediscovery method declaration above the private
helpers InitializeReplicaClients, DiscoverPrimaryReplicasAsync, and
UpdatePartitionPrimary to follow coding guidelines.
In `@ToyDbRouting/Services/RoutingService.cs`:
- Around line 22-27: Partition.PrimaryReplica is accessed concurrently (read in
RoutingService and written in PartitionManager.UpdatePartitionPrimary) causing a
data race; fix by adding synchronization: either make the backing field for
PrimaryReplica volatile and update it via Interlocked.Exchange in
PartitionManager.UpdatePartitionPrimary (and read the volatile field in
RoutingService), or implement a dedicated lock object used by both the
Partition.PrimaryReplica setter/getter and by
PartitionManager.UpdatePartitionPrimary so reads in RoutingService (where
PrimaryReplica is read) and writes in UpdatePartitionPrimary use the same lock;
pick one approach and apply it consistently to eliminate the visibility/race
issue.
---
Nitpick comments:
In `@ToyDbContracts/Protos/election.proto`:
- Around line 29-36: Replace the string-typed role in GetRoleResponse with a
protobuf enum to enforce type safety: define an enum (e.g., NodeRoleType with
values NODE_ROLE_UNKNOWN, NODE_ROLE_PRIMARY, NODE_ROLE_SECONDARY) and change the
GetRoleResponse.role field to that enum; update any server-side return
(ClusterService.GetRole) to return the enum value and adjust client checks
(ClusterClient.NodeRole.IsPrimary) to compare against the enum constant instead
of a string so all sides use the enum type.
In `@ToyDbUnitTests/Services/RoutingServiceRetryTests.cs`:
- Around line 33-37: Add a unit test to verify
PartitionManager.TriggerRediscovery is called when the primary replica returns
an RpcException with StatusCode.FailedPrecondition: mock the primary call used
by the RoutingService retry logic to throw new RpcException(new
Status(StatusCode.FailedPrecondition, "...")), arrange the existing
_partitionManagerMock (the Mock<PartitionManager> set up in the test class) and
verify _partitionManagerMock.Verify(p =>
p.TriggerRediscovery(It.IsAny<string>()), Times.Once()); name the test something
like FailedPrecondition_ShouldTriggerRediscovery and place it alongside the
other RoutingServiceRetryTests to assert the triggered rediscovery behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec205b49-d4cf-481a-9b9e-47fecd9b3d5f
📒 Files selected for processing (13)
README.mdToyDb/Services/ClusterService.csToyDbContracts/Protos/election.protoToyDbRouting/Clients/ClusterClient.csToyDbRouting/Models/Partition.csToyDbRouting/Models/RoutingOptions.csToyDbRouting/Program.csToyDbRouting/Services/PartitionManager.csToyDbRouting/Services/RoutingService.csToyDbRouting/appsettings.jsonToyDbUnitTests/Services/RoutingServiceRetryTests.csplans/completed/plan-leader-election.mdplans/completed/plan-routing-redundancy.md
| public ClusterClient(string address) | ||
| { | ||
| Address = address; | ||
| var handler = new HttpClientHandler | ||
| { | ||
| ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator | ||
| }; | ||
|
|
||
| var channel = GrpcChannel.ForAddress(address, new GrpcChannelOptions | ||
| { | ||
| HttpHandler = handler | ||
| }); | ||
| _electionClient = new Election.ElectionClient(channel); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "ClusterClient.cs" | head -20Repository: ryanachten/ToyDb
Length of output: 99
🏁 Script executed:
find . -type f -name "PartitionManager.cs" | head -20Repository: ryanachten/ToyDb
Length of output: 103
🏁 Script executed:
cat -n ./ToyDbRouting/Clients/ClusterClient.csRepository: ryanachten/ToyDb
Length of output: 1783
🏁 Script executed:
cat -n ./ToyDbRouting/Services/PartitionManager.cs | head -100Repository: ryanachten/ToyDb
Length of output: 4155
🏁 Script executed:
rg "new\s+(Clients\.)?ClusterClient" --type csRepository: ryanachten/ToyDb
Length of output: 177
🏁 Script executed:
rg "ClusterClient" --type csRepository: ryanachten/ToyDb
Length of output: 321
Make ClusterClient disposable and cache instances per address.
The constructor allocates HttpClientHandler and GrpcChannel but the class does not implement IDisposable/IAsyncDisposable, leaving these unmanaged resources unreleased. Since PartitionManager.DiscoverPrimaryReplicasAsync creates a new ClusterClient instance on each invocation within its background service loop (line 80), this causes a resource leak with handlers and channels accumulating indefinitely.
Implement IAsyncDisposable in ClusterClient to dispose the channel, and cache instances per address in PartitionManager to avoid repeated allocations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ToyDbRouting/Clients/ClusterClient.cs` around lines 11 - 23, ClusterClient
currently allocates an HttpClientHandler and a GrpcChannel in its constructor
but never disposes them; implement IAsyncDisposable (and IDisposable if you
prefer) on ClusterClient and in the DisposeAsync/Dispose method ensure you
dispose the HttpClientHandler and call DisposeAsync() on the GrpcChannel (or
equivalent channel shutdown) to release resources; then modify PartitionManager
to cache ClusterClient instances keyed by Address and have
DiscoverPrimaryReplicasAsync reuse the cached ClusterClient instead of creating
a new one each loop (and ensure cached clients are disposed on manager shutdown
or replaced safely), referencing ClusterClient, GrpcChannel, HttpClientHandler,
and PartitionManager.DiscoverPrimaryReplicasAsync in your changes.
| var handler = new HttpClientHandler | ||
| { | ||
| ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the ClusterClient.cs file
find . -name "ClusterClient.cs" -type fRepository: ryanachten/ToyDb
Length of output: 99
🏁 Script executed:
cat -n ./ToyDbRouting/Clients/ClusterClient.csRepository: ryanachten/ToyDb
Length of output: 1783
🏁 Script executed:
# Find and examine PartitionManager to understand usage context
find . -name "PartitionManager.cs" -type fRepository: ryanachten/ToyDb
Length of output: 103
🏁 Script executed:
cat -n ./ToyDbRouting/Services/PartitionManager.csRepository: ryanachten/ToyDb
Length of output: 5592
Don't bypass cert validation in discovery traffic.
DangerousAcceptAnyServerCertificateValidator accepts any server certificate unconditionally. A MITM attacker could impersonate a replica and feed the router bogus role/leader data, corrupting primary discovery. Either keep this strictly dev-only with environment guards, or configure proper certificate validation using the expected CA or certificate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ToyDbRouting/Clients/ClusterClient.cs` around lines 14 - 17, The code in
ClusterClient.cs sets
HttpClientHandler.ServerCertificateCustomValidationCallback to
DangerousAcceptAnyServerCertificateValidator which disables TLS validation for
discovery; replace this by enforcing proper cert validation or gating the bypass
behind a dev-only flag. Update the handler setup in the ClusterClient
constructor (the handler variable) to either (a) validate the server certificate
chain/issuer or compare the server cert thumbprint/subject against an expected
value (certificate pinning) using ServerCertificateCustomValidationCallback, or
(b) only use DangerousAcceptAnyServerCertificateValidator when an explicit
dev/test env flag/config (e.g., "Environment" or "AllowInsecureDiscovery") is
set; ensure the default behavior performs normal TLS validation and load trusted
CA or client certs via HttpClientHandler.ClientCertificates or the system store.
| public virtual void TriggerRediscovery(string partitionId) | ||
| { | ||
| _ = Task.Run(async () => | ||
| { | ||
| try | ||
| { | ||
| await DiscoverPrimaryReplicasAsync(CancellationToken.None); | ||
| } | ||
| catch (Exception ex) | ||
| { | ||
| logger.LogError(ex, "Error triggering rediscovery for partition {PartitionId}", partitionId); | ||
| } | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "PartitionManager.cs" -type fRepository: ryanachten/ToyDb
Length of output: 103
🏁 Script executed:
cat -n ./ToyDbRouting/Services/PartitionManager.csRepository: ryanachten/ToyDb
Length of output: 5592
🏁 Script executed:
rg "TriggerRediscovery" --type cs -A 3 -B 3Repository: ryanachten/ToyDb
Length of output: 1785
🏁 Script executed:
rg "RoutingService" --type cs -B 3 -A 5 | grep -A 5 "TriggerRediscovery"Repository: ryanachten/ToyDb
Length of output: 42
🏁 Script executed:
cat -n ./ToyDbRouting/Services/RoutingService.cs | head -100Repository: ryanachten/ToyDb
Length of output: 4503
🏁 Script executed:
rg "FAILED_PRECONDITION" ./ToyDbRouting/Services/RoutingService.cs -B 5 -A 5Repository: ryanachten/ToyDb
Length of output: 586
TriggerRediscovery is cluster-wide and unbounded.
RoutingService calls this with a concrete partition id on FAILED_PRECONDITION, but partitionId is never used in the method logic (only in the error log) and each failure spins a fire-and-forget Task.Run that rescans all partitions via DiscoverPrimaryReplicasAsync, using CancellationToken.None. A flapping partition can pile up concurrent cluster-wide discovery passes and keep them running during shutdown. Scope the work to the requested partition and coalesce in-flight triggers instead.
Additionally, move this public method before the private helper methods (InitializeReplicaClients, DiscoverPrimaryReplicasAsync, UpdatePartitionPrimary) per coding guidelines.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ToyDbRouting/Services/PartitionManager.cs` around lines 117 - 129,
TriggerRediscovery currently spins an unbounded fire-and-forget Task that
rescans all partitions with CancellationToken.None and only uses partitionId in
the log; change it to perform a scoped discovery for the single partitionId
(call a new or existing partition-scoped method instead of
DiscoverPrimaryReplicasAsync), avoid Task.Run by scheduling via a coalescing
mechanism (e.g., track in-flight triggers per partitionId using a
ConcurrentDictionary or similar to coalesce concurrent calls), accept or use a
cancellable token (propagate a CancellationToken to the discovery call) and
ensure the task is awaited or observably scheduled so shutdown can cancel it;
also move the public TriggerRediscovery method declaration above the private
helpers InitializeReplicaClients, DiscoverPrimaryReplicasAsync, and
UpdatePartitionPrimary to follow coding guidelines.
| ConsistentHashRing ring, | ||
| PartitionManager partitionManager | ||
| ) : Routing.Routing.RoutingBase | ||
| { | ||
| private readonly ConsistentHashRing _ring = ring; | ||
| private readonly PartitionManager _partitionManager = partitionManager; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check Partition class for synchronization around PrimaryReplica
ast-grep --pattern $'class Partition {
$$$
}'
# Also search for lock, volatile, or Interlocked usage in Partition
rg -n -C5 'lock\s*\(|volatile|Interlocked' --type=cs -g '*Partition*'Repository: ryanachten/ToyDb
Length of output: 42
🏁 Script executed:
find . -type f -name "*.cs" | grep -E "(Partition|Routing)" | head -20Repository: ryanachten/ToyDb
Length of output: 898
🏁 Script executed:
cat -n ToyDbRouting/Models/Partition.csRepository: ryanachten/ToyDb
Length of output: 2644
🏁 Script executed:
sed -n '70,115p' ToyDbRouting/Services/RoutingService.cs | cat -nRepository: ryanachten/ToyDb
Length of output: 2019
🏁 Script executed:
sed -n '100,120p' ToyDbRouting/Services/PartitionManager.cs | cat -nRepository: ryanachten/ToyDb
Length of output: 1132
🏁 Script executed:
wc -l ToyDbRouting/Services/RoutingService.csRepository: ryanachten/ToyDb
Length of output: 103
🏁 Script executed:
cat -n ToyDbRouting/Services/RoutingService.cs | sed -n '75,115p'Repository: ryanachten/ToyDb
Length of output: 1824
🏁 Script executed:
cat -n ToyDbRouting/Services/PartitionManager.csRepository: ryanachten/ToyDb
Length of output: 5592
🏁 Script executed:
rg -n "lock|volatile|Interlocked|ReaderWriterLock" ToyDbRouting/Services/RoutingService.cs ToyDbRouting/Models/Partition.cs ToyDbRouting/Services/PartitionManager.csRepository: ryanachten/ToyDb
Length of output: 42
🏁 Script executed:
cat -n ToyDbRouting/Services/RoutingService.csRepository: ryanachten/ToyDb
Length of output: 10587
Add synchronization to Partition.PrimaryReplica to prevent data races between request and background threads.
RoutingService reads partition.PrimaryReplica on request threads (lines 82, 105) while PartitionManager.UpdatePartitionPrimary writes to it from a background thread (PartitionManager.cs:111-115). Without synchronization, there is a data race that could lead to visibility issues where readers don't see the latest assignment due to lack of memory barriers.
Use volatile on the backing field or add a lock around both the read and write paths. Alternatively, use Interlocked.Exchange for the assignment and ensure reads also respect the synchronization strategy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ToyDbRouting/Services/RoutingService.cs` around lines 22 - 27,
Partition.PrimaryReplica is accessed concurrently (read in RoutingService and
written in PartitionManager.UpdatePartitionPrimary) causing a data race; fix by
adding synchronization: either make the backing field for PrimaryReplica
volatile and update it via Interlocked.Exchange in
PartitionManager.UpdatePartitionPrimary (and read the volatile field in
RoutingService), or implement a dedicated lock object used by both the
Partition.PrimaryReplica setter/getter and by
PartitionManager.UpdatePartitionPrimary so reads in RoutingService (where
PrimaryReplica is read) and writes in UpdatePartitionPrimary use the same lock;
pick one approach and apply it consistently to eliminate the visibility/race
issue.
…ndition, tests, health check
Summary
GetRoleRPCPartitionManagerbackground service that periodically queries replicas to discover the current primaryRoutingServiceto trigger rediscovery onFAILED_PRECONDITIONerrors (e.g., when primary fails and a new one is elected)RouterInstanceIdandPrimaryDiscoveryIntervalSecondsconfig options for routing redundancyChanges
Core Implementation
election.proto: AddGetRoleRequest/GetRoleResponsemessages and RPCClusterService.cs: ImplementGetRolehandler returning current role fromReplicaStateClusterClient.cs: New gRPC client for querying node role/statePartitionManager.cs: Background service that discovers and updates primary replicas dynamicallyPartition.cs: AddUpdatePrimarymethod to support dynamic primary updatesIntegration
RoutingService.cs: HandleFAILED_PRECONDITIONto trigger immediate rediscoveryRoutingOptions.cs: AddPrimaryDiscoveryIntervalSecondsandRouterInstanceIdconfigProgram.cs: RegisterPartitionManageras hosted service, add/healthendpointConfiguration
appsettings.json: AddPrimaryDiscoveryIntervalSecondsdefault (10s)Testing
RoutingServiceRetryTeststo includePartitionManagermockSummary by CodeRabbit
New Features
Documentation