feat: add dynamic primary discovery for routing layer by ryanachten · Pull Request #11 · ryanachten/ToyDb

ryanachten · 2026-04-03T21:25:58Z

Summary

Adds dynamic primary discovery to the routing layer via a new GetRole RPC
Implements PartitionManager background service that periodically queries replicas to discover the current primary
Integrates with RoutingService to trigger rediscovery on FAILED_PRECONDITION errors (e.g., when primary fails and a new one is elected)
Adds RouterInstanceId and PrimaryDiscoveryIntervalSeconds config options for routing redundancy

Changes

Core Implementation

election.proto: Add GetRoleRequest/GetRoleResponse messages and RPC
ClusterService.cs: Implement GetRole handler returning current role from ReplicaState
ClusterClient.cs: New gRPC client for querying node role/state
PartitionManager.cs: Background service that discovers and updates primary replicas dynamically
Partition.cs: Add UpdatePrimary method to support dynamic primary updates

Integration

RoutingService.cs: Handle FAILED_PRECONDITION to trigger immediate rediscovery
RoutingOptions.cs: Add PrimaryDiscoveryIntervalSeconds and RouterInstanceId config
Program.cs: Register PartitionManager as hosted service, add /health endpoint

Configuration

appsettings.json: Add PrimaryDiscoveryIntervalSeconds default (10s)

Testing

All 39 unit tests pass
Updated RoutingServiceRetryTests to include PartitionManager mock

Summary by CodeRabbit

New Features
- Added automatic primary replica discovery—the routing layer now periodically discovers and updates partition primaries for improved availability.
- Introduced a health check endpoint for router instance monitoring and validation.
- Enabled support for multiple router instances behind a load balancer with dynamic primary discovery.
Documentation
- Updated architecture diagrams and planning documentation to reflect routing-layer enhancements.

…ilover

coderabbitai · 2026-04-03T21:26:12Z

Warning

Rate limit exceeded

@ryanachten has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 7 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 7 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c892c30-166f-4045-a1c5-449478a4b953

📥 Commits

Reviewing files that changed from the base of the PR and between ca96760 and aaf63db.

📒 Files selected for processing (11)

ToyDb/Program.cs
ToyDb/Services/ClusterService.cs
ToyDbContracts/Protos/election.proto
ToyDbRouting/Clients/ClusterClient.cs
ToyDbRouting/Clients/ReplicaClient.cs
ToyDbRouting/Models/Partition.cs
ToyDbRouting/Program.cs
ToyDbRouting/Services/PartitionManager.cs
ToyDbRouting/Services/RoutingService.cs
ToyDbUnitTests/Services/RoutingServiceRetryTests.cs
plans/completed/plan-routing-redundancy.md

📝 Walkthrough

Walkthrough

This pull request implements dynamic primary replica discovery in the routing layer. It introduces a GetRole RPC to query node roles, adds a ClusterClient for replica discovery, and implements a PartitionManager background service that periodically discovers and updates partition primaries based on actual node state rather than static configuration.

Changes

Cohort / File(s)	Summary
gRPC Service Definition `ToyDbContracts/Protos/election.proto`	Added new `GetRole` RPC to `Election` service with `GetRoleRequest` (empty) and `GetRoleResponse` containing role, term, leader_id, and leader_address fields.
Cluster Service Implementation `ToyDb/Services/ClusterService.cs`	Added public `GetRole` gRPC override that returns node's current role (`"Primary"` or `"Secondary"`), term, and leader identifiers, normalizing null leader fields to empty strings.
Routing Client & Models `ToyDbRouting/Clients/ClusterClient.cs`	New `ClusterClient` class that creates gRPC channel to cluster nodes and calls `GetRole` RPC. Includes `NodeRole` data class with init-only properties and computed `IsPrimary` boolean.
Primary Replica Discovery Service `ToyDbRouting/Services/PartitionManager.cs`	New background service that periodically discovers current primary for each partition by querying replicas via `ClusterClient.GetRole`, updates routing state via `Partition.UpdatePrimary`, and exposes `TriggerRediscovery` for on-demand re-scanning.
Partition Model Update `ToyDbRouting/Models/Partition.cs`	Changed `PrimaryReplica` from immutable field to settable property and added `UpdatePrimary` method to support dynamic primary updates as discovery runs.
Routing Service Integration `ToyDbRouting/Services/RoutingService.cs`	Updated constructor to accept `PartitionManager` dependency and added catch for `RpcException` with `FailedPrecondition` status to trigger primary rediscovery on write failures.
Configuration & DI Setup `ToyDbRouting/Program.cs`, `ToyDbRouting/Models/RoutingOptions.cs`, `ToyDbRouting/appsettings.json`	Registered `PartitionManager` as singleton and hosted service. Added `PrimaryDiscoveryIntervalSeconds` (default 10) and `RouterInstanceId` configuration options. Added `GET /health` HTTP endpoint returning status, instance ID, and timestamp.
Test Updates `ToyDbUnitTests/Services/RoutingServiceRetryTests.cs`	Added mocked `PartitionManager` dependency to test setup and updated `RoutingService` construction calls.
Documentation `README.md`, `plans/completed/plan-leader-election.md`, `plans/completed/plan-routing-redundancy.md`	Updated README architecture diagram to show `PartitionManager` and `ClusterClient` components with role discovery flow. Renamed "High Availability (Leader Election)" heading to "Leader Election". Added new Phase 4 routing redundancy plan documenting dynamic primary discovery architecture and implementation steps.

Sequence Diagram(s)

sequenceDiagram
    actor Timer as Periodic Timer
    participant PM as PartitionManager
    participant CC as ClusterClient
    participant ES as Election Service<br/>(ClusterService)
    participant Ring as ConsistentHashRing<br/>(Partition)

    Timer->>PM: Discovery Interval Elapsed
    PM->>PM: Initialize Replica Cache
    
    loop For Each Partition
        PM->>PM: Create ClusterClient for<br/>each Replica Address
        
        loop For Each Replica (Primary, then Secondaries)
            PM->>CC: new ClusterClient(address)
            activate CC
            CC->>ES: GetRole(GetRoleRequest)
            activate ES
            ES-->>CC: GetRoleResponse{role, term, ...}
            deactivate ES
            CC-->>PM: NodeRole{IsPrimary, ...}
            deactivate CC
            
            alt IsPrimary == true
                PM->>PM: Found Primary!<br/>Break Loop
            end
        end
        
        alt Primary Found
            PM->>PM: Create/Cache ReplicaClient<br/>for New Primary
            PM->>Ring: UpdatePartitionPrimary(partitionId)
            activate Ring
            Ring->>Ring: Find Partition by ID
            Ring->>Ring: partition.UpdatePrimary(newPrimary)
            deactivate Ring
        else No Primary Found
            PM->>PM: Log Warning,<br/>Fallback to Config
        end
    end
    
    PM->>PM: Sleep for<br/>Discovery Interval

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

PR #10 — Implements leader election logic that this PR builds upon by adding the GetRole RPC and integrating role discovery into the routing layer.
PR #6 — Implements ConsistentHashRing in the routing layer, which this PR extends with dynamic primary management via PartitionManager.
PR #4 — Prior routing layer enhancements to RoutingOptions and Program.cs dependency injection that this PR extends with new configuration properties and hosted service registration.

Poem

🐰 A bunny's hop through primaries so bright,
Discovery whispers left and right,
Each partition learns its leader's name,
Through graceful updates, routing's aflame! 🌟
No static chains shall bind our way,
Hop on, hop on—dynamic today! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: add dynamic primary discovery for routing layer' directly and clearly describes the main change—implementing dynamic primary discovery functionality in the routing layer.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/routing-redundancy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (2)

ToyDbContracts/Protos/election.proto (1)
29-36: Consider using an enum for the role field for type safety.

Using a string for role requires exact case-sensitive matching between server (ClusterService.GetRole returns "Primary" or "Secondary") and client (ClusterClient.NodeRole.IsPrimary checks Role == "Primary"). An enum would provide compile-time safety and make the contract explicit:
enum NodeRoleType {
    NODE_ROLE_UNKNOWN = 0;
    NODE_ROLE_PRIMARY = 1;
    NODE_ROLE_SECONDARY = 2;
}

message GetRoleResponse {
    NodeRoleType role = 1;
    // ...
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbContracts/Protos/election.proto` around lines 29 - 36, Replace the
string-typed role in GetRoleResponse with a protobuf enum to enforce type
safety: define an enum (e.g., NodeRoleType with values NODE_ROLE_UNKNOWN,
NODE_ROLE_PRIMARY, NODE_ROLE_SECONDARY) and change the GetRoleResponse.role
field to that enum; update any server-side return (ClusterService.GetRole) to
return the enum value and adjust client checks
(ClusterClient.NodeRole.IsPrimary) to compare against the enum constant instead
of a string so all sides use the enum type.
ToyDbUnitTests/Services/RoutingServiceRetryTests.cs (1)
33-37: Consider adding a test for FAILED_PRECONDITION triggering rediscovery.

The mock setup is functional, but there's no test verifying that TriggerRediscovery is called when a primary replica returns FAILED_PRECONDITION. This is a key behavior introduced in this PR.

Would you like me to draft a test case that verifies TriggerRediscovery is invoked when the primary throws an RpcException with StatusCode.FailedPrecondition?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbUnitTests/Services/RoutingServiceRetryTests.cs` around lines 33 - 37,
Add a unit test to verify PartitionManager.TriggerRediscovery is called when the
primary replica returns an RpcException with StatusCode.FailedPrecondition: mock
the primary call used by the RoutingService retry logic to throw new
RpcException(new Status(StatusCode.FailedPrecondition, "...")), arrange the
existing _partitionManagerMock (the Mock<PartitionManager> set up in the test
class) and verify _partitionManagerMock.Verify(p =>
p.TriggerRediscovery(It.IsAny<string>()), Times.Once()); name the test something
like FailedPrecondition_ShouldTriggerRediscovery and place it alongside the
other RoutingServiceRetryTests to assert the triggered rediscovery behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plans/completed/plan-routing-redundancy.md`:
- Around line 45-70: Wrap both fenced code blocks under "Primary Discovery Flow"
and "Routing Redundancy Flow" with a language tag (use ```text) to satisfy
markdownlint, and rename the interval identifier in the Primary Discovery Flow
from DiscoveryIntervalMs to DiscoveryIntervalSeconds (or DiscoveryIntervalSec if
that matches config) and keep the default value as 10s; update any references in
the doc mentioning PartitionManager, RoutingService, Partition.PrimaryReplica,
and the discovery interval so they consistently use DiscoveryIntervalSeconds.

In `@ToyDbRouting/Clients/ClusterClient.cs`:
- Around line 11-23: ClusterClient currently allocates an HttpClientHandler and
a GrpcChannel in its constructor but never disposes them; implement
IAsyncDisposable (and IDisposable if you prefer) on ClusterClient and in the
DisposeAsync/Dispose method ensure you dispose the HttpClientHandler and call
DisposeAsync() on the GrpcChannel (or equivalent channel shutdown) to release
resources; then modify PartitionManager to cache ClusterClient instances keyed
by Address and have DiscoverPrimaryReplicasAsync reuse the cached ClusterClient
instead of creating a new one each loop (and ensure cached clients are disposed
on manager shutdown or replaced safely), referencing ClusterClient, GrpcChannel,
HttpClientHandler, and PartitionManager.DiscoverPrimaryReplicasAsync in your
changes.
- Around line 14-17: The code in ClusterClient.cs sets
HttpClientHandler.ServerCertificateCustomValidationCallback to
DangerousAcceptAnyServerCertificateValidator which disables TLS validation for
discovery; replace this by enforcing proper cert validation or gating the bypass
behind a dev-only flag. Update the handler setup in the ClusterClient
constructor (the handler variable) to either (a) validate the server certificate
chain/issuer or compare the server cert thumbprint/subject against an expected
value (certificate pinning) using ServerCertificateCustomValidationCallback, or
(b) only use DangerousAcceptAnyServerCertificateValidator when an explicit
dev/test env flag/config (e.g., "Environment" or "AllowInsecureDiscovery") is
set; ensure the default behavior performs normal TLS validation and load trusted
CA or client certs via HttpClientHandler.ClientCertificates or the system store.
- Around line 26-28: GetRole currently passes only the caller cancellation token
to _electionClient.GetRoleAsync which can hang; create a short, configurable
per-call deadline by constructing a linked CancellationTokenSource (use
CancellationTokenSource.CreateLinkedTokenSource) from the incoming
cancellationToken, call CancelAfter with the configured timeout, and pass
linked.Token to _electionClient.GetRoleAsync; ensure you dispose the linked CTS
and keep the timeout configurable (like HealthProbeService) so GetRole and the
NodeRole retrieval are bounded.

In `@ToyDbRouting/Models/Partition.cs`:
- Around line 11-17: GetHealthyPrimaryReplica is vulnerable to a race where
UpdatePrimary can swap PrimaryReplica between the health check and the return;
fix by taking a local snapshot of the PrimaryReplica at the start of
GetHealthyPrimaryReplica (e.g., var primary = PrimaryReplica) and use that local
variable for both the health lookup and the return value instead of accessing
the PrimaryReplica field twice; this ensures the health status corresponds to
the same ReplicaClient instance even if
PartitionManager.DiscoverPrimaryReplicasAsync calls UpdatePrimary concurrently.

In `@ToyDbRouting/Program.cs`:
- Around line 26-34: The /health endpoint always returns 200; change the MapGet
handler to check actual readiness (e.g., an injected readiness service or state
exposed by your router) and return 503 when the instance is not ready or has no
usable replicas. Update the handler signature used in app.MapGet("/health", ...)
to accept a readiness indicator (for example a RouterState or
IRouterHealthService with IsReady and AvailableReplicas properties) in addition
to IOptions<RoutingOptions>, then return Results.StatusCode(503) (or
Results.ServiceUnavailable) with a payload explaining the failure when IsReady
is false or AvailableReplicas == 0; otherwise return 200 with Status, InstanceId
(routingOptions.Value.RouterInstanceId) and Timestamp as before.

In `@ToyDbRouting/Services/PartitionManager.cs`:
- Around line 100-106: The code currently overwrites a previously discovered
primary with the static config when no primary is found; instead, check
_primaryReplicas for an existing entry before falling back: in the block
containing logger.LogWarning, attempt to get currentPrimary from
_primaryReplicas using partitionConfig.PartitionId and if found keep it (do not
call UpdatePartitionPrimary), otherwise fall back to the configured address by
retrieving/creating fallbackPrimary from _replicaClients
(partitionConfig.PrimaryReplicaAddress), assign it to
_primaryReplicas[partitionConfig.PartitionId] and call UpdatePartitionPrimary as
now; reference logger.LogWarning, _replicaClients, fallbackPrimary,
_primaryReplicas, UpdatePartitionPrimary, partitionConfig.PartitionId and
partitionConfig.PrimaryReplicaAddress.
- Around line 117-129: TriggerRediscovery currently spins an unbounded
fire-and-forget Task that rescans all partitions with CancellationToken.None and
only uses partitionId in the log; change it to perform a scoped discovery for
the single partitionId (call a new or existing partition-scoped method instead
of DiscoverPrimaryReplicasAsync), avoid Task.Run by scheduling via a coalescing
mechanism (e.g., track in-flight triggers per partitionId using a
ConcurrentDictionary or similar to coalesce concurrent calls), accept or use a
cancellable token (propagate a CancellationToken to the discovery call) and
ensure the task is awaited or observably scheduled so shutdown can cancel it;
also move the public TriggerRediscovery method declaration above the private
helpers InitializeReplicaClients, DiscoverPrimaryReplicasAsync, and
UpdatePartitionPrimary to follow coding guidelines.

In `@ToyDbRouting/Services/RoutingService.cs`:
- Around line 22-27: Partition.PrimaryReplica is accessed concurrently (read in
RoutingService and written in PartitionManager.UpdatePartitionPrimary) causing a
data race; fix by adding synchronization: either make the backing field for
PrimaryReplica volatile and update it via Interlocked.Exchange in
PartitionManager.UpdatePartitionPrimary (and read the volatile field in
RoutingService), or implement a dedicated lock object used by both the
Partition.PrimaryReplica setter/getter and by
PartitionManager.UpdatePartitionPrimary so reads in RoutingService (where
PrimaryReplica is read) and writes in UpdatePartitionPrimary use the same lock;
pick one approach and apply it consistently to eliminate the visibility/race
issue.

---

Nitpick comments:
In `@ToyDbContracts/Protos/election.proto`:
- Around line 29-36: Replace the string-typed role in GetRoleResponse with a
protobuf enum to enforce type safety: define an enum (e.g., NodeRoleType with
values NODE_ROLE_UNKNOWN, NODE_ROLE_PRIMARY, NODE_ROLE_SECONDARY) and change the
GetRoleResponse.role field to that enum; update any server-side return
(ClusterService.GetRole) to return the enum value and adjust client checks
(ClusterClient.NodeRole.IsPrimary) to compare against the enum constant instead
of a string so all sides use the enum type.

In `@ToyDbUnitTests/Services/RoutingServiceRetryTests.cs`:
- Around line 33-37: Add a unit test to verify
PartitionManager.TriggerRediscovery is called when the primary replica returns
an RpcException with StatusCode.FailedPrecondition: mock the primary call used
by the RoutingService retry logic to throw new RpcException(new
Status(StatusCode.FailedPrecondition, "...")), arrange the existing
_partitionManagerMock (the Mock<PartitionManager> set up in the test class) and
verify _partitionManagerMock.Verify(p =>
p.TriggerRediscovery(It.IsAny<string>()), Times.Once()); name the test something
like FailedPrecondition_ShouldTriggerRediscovery and place it alongside the
other RoutingServiceRetryTests to assert the triggered rediscovery behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec205b49-d4cf-481a-9b9e-47fecd9b3d5f

📥 Commits

Reviewing files that changed from the base of the PR and between 6f721a9 and ca96760.

📒 Files selected for processing (13)

README.md
ToyDb/Services/ClusterService.cs
ToyDbContracts/Protos/election.proto
ToyDbRouting/Clients/ClusterClient.cs
ToyDbRouting/Models/Partition.cs
ToyDbRouting/Models/RoutingOptions.cs
ToyDbRouting/Program.cs
ToyDbRouting/Services/PartitionManager.cs
ToyDbRouting/Services/RoutingService.cs
ToyDbRouting/appsettings.json
ToyDbUnitTests/Services/RoutingServiceRetryTests.cs
plans/completed/plan-leader-election.md
plans/completed/plan-routing-redundancy.md

coderabbitai · 2026-04-03T21:41:41Z

+    public ClusterClient(string address)
+    {
+        Address = address;
+        var handler = new HttpClientHandler
+        {
+            ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator
+        };
+
+        var channel = GrpcChannel.ForAddress(address, new GrpcChannelOptions
+        {
+            HttpHandler = handler
+        });
+        _electionClient = new Election.ElectionClient(channel);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "ClusterClient.cs" | head -20

Repository: ryanachten/ToyDb

Length of output: 99

🏁 Script executed:

find . -type f -name "PartitionManager.cs" | head -20

Repository: ryanachten/ToyDb

Length of output: 103

🏁 Script executed:

cat -n ./ToyDbRouting/Clients/ClusterClient.cs

Repository: ryanachten/ToyDb

Length of output: 1783

🏁 Script executed:

cat -n ./ToyDbRouting/Services/PartitionManager.cs | head -100

Repository: ryanachten/ToyDb

Length of output: 4155

🏁 Script executed:

rg "new\s+(Clients\.)?ClusterClient" --type cs

Repository: ryanachten/ToyDb

Length of output: 177

🏁 Script executed:

rg "ClusterClient" --type cs

Repository: ryanachten/ToyDb

Length of output: 321

Make ClusterClient disposable and cache instances per address.

The constructor allocates HttpClientHandler and GrpcChannel but the class does not implement IDisposable/IAsyncDisposable, leaving these unmanaged resources unreleased. Since PartitionManager.DiscoverPrimaryReplicasAsync creates a new ClusterClient instance on each invocation within its background service loop (line 80), this causes a resource leak with handlers and channels accumulating indefinitely.

Implement IAsyncDisposable in ClusterClient to dispose the channel, and cache instances per address in PartitionManager to avoid repeated allocations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ToyDbRouting/Clients/ClusterClient.cs` around lines 11 - 23, ClusterClient currently allocates an HttpClientHandler and a GrpcChannel in its constructor but never disposes them; implement IAsyncDisposable (and IDisposable if you prefer) on ClusterClient and in the DisposeAsync/Dispose method ensure you dispose the HttpClientHandler and call DisposeAsync() on the GrpcChannel (or equivalent channel shutdown) to release resources; then modify PartitionManager to cache ClusterClient instances keyed by Address and have DiscoverPrimaryReplicasAsync reuse the cached ClusterClient instead of creating a new one each loop (and ensure cached clients are disposed on manager shutdown or replaced safely), referencing ClusterClient, GrpcChannel, HttpClientHandler, and PartitionManager.DiscoverPrimaryReplicasAsync in your changes.

coderabbitai · 2026-04-03T21:41:41Z

+        var handler = new HttpClientHandler
+        {
+            ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator
+        };


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and read the ClusterClient.cs file find . -name "ClusterClient.cs" -type f

Repository: ryanachten/ToyDb

Length of output: 99

🏁 Script executed:

cat -n ./ToyDbRouting/Clients/ClusterClient.cs

Repository: ryanachten/ToyDb

Length of output: 1783

🏁 Script executed:

# Find and examine PartitionManager to understand usage context find . -name "PartitionManager.cs" -type f

Repository: ryanachten/ToyDb

Length of output: 103

🏁 Script executed:

cat -n ./ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 5592

Don't bypass cert validation in discovery traffic.

DangerousAcceptAnyServerCertificateValidator accepts any server certificate unconditionally. A MITM attacker could impersonate a replica and feed the router bogus role/leader data, corrupting primary discovery. Either keep this strictly dev-only with environment guards, or configure proper certificate validation using the expected CA or certificate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ToyDbRouting/Clients/ClusterClient.cs` around lines 14 - 17, The code in ClusterClient.cs sets HttpClientHandler.ServerCertificateCustomValidationCallback to DangerousAcceptAnyServerCertificateValidator which disables TLS validation for discovery; replace this by enforcing proper cert validation or gating the bypass behind a dev-only flag. Update the handler setup in the ClusterClient constructor (the handler variable) to either (a) validate the server certificate chain/issuer or compare the server cert thumbprint/subject against an expected value (certificate pinning) using ServerCertificateCustomValidationCallback, or (b) only use DangerousAcceptAnyServerCertificateValidator when an explicit dev/test env flag/config (e.g., "Environment" or "AllowInsecureDiscovery") is set; ensure the default behavior performs normal TLS validation and load trusted CA or client certs via HttpClientHandler.ClientCertificates or the system store.

coderabbitai · 2026-04-03T21:41:41Z

+    public virtual void TriggerRediscovery(string partitionId)
+    {
+        _ = Task.Run(async () =>
+        {
+            try
+            {
+                await DiscoverPrimaryReplicasAsync(CancellationToken.None);
+            }
+            catch (Exception ex)
+            {
+                logger.LogError(ex, "Error triggering rediscovery for partition {PartitionId}", partitionId);
+            }
+        });


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "PartitionManager.cs" -type f

Repository: ryanachten/ToyDb

Length of output: 103

🏁 Script executed:

cat -n ./ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 5592

🏁 Script executed:

rg "TriggerRediscovery" --type cs -A 3 -B 3

Repository: ryanachten/ToyDb

Length of output: 1785

🏁 Script executed:

rg "RoutingService" --type cs -B 3 -A 5 | grep -A 5 "TriggerRediscovery"

Repository: ryanachten/ToyDb

Length of output: 42

🏁 Script executed:

cat -n ./ToyDbRouting/Services/RoutingService.cs | head -100

Repository: ryanachten/ToyDb

Length of output: 4503

🏁 Script executed:

rg "FAILED_PRECONDITION" ./ToyDbRouting/Services/RoutingService.cs -B 5 -A 5

Repository: ryanachten/ToyDb

Length of output: 586

TriggerRediscovery is cluster-wide and unbounded.

RoutingService calls this with a concrete partition id on FAILED_PRECONDITION, but partitionId is never used in the method logic (only in the error log) and each failure spins a fire-and-forget Task.Run that rescans all partitions via DiscoverPrimaryReplicasAsync, using CancellationToken.None. A flapping partition can pile up concurrent cluster-wide discovery passes and keep them running during shutdown. Scope the work to the requested partition and coalesce in-flight triggers instead.

Additionally, move this public method before the private helper methods (InitializeReplicaClients, DiscoverPrimaryReplicasAsync, UpdatePartitionPrimary) per coding guidelines.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ToyDbRouting/Services/PartitionManager.cs` around lines 117 - 129, TriggerRediscovery currently spins an unbounded fire-and-forget Task that rescans all partitions with CancellationToken.None and only uses partitionId in the log; change it to perform a scoped discovery for the single partitionId (call a new or existing partition-scoped method instead of DiscoverPrimaryReplicasAsync), avoid Task.Run by scheduling via a coalescing mechanism (e.g., track in-flight triggers per partitionId using a ConcurrentDictionary or similar to coalesce concurrent calls), accept or use a cancellable token (propagate a CancellationToken to the discovery call) and ensure the task is awaited or observably scheduled so shutdown can cancel it; also move the public TriggerRediscovery method declaration above the private helpers InitializeReplicaClients, DiscoverPrimaryReplicasAsync, and UpdatePartitionPrimary to follow coding guidelines.

coderabbitai · 2026-04-03T21:41:41Z

+    ConsistentHashRing ring,
+    PartitionManager partitionManager
 ) : Routing.Routing.RoutingBase
 {
    private readonly ConsistentHashRing _ring = ring;
+    private readonly PartitionManager _partitionManager = partitionManager;


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check Partition class for synchronization around PrimaryReplica ast-grep --pattern $'class Partition { $$$ }' # Also search for lock, volatile, or Interlocked usage in Partition rg -n -C5 'lock\s*\(|volatile|Interlocked' --type=cs -g '*Partition*'

Repository: ryanachten/ToyDb

Length of output: 42

🏁 Script executed:

find . -type f -name "*.cs" | grep -E "(Partition|Routing)" | head -20

Repository: ryanachten/ToyDb

Length of output: 898

🏁 Script executed:

cat -n ToyDbRouting/Models/Partition.cs

Repository: ryanachten/ToyDb

Length of output: 2644

🏁 Script executed:

sed -n '70,115p' ToyDbRouting/Services/RoutingService.cs | cat -n

Repository: ryanachten/ToyDb

Length of output: 2019

🏁 Script executed:

sed -n '100,120p' ToyDbRouting/Services/PartitionManager.cs | cat -n

Repository: ryanachten/ToyDb

Length of output: 1132

🏁 Script executed:

wc -l ToyDbRouting/Services/RoutingService.cs

Repository: ryanachten/ToyDb

Length of output: 103

🏁 Script executed:

cat -n ToyDbRouting/Services/RoutingService.cs | sed -n '75,115p'

Repository: ryanachten/ToyDb

Length of output: 1824

🏁 Script executed:

cat -n ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 5592

🏁 Script executed:

rg -n "lock|volatile|Interlocked|ReaderWriterLock" ToyDbRouting/Services/RoutingService.cs ToyDbRouting/Models/Partition.cs ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 42

🏁 Script executed:

cat -n ToyDbRouting/Services/RoutingService.cs

Repository: ryanachten/ToyDb

Length of output: 10587

Add synchronization to Partition.PrimaryReplica to prevent data races between request and background threads.

RoutingService reads partition.PrimaryReplica on request threads (lines 82, 105) while PartitionManager.UpdatePartitionPrimary writes to it from a background thread (PartitionManager.cs:111-115). Without synchronization, there is a data race that could lead to visibility issues where readers don't see the latest assignment due to lack of memory barriers.

Use volatile on the backing field or add a lock around both the read and write paths. Alternatively, use Interlocked.Exchange for the assignment and ensure reads also respect the synchronization strategy.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ToyDbRouting/Services/RoutingService.cs` around lines 22 - 27, Partition.PrimaryReplica is accessed concurrently (read in RoutingService and written in PartitionManager.UpdatePartitionPrimary) causing a data race; fix by adding synchronization: either make the backing field for PrimaryReplica volatile and update it via Interlocked.Exchange in PartitionManager.UpdatePartitionPrimary (and read the volatile field in RoutingService), or implement a dedicated lock object used by both the Partition.PrimaryReplica setter/getter and by PartitionManager.UpdatePartitionPrimary so reads in RoutingService (where PrimaryReplica is read) and writes in UpdatePartitionPrimary use the same lock; pick one approach and apply it consistently to eliminate the visibility/race issue.

…ndition, tests, health check

ryanachten added 4 commits April 4, 2026 10:12

feat: add GetRole RPC to election service for role discovery

cc73004

feat: add PartitionManager service for dynamic primary discovery

a5d374e

feat: integrate PartitionManager with RoutingService for automatic fa…

adc685d

…ilover

docs: move routing redundancy plan to completed

ca96760

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

ryanachten added 5 commits April 4, 2026 10:45

fix: address PR comments - enum for role, disposal, deadline, race co…

8bf3914

…ndition, tests, health check

fix: increase gRPC message sizes to handle large GetAllValues responses

07f1cea

fix: increase gRPC message size to 500MB for large responses

5f89102

chore: trigger CI re-run

c4d5766

fix: revert GetAllValues to original parallel implementation

aaf63db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dynamic primary discovery for routing layer#11

feat: add dynamic primary discovery for routing layer#11
ryanachten wants to merge 9 commits into
mainfrom
feature/routing-redundancy

ryanachten commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 3, 2026

Uh oh!

coderabbitai Bot Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 3, 2026

Uh oh!

coderabbitai Bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryanachten commented Apr 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Integration

Configuration

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryanachten commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading