Skip to content

feat: add dynamic primary discovery for routing layer#11

Open
ryanachten wants to merge 9 commits into
mainfrom
feature/routing-redundancy
Open

feat: add dynamic primary discovery for routing layer#11
ryanachten wants to merge 9 commits into
mainfrom
feature/routing-redundancy

Conversation

@ryanachten
Copy link
Copy Markdown
Owner

@ryanachten ryanachten commented Apr 3, 2026

Summary

  • Adds dynamic primary discovery to the routing layer via a new GetRole RPC
  • Implements PartitionManager background service that periodically queries replicas to discover the current primary
  • Integrates with RoutingService to trigger rediscovery on FAILED_PRECONDITION errors (e.g., when primary fails and a new one is elected)
  • Adds RouterInstanceId and PrimaryDiscoveryIntervalSeconds config options for routing redundancy

Changes

Core Implementation

  • election.proto: Add GetRoleRequest/GetRoleResponse messages and RPC
  • ClusterService.cs: Implement GetRole handler returning current role from ReplicaState
  • ClusterClient.cs: New gRPC client for querying node role/state
  • PartitionManager.cs: Background service that discovers and updates primary replicas dynamically
  • Partition.cs: Add UpdatePrimary method to support dynamic primary updates

Integration

  • RoutingService.cs: Handle FAILED_PRECONDITION to trigger immediate rediscovery
  • RoutingOptions.cs: Add PrimaryDiscoveryIntervalSeconds and RouterInstanceId config
  • Program.cs: Register PartitionManager as hosted service, add /health endpoint

Configuration

  • appsettings.json: Add PrimaryDiscoveryIntervalSeconds default (10s)

Testing

  • All 39 unit tests pass
  • Updated RoutingServiceRetryTests to include PartitionManager mock

Summary by CodeRabbit

  • New Features

    • Added automatic primary replica discovery—the routing layer now periodically discovers and updates partition primaries for improved availability.
    • Introduced a health check endpoint for router instance monitoring and validation.
    • Enabled support for multiple router instances behind a load balancer with dynamic primary discovery.
  • Documentation

    • Updated architecture diagrams and planning documentation to reflect routing-layer enhancements.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 3, 2026

Warning

Rate limit exceeded

@ryanachten has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 7 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 7 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c892c30-166f-4045-a1c5-449478a4b953

📥 Commits

Reviewing files that changed from the base of the PR and between ca96760 and aaf63db.

📒 Files selected for processing (11)
  • ToyDb/Program.cs
  • ToyDb/Services/ClusterService.cs
  • ToyDbContracts/Protos/election.proto
  • ToyDbRouting/Clients/ClusterClient.cs
  • ToyDbRouting/Clients/ReplicaClient.cs
  • ToyDbRouting/Models/Partition.cs
  • ToyDbRouting/Program.cs
  • ToyDbRouting/Services/PartitionManager.cs
  • ToyDbRouting/Services/RoutingService.cs
  • ToyDbUnitTests/Services/RoutingServiceRetryTests.cs
  • plans/completed/plan-routing-redundancy.md
📝 Walkthrough

Walkthrough

This pull request implements dynamic primary replica discovery in the routing layer. It introduces a GetRole RPC to query node roles, adds a ClusterClient for replica discovery, and implements a PartitionManager background service that periodically discovers and updates partition primaries based on actual node state rather than static configuration.

Changes

Cohort / File(s) Summary
gRPC Service Definition
ToyDbContracts/Protos/election.proto
Added new GetRole RPC to Election service with GetRoleRequest (empty) and GetRoleResponse containing role, term, leader_id, and leader_address fields.
Cluster Service Implementation
ToyDb/Services/ClusterService.cs
Added public GetRole gRPC override that returns node's current role ("Primary" or "Secondary"), term, and leader identifiers, normalizing null leader fields to empty strings.
Routing Client & Models
ToyDbRouting/Clients/ClusterClient.cs
New ClusterClient class that creates gRPC channel to cluster nodes and calls GetRole RPC. Includes NodeRole data class with init-only properties and computed IsPrimary boolean.
Primary Replica Discovery Service
ToyDbRouting/Services/PartitionManager.cs
New background service that periodically discovers current primary for each partition by querying replicas via ClusterClient.GetRole, updates routing state via Partition.UpdatePrimary, and exposes TriggerRediscovery for on-demand re-scanning.
Partition Model Update
ToyDbRouting/Models/Partition.cs
Changed PrimaryReplica from immutable field to settable property and added UpdatePrimary method to support dynamic primary updates as discovery runs.
Routing Service Integration
ToyDbRouting/Services/RoutingService.cs
Updated constructor to accept PartitionManager dependency and added catch for RpcException with FailedPrecondition status to trigger primary rediscovery on write failures.
Configuration & DI Setup
ToyDbRouting/Program.cs, ToyDbRouting/Models/RoutingOptions.cs, ToyDbRouting/appsettings.json
Registered PartitionManager as singleton and hosted service. Added PrimaryDiscoveryIntervalSeconds (default 10) and RouterInstanceId configuration options. Added GET /health HTTP endpoint returning status, instance ID, and timestamp.
Test Updates
ToyDbUnitTests/Services/RoutingServiceRetryTests.cs
Added mocked PartitionManager dependency to test setup and updated RoutingService construction calls.
Documentation
README.md, plans/completed/plan-leader-election.md, plans/completed/plan-routing-redundancy.md
Updated README architecture diagram to show PartitionManager and ClusterClient components with role discovery flow. Renamed "High Availability (Leader Election)" heading to "Leader Election". Added new Phase 4 routing redundancy plan documenting dynamic primary discovery architecture and implementation steps.

Sequence Diagram(s)

sequenceDiagram
    actor Timer as Periodic Timer
    participant PM as PartitionManager
    participant CC as ClusterClient
    participant ES as Election Service<br/>(ClusterService)
    participant Ring as ConsistentHashRing<br/>(Partition)

    Timer->>PM: Discovery Interval Elapsed
    PM->>PM: Initialize Replica Cache
    
    loop For Each Partition
        PM->>PM: Create ClusterClient for<br/>each Replica Address
        
        loop For Each Replica (Primary, then Secondaries)
            PM->>CC: new ClusterClient(address)
            activate CC
            CC->>ES: GetRole(GetRoleRequest)
            activate ES
            ES-->>CC: GetRoleResponse{role, term, ...}
            deactivate ES
            CC-->>PM: NodeRole{IsPrimary, ...}
            deactivate CC
            
            alt IsPrimary == true
                PM->>PM: Found Primary!<br/>Break Loop
            end
        end
        
        alt Primary Found
            PM->>PM: Create/Cache ReplicaClient<br/>for New Primary
            PM->>Ring: UpdatePartitionPrimary(partitionId)
            activate Ring
            Ring->>Ring: Find Partition by ID
            Ring->>Ring: partition.UpdatePrimary(newPrimary)
            deactivate Ring
        else No Primary Found
            PM->>PM: Log Warning,<br/>Fallback to Config
        end
    end
    
    PM->>PM: Sleep for<br/>Discovery Interval
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #10 — Implements leader election logic that this PR builds upon by adding the GetRole RPC and integrating role discovery into the routing layer.
  • PR #6 — Implements ConsistentHashRing in the routing layer, which this PR extends with dynamic primary management via PartitionManager.
  • PR #4 — Prior routing layer enhancements to RoutingOptions and Program.cs dependency injection that this PR extends with new configuration properties and hosted service registration.

Poem

🐰 A bunny's hop through primaries so bright,
Discovery whispers left and right,
Each partition learns its leader's name,
Through graceful updates, routing's aflame! 🌟
No static chains shall bind our way,
Hop on, hop on—dynamic today! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add dynamic primary discovery for routing layer' directly and clearly describes the main change—implementing dynamic primary discovery functionality in the routing layer.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/routing-redundancy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
ToyDbContracts/Protos/election.proto (1)

29-36: Consider using an enum for the role field for type safety.

Using a string for role requires exact case-sensitive matching between server (ClusterService.GetRole returns "Primary" or "Secondary") and client (ClusterClient.NodeRole.IsPrimary checks Role == "Primary"). An enum would provide compile-time safety and make the contract explicit:

enum NodeRoleType {
    NODE_ROLE_UNKNOWN = 0;
    NODE_ROLE_PRIMARY = 1;
    NODE_ROLE_SECONDARY = 2;
}

message GetRoleResponse {
    NodeRoleType role = 1;
    // ...
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbContracts/Protos/election.proto` around lines 29 - 36, Replace the
string-typed role in GetRoleResponse with a protobuf enum to enforce type
safety: define an enum (e.g., NodeRoleType with values NODE_ROLE_UNKNOWN,
NODE_ROLE_PRIMARY, NODE_ROLE_SECONDARY) and change the GetRoleResponse.role
field to that enum; update any server-side return (ClusterService.GetRole) to
return the enum value and adjust client checks
(ClusterClient.NodeRole.IsPrimary) to compare against the enum constant instead
of a string so all sides use the enum type.
ToyDbUnitTests/Services/RoutingServiceRetryTests.cs (1)

33-37: Consider adding a test for FAILED_PRECONDITION triggering rediscovery.

The mock setup is functional, but there's no test verifying that TriggerRediscovery is called when a primary replica returns FAILED_PRECONDITION. This is a key behavior introduced in this PR.

Would you like me to draft a test case that verifies TriggerRediscovery is invoked when the primary throws an RpcException with StatusCode.FailedPrecondition?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbUnitTests/Services/RoutingServiceRetryTests.cs` around lines 33 - 37,
Add a unit test to verify PartitionManager.TriggerRediscovery is called when the
primary replica returns an RpcException with StatusCode.FailedPrecondition: mock
the primary call used by the RoutingService retry logic to throw new
RpcException(new Status(StatusCode.FailedPrecondition, "...")), arrange the
existing _partitionManagerMock (the Mock<PartitionManager> set up in the test
class) and verify _partitionManagerMock.Verify(p =>
p.TriggerRediscovery(It.IsAny<string>()), Times.Once()); name the test something
like FailedPrecondition_ShouldTriggerRediscovery and place it alongside the
other RoutingServiceRetryTests to assert the triggered rediscovery behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plans/completed/plan-routing-redundancy.md`:
- Around line 45-70: Wrap both fenced code blocks under "Primary Discovery Flow"
and "Routing Redundancy Flow" with a language tag (use ```text) to satisfy
markdownlint, and rename the interval identifier in the Primary Discovery Flow
from DiscoveryIntervalMs to DiscoveryIntervalSeconds (or DiscoveryIntervalSec if
that matches config) and keep the default value as 10s; update any references in
the doc mentioning PartitionManager, RoutingService, Partition.PrimaryReplica,
and the discovery interval so they consistently use DiscoveryIntervalSeconds.

In `@ToyDbRouting/Clients/ClusterClient.cs`:
- Around line 11-23: ClusterClient currently allocates an HttpClientHandler and
a GrpcChannel in its constructor but never disposes them; implement
IAsyncDisposable (and IDisposable if you prefer) on ClusterClient and in the
DisposeAsync/Dispose method ensure you dispose the HttpClientHandler and call
DisposeAsync() on the GrpcChannel (or equivalent channel shutdown) to release
resources; then modify PartitionManager to cache ClusterClient instances keyed
by Address and have DiscoverPrimaryReplicasAsync reuse the cached ClusterClient
instead of creating a new one each loop (and ensure cached clients are disposed
on manager shutdown or replaced safely), referencing ClusterClient, GrpcChannel,
HttpClientHandler, and PartitionManager.DiscoverPrimaryReplicasAsync in your
changes.
- Around line 14-17: The code in ClusterClient.cs sets
HttpClientHandler.ServerCertificateCustomValidationCallback to
DangerousAcceptAnyServerCertificateValidator which disables TLS validation for
discovery; replace this by enforcing proper cert validation or gating the bypass
behind a dev-only flag. Update the handler setup in the ClusterClient
constructor (the handler variable) to either (a) validate the server certificate
chain/issuer or compare the server cert thumbprint/subject against an expected
value (certificate pinning) using ServerCertificateCustomValidationCallback, or
(b) only use DangerousAcceptAnyServerCertificateValidator when an explicit
dev/test env flag/config (e.g., "Environment" or "AllowInsecureDiscovery") is
set; ensure the default behavior performs normal TLS validation and load trusted
CA or client certs via HttpClientHandler.ClientCertificates or the system store.
- Around line 26-28: GetRole currently passes only the caller cancellation token
to _electionClient.GetRoleAsync which can hang; create a short, configurable
per-call deadline by constructing a linked CancellationTokenSource (use
CancellationTokenSource.CreateLinkedTokenSource) from the incoming
cancellationToken, call CancelAfter with the configured timeout, and pass
linked.Token to _electionClient.GetRoleAsync; ensure you dispose the linked CTS
and keep the timeout configurable (like HealthProbeService) so GetRole and the
NodeRole retrieval are bounded.

In `@ToyDbRouting/Models/Partition.cs`:
- Around line 11-17: GetHealthyPrimaryReplica is vulnerable to a race where
UpdatePrimary can swap PrimaryReplica between the health check and the return;
fix by taking a local snapshot of the PrimaryReplica at the start of
GetHealthyPrimaryReplica (e.g., var primary = PrimaryReplica) and use that local
variable for both the health lookup and the return value instead of accessing
the PrimaryReplica field twice; this ensures the health status corresponds to
the same ReplicaClient instance even if
PartitionManager.DiscoverPrimaryReplicasAsync calls UpdatePrimary concurrently.

In `@ToyDbRouting/Program.cs`:
- Around line 26-34: The /health endpoint always returns 200; change the MapGet
handler to check actual readiness (e.g., an injected readiness service or state
exposed by your router) and return 503 when the instance is not ready or has no
usable replicas. Update the handler signature used in app.MapGet("/health", ...)
to accept a readiness indicator (for example a RouterState or
IRouterHealthService with IsReady and AvailableReplicas properties) in addition
to IOptions<RoutingOptions>, then return Results.StatusCode(503) (or
Results.ServiceUnavailable) with a payload explaining the failure when IsReady
is false or AvailableReplicas == 0; otherwise return 200 with Status, InstanceId
(routingOptions.Value.RouterInstanceId) and Timestamp as before.

In `@ToyDbRouting/Services/PartitionManager.cs`:
- Around line 100-106: The code currently overwrites a previously discovered
primary with the static config when no primary is found; instead, check
_primaryReplicas for an existing entry before falling back: in the block
containing logger.LogWarning, attempt to get currentPrimary from
_primaryReplicas using partitionConfig.PartitionId and if found keep it (do not
call UpdatePartitionPrimary), otherwise fall back to the configured address by
retrieving/creating fallbackPrimary from _replicaClients
(partitionConfig.PrimaryReplicaAddress), assign it to
_primaryReplicas[partitionConfig.PartitionId] and call UpdatePartitionPrimary as
now; reference logger.LogWarning, _replicaClients, fallbackPrimary,
_primaryReplicas, UpdatePartitionPrimary, partitionConfig.PartitionId and
partitionConfig.PrimaryReplicaAddress.
- Around line 117-129: TriggerRediscovery currently spins an unbounded
fire-and-forget Task that rescans all partitions with CancellationToken.None and
only uses partitionId in the log; change it to perform a scoped discovery for
the single partitionId (call a new or existing partition-scoped method instead
of DiscoverPrimaryReplicasAsync), avoid Task.Run by scheduling via a coalescing
mechanism (e.g., track in-flight triggers per partitionId using a
ConcurrentDictionary or similar to coalesce concurrent calls), accept or use a
cancellable token (propagate a CancellationToken to the discovery call) and
ensure the task is awaited or observably scheduled so shutdown can cancel it;
also move the public TriggerRediscovery method declaration above the private
helpers InitializeReplicaClients, DiscoverPrimaryReplicasAsync, and
UpdatePartitionPrimary to follow coding guidelines.

In `@ToyDbRouting/Services/RoutingService.cs`:
- Around line 22-27: Partition.PrimaryReplica is accessed concurrently (read in
RoutingService and written in PartitionManager.UpdatePartitionPrimary) causing a
data race; fix by adding synchronization: either make the backing field for
PrimaryReplica volatile and update it via Interlocked.Exchange in
PartitionManager.UpdatePartitionPrimary (and read the volatile field in
RoutingService), or implement a dedicated lock object used by both the
Partition.PrimaryReplica setter/getter and by
PartitionManager.UpdatePartitionPrimary so reads in RoutingService (where
PrimaryReplica is read) and writes in UpdatePartitionPrimary use the same lock;
pick one approach and apply it consistently to eliminate the visibility/race
issue.

---

Nitpick comments:
In `@ToyDbContracts/Protos/election.proto`:
- Around line 29-36: Replace the string-typed role in GetRoleResponse with a
protobuf enum to enforce type safety: define an enum (e.g., NodeRoleType with
values NODE_ROLE_UNKNOWN, NODE_ROLE_PRIMARY, NODE_ROLE_SECONDARY) and change the
GetRoleResponse.role field to that enum; update any server-side return
(ClusterService.GetRole) to return the enum value and adjust client checks
(ClusterClient.NodeRole.IsPrimary) to compare against the enum constant instead
of a string so all sides use the enum type.

In `@ToyDbUnitTests/Services/RoutingServiceRetryTests.cs`:
- Around line 33-37: Add a unit test to verify
PartitionManager.TriggerRediscovery is called when the primary replica returns
an RpcException with StatusCode.FailedPrecondition: mock the primary call used
by the RoutingService retry logic to throw new RpcException(new
Status(StatusCode.FailedPrecondition, "...")), arrange the existing
_partitionManagerMock (the Mock<PartitionManager> set up in the test class) and
verify _partitionManagerMock.Verify(p =>
p.TriggerRediscovery(It.IsAny<string>()), Times.Once()); name the test something
like FailedPrecondition_ShouldTriggerRediscovery and place it alongside the
other RoutingServiceRetryTests to assert the triggered rediscovery behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec205b49-d4cf-481a-9b9e-47fecd9b3d5f

📥 Commits

Reviewing files that changed from the base of the PR and between 6f721a9 and ca96760.

📒 Files selected for processing (13)
  • README.md
  • ToyDb/Services/ClusterService.cs
  • ToyDbContracts/Protos/election.proto
  • ToyDbRouting/Clients/ClusterClient.cs
  • ToyDbRouting/Models/Partition.cs
  • ToyDbRouting/Models/RoutingOptions.cs
  • ToyDbRouting/Program.cs
  • ToyDbRouting/Services/PartitionManager.cs
  • ToyDbRouting/Services/RoutingService.cs
  • ToyDbRouting/appsettings.json
  • ToyDbUnitTests/Services/RoutingServiceRetryTests.cs
  • plans/completed/plan-leader-election.md
  • plans/completed/plan-routing-redundancy.md

Comment thread plans/completed/plan-routing-redundancy.md
Comment thread ToyDbRouting/Clients/ClusterClient.cs Outdated
Comment on lines +11 to +23
public ClusterClient(string address)
{
Address = address;
var handler = new HttpClientHandler
{
ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator
};

var channel = GrpcChannel.ForAddress(address, new GrpcChannelOptions
{
HttpHandler = handler
});
_electionClient = new Election.ElectionClient(channel);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "ClusterClient.cs" | head -20

Repository: ryanachten/ToyDb

Length of output: 99


🏁 Script executed:

find . -type f -name "PartitionManager.cs" | head -20

Repository: ryanachten/ToyDb

Length of output: 103


🏁 Script executed:

cat -n ./ToyDbRouting/Clients/ClusterClient.cs

Repository: ryanachten/ToyDb

Length of output: 1783


🏁 Script executed:

cat -n ./ToyDbRouting/Services/PartitionManager.cs | head -100

Repository: ryanachten/ToyDb

Length of output: 4155


🏁 Script executed:

rg "new\s+(Clients\.)?ClusterClient" --type cs

Repository: ryanachten/ToyDb

Length of output: 177


🏁 Script executed:

rg "ClusterClient" --type cs

Repository: ryanachten/ToyDb

Length of output: 321


Make ClusterClient disposable and cache instances per address.

The constructor allocates HttpClientHandler and GrpcChannel but the class does not implement IDisposable/IAsyncDisposable, leaving these unmanaged resources unreleased. Since PartitionManager.DiscoverPrimaryReplicasAsync creates a new ClusterClient instance on each invocation within its background service loop (line 80), this causes a resource leak with handlers and channels accumulating indefinitely.

Implement IAsyncDisposable in ClusterClient to dispose the channel, and cache instances per address in PartitionManager to avoid repeated allocations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbRouting/Clients/ClusterClient.cs` around lines 11 - 23, ClusterClient
currently allocates an HttpClientHandler and a GrpcChannel in its constructor
but never disposes them; implement IAsyncDisposable (and IDisposable if you
prefer) on ClusterClient and in the DisposeAsync/Dispose method ensure you
dispose the HttpClientHandler and call DisposeAsync() on the GrpcChannel (or
equivalent channel shutdown) to release resources; then modify PartitionManager
to cache ClusterClient instances keyed by Address and have
DiscoverPrimaryReplicasAsync reuse the cached ClusterClient instead of creating
a new one each loop (and ensure cached clients are disposed on manager shutdown
or replaced safely), referencing ClusterClient, GrpcChannel, HttpClientHandler,
and PartitionManager.DiscoverPrimaryReplicasAsync in your changes.

Comment on lines +14 to +17
var handler = new HttpClientHandler
{
ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and read the ClusterClient.cs file
find . -name "ClusterClient.cs" -type f

Repository: ryanachten/ToyDb

Length of output: 99


🏁 Script executed:

cat -n ./ToyDbRouting/Clients/ClusterClient.cs

Repository: ryanachten/ToyDb

Length of output: 1783


🏁 Script executed:

# Find and examine PartitionManager to understand usage context
find . -name "PartitionManager.cs" -type f

Repository: ryanachten/ToyDb

Length of output: 103


🏁 Script executed:

cat -n ./ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 5592


Don't bypass cert validation in discovery traffic.

DangerousAcceptAnyServerCertificateValidator accepts any server certificate unconditionally. A MITM attacker could impersonate a replica and feed the router bogus role/leader data, corrupting primary discovery. Either keep this strictly dev-only with environment guards, or configure proper certificate validation using the expected CA or certificate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbRouting/Clients/ClusterClient.cs` around lines 14 - 17, The code in
ClusterClient.cs sets
HttpClientHandler.ServerCertificateCustomValidationCallback to
DangerousAcceptAnyServerCertificateValidator which disables TLS validation for
discovery; replace this by enforcing proper cert validation or gating the bypass
behind a dev-only flag. Update the handler setup in the ClusterClient
constructor (the handler variable) to either (a) validate the server certificate
chain/issuer or compare the server cert thumbprint/subject against an expected
value (certificate pinning) using ServerCertificateCustomValidationCallback, or
(b) only use DangerousAcceptAnyServerCertificateValidator when an explicit
dev/test env flag/config (e.g., "Environment" or "AllowInsecureDiscovery") is
set; ensure the default behavior performs normal TLS validation and load trusted
CA or client certs via HttpClientHandler.ClientCertificates or the system store.

Comment thread ToyDbRouting/Clients/ClusterClient.cs Outdated
Comment thread ToyDbRouting/Models/Partition.cs
Comment thread ToyDbRouting/Program.cs Outdated
Comment thread ToyDbRouting/Services/PartitionManager.cs Outdated
Comment on lines +117 to +129
public virtual void TriggerRediscovery(string partitionId)
{
_ = Task.Run(async () =>
{
try
{
await DiscoverPrimaryReplicasAsync(CancellationToken.None);
}
catch (Exception ex)
{
logger.LogError(ex, "Error triggering rediscovery for partition {PartitionId}", partitionId);
}
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "PartitionManager.cs" -type f

Repository: ryanachten/ToyDb

Length of output: 103


🏁 Script executed:

cat -n ./ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 5592


🏁 Script executed:

rg "TriggerRediscovery" --type cs -A 3 -B 3

Repository: ryanachten/ToyDb

Length of output: 1785


🏁 Script executed:

rg "RoutingService" --type cs -B 3 -A 5 | grep -A 5 "TriggerRediscovery"

Repository: ryanachten/ToyDb

Length of output: 42


🏁 Script executed:

cat -n ./ToyDbRouting/Services/RoutingService.cs | head -100

Repository: ryanachten/ToyDb

Length of output: 4503


🏁 Script executed:

rg "FAILED_PRECONDITION" ./ToyDbRouting/Services/RoutingService.cs -B 5 -A 5

Repository: ryanachten/ToyDb

Length of output: 586


TriggerRediscovery is cluster-wide and unbounded.

RoutingService calls this with a concrete partition id on FAILED_PRECONDITION, but partitionId is never used in the method logic (only in the error log) and each failure spins a fire-and-forget Task.Run that rescans all partitions via DiscoverPrimaryReplicasAsync, using CancellationToken.None. A flapping partition can pile up concurrent cluster-wide discovery passes and keep them running during shutdown. Scope the work to the requested partition and coalesce in-flight triggers instead.

Additionally, move this public method before the private helper methods (InitializeReplicaClients, DiscoverPrimaryReplicasAsync, UpdatePartitionPrimary) per coding guidelines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbRouting/Services/PartitionManager.cs` around lines 117 - 129,
TriggerRediscovery currently spins an unbounded fire-and-forget Task that
rescans all partitions with CancellationToken.None and only uses partitionId in
the log; change it to perform a scoped discovery for the single partitionId
(call a new or existing partition-scoped method instead of
DiscoverPrimaryReplicasAsync), avoid Task.Run by scheduling via a coalescing
mechanism (e.g., track in-flight triggers per partitionId using a
ConcurrentDictionary or similar to coalesce concurrent calls), accept or use a
cancellable token (propagate a CancellationToken to the discovery call) and
ensure the task is awaited or observably scheduled so shutdown can cancel it;
also move the public TriggerRediscovery method declaration above the private
helpers InitializeReplicaClients, DiscoverPrimaryReplicasAsync, and
UpdatePartitionPrimary to follow coding guidelines.

Comment on lines +22 to +27
ConsistentHashRing ring,
PartitionManager partitionManager
) : Routing.Routing.RoutingBase
{
private readonly ConsistentHashRing _ring = ring;
private readonly PartitionManager _partitionManager = partitionManager;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check Partition class for synchronization around PrimaryReplica
ast-grep --pattern $'class Partition {
  $$$
}'

# Also search for lock, volatile, or Interlocked usage in Partition
rg -n -C5 'lock\s*\(|volatile|Interlocked' --type=cs -g '*Partition*'

Repository: ryanachten/ToyDb

Length of output: 42


🏁 Script executed:

find . -type f -name "*.cs" | grep -E "(Partition|Routing)" | head -20

Repository: ryanachten/ToyDb

Length of output: 898


🏁 Script executed:

cat -n ToyDbRouting/Models/Partition.cs

Repository: ryanachten/ToyDb

Length of output: 2644


🏁 Script executed:

sed -n '70,115p' ToyDbRouting/Services/RoutingService.cs | cat -n

Repository: ryanachten/ToyDb

Length of output: 2019


🏁 Script executed:

sed -n '100,120p' ToyDbRouting/Services/PartitionManager.cs | cat -n

Repository: ryanachten/ToyDb

Length of output: 1132


🏁 Script executed:

wc -l ToyDbRouting/Services/RoutingService.cs

Repository: ryanachten/ToyDb

Length of output: 103


🏁 Script executed:

cat -n ToyDbRouting/Services/RoutingService.cs | sed -n '75,115p'

Repository: ryanachten/ToyDb

Length of output: 1824


🏁 Script executed:

cat -n ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 5592


🏁 Script executed:

rg -n "lock|volatile|Interlocked|ReaderWriterLock" ToyDbRouting/Services/RoutingService.cs ToyDbRouting/Models/Partition.cs ToyDbRouting/Services/PartitionManager.cs

Repository: ryanachten/ToyDb

Length of output: 42


🏁 Script executed:

cat -n ToyDbRouting/Services/RoutingService.cs

Repository: ryanachten/ToyDb

Length of output: 10587


Add synchronization to Partition.PrimaryReplica to prevent data races between request and background threads.

RoutingService reads partition.PrimaryReplica on request threads (lines 82, 105) while PartitionManager.UpdatePartitionPrimary writes to it from a background thread (PartitionManager.cs:111-115). Without synchronization, there is a data race that could lead to visibility issues where readers don't see the latest assignment due to lack of memory barriers.

Use volatile on the backing field or add a lock around both the read and write paths. Alternatively, use Interlocked.Exchange for the assignment and ensure reads also respect the synchronization strategy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ToyDbRouting/Services/RoutingService.cs` around lines 22 - 27,
Partition.PrimaryReplica is accessed concurrently (read in RoutingService and
written in PartitionManager.UpdatePartitionPrimary) causing a data race; fix by
adding synchronization: either make the backing field for PrimaryReplica
volatile and update it via Interlocked.Exchange in
PartitionManager.UpdatePartitionPrimary (and read the volatile field in
RoutingService), or implement a dedicated lock object used by both the
Partition.PrimaryReplica setter/getter and by
PartitionManager.UpdatePartitionPrimary so reads in RoutingService (where
PrimaryReplica is read) and writes in UpdatePartitionPrimary use the same lock;
pick one approach and apply it consistently to eliminate the visibility/race
issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant