roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce

`kv0/enc=false/nodes=3/cpu=32` and a few related variants regressed on Feb 4th:

<img width="795" alt="Screen Shot 2023-02-08 at 11 41 16 AM" src="https://user-images.githubusercontent.com/5438456/217594515-ae78445d-6fc3-49de-befc-d2375f6af7a5.png">

This regression is only observed on AWS and not on GCE. It's also only observed on the 32 vCPU variant and not on the 8 vCPU variant.

I've determined that this was a result of https://github.com/cockroachdb/cockroach/pull/94165. When I run the benchmark and switch the `kv.raft_log.non_blocking_synchronization.enabled` cluster setting midway through to disable async storage writes, throughput increases.

<img width="550" alt="Screen Shot 2023-02-08 at 11 40 13 AM" src="https://user-images.githubusercontent.com/5438456/217594256-321db216-e704-4556-bc75-b0cce1b2c095.png">

Interestingly, log commit latency also drops.

<img width="990" alt="Screen Shot 2023-02-08 at 11 45 38 AM" src="https://user-images.githubusercontent.com/5438456/217595681-85f6d05d-2ca3-468a-8bfe-017ad5ff1dfe.png">

One possibility is that fsyncs on AWS instance stores (with `nobarrier`) are fast enough that we are exceeding the Pebble `SyncConcurrency` of 512. This would cause the async log writes to become synchronous. I don't know why this is worse than the non-async storage write configuration. One thought is that we might be observing the overhead of the asynchronous write path (two goroutine hops) without the benefit (because we're still blocking before entering it) and so we see the throughput regression. 

Another thought is that async storage writes trade-off reduced interference between writes on the same range for reduced batching of writes on the same range. In a system where the p50 fsync latency is **.10ms**, is this the right trade-off?

Next steps:
- experiment with a larger `SyncConcurrency`
- experiment with `nobarrier` disabled
- experiment with EBS volumes

Jira issue: CRDB-24341

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions