Skip to content

Benchmark: add statistical aggregation, throughput-window mode, and richer result reporting #33

Description

@jtnelson

Background

com.cinchapi.common.profile.Benchmark is widely used across Cinchapi projects for measuring server-side and end-to-end timing. cinchapi/concourse alone has 17+ benchmark tests built on it, plus several test classes in concourse-server (e.g. TMapsTest, DatabaseTest, SegmentTest, StoresTest, ByteableCollectionsTest, CompiledInfingramTest).

The current API exposes only the arithmetic mean (average(int)) and total elapsed (run(int)), which limits its usefulness in noise-prone CI environments where outliers dominate single-shot runs and a single mean can be misleading. Concrete CI evidence: the same Concourse code produced 1952 - 4557 ops/sec (2.3x spread) for OpsPerSecondTest.testVerifyAndSwap across four CircleCI shards on the same commit.

This issue tracks improvements needed to make Benchmark a credible regression-detection primitive for the cross-version benchmark suite in cinchapi/concourse.

Current API surface

  • Benchmark.run() returns single-sample elapsed (long)
  • Benchmark.run(int n) returns sum of n elapsed times (long)
  • Benchmark.average(int n) returns arithmetic mean (double)
  • Builder path: measure(Runnable).in(TimeUnit).warmups(int).async()?.run() / run(n) / average(n)

No median, no percentiles, no min/max, no stddev, no outlier rejection, no throughput-window mode, no rich result object.

Gaps blocking better testing

  1. Statistical aggregation — median (p50), p95, p99, min, max, stddev. A single mean is fragile under GC pauses on shared CI infrastructure; we routinely see 2-2.5x variance on the same code across CI shards.
  2. Throughput-window mode — for tests that measure ops/sec over a fixed time window rather than latency over a fixed iteration count (e.g. Concourse's TransactionThroughputTest, OpsPerSecondTest, AbstractTransporterThroughputTest). These tests roll their own loop with System.currentTimeMillis() today.
  3. Rich result object — a BenchmarkResult (or similar) carrying min/max/mean/median/percentiles/iterations/totalElapsed in one object so callers do not have to run twice to print two stats.
  4. Outlier trimming — drop top/bottom k samples before aggregating (trimmed mean), a cheap defense against single-GC-pause outliers.

Proposed API sketch

Existing API stays compatible. New chainable configuration on the builder:

BenchmarkResult result = Benchmark.measure(() -> ...)
        .in(TimeUnit.MILLISECONDS)
        .warmups(3)
        .iterations(10)
        .reportPercentiles(50, 95, 99)
        .trimOutliers(1, 1)   // drop top 1 and bottom 1
        .run();               // returns BenchmarkResult

New throughput-window mode for ops/sec measurement:

BenchmarkResult result = Benchmark.measure(() -> ...)
        .in(TimeUnit.SECONDS)
        .warmups(2)
        .runFor(Duration.ofSeconds(10));
// result.throughput()  => ops/sec
// result.iterations()  => total ops completed
// result.totalElapsed() => actual wall time

BenchmarkResult exposes:

  • min(), max(), mean(), median()
  • percentile(double p) for arbitrary percentile
  • stddev()
  • iterations(), totalElapsed(), throughput() (when runFor was used)
  • samples() for raw per-iteration samples so callers can do their own analysis

Implementation Plan

  1. Introduce BenchmarkResult with the fields above. Keep it immutable.
  2. Add iterations(int), reportPercentiles(int...), trimOutliers(int low, int high), runFor(Duration) to the builder.
  3. Update the abstract-class instance API (Benchmark subclassing path) to support the same configuration via setter methods or by routing through the builder internally.
  4. Make warmup runs explicitly excluded from the result aggregation (today warmups(int) exists on the builder but its semantics relative to average(int) are not documented).
  5. Add unit tests that:
    • Verify percentile math on a known sample distribution
    • Verify runFor(Duration) runs at least N iterations within tolerance and reports the right throughput
    • Verify warmup samples are excluded from the result
    • Verify trimmed-mean math drops the requested top/bottom k samples
  6. Update Javadoc to describe the new contract end-to-end.

Acceptance Criteria

  • BenchmarkResult exposes min, max, mean, median, p95, p99, stddev, iterations, totalElapsed, throughput, samples
  • Builder gains iterations(int), reportPercentiles(int...), trimOutliers(int low, int high), runFor(Duration)
  • Existing run(), run(int), average(int) continue to compile and behave unchanged
  • Warmup runs are documented as excluded from result aggregation
  • Unit tests cover percentile math, throughput-window mode, warmup exclusion, and outlier trimming

Out of scope (file separately if pursued)

  • Forked-JVM isolation (JMH-style)
  • Multi-thread workload generation
  • JIT dead-code-elimination prevention via Blackhole-equivalent
  • Annotation-based parameterization (@Param)

Caller using this work

cinchapi/concourse will depend on this in its perf-test hardening effort.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions