Benchmark: add statistical aggregation, throughput-window mode, and richer result reporting

## Background

`com.cinchapi.common.profile.Benchmark` is widely used across Cinchapi projects for measuring server-side and end-to-end timing. `cinchapi/concourse` alone has 17+ benchmark tests built on it, plus several test classes in `concourse-server` (e.g. `TMapsTest`, `DatabaseTest`, `SegmentTest`, `StoresTest`, `ByteableCollectionsTest`, `CompiledInfingramTest`).

The current API exposes only the **arithmetic mean** (`average(int)`) and **total elapsed** (`run(int)`), which limits its usefulness in noise-prone CI environments where outliers dominate single-shot runs and a single mean can be misleading. Concrete CI evidence: the same Concourse code produced 1952 - 4557 ops/sec (2.3x spread) for `OpsPerSecondTest.testVerifyAndSwap` across four CircleCI shards on the same commit.

This issue tracks improvements needed to make `Benchmark` a credible regression-detection primitive for the cross-version benchmark suite in `cinchapi/concourse`.

## Current API surface

- `Benchmark.run()` returns single-sample elapsed (`long`)
- `Benchmark.run(int n)` returns sum of `n` elapsed times (`long`)
- `Benchmark.average(int n)` returns arithmetic mean (`double`)
- Builder path: `measure(Runnable).in(TimeUnit).warmups(int).async()?.run() / run(n) / average(n)`

No median, no percentiles, no min/max, no stddev, no outlier rejection, no throughput-window mode, no rich result object.

## Gaps blocking better testing

1. **Statistical aggregation** &mdash; median (p50), p95, p99, min, max, stddev. A single mean is fragile under GC pauses on shared CI infrastructure; we routinely see 2-2.5x variance on the same code across CI shards.
2. **Throughput-window mode** &mdash; for tests that measure ops/sec over a fixed time window rather than latency over a fixed iteration count (e.g. Concourse's `TransactionThroughputTest`, `OpsPerSecondTest`, `AbstractTransporterThroughputTest`). These tests roll their own loop with `System.currentTimeMillis()` today.
3. **Rich result object** &mdash; a `BenchmarkResult` (or similar) carrying min/max/mean/median/percentiles/iterations/totalElapsed in one object so callers do not have to run twice to print two stats.
4. **Outlier trimming** &mdash; drop top/bottom k samples before aggregating (trimmed mean), a cheap defense against single-GC-pause outliers.

## Proposed API sketch

Existing API stays compatible. New chainable configuration on the builder:

```java
BenchmarkResult result = Benchmark.measure(() -> ...)
        .in(TimeUnit.MILLISECONDS)
        .warmups(3)
        .iterations(10)
        .reportPercentiles(50, 95, 99)
        .trimOutliers(1, 1)   // drop top 1 and bottom 1
        .run();               // returns BenchmarkResult
```

New throughput-window mode for ops/sec measurement:

```java
BenchmarkResult result = Benchmark.measure(() -> ...)
        .in(TimeUnit.SECONDS)
        .warmups(2)
        .runFor(Duration.ofSeconds(10));
// result.throughput()  => ops/sec
// result.iterations()  => total ops completed
// result.totalElapsed() => actual wall time
```

`BenchmarkResult` exposes:

- `min()`, `max()`, `mean()`, `median()`
- `percentile(double p)` for arbitrary percentile
- `stddev()`
- `iterations()`, `totalElapsed()`, `throughput()` (when `runFor` was used)
- `samples()` for raw per-iteration samples so callers can do their own analysis

## Implementation Plan

1. Introduce `BenchmarkResult` with the fields above. Keep it immutable.
2. Add `iterations(int)`, `reportPercentiles(int...)`, `trimOutliers(int low, int high)`, `runFor(Duration)` to the builder.
3. Update the abstract-class instance API (`Benchmark` subclassing path) to support the same configuration via setter methods or by routing through the builder internally.
4. Make warmup runs explicitly excluded from the result aggregation (today `warmups(int)` exists on the builder but its semantics relative to `average(int)` are not documented).
5. Add unit tests that:
   - Verify percentile math on a known sample distribution
   - Verify `runFor(Duration)` runs at least N iterations within tolerance and reports the right throughput
   - Verify warmup samples are excluded from the result
   - Verify trimmed-mean math drops the requested top/bottom k samples
6. Update Javadoc to describe the new contract end-to-end.

## Acceptance Criteria

- [ ] `BenchmarkResult` exposes min, max, mean, median, p95, p99, stddev, iterations, totalElapsed, throughput, samples
- [ ] Builder gains `iterations(int)`, `reportPercentiles(int...)`, `trimOutliers(int low, int high)`, `runFor(Duration)`
- [ ] Existing `run()`, `run(int)`, `average(int)` continue to compile and behave unchanged
- [ ] Warmup runs are documented as excluded from result aggregation
- [ ] Unit tests cover percentile math, throughput-window mode, warmup exclusion, and outlier trimming

## Out of scope (file separately if pursued)

- Forked-JVM isolation (JMH-style)
- Multi-thread workload generation
- JIT dead-code-elimination prevention via Blackhole-equivalent
- Annotation-based parameterization (`@Param`)

## Caller using this work

`cinchapi/concourse` will depend on this in its perf-test hardening effort.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark: add statistical aggregation, throughput-window mode, and richer result reporting #33

Background

Current API surface

Gaps blocking better testing

Proposed API sketch

Implementation Plan

Acceptance Criteria

Out of scope (file separately if pursued)

Caller using this work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Benchmark: add statistical aggregation, throughput-window mode, and richer result reporting #33

Description

Background

Current API surface

Gaps blocking better testing

Proposed API sketch

Implementation Plan

Acceptance Criteria

Out of scope (file separately if pursued)

Caller using this work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions