Add energy measurement support (RAPL + iDRAC) by holly-cummins · Pull Request #473 · quarkusio/spring-quarkus-perf-comparison

holly-cummins · 2026-04-17T14:46:34Z

Resolves #99.

This is based on John O'Hara's work in the quarkus-qe-spring-boot-comparison branch of his old repo. I've copied the relevant qDup scripts from his yaml, and then added some extra logic to allow it to be turned on and off, guard execution on non-Intel hardware and non-Dell hardware, and report an average energy consumption into the json.

The energy measurements will run as part of the same job that gathers rss information. That seems suitable, since it's fixed throughput, and we're running it already. (@franz1981 and @edeandrea, should that job be made longer, to get higher quality results?)

I'm stumbling in the dark a bit with the qDup, since this is my first time touching the scripts. I've tested as well as I can, but I don't have access to any Intel hardware, so I had to test by getting claude to stub out the rapl and idrac endpoints. That's not brilliant, so I'd be ever so grateful if @edeandrea or @franz1981 could test for me. :)

These are the changes:

Add RAPL (CPU-level) and iDRAC (system-level) power measurement to the measure-rss benchmark phase
New --energy-rapl and --energy-idrac CLI flags (enabled|disabled, default: enabled) to control which measurements run
Hardware guards silently skip unsupported measurements (e.g. no RAPL on ARM, no racadm on non-Dell), so both can safely be left enabled on any platform
New scripts/perf-lab/helpers/energy.yml with install-rapl-plot and capture-system-power scripts

Energy measurement outputs

.rapl files — per-second CPU package and DRAM power readings from Intel RAPL
.svrPwr files — per-second system-level wattage readings from Dell iDRAC/racadm
new stanzas in the json. Claude claims they will look like this, with per-iteration average watts for each source, plus cross-iteration averages.

  "energy": {                                                                                                                                       
    "idrac": [157, 155],                                                                                                                            
    "avIdrac": 156,                                                                                                                                 
    "raplPkg": [25.85, 26.10],                                                                                                                      
    "avRaplPkg": 25.975,                                                                                                                            
    "raplDram": [5.42, 5.38],                                                                                                                     
    "avRaplDram": 5.40                                                                                                                              
  }

github-actions · 2026-04-17T15:53:38Z

🤖 This PR has been synchronized to the ootb branch. 🔄

Add RAPL (CPU-level) and iDRAC (system-level) power measurement to the measure-rss benchmark phase. Energy sources are controlled by per-source config flags (config.energy.rapl, config.energy.idrac) with a single --energy CLI flag accepting comma-separated values (all, none, rapl, idrac, or rapl,idrac). - New helpers/energy.yml with install-rapl-plot and capture-system-power - Hardware guards skip unsupported measurements silently (safe on any platform) - All sh/regex patterns use then blocks for qDup 0.11 compatibility - Uses sudo script instead of sh: sudo - Energy file downloads queued early for resilience - Per-iteration average watts written to JSON output with cross-iteration averages - Auto-adds measure-rss to test list when --energy is explicitly set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use IFS-based splitting so comma-separated values are parsed correctly, and use a case statement for exact-match validation instead of regex substring matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rapl-plot runs indefinitely and must be explicitly terminated. Store its PID via set-state and sudo kill it before parsing the output file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

edeandrea · 2026-04-17T19:10:54Z

I ran this locally. It looks like rapl worked, but the idrac didn't.

14:47:03,809 [ test-runner:run-tests-1354 @ target-host ] [[ -x /tmp/rapl/rapl-plot ]] && echo RAPL_READY
RAPL_READY
14:47:03,810 [ test-runner:run-tests-1354 @ target-host ] regex: RAPL_READY
14:47:03,810 [ test-runner:run-tests-1354 @ target-host ] script-cmd: sudo
14:47:03,810 [ test-runner:run-tests-1354 @ target-host ] sudo
14:47:03,811 [ test-runner:run-tests-1354 @ target-host ] script-cmd: sudo-${{os}}
14:47:03,811 [ test-runner:run-tests-1354 @ target-host ] sudo-linux
14:47:04,236 [ test-runner:run-tests-1354 @ target-host ] sudo /tmp/rapl/rapl-plot -p > /home/hyperfoil/spring-quarkus-perf-comparison/logs/measure-rss-quarkus3-jvm-0.rapl &
[1] 1800810
14:47:04,661 [ test-runner:run-tests-1354 @ target-host ] echo $!
1800810
14:47:04,661 [ test-runner:run-tests-1354 @ target-host ] set-state: RUN.RAPL_PID 1800810
14:47:04,661 [ test-runner:run-tests-1354 @ target-host ] read-state: enabled
14:47:04,661 [ test-runner:run-tests-1354 @ target-host ] regex: enabled
14:47:05,091 [ test-runner:run-tests-1354 @ target-host ] command -v racadm > /dev/null 2>&1 && echo IDRAC_READY
14:47:05,520 [ test-runner:run-tests-1354 @ target-host ] taskset --cpu-list 0-3 java -Xms512m -Xmx512m -XX:+UseNUMA  -jar /home/hyperfoil/spring-quarkus-perf-comparison/spring-quarkus-perf-comparison/builds/quarkus3-jvm/quarkus-app/quarkus-run.jar &>/home/hyperfoil/spring-quarkus-perf-comparison/logs/measure-rss-quarkus3-jvm-0.log &

i never see the regex check for IDRAC_READY

I think the problem is with - sh: "command -v racadm > /dev/null 2>&1 && echo IDRAC_READY"

The echo never happens because the console output is redirected to /dev/null

Also, whats the purpose of the sh followed by the regex? shouldn't it always echo IDRAC_READY? and if so, why use regex? Could it ever be anything else?

It also looks like its running that same command in multiple places?

Additionally,

  capture-system-power:
    - script: idrac-login
      with:
        idrac: ${{= ${{servers}}.filter( server => { return server.hostname == "${{env.HOST}}"})[0].idrac}}
        password: ${{idrac_pwd}}

where is idrac_pwd ever set? or servers for that matter? Furthermore, capture-system-power seems to be calling another script, idrac-login, but that script doesn't seem to exist.

edeandrea · 2026-04-17T19:15:02Z

This may just be coincidence, but I ran both quarkus3-jvm and spring4-jvm, and the results for energy were the same for both.

"energy": {
        "raplPkg": [
          0.44
        ],
        "raplDram": [
          0.53
        ],
        "avRaplPkg": 0.44,
        "avRaplDram": 0.53
      }

full results:

{
  "timing": {
    "start": "2026-04-17T18:45:16Z",
    "stop": "2026-04-17T18:50:15Z"
  },
  "results": {
    "quarkus3-jvm": {
      "build": {
        "timings": [
          9.21
        ],
        "avBuildTime": 9.21
      },
      "startup": {
        "timings": [
          2494.544875
        ],
        "avStartTime": 2494.544875
      },
      "rss": {
        "startup": [
          270.3828125
        ],
        "firstRequest": [
          288.33984375
        ],
        "avStartupRss": 270.3828125,
        "avFirstRequestRss": 288.33984375
      },
      "energy": {
        "raplPkg": [
          0.44
        ],
        "raplDram": [
          0.53
        ],
        "avRaplPkg": 0.44,
        "avRaplDram": 0.53
      }
    },
    "spring4-jvm": {
      "build": {
        "timings": [
          3.91
        ],
        "avBuildTime": 3.91
      },
      "startup": {
        "timings": [
          6873.358205
        ],
        "avStartTime": 6873.358205
      },
      "rss": {
        "startup": [
          610.58203125
        ],
        "firstRequest": [
          636.1015625
        ],
        "avStartupRss": 610.58203125,
        "avFirstRequestRss": 636.1015625
      },
      "energy": {
        "raplPkg": [
          0.44
        ],
        "raplDram": [
          0.53
        ],
        "avRaplPkg": 0.44,
        "avRaplDram": 0.53
      }
    }
  },
  "config": {
    "jvm": {
      "args": "-XX:+UseNUMA",
      "memory": "-Xms512m -Xmx512m",
      "graalvm": {
        "version": "25.0.2-graalce",
        "home": ""
      },
      "version": "25.0.2-tem",
      "home": ""
    },
    "num_iterations": 1,
    "quarkus": {
      "build_config_args": "",
      "native_build_options": "",
      "version": "3.34.3"
    },
    "repo": {
      "short_commit": "60f1a44",
      "scenario": "tuned",
      "commit": "60f1a446fd208a2fdf7916f6bb716274227a8fb0",
      "branch": "lab-energy-measurement",
      "scenarioName": "Tuned",
      "url": "https://github.com/holly-cummins/spring-quarkus-perf-comparison"
    },
    "springboot3": {
      "native_build_options": "",
      "version": "3.5.13"
    },
    "springboot4": {
      "native_build_options": "",
      "version": "4.0.5"
    },
    "resources": {
      "cpu": {
        "app": "0-3",
        "1st_request": 10,
        "otel": "7-9",
        "monitor": 13,
        "load_generator": "10-12",
        "db": "4-6"
      },
      "app_cpus": 4
    },
    "run": {
      "dropOsFilesystemCaches": "true",
      "description": "energy",
      "useContainerHostNetwork": "true"
    },
    "units": {
      "rss": {
        "load": "MiB",
        "startup": "MiB",
        "firstRequest": "MiB"
      },
      "load": {
        "throughputDensity": "tps per MiB",
        "throughput": "tps",
        "errors": {
          "connectionErrors": "absolute number",
          "requestTimeouts": "absolute number"
        }
      },
      "timings": {
        "build": "sec",
        "startup": "ms"
      },
      "energy": {
        "raplDram": "W",
        "raplPkg": "W",
        "idrac": "W"
      }
    },
    "profiler": {
      "name": "none",
      "events": "cpu"
    },
    "energy": {
      "rapl": "enabled",
      "idrac": "enabled"
    }
  },
  "env": {
    "host": {
      "memory": "62Gi",
      "os": "Fedora Linux 43 (Workstation Edition)",
      "kernel": "6.19.10-200.fc43.x86_64",
      "cpu": "Intel(R) Core(TM) Ultra 7 165H (22 cpus)",
      "type": "LENOVO 21KWS49V00",
      "gpu": "Intel Corporation Meteor Lake-P [Intel Arc Graphics]"
    }
  }
}

franz1981 · 2026-04-17T20:41:56Z

As pointed by @edeandrea maybe something is off, but I still think we should compute the Joule from the energy drawn (that means that the time of startup matter) - and eventually move to https://greensoftware.foundation/articles/gps-up-a-better-metric-for-comparing-software-energy-efficiency/ too

Add idrac-login qDup script that configures remote racadm access via shell function wrapping, with connectivity validation via getsysinfo. Replace the duplicated command -v racadm + IDRAC_READY guard pattern (repeated 3 times in measure-rss) with a single validate-idrac setup script that stores availability in RUN.IDRAC_AVAILABLE state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

holly-cummins · 2026-04-20T20:07:59Z

never see the regex check for IDRAC_READY

I think the problem is with - sh: "command -v racadm > /dev/null 2>&1 && echo IDRAC_READY"

The echo never happens because the console output is redirected to /dev/null

Claude says

The redirect is confusing (though it's actually correct — >/dev/null only applies to command -v, not the echo after &&)

Also, whats the purpose of the sh followed by the regex? shouldn't it always echo IDRAC_READY? and if so, why use regex? Could it ever be anything else?

It also looks like its running that same command in multiple places?

It agrees with you. :)

The sh+regex guard is roundabout — it's a runtime check for racadm availability, repeated identically each time

It's duplicated 3 times in measure-rss

The fix: check once during setup, store the result in state, and use read-state guards everywhere else. This mirrors how RAPL is validated during install-rapl-plot.

I've fixed those points. I've also put in a best guess at a idrac-login script, since I can't find a model one anywhere in the original scripts. There is a default password but the systems push users to change them, so we have to assume it's something else, and configure that.

holly-cummins · 2026-04-20T20:18:18Z

As pointed by @edeandrea maybe something is off, but I still think we should compute the Joule from the energy drawn (that means that the time of startup matter) - and eventually move to https://greensoftware.foundation/articles/gps-up-a-better-metric-for-comparing-software-energy-efficiency/ too

You're totally right. I'm a bit sceptical about making GPS-UP our primary metric because we want to be able to communicate to people without them having to go read a document. I think it's maybe a bit too specialised for this use case, too – the question we're asking isn't "I want to evaluate an optimisation," but more "I want to know what my energy bill might be."

But I'd got muddled about what the RSS test was doing, which may be partly why @edeandrea's results make no sense. We know, from both local measurement and our earlier publications, there's an energy difference. I'd imagined the RSS test was running for a fixed time, and taking a mean or max RSS after a warmup period. If that was the case, average and total energies would be strictly proportional, and it wouldn't really matter which I measured. (Technically, the average would be a power, not an energy, but that's a detail. :) )

But on looking more closely, I see that the RSS test only runs until the first request comes back. So that means
(a) the test is so short that it will be hard to measure anything meaningfully. At one sample a second, the noise will be huge.
(b) average is totally wrong as a metric because it would give no benefit to starting rapidly

So we could add a whole other test, or we could extend the RSS test if energy is enabled. I'll go for the 're-use but extend' option, and switch to measuring total.

Add configurable duration (--energy-duration, default 150s matching load test) to keep energy capture running after RSS is measured. When --energy none, duration is set to 0 so no time is wasted. Change energy metrics from average power (W) to total energy (J): iDRAC sums watt-readings instead of averaging, RAPL awk sums instead of averaging, and units updated from W to J. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

holly-cummins · 2026-04-21T14:07:54Z

I kept looking for this comment from @zakkak, thinking it was on this PR, but it's elsewhere. Pasting it here since it's relevant (but probably an advanced iteration of the basics done here): #387 (comment)

Remove remote iDRAC access (host/user/password parameters, shell function wrapper). Local racadm communicates with iDRAC directly on Dell servers without network configuration or credentials. This also addresses the reviewer question about where servers and idrac_pwd were set — they're no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

edeandrea · 2026-04-21T15:33:34Z

FYI @holly-cummins I re-ran this and I'm still getting the exact same values for both Quarkus & Spring

      "energy": {
        "raplPkg": [
          27
        ],
        "raplDram": [
          33
        ],
        "avRaplPkg": 27,
        "avRaplDram": 33
      }

edeandrea · 2026-04-21T16:22:39Z

Is the measure-rss test the right test for this? Like you said, its a quick test - start the app (measure rss), do a single curl (measure rss).

Has enough happened to know how much energy was consumed? Shouldn't energy measurement have a time component to it? All of the tests in measure-rss run for a fixed amount of time. App start command is issued, the test pauses for a set duration of time, then issues a curl, then shuts the app down.

Or should energy measurement be a component of each test? (Time to first request, load test, rss, build time, etc)? I would think that energy expended during load testing could be interesting. If an app can handle lots of request, but we find that the app threads are "asleep" waiting on cpu, maybe the app uses less energy?

I'm also not sure what - sh: sleep ${{config.energy.rss.duration}} is doing?

holly-cummins · 2026-04-21T16:52:59Z

Is the measure-rss test the right test for this? Like you said, its a quick test - start the app (measure rss), do a single curl (measure rss).

Has enough happened to know how much energy was consumed? Shouldn't energy measurement have a time component to it?

Yes, definitely. It's supposed to be putting load on the system for 150s, same duration as the normal tests. But I suppose it's possibly just letting the system run for 150s, with no load. That would be rather sad, but would explain the strange results.

All of the tests in measure-rss run for a fixed amount of time. App start command is issued, the test pauses for a set duration of time, then issues a curl, then shuts the app down.

Oh dear.

Or should energy measurement be a component of each test? (Time to first request, load test, rss, build time, etc)? I would think that energy expended during load testing could be interesting. If an app can handle lots of request, but we find that the app threads are "asleep" waiting on cpu, maybe the app uses less energy?

I think energy testing needs to be done with fixed load. Well, that's not quite true. It possibly would be interesting to report energy for all of the tests, because if we can handle 2x throughput with less energy, that's very interesting, but I think on the "vary one thing at a time" principle that @franz1981 has drilled into us, the 'best' energy measurement is done at fixed throughput, so it's an "all other things are equal" test.

I'm also not sure what - sh: sleep ${{config.energy.rss.duration}} is doing?

I kind of assumed qDup had two threads so that thread was just waiting for the load to execute. But maybe that's actually not at all what it's doing, and it's just wasting time. :(

edeandrea · 2026-04-21T17:00:38Z

Yes, definitely. It's supposed to be putting load on the system for 150s, same duration as the normal tests. But I suppose it's possibly just letting the system run for 150s, with no load. That would be rather said, but would explain the strange results.

There's no load on the system during that 150s. The curl has already happened. The app is just sitting there idle without anything hitting it.

franz1981 · 2026-04-21T18:11:07Z

Interestingly, it's actually easier to get the max TPS test right than the fixed throughput one. By saturating the CPU, we're effectively running at a fixed hardware load (100% utilization) and just measuring the work done (Requests per Joule).

Not to mention, if we use a fixed TPS, we run into a sneaky issue: if one framework is much more efficient and uses far fewer resources, it will spend more time sleeping. In that scenario, the energy cost of constant CPU wakeups could actually end up being higher than the cost of handling the requests that originated them!

So we have to think about this carefully before we commit entirely to fixed TPS for energy baselines.
In short let's experiment with both and analyze which work best and what they mean to users 🙏

I know that I am contradicting myself here, but for scientific reasons I cannot ignore the challenges of testing each scenario 🙏

holly-cummins · 2026-04-21T18:22:34Z

So we have to think about this carefully before we commit entirely to fixed TPS for energy baselines. In short let's experiment with both and analyze which work best and what they mean to users 🙏

I know that I am contradicting myself here, but for scientific reasons I cannot ignore the challenges of testing each scenario 🙏

Since I'm doing such an awesome job of getting one scenario right, I think doing two scenarios is an excellent idea. :)

(Joking aside, I agree, let's experiment and think. In that order, probably. :) )

Two new test scripts: - measure-energy-max-throughput: captures energy during the 30s steady-state wrk run at max throughput (after 2m warmup) - measure-energy-fixed-throughput: captures energy during a configurable-duration wrk2 run at a fixed request rate (default 5000 tps for 150s, via --fixed-throughput-rate and --fixed-throughput-duration) Extract energy capture into reusable qDup helper scripts (start-energy-capture, stop-energy-capture, calc-energy-totals) and refactor measure-rss to use them. Remove the broken 150s energy sleep from measure-rss since idle measurement has no load. Energy metrics are reported as total joules, not average watts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Energy at max throughput is now captured during the existing load test instead of requiring a separate test job. Removes the standalone measure-energy-max-throughput script and strips energy capture from measure-rss where it added no value (no load). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

holly-cummins · 2026-04-21T19:44:11Z

Another option for measuring is to ramp up the requests, the same as what we had in the original measurements:

That would be expensive to do every time, though.

Fix state variable naming consistency, remove stale measure-rss force-add, merge double-loop in --energy parsing, add input validation for fixed-throughput options, add java -version and monitor-processes to fixed-throughput test, fix wrk timeout, and compute energy-per-request (J/req) for both iDRAC and RAPL in test output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

edeandrea · 2026-04-22T14:38:12Z

@holly-cummins the energy capture - is it looking at a specific process? Or is it looking at the system as a whole during a period of time?

If its capturing the total power consumption of the system, wouldn't that capture both the app under load and the load generator (plus database and other things running)?

zakkak · 2026-04-24T08:43:40Z

Not to mention, if we use a fixed TPS, we run into a sneaky issue: if one framework is much more efficient and uses far fewer resources, it will spend more time sleeping. In that scenario, the energy cost of constant CPU wakeups could actually end up being higher than the cost of handling the requests that originated them!

@franz1981, who would be triggering these "constant CPU wakeups" the framework itself or the energy measurement harness? If it's the framework itself then it sounds fair game to me. The CPU wakeups should be priced in...

Another option for measuring is to ramp up the requests, the same as what we had in the original measurements:

@holly-cummins Tons is a bit scary on the y-axis :D

That would be expensive to do every time, though.

+1, it's pretty nice though so it might be worth running on each new feature release (once a month?). How long does it take to run?

franz1981 · 2026-04-24T08:58:39Z

who would be triggering these "constant CPU wakeups

The originating cause is: network interrupts due to new requests arriving (+ causal dependency between thread pools due to Quarkus threading model). But the actual motivation is because each of the involved thread pools (netty and any workers) could have already completed their job., they have an higher chance to go park - and be unparked.

The slower framework instead, is kept awakw...making it appears as efficient (or more?) depending by the machine settings for C state.

That's why is dangerous to measure just a single fixed TPS, but you would like to measure at different load - exactly like .measuring efficiency of cars under different scenarios (type of roads, different speeds, acceleration/deceleration, etc etc)

holly-cummins · 2026-04-24T09:31:02Z

@edeandrea and I were talking the other day, and the way our test is setup may make energy measurement in the max load case non-interesting. I'd still like to collect it, but I don't think we can draw very many conclusions from it. Because RAPL collects whole-system power, and because the DBMS and load generator are co-located on the system, we'd end up measuring them as well. If their work is (relatively) constant, we can just treat it as a baseline in the comparison. But if Quarkus is able to handle more load, the load generator and DBMS will be doing more work, and consuming more power. We wouldn't be able to draw conclusions about the efficiciency of the frameworks.

So it sounds like we'll want to be doing a few different loads as our 'default' measurement, although I'd also like to report a 'slice though' metric. In our old data, we did see a fairly linear relationship between load and energy over wide ranges

... but less linear in small ranges, which I guess would be consistent with @franz1981's concerns.

zakkak · 2026-04-24T12:05:59Z

The originating cause is: network interrupts due to new requests arriving (+ causal dependency between thread pools due to Quarkus threading model). But the actual motivation is because each of the involved thread pools (netty and any workers) could have already completed their job., they have an higher chance to go park - and be unparked.

Good, so it's part of the framework (or its underlying dependencies) in which case it's fair to measure it. No?

The slower framework instead, is kept awakw...making it appears as efficient (or more?) depending by the machine settings for C state.

@franz1981 I am confused, why would an awake CPU be more efficient than a sleeping one that's being waken up?

That's why is dangerous to measure just a single fixed TPS, but you would like to measure at different load - exactly like .measuring efficiency of cars under different scenarios (type of roads, different speeds, acceleration/deceleration, etc etc)

I totally agree on that.

Because RAPL collects whole-system power, and because the DBMS and load generator are co-located on the system, we'd end up measuring them as well.

@holly-cummins is there no option to run the load generator and DBMS on a different system? It would sure add some overhead which would drop the peak throughput, but would make the results more realistic and representative IMHO (no matter if using a fixed throughput or going for peak). Keeping the DBMS in the mix might be OK, but the load generator doesn't make much sense.

franz1981 · 2026-04-24T12:55:49Z

@zakkak

it's part of the framework (or its underlying dependencies) in which case it's fair to measure it. No?

It depends — it's more a timing artifact than a framework property. Ideally you'd measure each framework at its own 75% load (which differs between them), or size the more efficient one to match the peak throughput of the slower one, and then run both at the same fixed TPS. That makes the comparison meaningful in terms of capacity served.

I am confused, why would an awake CPU be more efficient than a sleeping one that's being waken up?

It's not more efficient in steady state — but it appears so under fixed TPS measurement. With turboboost disabled and a stable P-state, a warm CPU draws a predictable, moderate amount of energy. A CPU that enters deep C-states between requests pays a wakeup tax on every request arrival: PLL relock, cache warm-up, a potential turbo burst. That cost gets attributed to the efficient framework precisely because it finishes fast enough to sleep. The slow framework never parks, avoids all that overhead, and ends up looking comparably efficient — or better — depending on the machine's C-state configuration.

zakkak · 2026-04-24T18:21:36Z

Depends. It's more an artifact of some bad timing ^^ you want to measure for both frameworks how they perform at THEIR 75% load/vs peak (which is different)

Hmm I find Energy per request more interesting/representative. Running at 75% capacity doesn't say much when the capacity is not the same.

Service providers probably have a rough estimate of how many requests they anticipate on a given period and are probably interested in minimizing the total cost of serving that requests while meeting some performance criteria. So I would expect them to be interested in cost per request which is tightly coupled with energy per request (when running on premise at least).

or size the most efficient one to serve the same peak of the slower - and than you can measure the same fixed one against both - which are more comparable in term of "capacity".

I think this is tricky as well. Since both frameworks run on the same machine the one stressing the machine will be consuming more power than the other. At the same time, the "throttled" framework despite consuming less energy overall for the same throughput, it will have a worse energy per request ratio if I am correctly reading Holly's plot in #473 (comment) which indicates that energy consumption/request improves linearly with TPS. So the highest the TPS the more efficient a framework is (even when compared with itself at lower TPS).

If a CPU is kept warm and P state with turboboost is disabled, it consume a certain amount of energy, but going in deep C state + unparking is like measuring how much fuel a car consume while stopping/starting - it rather different than at steady state.

Sure, but that's OK IMHO if energy measurement is done end-to-end. On the other side if energy measurement is done by periodically sampling power consumption then the non-steady state might indeed skew things...

In the case of end-to-end measurements though I would still expect the start/stop to result in better energy consumption in most cases, just like my car is still expected to burn less fuel overall if it turns itself off at the traffic lights.

To take a step back, what question are we trying to answer with these numbers?

franz1981 · 2026-04-25T12:59:09Z

@zakkak

In the case of end-to-end measurements though I would still expect the start/stop to result in better energy consumption in most cases, just like my car is still expected to burn less fuel overall if it turns itself off at the traffic lights.

Traffic lights are periodic and predictable — HTTP load isn't. wrk2 generates Poisson arrivals: inter-arrival times are exponentially distributed (CV=1), so even at a fixed mean rate you get bursts and gaps that the Linux idle governor (menu/teo) cannot reliably predict.

Each C-state has a target residency — a minimum dwell time for the sleep savings to outweigh the entry/exit overhead (C6: ~100–300µs). The governor uses timer deadlines and recent idle history to pick the right state, but under high-variance arrivals misprediction rate increases: it may commit to a deeper state and get pulled out before residency, paying the transition cost with reduced savings.

This doesn't necessarily flip the result — race-to-idle is still the dominant design philosophy of modern CPUs, and a busy framework burning dynamic power continuously isn't free either. But it does mean the efficiency gap measured at fixed TPS can be smaller than the raw throughput ratio suggests, and the magnitude is hardware, kernel-tuning, and load-point dependent.

That's why measuring at a single fixed TPS is risky — the result tells you as much about C-state behavior on that specific machine as it does about the frameworks.

To take a step back, what question are we trying to answer with these numbers?

+💯

diegolovison · 2026-05-05T14:54:33Z

  PROFILER_JVM_ARGS:
  BASE_JAVA_CMD: ${{APP_CMD_PREFIX}} java ${{config.jvm.memory}} ${{config.jvm.args}} ${{PROFILER_JVM_ARGS}}
-  TESTS : [measure-time-to-first-request, measure-rss, run-load-test]
+  TESTS : [measure-time-to-first-request, measure-rss, run-load-test, measure-energy-fixed-throughput]


I don't believe a dedicated test is necessary. We can add a configuration variable to toggle energy measurement for any existing test instead.

Proposed Approach:

Toggle via Variable: Use a flag like config.enable.energy.consumption.

Trigger the Script: In existing tests (e.g., measure-time-to-first-request, measure-rss, run-load-test), call the energy measurement script just before the warm-up or steady-state phases.

Even if a custom test is required later, we should keep this same implementation logic.

I did apply this approach originally, but the problem is that we don't have an existing test that runs for a long period at fixed load.

What if we rename the test, to drop 'energy'? That keeps the separation between metric and test, but gives us the new test parameters we need.

In any case since we cannot distinguish the cost of the different components this will make energy comparison across different tests not possible

I did apply this approach originally, but the problem is that we don't have an existing test that runs for a long period at a fixed load.

I think we should have two separate PRs:

Create a test that represents: runs for a long period at a fixed load. (soak test)

Another PR that adds energy consumption metrics to any test

This is what we are doing downstream. Energy measurement is a metric that can be attached to any test.

diegolovison · 2026-05-05T14:58:01Z

          then:
            - abort: unable to connect to target application
        - sleep: 30s # Wait for the app to complete warmup
+        - script: start-energy-capture


Because the energy consumption becomes a metric and not a test. See my previous comment. We can now have a special QDup file for each kind of energy consumption measurement.

I do like the approach of energy being a metric, and I'm hoping we'll capture it in several of our tests (even if we decide the results are most useful for one or two).

diegolovison

I think we should separate the metric from the test.

holly-cummins · 2026-05-05T15:24:37Z

This doesn't necessarily flip the result — race-to-idle is still the dominant design philosophy of modern CPUs, and a busy framework burning dynamic power continuously isn't free either. But it does mean the efficiency gap measured at fixed TPS can be smaller than the raw throughput ratio suggests, and the magnitude is hardware, kernel-tuning, and load-point dependent.

That's why measuring at a single fixed TPS is risky — the result tells you as much about C-state behavior on that specific machine as it does about the frameworks.

I've been thinking about this more, and I think the 'gold standard' test would be a scalability test (like the one @franz1981 is adding in #492). But I think we'll probably also want to report a 'slice' through that scale, probably somewhere around the middle.

To take a step back, what question are we trying to answer with these numbers?

+💯

It's a good question. I think the question, ultimately, is "how will moving to Quarkus help me achieve my sustainability goals by reducing my carbon footprint?" The answer there is two parts, with one being more complex than the other. The more complex answer is that higher transaction rates for the same hardware, combined with lower memory requirements, mean workloads can run on smaller hardware and have less embodied carbon. That ties closely into the "what container would we need?" and cost calculation efforts. The second answer is "what would my energy usage per transaction be?" I think to be fair, and measuring-only-one-thing-at-a-time, that should be measured at a fixed transaction rate. The question isn't "if I had twice as many customers once I switched to Quarkus, what would my energy cost per transaction be?". (I mean, if it was a realistic question, that would be kind of awesome, but I don't think "doubles your customer base overnight" is part of our value proposition.)

If modern processor architecture means Quarkus still consumes similar-ish energy even when it seems to be running at half its potential max capacity, that's useful information. It's not the answer we want, but it's going to more accurately reflect user interests than a theoretical energy usage that relies on throughput being higher to avoid scheduler/hardware penalties.

As an aside, if we assume our results are still mostly-like our results the last time we measured energy, a vertical slice does show the energy savings ratio is less than the max-throughput-ratio, but maybe not hugely. How much depends on the baseline (these charts have a non-zero baseline). For JVM, it's the ratio between the red line and the purple line: