Add energy measurement support (RAPL + iDRAC)#473
Add energy measurement support (RAPL + iDRAC)#473holly-cummins wants to merge 10 commits intoquarkusio:mainfrom
Conversation
3df8246 to
0480ec4
Compare
|
🤖 This PR has been synchronized to the |
Add RAPL (CPU-level) and iDRAC (system-level) power measurement to the measure-rss benchmark phase. Energy sources are controlled by per-source config flags (config.energy.rapl, config.energy.idrac) with a single --energy CLI flag accepting comma-separated values (all, none, rapl, idrac, or rapl,idrac). - New helpers/energy.yml with install-rapl-plot and capture-system-power - Hardware guards skip unsupported measurements silently (safe on any platform) - All sh/regex patterns use then blocks for qDup 0.11 compatibility - Uses sudo script instead of sh: sudo - Energy file downloads queued early for resilience - Per-iteration average watts written to JSON output with cross-iteration averages - Auto-adds measure-rss to test list when --energy is explicitly set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0480ec4 to
460dd60
Compare
Use IFS-based splitting so comma-separated values are parsed correctly, and use a case statement for exact-match validation instead of regex substring matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rapl-plot runs indefinitely and must be explicitly terminated. Store its PID via set-state and sudo kill it before parsing the output file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I ran this locally. It looks like rapl worked, but the idrac didn't. i never see the regex check for I think the problem is with The echo never happens because the console output is redirected to Also, whats the purpose of the It also looks like its running that same command in multiple places? Additionally, capture-system-power:
- script: idrac-login
with:
idrac: ${{= ${{servers}}.filter( server => { return server.hostname == "${{env.HOST}}"})[0].idrac}}
password: ${{idrac_pwd}}where is |
|
This may just be coincidence, but I ran both "energy": {
"raplPkg": [
0.44
],
"raplDram": [
0.53
],
"avRaplPkg": 0.44,
"avRaplDram": 0.53
}full results: {
"timing": {
"start": "2026-04-17T18:45:16Z",
"stop": "2026-04-17T18:50:15Z"
},
"results": {
"quarkus3-jvm": {
"build": {
"timings": [
9.21
],
"avBuildTime": 9.21
},
"startup": {
"timings": [
2494.544875
],
"avStartTime": 2494.544875
},
"rss": {
"startup": [
270.3828125
],
"firstRequest": [
288.33984375
],
"avStartupRss": 270.3828125,
"avFirstRequestRss": 288.33984375
},
"energy": {
"raplPkg": [
0.44
],
"raplDram": [
0.53
],
"avRaplPkg": 0.44,
"avRaplDram": 0.53
}
},
"spring4-jvm": {
"build": {
"timings": [
3.91
],
"avBuildTime": 3.91
},
"startup": {
"timings": [
6873.358205
],
"avStartTime": 6873.358205
},
"rss": {
"startup": [
610.58203125
],
"firstRequest": [
636.1015625
],
"avStartupRss": 610.58203125,
"avFirstRequestRss": 636.1015625
},
"energy": {
"raplPkg": [
0.44
],
"raplDram": [
0.53
],
"avRaplPkg": 0.44,
"avRaplDram": 0.53
}
}
},
"config": {
"jvm": {
"args": "-XX:+UseNUMA",
"memory": "-Xms512m -Xmx512m",
"graalvm": {
"version": "25.0.2-graalce",
"home": ""
},
"version": "25.0.2-tem",
"home": ""
},
"num_iterations": 1,
"quarkus": {
"build_config_args": "",
"native_build_options": "",
"version": "3.34.3"
},
"repo": {
"short_commit": "60f1a44",
"scenario": "tuned",
"commit": "60f1a446fd208a2fdf7916f6bb716274227a8fb0",
"branch": "lab-energy-measurement",
"scenarioName": "Tuned",
"url": "https://github.com/holly-cummins/spring-quarkus-perf-comparison"
},
"springboot3": {
"native_build_options": "",
"version": "3.5.13"
},
"springboot4": {
"native_build_options": "",
"version": "4.0.5"
},
"resources": {
"cpu": {
"app": "0-3",
"1st_request": 10,
"otel": "7-9",
"monitor": 13,
"load_generator": "10-12",
"db": "4-6"
},
"app_cpus": 4
},
"run": {
"dropOsFilesystemCaches": "true",
"description": "energy",
"useContainerHostNetwork": "true"
},
"units": {
"rss": {
"load": "MiB",
"startup": "MiB",
"firstRequest": "MiB"
},
"load": {
"throughputDensity": "tps per MiB",
"throughput": "tps",
"errors": {
"connectionErrors": "absolute number",
"requestTimeouts": "absolute number"
}
},
"timings": {
"build": "sec",
"startup": "ms"
},
"energy": {
"raplDram": "W",
"raplPkg": "W",
"idrac": "W"
}
},
"profiler": {
"name": "none",
"events": "cpu"
},
"energy": {
"rapl": "enabled",
"idrac": "enabled"
}
},
"env": {
"host": {
"memory": "62Gi",
"os": "Fedora Linux 43 (Workstation Edition)",
"kernel": "6.19.10-200.fc43.x86_64",
"cpu": "Intel(R) Core(TM) Ultra 7 165H (22 cpus)",
"type": "LENOVO 21KWS49V00",
"gpu": "Intel Corporation Meteor Lake-P [Intel Arc Graphics]"
}
}
}
|
|
As pointed by @edeandrea maybe something is off, but I still think we should compute the Joule from the energy drawn (that means that the time of startup matter) - and eventually move to https://greensoftware.foundation/articles/gps-up-a-better-metric-for-comparing-software-energy-efficiency/ too |
Add idrac-login qDup script that configures remote racadm access via shell function wrapping, with connectivity validation via getsysinfo. Replace the duplicated command -v racadm + IDRAC_READY guard pattern (repeated 3 times in measure-rss) with a single validate-idrac setup script that stores availability in RUN.IDRAC_AVAILABLE state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude says
It agrees with you. :)
I've fixed those points. I've also put in a best guess at a |
You're totally right. I'm a bit sceptical about making GPS-UP our primary metric because we want to be able to communicate to people without them having to go read a document. I think it's maybe a bit too specialised for this use case, too – the question we're asking isn't "I want to evaluate an optimisation," but more "I want to know what my energy bill might be." But I'd got muddled about what the RSS test was doing, which may be partly why @edeandrea's results make no sense. We know, from both local measurement and our earlier publications, there's an energy difference. I'd imagined the RSS test was running for a fixed time, and taking a mean or max RSS after a warmup period. If that was the case, average and total energies would be strictly proportional, and it wouldn't really matter which I measured. (Technically, the average would be a power, not an energy, but that's a detail. :) ) But on looking more closely, I see that the RSS test only runs until the first request comes back. So that means So we could add a whole other test, or we could extend the RSS test if energy is enabled. I'll go for the 're-use but extend' option, and switch to measuring total. |
Add configurable duration (--energy-duration, default 150s matching load test) to keep energy capture running after RSS is measured. When --energy none, duration is set to 0 so no time is wasted. Change energy metrics from average power (W) to total energy (J): iDRAC sums watt-readings instead of averaging, RAPL awk sums instead of averaging, and units updated from W to J. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
I kept looking for this comment from @zakkak, thinking it was on this PR, but it's elsewhere. Pasting it here since it's relevant (but probably an advanced iteration of the basics done here): #387 (comment) |
Remove remote iDRAC access (host/user/password parameters, shell function wrapper). Local racadm communicates with iDRAC directly on Dell servers without network configuration or credentials. This also addresses the reviewer question about where servers and idrac_pwd were set — they're no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
FYI @holly-cummins I re-ran this and I'm still getting the exact same values for both Quarkus & Spring "energy": {
"raplPkg": [
27
],
"raplDram": [
33
],
"avRaplPkg": 27,
"avRaplDram": 33
} |
|
Is the Has enough happened to know how much energy was consumed? Shouldn't energy measurement have a time component to it? All of the tests in Or should energy measurement be a component of each test? (Time to first request, load test, rss, build time, etc)? I would think that energy expended during load testing could be interesting. If an app can handle lots of request, but we find that the app threads are "asleep" waiting on cpu, maybe the app uses less energy? I'm also not sure what |
Yes, definitely. It's supposed to be putting load on the system for 150s, same duration as the normal tests. But I suppose it's possibly just letting the system run for 150s, with no load. That would be rather sad, but would explain the strange results.
Oh dear.
I think energy testing needs to be done with fixed load. Well, that's not quite true. It possibly would be interesting to report energy for all of the tests, because if we can handle 2x throughput with less energy, that's very interesting, but I think on the "vary one thing at a time" principle that @franz1981 has drilled into us, the 'best' energy measurement is done at fixed throughput, so it's an "all other things are equal" test.
I kind of assumed qDup had two threads so that thread was just waiting for the load to execute. But maybe that's actually not at all what it's doing, and it's just wasting time. :( |
There's no load on the system during that 150s. The curl has already happened. The app is just sitting there idle without anything hitting it. |
|
Interestingly, it's actually easier to get the max TPS test right than the fixed throughput one. By saturating the CPU, we're effectively running at a fixed hardware load (100% utilization) and just measuring the work done (Requests per Joule). Not to mention, if we use a fixed TPS, we run into a sneaky issue: if one framework is much more efficient and uses far fewer resources, it will spend more time sleeping. In that scenario, the energy cost of constant CPU wakeups could actually end up being higher than the cost of handling the requests that originated them! So we have to think about this carefully before we commit entirely to fixed TPS for energy baselines. I know that I am contradicting myself here, but for scientific reasons I cannot ignore the challenges of testing each scenario 🙏 |
Since I'm doing such an awesome job of getting one scenario right, I think doing two scenarios is an excellent idea. :) (Joking aside, I agree, let's experiment and think. In that order, probably. :) ) |
Two new test scripts: - measure-energy-max-throughput: captures energy during the 30s steady-state wrk run at max throughput (after 2m warmup) - measure-energy-fixed-throughput: captures energy during a configurable-duration wrk2 run at a fixed request rate (default 5000 tps for 150s, via --fixed-throughput-rate and --fixed-throughput-duration) Extract energy capture into reusable qDup helper scripts (start-energy-capture, stop-energy-capture, calc-energy-totals) and refactor measure-rss to use them. Remove the broken 150s energy sleep from measure-rss since idle measurement has no load. Energy metrics are reported as total joules, not average watts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Energy at max throughput is now captured during the existing load test instead of requiring a separate test job. Removes the standalone measure-energy-max-throughput script and strips energy capture from measure-rss where it added no value (no load). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix state variable naming consistency, remove stale measure-rss force-add, merge double-loop in --energy parsing, add input validation for fixed-throughput options, add java -version and monitor-processes to fixed-throughput test, fix wrk timeout, and compute energy-per-request (J/req) for both iDRAC and RAPL in test output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@holly-cummins the energy capture - is it looking at a specific process? Or is it looking at the system as a whole during a period of time? If its capturing the total power consumption of the system, wouldn't that capture both the app under load and the load generator (plus database and other things running)? |
@franz1981, who would be triggering these "constant CPU wakeups" the framework itself or the energy measurement harness? If it's the framework itself then it sounds fair game to me. The CPU wakeups should be priced in...
@holly-cummins Tons is a bit scary on the y-axis :D
+1, it's pretty nice though so it might be worth running on each new feature release (once a month?). How long does it take to run? |
The originating cause is: network interrupts due to new requests arriving (+ causal dependency between thread pools due to Quarkus threading model). But the actual motivation is because each of the involved thread pools (netty and any workers) could have already completed their job., they have an higher chance to go park - and be unparked. The slower framework instead, is kept awakw...making it appears as efficient (or more?) depending by the machine settings for C state. That's why is dangerous to measure just a single fixed TPS, but you would like to measure at different load - exactly like .measuring efficiency of cars under different scenarios (type of roads, different speeds, acceleration/deceleration, etc etc) |
|
@edeandrea and I were talking the other day, and the way our test is setup may make energy measurement in the max load case non-interesting. I'd still like to collect it, but I don't think we can draw very many conclusions from it. Because RAPL collects whole-system power, and because the DBMS and load generator are co-located on the system, we'd end up measuring them as well. If their work is (relatively) constant, we can just treat it as a baseline in the comparison. But if Quarkus is able to handle more load, the load generator and DBMS will be doing more work, and consuming more power. We wouldn't be able to draw conclusions about the efficiciency of the frameworks. So it sounds like we'll want to be doing a few different loads as our 'default' measurement, although I'd also like to report a 'slice though' metric. In our old data, we did see a fairly linear relationship between load and energy over wide ranges
... but less linear in small ranges, which I guess would be consistent with @franz1981's concerns.
|
Good, so it's part of the framework (or its underlying dependencies) in which case it's fair to measure it. No?
@franz1981 I am confused, why would an awake CPU be more efficient than a sleeping one that's being waken up?
I totally agree on that.
@holly-cummins is there no option to run the load generator and DBMS on a different system? It would sure add some overhead which would drop the peak throughput, but would make the results more realistic and representative IMHO (no matter if using a fixed throughput or going for peak). Keeping the DBMS in the mix might be OK, but the load generator doesn't make much sense. |
It depends — it's more a timing artifact than a framework property. Ideally you'd measure each framework at its own 75% load (which differs between them), or size the more efficient one to match the peak throughput of the slower one, and then run both at the same fixed TPS. That makes the comparison meaningful in terms of capacity served.
It's not more efficient in steady state — but it appears so under fixed TPS measurement. With turboboost disabled and a stable P-state, a warm CPU draws a predictable, moderate amount of energy. A CPU that enters deep C-states between requests pays a wakeup tax on every request arrival: PLL relock, cache warm-up, a potential turbo burst. That cost gets attributed to the efficient framework precisely because it finishes fast enough to sleep. The slow framework never parks, avoids all that overhead, and ends up looking comparably efficient — or better — depending on the machine's C-state configuration. |
Hmm I find Energy per request more interesting/representative. Running at 75% capacity doesn't say much when the capacity is not the same. Service providers probably have a rough estimate of how many requests they anticipate on a given period and are probably interested in minimizing the total cost of serving that requests while meeting some performance criteria. So I would expect them to be interested in
I think this is tricky as well. Since both frameworks run on the same machine the one stressing the machine will be consuming more power than the other. At the same time, the "throttled" framework despite consuming less energy overall for the same throughput, it will have a worse
Sure, but that's OK IMHO if energy measurement is done end-to-end. On the other side if energy measurement is done by periodically sampling power consumption then the non-steady state might indeed skew things... In the case of end-to-end measurements though I would still expect the start/stop to result in better energy consumption in most cases, just like my car is still expected to burn less fuel overall if it turns itself off at the traffic lights. To take a step back, what question are we trying to answer with these numbers? |
Traffic lights are periodic and predictable — HTTP load isn't. wrk2 generates Poisson arrivals: inter-arrival times are exponentially distributed (CV=1), so even at a fixed mean rate you get bursts and gaps that the Linux idle governor ( Each C-state has a target residency — a minimum dwell time for the sleep savings to outweigh the entry/exit overhead (C6: ~100–300µs). The governor uses timer deadlines and recent idle history to pick the right state, but under high-variance arrivals misprediction rate increases: it may commit to a deeper state and get pulled out before residency, paying the transition cost with reduced savings. This doesn't necessarily flip the result — race-to-idle is still the dominant design philosophy of modern CPUs, and a busy framework burning dynamic power continuously isn't free either. But it does mean the efficiency gap measured at fixed TPS can be smaller than the raw throughput ratio suggests, and the magnitude is hardware, kernel-tuning, and load-point dependent. That's why measuring at a single fixed TPS is risky — the result tells you as much about C-state behavior on that specific machine as it does about the frameworks.
+💯 |
| PROFILER_JVM_ARGS: | ||
| BASE_JAVA_CMD: ${{APP_CMD_PREFIX}} java ${{config.jvm.memory}} ${{config.jvm.args}} ${{PROFILER_JVM_ARGS}} | ||
| TESTS : [measure-time-to-first-request, measure-rss, run-load-test] | ||
| TESTS : [measure-time-to-first-request, measure-rss, run-load-test, measure-energy-fixed-throughput] |
There was a problem hiding this comment.
I don't believe a dedicated test is necessary. We can add a configuration variable to toggle energy measurement for any existing test instead.
Proposed Approach:
- Toggle via Variable: Use a flag like config.enable.energy.consumption.
- Trigger the Script: In existing tests (e.g., measure-time-to-first-request, measure-rss, run-load-test), call the energy measurement script just before the warm-up or steady-state phases.
Even if a custom test is required later, we should keep this same implementation logic.
There was a problem hiding this comment.
I did apply this approach originally, but the problem is that we don't have an existing test that runs for a long period at fixed load.
There was a problem hiding this comment.
What if we rename the test, to drop 'energy'? That keeps the separation between metric and test, but gives us the new test parameters we need.
There was a problem hiding this comment.
In any case since we cannot distinguish the cost of the different components this will make energy comparison across different tests not possible
There was a problem hiding this comment.
I did apply this approach originally, but the problem is that we don't have an existing test that runs for a long period at a fixed load.
I think we should have two separate PRs:
- Create a test that represents: runs for a long period at a fixed load. (soak test)
- Another PR that adds energy consumption metrics to any test
This is what we are doing downstream. Energy measurement is a metric that can be attached to any test.
| then: | ||
| - abort: unable to connect to target application | ||
| - sleep: 30s # Wait for the app to complete warmup | ||
| - script: start-energy-capture |
There was a problem hiding this comment.
Because the energy consumption becomes a metric and not a test. See my previous comment. We can now have a special QDup file for each kind of energy consumption measurement.
There was a problem hiding this comment.
I do like the approach of energy being a metric, and I'm hoping we'll capture it in several of our tests (even if we decide the results are most useful for one or two).
diegolovison
left a comment
There was a problem hiding this comment.
I think we should separate the metric from the test.
I've been thinking about this more, and I think the 'gold standard' test would be a scalability test (like the one @franz1981 is adding in #492). But I think we'll probably also want to report a 'slice' through that scale, probably somewhere around the middle.
It's a good question. I think the question, ultimately, is "how will moving to Quarkus help me achieve my sustainability goals by reducing my carbon footprint?" The answer there is two parts, with one being more complex than the other. The more complex answer is that higher transaction rates for the same hardware, combined with lower memory requirements, mean workloads can run on smaller hardware and have less embodied carbon. That ties closely into the "what container would we need?" and cost calculation efforts. The second answer is "what would my energy usage per transaction be?" I think to be fair, and measuring-only-one-thing-at-a-time, that should be measured at a fixed transaction rate. The question isn't "if I had twice as many customers once I switched to Quarkus, what would my energy cost per transaction be?". (I mean, if it was a realistic question, that would be kind of awesome, but I don't think "doubles your customer base overnight" is part of our value proposition.) If modern processor architecture means Quarkus still consumes similar-ish energy even when it seems to be running at half its potential max capacity, that's useful information. It's not the answer we want, but it's going to more accurately reflect user interests than a theoretical energy usage that relies on throughput being higher to avoid scheduler/hardware penalties. As an aside, if we assume our results are still mostly-like our results the last time we measured energy, a vertical slice does show the energy savings ratio is less than the max-throughput-ratio, but maybe not hugely. How much depends on the baseline (these charts have a non-zero baseline). For JVM, it's the ratio between the red line and the purple line: |



Resolves #99.
This is based on John O'Hara's work in the
quarkus-qe-spring-boot-comparisonbranch of his old repo. I've copied the relevant qDup scripts from his yaml, and then added some extra logic to allow it to be turned on and off, guard execution on non-Intel hardware and non-Dell hardware, and report an average energy consumption into the json.The energy measurements will run as part of the same job that gathers rss information. That seems suitable, since it's fixed throughput, and we're running it already. (@franz1981 and @edeandrea, should that job be made longer, to get higher quality results?)
I'm stumbling in the dark a bit with the qDup, since this is my first time touching the scripts. I've tested as well as I can, but I don't have access to any Intel hardware, so I had to test by getting claude to stub out the rapl and idrac endpoints. That's not brilliant, so I'd be ever so grateful if @edeandrea or @franz1981 could test for me. :)
These are the changes:
measure-rssbenchmark phasescripts/perf-lab/helpers/energy.ymlwithinstall-rapl-plotandcapture-system-powerscriptsEnergy measurement outputs
.raplfiles — per-second CPU package and DRAM power readings from Intel RAPL.svrPwrfiles — per-second system-level wattage readings from Dell iDRAC/racadm