Integration Name
Elasticsearch Service Billing [packages/ess_billing]
Dataset Name
ess_billing.billing
Integration Version
1.9.0
Agent Version
9.4.2
Agent Output Type
elasticsearch
Elasticsearch Version
9.4.2
OS Version and Architecture
Agentless integration
Software/API Version
No response
Error Message
There is no error message, and that is part of the problem. The agent, the cel-es-agentless-output component, and both ess_billing units all report HEALTHY/Running, Fleet shows Connected, logs are info-level only (plus three benign startup warnings), and every billing API request returns HTTP 200. Collection silently produces no recent data despite valid, correctly-permissioned API keys. The complete absence of any error is what makes this hard to detect.
Event Original
Not applicable — this is not an ingest-pipeline error, so there is no event.original that reproduces a pipeline failure. The documents that are produced index correctly and are deduplicated by the pipeline's fingerprint→_id processor (fields: ess.billing.deployment_id, ess.billing.from, ess.billing.to, ess.billing.sku, ess.billing.total_ecu). The problem is upstream of ingest: the CEL input never advances its collection window far enough to emit recent events.
What did you do?
Deployed the ESS Billing integration (ess_billing package 1.9.0) as an agentless integration on Elastic Agent 9.4.2.
Relevant configuration:
Deployment type: agentless (managed Wolfi pods; hostnames agentless---)
billing data stream:
interval: 24h
state.lookbehind: 365 (days — the package default, "How far back to fetch data for the first run")
resource.url: https://billing.elastic-cloud.com/api/v2/billing/organizations/<ORG_ID>/costs/instances
add_tags: false
API key has the Billing admin role; org ID is correct; billing data is visible in the Elastic Cloud console for the current period.
The CEL billing program processes one 24-hour window per request, advancing state.cursor.last_to by 24h each execution and walking forward from now - lookbehind toward the present.
What did you see?
The billing data stream never ingests recent data. It either ingests documents timestamped ~a year in the past (early backfill windows) or produces nothing, and current-period billing never appears in dashboards. No error is surfaced.
Evidence from three diagnostic bundles taken from this same agentless integration over ~31 hours (two ess_billing, one openai as a control):
-
The CEL registry (persisted cursor store) is empty in every capture. components/cel-es-agentless-output/registry.tar.gz → registry/filebeat/log.json is 0 bytes in all three bundles, with an mtime equal to pod boot time. The cursor (state.cursor.last_to) is the only state meant to survive a restart, and it is never durably written.
-
Pods are recycled ~daily and the cursor does not survive the recycle:
| Capture |
Component |
Pod (replicaset/suffix) |
beat.info.uptime.ms |
Pod age |
| 1 |
ess_billing |
...-66fb4666c5-pqznx |
83,708,933 |
23.25 h |
| 2 |
ess_billing |
...-57cffd756c-f6trp |
87,100,051 |
24.19 h |
| 3 |
openai |
...-7c56c55495-mk8q5 |
97,977,326 |
27.22 h |
Different ReplicaSet hashes/pod names between captures confirm the pod was replaced; each captured pod is ~1 day old. (Caveat: these readings strongly suggest ~daily recycling but do not by themselves prove a fixed timer.)
-
The billing stream advances only a few windows per pod lifetime, so it never reaches "now". From input_metrics.json (billing stream): Capture 1 had cel_executions=5, http_request_total=5, 2xx=5, events_published=64 (old backfill windows from ~a year ago); Capture 2 had cel_executions=2, http_request_total=0, events_published=0. With lookbehind=365 and one 24h window per execution, the input must traverse ~365 sequential windows to reach the present, but the pod is recycled (~24h) and the cursor is lost long before that completes — so each fresh pod restarts from now - 365d and perpetually re-processes the same year-old windows.
-
Control case (OpenAI) on the identical platform is unaffected. Same agent build, same agentless runtime, same cel-es-agentless-output component, same 0-byte registry — yet OpenAI collects normally (cel_executions in the hundreds, tens of thousands of events per stream). OpenAI uses interval: 5m, initial_interval: 24h, and returns many 1-minute buckets in bulk per request, so a fresh pod catches up to the present within seconds/minutes of boot — inside a pod's lifetime — making the lost cursor irrelevant. ESS Billing's 365-day distance at one-day-per-request granularity cannot catch up within a pod's life.
What did you expect to see?
After the initial backfill, the billing data stream should reach the present and then collect the latest day's billing on each interval, indefinitely — current-period spend should appear and stay current, as it does on a long-lived (non-agentless) Fleet agent. Concretely, either the agentless runtime should persist the CEL cursor across pod recycles so backfill progress is not discarded each day, and/or the integration should be able to catch up to the present within a single pod lifetime on agentless (e.g. a smaller default lookbehind and/or chunked/bulk backfill) so a lost cursor is survivable as it is for OpenAI.
Anything else?
The particularity of this issue is that it is the interaction of ESS Billing + agentless, not either alone.
On a classic always-on Fleet agent, ESS Billing works: the process lives long enough to finish the 365-day walk and the cursor persists on disk across the rare restart. The forward-crawling, cursor-resumable, ingest-deduplicated design is standard and correct.
On agentless, two platform properties break it: (a) pods recycle ~daily, and (b) the CEL registry/cursor is not persisted across the recycle (0-byte log.json in all captures).
ESS Billing is uniquely exposed because of (i) a 365-day default lookbehind (large catch-up distance) and (ii) one-day-per-request granularity (slow catch-up rate). OpenAI on the same platform proves a CEL integration can tolerate the daily recycle when it catches up quickly.
Observed workaround (mitigation, not a fix): reducing Lookbehind to a few days lets a fresh pod catch up to the present within minutes of boot. Re-walking those few days on each recycle is safe because the ingest pipeline fingerprints _id from deployment_id + from + to + sku + total_ecu, so repeated passes over the same window upsert idempotently and do not duplicate documents.
Integration Name
Elasticsearch Service Billing [packages/ess_billing]
Dataset Name
ess_billing.billing
Integration Version
1.9.0
Agent Version
9.4.2
Agent Output Type
elasticsearch
Elasticsearch Version
9.4.2
OS Version and Architecture
Agentless integration
Software/API Version
No response
Error Message
There is no error message, and that is part of the problem. The agent, the cel-es-agentless-output component, and both ess_billing units all report HEALTHY/Running, Fleet shows Connected, logs are info-level only (plus three benign startup warnings), and every billing API request returns HTTP 200. Collection silently produces no recent data despite valid, correctly-permissioned API keys. The complete absence of any error is what makes this hard to detect.
Event Original
Not applicable — this is not an ingest-pipeline error, so there is no event.original that reproduces a pipeline failure. The documents that are produced index correctly and are deduplicated by the pipeline's fingerprint→_id processor (fields: ess.billing.deployment_id, ess.billing.from, ess.billing.to, ess.billing.sku, ess.billing.total_ecu). The problem is upstream of ingest: the CEL input never advances its collection window far enough to emit recent events.
What did you do?
Deployed the ESS Billing integration (ess_billing package 1.9.0) as an agentless integration on Elastic Agent 9.4.2.
Relevant configuration:
Deployment type: agentless (managed Wolfi pods; hostnames agentless---)
billing data stream:
interval: 24h
state.lookbehind: 365 (days — the package default, "How far back to fetch data for the first run")
resource.url: https://billing.elastic-cloud.com/api/v2/billing/organizations/<ORG_ID>/costs/instances
add_tags: false
API key has the Billing admin role; org ID is correct; billing data is visible in the Elastic Cloud console for the current period.
The CEL billing program processes one 24-hour window per request, advancing state.cursor.last_to by 24h each execution and walking forward from now - lookbehind toward the present.
What did you see?
The
billingdata stream never ingests recent data. It either ingests documents timestamped ~a year in the past (early backfill windows) or produces nothing, and current-period billing never appears in dashboards. No error is surfaced.Evidence from three diagnostic bundles taken from this same agentless integration over ~31 hours (two ess_billing, one openai as a control):
The CEL registry (persisted cursor store) is empty in every capture.
components/cel-es-agentless-output/registry.tar.gz→registry/filebeat/log.jsonis 0 bytes in all three bundles, with an mtime equal to pod boot time. The cursor (state.cursor.last_to) is the only state meant to survive a restart, and it is never durably written.Pods are recycled ~daily and the cursor does not survive the recycle:
Different ReplicaSet hashes/pod names between captures confirm the pod was replaced; each captured pod is ~1 day old. (Caveat: these readings strongly suggest ~daily recycling but do not by themselves prove a fixed timer.)
The
billingstream advances only a few windows per pod lifetime, so it never reaches "now". Frominput_metrics.json(billing stream): Capture 1 hadcel_executions=5,http_request_total=5,2xx=5,events_published=64(old backfill windows from ~a year ago); Capture 2 hadcel_executions=2,http_request_total=0,events_published=0. Withlookbehind=365and one 24h window per execution, the input must traverse ~365 sequential windows to reach the present, but the pod is recycled (~24h) and the cursor is lost long before that completes — so each fresh pod restarts fromnow - 365dand perpetually re-processes the same year-old windows.Control case (OpenAI) on the identical platform is unaffected. Same agent build, same agentless runtime, same
cel-es-agentless-outputcomponent, same 0-byte registry — yet OpenAI collects normally (cel_executionsin the hundreds, tens of thousands of events per stream). OpenAI usesinterval: 5m,initial_interval: 24h, and returns many 1-minute buckets in bulk per request, so a fresh pod catches up to the present within seconds/minutes of boot — inside a pod's lifetime — making the lost cursor irrelevant. ESS Billing's 365-day distance at one-day-per-request granularity cannot catch up within a pod's life.What did you expect to see?
After the initial backfill, the
billingdata stream should reach the present and then collect the latest day's billing on each interval, indefinitely — current-period spend should appear and stay current, as it does on a long-lived (non-agentless) Fleet agent. Concretely, either the agentless runtime should persist the CEL cursor across pod recycles so backfill progress is not discarded each day, and/or the integration should be able to catch up to the present within a single pod lifetime on agentless (e.g. a smaller defaultlookbehindand/or chunked/bulk backfill) so a lost cursor is survivable as it is for OpenAI.Anything else?
The particularity of this issue is that it is the interaction of ESS Billing + agentless, not either alone.
On a classic always-on Fleet agent, ESS Billing works: the process lives long enough to finish the 365-day walk and the cursor persists on disk across the rare restart. The forward-crawling, cursor-resumable, ingest-deduplicated design is standard and correct.
On agentless, two platform properties break it: (a) pods recycle ~daily, and (b) the CEL registry/cursor is not persisted across the recycle (0-byte log.json in all captures).
ESS Billing is uniquely exposed because of (i) a 365-day default lookbehind (large catch-up distance) and (ii) one-day-per-request granularity (slow catch-up rate). OpenAI on the same platform proves a CEL integration can tolerate the daily recycle when it catches up quickly.
Observed workaround (mitigation, not a fix): reducing Lookbehind to a few days lets a fresh pod catch up to the present within minutes of boot. Re-walking those few days on each recycle is safe because the ingest pipeline fingerprints _id from deployment_id + from + to + sku + total_ecu, so repeated passes over the same window upsert idempotently and do not duplicate documents.