Skip to content

[ESS Billing] Agentless: billing data stream never reaches current data (CEL cursor lost on pod recycle) #19639

@eddxavier-elastic

Description

@eddxavier-elastic

Integration Name

Elasticsearch Service Billing [packages/ess_billing]

Dataset Name

ess_billing.billing

Integration Version

1.9.0

Agent Version

9.4.2

Agent Output Type

elasticsearch

Elasticsearch Version

9.4.2

OS Version and Architecture

Agentless integration

Software/API Version

No response

Error Message

There is no error message, and that is part of the problem. The agent, the cel-es-agentless-output component, and both ess_billing units all report HEALTHY/Running, Fleet shows Connected, logs are info-level only (plus three benign startup warnings), and every billing API request returns HTTP 200. Collection silently produces no recent data despite valid, correctly-permissioned API keys. The complete absence of any error is what makes this hard to detect.

Event Original

Not applicable — this is not an ingest-pipeline error, so there is no event.original that reproduces a pipeline failure. The documents that are produced index correctly and are deduplicated by the pipeline's fingerprint→_id processor (fields: ess.billing.deployment_id, ess.billing.from, ess.billing.to, ess.billing.sku, ess.billing.total_ecu). The problem is upstream of ingest: the CEL input never advances its collection window far enough to emit recent events.

What did you do?

Deployed the ESS Billing integration (ess_billing package 1.9.0) as an agentless integration on Elastic Agent 9.4.2.

Relevant configuration:

Deployment type: agentless (managed Wolfi pods; hostnames agentless---)
billing data stream:

interval: 24h
state.lookbehind: 365 (days — the package default, "How far back to fetch data for the first run")
resource.url: https://billing.elastic-cloud.com/api/v2/billing/organizations/<ORG_ID>/costs/instances
add_tags: false

API key has the Billing admin role; org ID is correct; billing data is visible in the Elastic Cloud console for the current period.

The CEL billing program processes one 24-hour window per request, advancing state.cursor.last_to by 24h each execution and walking forward from now - lookbehind toward the present.

What did you see?

The billing data stream never ingests recent data. It either ingests documents timestamped ~a year in the past (early backfill windows) or produces nothing, and current-period billing never appears in dashboards. No error is surfaced.

Evidence from three diagnostic bundles taken from this same agentless integration over ~31 hours (two ess_billing, one openai as a control):

  1. The CEL registry (persisted cursor store) is empty in every capture. components/cel-es-agentless-output/registry.tar.gzregistry/filebeat/log.json is 0 bytes in all three bundles, with an mtime equal to pod boot time. The cursor (state.cursor.last_to) is the only state meant to survive a restart, and it is never durably written.

  2. Pods are recycled ~daily and the cursor does not survive the recycle:

Capture Component Pod (replicaset/suffix) beat.info.uptime.ms Pod age
1 ess_billing ...-66fb4666c5-pqznx 83,708,933 23.25 h
2 ess_billing ...-57cffd756c-f6trp 87,100,051 24.19 h
3 openai ...-7c56c55495-mk8q5 97,977,326 27.22 h

Different ReplicaSet hashes/pod names between captures confirm the pod was replaced; each captured pod is ~1 day old. (Caveat: these readings strongly suggest ~daily recycling but do not by themselves prove a fixed timer.)

  1. The billing stream advances only a few windows per pod lifetime, so it never reaches "now". From input_metrics.json (billing stream): Capture 1 had cel_executions=5, http_request_total=5, 2xx=5, events_published=64 (old backfill windows from ~a year ago); Capture 2 had cel_executions=2, http_request_total=0, events_published=0. With lookbehind=365 and one 24h window per execution, the input must traverse ~365 sequential windows to reach the present, but the pod is recycled (~24h) and the cursor is lost long before that completes — so each fresh pod restarts from now - 365d and perpetually re-processes the same year-old windows.

  2. Control case (OpenAI) on the identical platform is unaffected. Same agent build, same agentless runtime, same cel-es-agentless-output component, same 0-byte registry — yet OpenAI collects normally (cel_executions in the hundreds, tens of thousands of events per stream). OpenAI uses interval: 5m, initial_interval: 24h, and returns many 1-minute buckets in bulk per request, so a fresh pod catches up to the present within seconds/minutes of boot — inside a pod's lifetime — making the lost cursor irrelevant. ESS Billing's 365-day distance at one-day-per-request granularity cannot catch up within a pod's life.

What did you expect to see?

After the initial backfill, the billing data stream should reach the present and then collect the latest day's billing on each interval, indefinitely — current-period spend should appear and stay current, as it does on a long-lived (non-agentless) Fleet agent. Concretely, either the agentless runtime should persist the CEL cursor across pod recycles so backfill progress is not discarded each day, and/or the integration should be able to catch up to the present within a single pod lifetime on agentless (e.g. a smaller default lookbehind and/or chunked/bulk backfill) so a lost cursor is survivable as it is for OpenAI.

Anything else?

The particularity of this issue is that it is the interaction of ESS Billing + agentless, not either alone.

On a classic always-on Fleet agent, ESS Billing works: the process lives long enough to finish the 365-day walk and the cursor persists on disk across the rare restart. The forward-crawling, cursor-resumable, ingest-deduplicated design is standard and correct.
On agentless, two platform properties break it: (a) pods recycle ~daily, and (b) the CEL registry/cursor is not persisted across the recycle (0-byte log.json in all captures).
ESS Billing is uniquely exposed because of (i) a 365-day default lookbehind (large catch-up distance) and (ii) one-day-per-request granularity (slow catch-up rate). OpenAI on the same platform proves a CEL integration can tolerate the daily recycle when it catches up quickly.

Observed workaround (mitigation, not a fix): reducing Lookbehind to a few days lets a fresh pod catch up to the present within minutes of boot. Re-walking those few days on each recycle is safe because the ingest pipeline fingerprints _id from deployment_id + from + to + sku + total_ecu, so repeated passes over the same window upsert idempotently and do not duplicate documents.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions