[ESS Billing] Agentless: billing data stream never reaches current data (CEL cursor lost on pod recycle)

### Integration Name

Elasticsearch Service Billing [packages/ess_billing]

### Dataset Name

ess_billing.billing

### Integration Version

1.9.0

### Agent Version

9.4.2

### Agent Output Type

elasticsearch

### Elasticsearch Version

9.4.2

### OS Version and Architecture

Agentless integration 

### Software/API Version

_No response_

### Error Message

There is no error message, and that is part of the problem. The agent, the cel-es-agentless-output component, and both ess_billing units all report HEALTHY/Running, Fleet shows Connected, logs are info-level only (plus three benign startup warnings), and every billing API request returns HTTP 200. Collection silently produces no recent data despite valid, correctly-permissioned API keys. The complete absence of any error is what makes this hard to detect.

### Event Original

Not applicable — this is not an ingest-pipeline error, so there is no event.original that reproduces a pipeline failure. The documents that are produced index correctly and are deduplicated by the pipeline's fingerprint→_id processor (fields: ess.billing.deployment_id, ess.billing.from, ess.billing.to, ess.billing.sku, ess.billing.total_ecu). The problem is upstream of ingest: the CEL input never advances its collection window far enough to emit recent events.

### What did you do?

Deployed the ESS Billing integration (ess_billing package 1.9.0) as an agentless integration on Elastic Agent 9.4.2.

Relevant configuration:

Deployment type: agentless (managed Wolfi pods; hostnames agentless-<uuid>-<replicaset-hash>-<suffix>)
billing data stream:

interval: 24h
state.lookbehind: 365 (days — the package default, "How far back to fetch data for the first run")
resource.url: https://billing.elastic-cloud.com/api/v2/billing/organizations/<ORG_ID>/costs/instances
add_tags: false

API key has the Billing admin role; org ID is correct; billing data is visible in the Elastic Cloud console for the current period.

The CEL billing program processes one 24-hour window per request, advancing state.cursor.last_to by 24h each execution and walking forward from now - lookbehind toward the present.

### What did you see?

The `billing` data stream never ingests recent data. It either ingests documents timestamped ~a year in the past (early backfill windows) or produces nothing, and current-period billing never appears in dashboards. No error is surfaced.

Evidence from three diagnostic bundles taken from this same agentless integration over ~31 hours (two ess_billing, one openai as a control):

1) The CEL registry (persisted cursor store) is empty in every capture. `components/cel-es-agentless-output/registry.tar.gz` → `registry/filebeat/log.json` is 0 bytes in all three bundles, with an mtime equal to pod boot time. The cursor (`state.cursor.last_to`) is the only state meant to survive a restart, and it is never durably written.

2) Pods are recycled ~daily and the cursor does not survive the recycle:

| Capture | Component | Pod (replicaset/suffix) | beat.info.uptime.ms | Pod age |
|---|---|---|---|---|
| 1 | ess_billing | ...-66fb4666c5-pqznx | 83,708,933 | 23.25 h |
| 2 | ess_billing | ...-57cffd756c-f6trp | 87,100,051 | 24.19 h |
| 3 | openai | ...-7c56c55495-mk8q5 | 97,977,326 | 27.22 h |

Different ReplicaSet hashes/pod names between captures confirm the pod was replaced; each captured pod is ~1 day old. (Caveat: these readings strongly suggest ~daily recycling but do not by themselves prove a fixed timer.)

3) The `billing` stream advances only a few windows per pod lifetime, so it never reaches "now". From `input_metrics.json` (billing stream): Capture 1 had `cel_executions=5`, `http_request_total=5`, `2xx=5`, `events_published=64` (old backfill windows from ~a year ago); Capture 2 had `cel_executions=2`, `http_request_total=0`, `events_published=0`. With `lookbehind=365` and one 24h window per execution, the input must traverse ~365 sequential windows to reach the present, but the pod is recycled (~24h) and the cursor is lost long before that completes — so each fresh pod restarts from `now - 365d` and perpetually re-processes the same year-old windows.

4) Control case (OpenAI) on the identical platform is unaffected. Same agent build, same agentless runtime, same `cel-es-agentless-output` component, same 0-byte registry — yet OpenAI collects normally (`cel_executions` in the hundreds, tens of thousands of events per stream). OpenAI uses `interval: 5m`, `initial_interval: 24h`, and returns many 1-minute buckets in bulk per request, so a fresh pod catches up to the present within seconds/minutes of boot — inside a pod's lifetime — making the lost cursor irrelevant. ESS Billing's 365-day distance at one-day-per-request granularity cannot catch up within a pod's life.

### What did you expect to see?

After the initial backfill, the `billing` data stream should reach the present and then collect the latest day's billing on each interval, indefinitely — current-period spend should appear and stay current, as it does on a long-lived (non-agentless) Fleet agent. Concretely, either the agentless runtime should persist the CEL cursor across pod recycles so backfill progress is not discarded each day, and/or the integration should be able to catch up to the present within a single pod lifetime on agentless (e.g. a smaller default `lookbehind` and/or chunked/bulk backfill) so a lost cursor is survivable as it is for OpenAI.

### Anything else?

The particularity of this issue is that it is the interaction of ESS Billing + agentless, not either alone.


On a classic always-on Fleet agent, ESS Billing works: the process lives long enough to finish the 365-day walk and the cursor persists on disk across the rare restart. The forward-crawling, cursor-resumable, ingest-deduplicated design is standard and correct.
On agentless, two platform properties break it: (a) pods recycle ~daily, and (b) the CEL registry/cursor is not persisted across the recycle (0-byte log.json in all captures).
ESS Billing is uniquely exposed because of (i) a 365-day default lookbehind (large catch-up distance) and (ii) one-day-per-request granularity (slow catch-up rate). OpenAI on the same platform proves a CEL integration can tolerate the daily recycle when it catches up quickly.

Observed workaround (mitigation, not a fix): reducing Lookbehind to a few days lets a fresh pod catch up to the present within minutes of boot. Re-walking those few days on each recycle is safe because the ingest pipeline fingerprints _id from deployment_id + from + to + sku + total_ecu, so repeated passes over the same window upsert idempotently and do not duplicate documents.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ESS Billing] Agentless: billing data stream never reaches current data (CEL cursor lost on pod recycle) #19639

Integration Name

Dataset Name

Integration Version

Agent Version

Agent Output Type

Elasticsearch Version

OS Version and Architecture

Software/API Version

Error Message

Event Original

What did you do?

What did you see?

What did you expect to see?

Anything else?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Capture	Component	Pod (replicaset/suffix)	beat.info.uptime.ms	Pod age
1	ess_billing	...-66fb4666c5-pqznx	83,708,933	23.25 h
2	ess_billing	...-57cffd756c-f6trp	87,100,051	24.19 h
3	openai	...-7c56c55495-mk8q5	97,977,326	27.22 h

[ESS Billing] Agentless: billing data stream never reaches current data (CEL cursor lost on pod recycle) #19639

Description

Integration Name

Dataset Name

Integration Version

Agent Version

Agent Output Type

Elasticsearch Version

OS Version and Architecture

Software/API Version

Error Message

Event Original

What did you do?

What did you see?

What did you expect to see?

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions