Skip to content

feat: Surface dataset size hints to guide large-output fetches#932

Draft
jirispilka wants to merge 2 commits into
masterfrom
claude/nice-ride-HBpVv
Draft

feat: Surface dataset size hints to guide large-output fetches#932
jirispilka wants to merge 2 commits into
masterfrom
claude/nice-ride-HBpVv

Conversation

@jirispilka
Copy link
Copy Markdown
Collaborator

@jirispilka jirispilka commented May 29, 2026

Closes #878

TL;DR

Problem: Dataset items can be huge (full page text/HTML), so fetching them blows the LLM's context window. Item count is no guide — measured datasets run ~27 KB/item, so even a few items exceed the limit, and the model has no way to know before it fetches.

Fix: Tell the model the size up front instead of truncating. get-actor-run and get-dataset surface the dataset's byte size (stats.inflatedBytes), and get-dataset-items appends an estimated full-dataset size with a steer to narrow via fields= or page with offset. Advisory only — nothing is cut, so large-context clients can still pull everything.


Add advisory size hints to dataset tools so callers can narrow fetches before hitting output limits. Hints extrapolate full-dataset byte size from the returned page and steer toward fields= projection or pagination when large.

Key changes:

  • buildDatasetItemsSizeHint(): New function that estimates full-dataset size from returned bytes/count and appends a hint to get-dataset-items output when projected size exceeds DATASET_SIZE_HINT_BYTES (50 KB). Clamps total count to avoid "N of fewer" when Apify's eventually-consistent metadata lags.
  • get-dataset-items response: Appends size hint to JSON output; updated description to mention large-item handling.
  • get-actor-run response: Surfaces inflatedBytes from dataset stats in structured output and RunDataset type. Adds datasetSizeSummarySuffix() (shows "~2.3 MB, ~50 KB/item") and datasetSizeNextStepHint() (steers to narrow when large) to the succeeded summary/nextStep.
  • formatBytes() utility: New function for human-readable byte formatting ("512 B", "27 KB", "1.2 MB", "3.4 GB") with rounding that carries at unit boundaries.
  • Constants: Added DATASET_SIZE_HINT_BYTES (50 KB threshold) and NARROW_OUTPUT_HINT (shared steer text).
  • Tests: Full coverage for buildDatasetItemsSizeHint(), formatBytes(), and integration tests for both tools surfacing hints correctly.

Hints are soft advisory only — they do not truncate output, only inform the caller of estimated full-dataset size so they can adjust limit and fields before fetching.

https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu

claude added 2 commits May 28, 2026 21:11
Surface dataset byte size (Apify stats.inflatedBytes) so the model sizes its
fetch before pulling items, instead of truncating output. Item count is not a
size proxy — items can hold full page text/HTML.

- get-actor-run: add size and per-item to the SUCCEEDED summary, plus a
  large-output steer in nextStep; expose inflatedBytes in structured output.
- get-dataset: document stats.inflatedBytes in the description.
- get-dataset-items: append a self-extrapolated size hint (no cap) and note in
  the description that results can be large; prefer fields= and a small limit.
- Add formatBytes() helper.

inflatedBytes is returned by the API but undeclared in the apify-client type,
so it is read defensively and used only for advisory hints, never as a guard.

Refs #878

https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu
- get-dataset-items: serialize the page once and pass the byte count into
  buildDatasetItemsSizeHint, instead of re-serializing the items array (avoids
  double JSON.stringify of large payloads on the request path).
- get-dataset-items: clamp a lagging `total` to at least the returned page size,
  so the hint no longer reads "Returned 20 of 5 items" under eventual consistency.
- formatBytes: carry at unit boundaries so values like 1048575 render "1.0 MB"
  instead of the malformed "1024 KB"; covers fractional per-item sizes too.
- Add DATASET_SIZE_HINT_BYTES (soft hint threshold) separate from the
  TOOL_MAX_OUTPUT_CHARS truncation cap, and a shared NARROW_OUTPUT_HINT phrase
  used by both the run-response and items hints.

https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu
@github-actions github-actions Bot added t-ai Issues owned by the AI team. tested Temporary label used only programatically for some analytics. labels May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-ai Issues owned by the AI team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Hint dataset size to prevent get-dataset-items context blowup

3 participants