feat: Surface dataset size hints to guide large-output fetches by jirispilka · Pull Request #932 · apify/apify-mcp-server

jirispilka · 2026-05-29T08:59:09Z

Closes #878

TL;DR

Problem: Dataset items can be huge (full page text/HTML), so fetching them blows the LLM's context window. Item count is no guide — measured datasets run ~27 KB/item, so even a few items exceed the limit, and the model has no way to know before it fetches.

Fix: Tell the model the size up front instead of truncating. get-actor-run and get-dataset surface the dataset's byte size (stats.inflatedBytes), and get-dataset-items appends an estimated full-dataset size with a steer to narrow via fields= or page with offset. Advisory only — nothing is cut, so large-context clients can still pull everything.

Add advisory size hints to dataset tools so callers can narrow fetches before hitting output limits. Hints extrapolate full-dataset byte size from the returned page and steer toward fields= projection or pagination when large.

Key changes:

buildDatasetItemsSizeHint(): New function that estimates full-dataset size from returned bytes/count and appends a hint to get-dataset-items output when projected size exceeds DATASET_SIZE_HINT_BYTES (50 KB). Clamps total count to avoid "N of fewer" when Apify's eventually-consistent metadata lags.
get-dataset-items response: Appends size hint to JSON output; updated description to mention large-item handling.
get-actor-run response: Surfaces inflatedBytes from dataset stats in structured output and RunDataset type. Adds datasetSizeSummarySuffix() (shows "~2.3 MB, ~50 KB/item") and datasetSizeNextStepHint() (steers to narrow when large) to the succeeded summary/nextStep.
formatBytes() utility: New function for human-readable byte formatting ("512 B", "27 KB", "1.2 MB", "3.4 GB") with rounding that carries at unit boundaries.
Constants: Added DATASET_SIZE_HINT_BYTES (50 KB threshold) and NARROW_OUTPUT_HINT (shared steer text).
Tests: Full coverage for buildDatasetItemsSizeHint(), formatBytes(), and integration tests for both tools surfacing hints correctly.

Hints are soft advisory only — they do not truncate output, only inform the caller of estimated full-dataset size so they can adjust limit and fields before fetching.

https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu

Surface dataset byte size (Apify stats.inflatedBytes) so the model sizes its fetch before pulling items, instead of truncating output. Item count is not a size proxy — items can hold full page text/HTML. - get-actor-run: add size and per-item to the SUCCEEDED summary, plus a large-output steer in nextStep; expose inflatedBytes in structured output. - get-dataset: document stats.inflatedBytes in the description. - get-dataset-items: append a self-extrapolated size hint (no cap) and note in the description that results can be large; prefer fields= and a small limit. - Add formatBytes() helper. inflatedBytes is returned by the API but undeclared in the apify-client type, so it is read defensively and used only for advisory hints, never as a guard. Refs #878 https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu

- get-dataset-items: serialize the page once and pass the byte count into buildDatasetItemsSizeHint, instead of re-serializing the items array (avoids double JSON.stringify of large payloads on the request path). - get-dataset-items: clamp a lagging `total` to at least the returned page size, so the hint no longer reads "Returned 20 of 5 items" under eventual consistency. - formatBytes: carry at unit boundaries so values like 1048575 render "1.0 MB" instead of the malformed "1024 KB"; covers fractional per-item sizes too. - Add DATASET_SIZE_HINT_BYTES (soft hint threshold) separate from the TOOL_MAX_OUTPUT_CHARS truncation cap, and a shared NARROW_OUTPUT_HINT phrase used by both the run-response and items hints. https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu

claude added 2 commits May 28, 2026 21:11

github-actions Bot assigned jirispilka May 29, 2026

github-actions Bot added t-ai Issues owned by the AI team. tested Temporary label used only programatically for some analytics. labels May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Surface dataset size hints to guide large-output fetches#932

feat: Surface dataset size hints to guide large-output fetches#932
jirispilka wants to merge 2 commits into
masterfrom
claude/nice-ride-HBpVv

jirispilka commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jirispilka commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jirispilka commented May 29, 2026 •

edited

Loading