feat: Surface dataset size hints to guide large-output fetches#932
Draft
jirispilka wants to merge 2 commits into
Draft
feat: Surface dataset size hints to guide large-output fetches#932jirispilka wants to merge 2 commits into
jirispilka wants to merge 2 commits into
Conversation
Surface dataset byte size (Apify stats.inflatedBytes) so the model sizes its fetch before pulling items, instead of truncating output. Item count is not a size proxy — items can hold full page text/HTML. - get-actor-run: add size and per-item to the SUCCEEDED summary, plus a large-output steer in nextStep; expose inflatedBytes in structured output. - get-dataset: document stats.inflatedBytes in the description. - get-dataset-items: append a self-extrapolated size hint (no cap) and note in the description that results can be large; prefer fields= and a small limit. - Add formatBytes() helper. inflatedBytes is returned by the API but undeclared in the apify-client type, so it is read defensively and used only for advisory hints, never as a guard. Refs #878 https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu
- get-dataset-items: serialize the page once and pass the byte count into buildDatasetItemsSizeHint, instead of re-serializing the items array (avoids double JSON.stringify of large payloads on the request path). - get-dataset-items: clamp a lagging `total` to at least the returned page size, so the hint no longer reads "Returned 20 of 5 items" under eventual consistency. - formatBytes: carry at unit boundaries so values like 1048575 render "1.0 MB" instead of the malformed "1024 KB"; covers fractional per-item sizes too. - Add DATASET_SIZE_HINT_BYTES (soft hint threshold) separate from the TOOL_MAX_OUTPUT_CHARS truncation cap, and a shared NARROW_OUTPUT_HINT phrase used by both the run-response and items hints. https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #878
TL;DR
Problem: Dataset items can be huge (full page text/HTML), so fetching them blows the LLM's context window. Item count is no guide — measured datasets run ~27 KB/item, so even a few items exceed the limit, and the model has no way to know before it fetches.
Fix: Tell the model the size up front instead of truncating.
get-actor-runandget-datasetsurface the dataset's byte size (stats.inflatedBytes), andget-dataset-itemsappends an estimated full-dataset size with a steer to narrow viafields=or page withoffset. Advisory only — nothing is cut, so large-context clients can still pull everything.Add advisory size hints to dataset tools so callers can narrow fetches before hitting output limits. Hints extrapolate full-dataset byte size from the returned page and steer toward
fields=projection or pagination when large.Key changes:
buildDatasetItemsSizeHint(): New function that estimates full-dataset size from returned bytes/count and appends a hint toget-dataset-itemsoutput when projected size exceedsDATASET_SIZE_HINT_BYTES(50 KB). Clamps total count to avoid "N of fewer" when Apify's eventually-consistent metadata lags.get-dataset-itemsresponse: Appends size hint to JSON output; updated description to mention large-item handling.get-actor-runresponse: SurfacesinflatedBytesfrom dataset stats in structured output andRunDatasettype. AddsdatasetSizeSummarySuffix()(shows "~2.3 MB, ~50 KB/item") anddatasetSizeNextStepHint()(steers to narrow when large) to the succeeded summary/nextStep.formatBytes()utility: New function for human-readable byte formatting ("512 B", "27 KB", "1.2 MB", "3.4 GB") with rounding that carries at unit boundaries.DATASET_SIZE_HINT_BYTES(50 KB threshold) andNARROW_OUTPUT_HINT(shared steer text).buildDatasetItemsSizeHint(),formatBytes(), and integration tests for both tools surfacing hints correctly.Hints are soft advisory only — they do not truncate output, only inform the caller of estimated full-dataset size so they can adjust
limitandfieldsbefore fetching.https://claude.ai/code/session_01StznoKeKDSsjCzUFrgJBqu