You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add prompt caching for model inference cost reduction
Overview
Add prompt caching support to reduce costs associated with model inference. Prompt caching allows the system prompt and repeated context to be cached across invocations, significantly reducing input token costs for multi-turn conversations. The implementation must account for differences across model families — Anthropic Claude models support prompt caching natively via cache control breakpoints, while Amazon Nova models do not currently support this feature. Users should be able to enable caching per agent or at the model family level, and should see reporting on cache hit rates, efficiency gains, and cost savings.
This issue depends on issue #30 for token usage tracking and cost estimation infrastructure.
Context
Current State
The Strands BedrockModel is initialized with only model_id, max_tokens, and streaming=True in agents/strands_agent/src/agent.py (lines 32-37) — no caching parameters are passed
SUPPORTED_MODELS in agents.py (lines 52-63) lists 5 Anthropic Claude models and 5 Amazon Nova models with model_id, display_name, group, and max_tokens — no caching capability flags
The AGENT_CONFIG_JSON structure contains system_prompt, model_id, max_tokens, and integrations — no caching configuration
The AgentConfig dataclass in agents/strands_agent/src/config.py has no caching-related fields
The invocation flow passes only the prompt text through the API chain (invoke_agent_runtime() in agentcore.py) — no cache control directives
The system prompt is static per agent deployment, making it an ideal candidate for caching since it does not change between invocations
No references to prompt caching, cache_control, or cache breakpoints exist in the codebase (the only "cache" references are HTTP Cache-Control: no-cache for SSE streaming and __pycache__ cleanup during artifact builds)
The hook system (BeforeInvocationEvent, AfterInvocationEvent) in the Strands agent provides injection points at invocation boundaries
Key Files
agents/strands_agent/src/agent.py — Builds Strands Agent with BedrockModel (no cache params)
agents/strands_agent/src/config.py — AgentConfig dataclass (no caching fields)
agents/strands_agent/src/handler.py — Entry point, invokes agent.stream_async(prompt)
AWS SDK: boto3 (bedrock-agentcore client for invocation)
Prompt Caching by Model Family
Anthropic Claude: Supports prompt caching via cache_control breakpoints in the messages API. Cached input tokens are billed at a reduced rate (typically 90% discount). A cache write occurs on the first request; subsequent requests with the same prefix get cache reads.
Amazon Nova: Does not currently support prompt caching. Caching configuration should be silently skipped or disabled for Nova models.
Requirements
R1: Enable prompt caching per agent or model family
Users should be able to enable prompt caching for individual agents or at the model family level. The implementation must correctly handle differences between model families.
Add a supports_prompt_caching boolean flag to each entry in SUPPORTED_MODELS:
When prompt_caching_enabled is true and the model supports it, configure the Strands BedrockModel or the underlying Bedrock API call with cache control parameters:
Apply a cache breakpoint after the system prompt so it is cached across invocations within the same session
If the Strands SDK exposes cache control parameters on BedrockModel, use them directly
If the Strands SDK does not expose caching natively, explore passing additional model kwargs or extending the model wrapper to inject cache_control in the messages payload
When prompt_caching_enabled is true but the model does not support caching (e.g. user switches models after enabling), log a warning and proceed without caching — do not fail the invocation
R2: Cache efficiency reporting
Users should see backend reporting on cache hit rates, efficiency gains, and cost savings from prompt caching.
Show aggregate cache savings across all agents in the group
Highlight which agents benefit most from caching
R3: Cache status visibility and usage impact
Users should see that prompt caching is enabled on their agents and understand the impact on token usage and cost.
Display a visual indicator on agent cards (AgentCard.tsx) when prompt caching is enabled:
A small badge or icon (e.g. a cache/lightning icon) next to the model name
Tooltip showing "Prompt caching enabled"
On the agent detail page (AgentDetailPage.tsx), add a caching section:
Show whether caching is enabled/disabled with a toggle to change it (triggers redeployment of AGENT_CONFIG_JSON)
Display cache statistics: hit rate, total cache reads, estimated savings
Show a per-session breakdown — cache writes typically occur on the first invocation of a session, with subsequent invocations in the same session getting cache reads
In the invocation detail (InvocationDetailPage.tsx) and LatencySummary:
Show cache read/write token counts alongside input/output token counts
If a cache hit occurred, highlight the cost savings for that invocation (e.g. "Saved $X.XX from cache")
Differentiate between cached and non-cached input tokens in the token breakdown
In the sessions table on the agent detail page:
Add a column or indicator showing cache utilization per session
Sessions with caching should show the ratio of cached vs. uncached tokens
Testing
Run backend tests: cd backend && make test
Run frontend typecheck: cd frontend && npx tsc --noEmit
Verify caching toggle:
Select a Claude model → caching toggle is enabled and functional
Select a Nova model → caching toggle is disabled with tooltip
Enable caching → deploy agent → AGENT_CONFIG_JSON includes prompt_caching_enabled: true
Verify cache behavior during invocation:
Deploy an agent with caching enabled (Claude model)
First invocation in a session: expect cache write tokens (system prompt cached)
Subsequent invocations in the same session: expect cache read tokens (cache hit)
Verify cache_read_tokens, cache_write_tokens, and cache_hit are stored on the Invocation record
Verify cache reporting:
GET /api/agents/{agent_id}/cache-stats returns accurate hit rate and savings
Agent detail page displays cache statistics
Cost dashboard includes caching savings section
Verify model family handling:
Agent with Nova model and caching enabled → caching silently skipped, no errors
Agent switches from Claude to Nova → caching toggle auto-disables
Verify UI indicators:
Agent card shows cache badge when enabled
Invocation detail shows cache token breakdown
LatencySummary shows cache savings
Out of Scope
Cross-session cache sharing (caching is per-session within AgentCore)
Tool result caching or response caching (only input/system prompt caching)
Add prompt caching for model inference cost reduction
Overview
Add prompt caching support to reduce costs associated with model inference. Prompt caching allows the system prompt and repeated context to be cached across invocations, significantly reducing input token costs for multi-turn conversations. The implementation must account for differences across model families — Anthropic Claude models support prompt caching natively via cache control breakpoints, while Amazon Nova models do not currently support this feature. Users should be able to enable caching per agent or at the model family level, and should see reporting on cache hit rates, efficiency gains, and cost savings.
This issue depends on issue #30 for token usage tracking and cost estimation infrastructure.
Context
Current State
BedrockModelis initialized with onlymodel_id,max_tokens, andstreaming=Trueinagents/strands_agent/src/agent.py(lines 32-37) — no caching parameters are passedSUPPORTED_MODELSinagents.py(lines 52-63) lists 5 Anthropic Claude models and 5 Amazon Nova models withmodel_id,display_name,group, andmax_tokens— no caching capability flagsAGENT_CONFIG_JSONstructure containssystem_prompt,model_id,max_tokens, andintegrations— no caching configurationAgentConfigdataclass inagents/strands_agent/src/config.pyhas no caching-related fieldsinvoke_agent_runtime()inagentcore.py) — no cache control directivescache_control, or cache breakpoints exist in the codebase (the only "cache" references are HTTPCache-Control: no-cachefor SSE streaming and__pycache__cleanup during artifact builds)BeforeInvocationEvent,AfterInvocationEvent) in the Strands agent provides injection points at invocation boundariesKey Files
agents/strands_agent/src/agent.py— Builds StrandsAgentwithBedrockModel(no cache params)agents/strands_agent/src/config.py—AgentConfigdataclass (no caching fields)agents/strands_agent/src/handler.py— Entry point, invokesagent.stream_async(prompt)agents/strands_agent/requirements.txt— Dependencies:strands-agents[a2a]>=0.1.0,boto3>=1.35.0backend/app/routers/agents.py—SUPPORTED_MODELS,_deploy_agent(),AGENT_CONFIG_JSONconstruction (lines 520-529)backend/app/routers/invocations.py— SSE invocation endpointbackend/app/services/agentcore.py—invoke_agent_runtime()boto3 call (lines 74-162)backend/app/models/invocation.py— Invocation ORM model (timing + content, no cache metrics)frontend/src/components/LatencySummary.tsx— Timing metrics displayfrontend/src/pages/AgentDetailPage.tsx— Agent detail with invocation panelTechnology Stack
strands-agents),BedrockModelbedrock-agentcoreclient for invocation)Prompt Caching by Model Family
cache_controlbreakpoints in the messages API. Cached input tokens are billed at a reduced rate (typically 90% discount). A cache write occurs on the first request; subsequent requests with the same prefix get cache reads.Requirements
R1: Enable prompt caching per agent or model family
Users should be able to enable prompt caching for individual agents or at the model family level. The implementation must correctly handle differences between model families.
supports_prompt_cachingboolean flag to each entry inSUPPORTED_MODELS:truefalseGET /api/pricing/modelsfrom issue feat: add JSON import/export and agent deletion polling (#28) #30) so the frontend knows which models support cachingprompt_caching_enabledfield toAgentConfig(dataclass inconfig.py) with a default offalseprompt_caching_enabledfield toAgentDeployRequestand include it in theAGENT_CONFIG_JSONenvironment variable during deploymentAgentRegistrationForm.tsxto enable/disable prompt caching:prompt_caching_enabledfield in JSON import/export (issue feat: tagging page and custom tags (closes #24) #27)prompt_caching_enabledis true and the model supports it, configure the StrandsBedrockModelor the underlying Bedrock API call with cache control parameters:BedrockModel, use them directlycache_controlin the messages payloadprompt_caching_enabledis true but the model does not support caching (e.g. user switches models after enabling), log a warning and proceed without caching — do not fail the invocationR2: Cache efficiency reporting
Users should see backend reporting on cache hit rates, efficiency gains, and cost savings from prompt caching.
Invocationmodel with caching metrics (builds on the token fields from issue feat: add JSON import/export and agent deletion polling (#28) #30):cache_read_tokens(integer, nullable) — number of tokens served from cachecache_write_tokens(integer, nullable) — number of tokens written to cache on first requestcache_hit(boolean, nullable) — whether the invocation resulted in a cache hitusage.cache_creation_input_tokensandusage.cache_read_input_tokensin the response metadataGET /api/agents/{agent_id}/cache-stats) that returns aggregate caching statistics:(cache_read_tokens * (regular_input_price - cached_input_price)) / 1000R3: Cache status visibility and usage impact
Users should see that prompt caching is enabled on their agents and understand the impact on token usage and cost.
AgentCard.tsx) when prompt caching is enabled:AgentDetailPage.tsx), add a caching section:AGENT_CONFIG_JSON)InvocationDetailPage.tsx) andLatencySummary:Testing
cd backend && make testcd frontend && npx tsc --noEmitAGENT_CONFIG_JSONincludesprompt_caching_enabled: truecache_read_tokens,cache_write_tokens, andcache_hitare stored on the Invocation recordGET /api/agents/{agent_id}/cache-statsreturns accurate hit rate and savingsOut of Scope