Skip to content

Add support for rate limiting #75

Description

@heeki

Add support for rate limiting

Overview

Add support for rate limiting with agents for both model and tool calls. This will enable users to control costs, prevent runaway agents, and enforce usage constraints at various levels (per invocation, per user, per time window). Rate limiting is essential for production deployments and multi-tenant environments.

Requirements

R1: Model Call Rate Limiting

Users should be able to configure rate limiting for model calls, setting token budgets. This includes:

  • Token budget limits per invocation (input + output tokens)
  • Token budget limits per user (daily/monthly)
  • Token budget limits per agent (daily/monthly)
  • Graceful handling when token budgets are exceeded
  • Real-time token usage tracking and enforcement
  • Different limits for different model providers (see Issue Add support for alternate LLM providers #74)
  • Warning thresholds before hitting hard limits

R2: Tool Call Rate Limiting

Users should be able to configure rate limiting for tool calls, constraining the amount of tool calls that are made. This includes:

  • Maximum tool calls per invocation
  • Maximum tool calls per user (daily/hourly)
  • Maximum tool calls per agent (daily/hourly)
  • Per-tool-type limits (e.g., limit expensive operations like web fetches)
  • Graceful handling when tool call limits are exceeded
  • Real-time tool call counting and enforcement
  • Circuit breaker patterns for repeated failures

Technical Considerations

Rate Limit Types

  • Per-Invocation Limits: Hard caps on a single agent invocation
    • Token budget (input + output)
    • Tool call count
    • Time limit (max execution time)
  • Per-User Limits: Quotas across all invocations by a user
    • Daily/monthly token budget
    • Hourly/daily tool call limits
    • Concurrent invocation limits
  • Per-Agent Limits: Quotas for a specific agent
    • Daily/monthly token budget for that agent
    • Tool call limits specific to agent behavior
  • Per-Tool Limits: Constraints on specific tools
    • Web fetch limits (prevent scraping abuse)
    • Bash execution limits (prevent resource exhaustion)
    • File operation limits

Configuration Management

  • Update etc/environment.sh to include rate limit configuration:
    • Default token budget per invocation
    • Default tool call limit per invocation
    • Per-user daily/monthly limits
    • Per-agent limits
    • Warning threshold percentages (e.g., warn at 80% of limit)
  • Support environment-specific limits (dev vs production)
  • Store user-specific overrides in database (RDS)
  • Allow admin users to configure custom limits per user/agent

Backend Changes

Rate Limit Store

  • Token Budget Tracking: Track token usage in real-time
    • Store current usage in memory (Redis/ElastiCache for distributed)
    • Persist to database for historical tracking
    • Reset counters at appropriate intervals (daily/monthly)
  • Tool Call Tracking: Track tool call counts
    • Per-invocation counter (in-memory)
    • Per-user counter (Redis/database)
    • Per-tool-type counters

Enforcement Layer

  • Middleware: Create rate limiting middleware that:
    • Checks limits before processing requests
    • Updates counters during execution
    • Enforces hard limits and rejects requests when exceeded
    • Returns clear error messages with limit details
  • Streaming Support: For streaming responses, enforce limits mid-stream
    • Stop generation when token budget is reached
    • Append truncation message to response
  • Tool Call Gating: Before executing any tool call:
    • Check if tool call limit would be exceeded
    • Check if specific tool has per-tool limits
    • Allow or deny with clear error message

Database Schema

New tables or columns in existing tables:

  • rate_limits table:
    • user_id (or agent_id)
    • limit_type (token_daily, token_monthly, tool_calls_hourly, etc.)
    • current_usage
    • limit_value
    • reset_at (timestamp for when counter resets)
    • created_at, updated_at
  • Update invocations table:
    • tokens_used (input + output)
    • tool_calls_count
    • rate_limited (boolean flag)
    • limit_exceeded_type (which limit was hit)

Frontend Changes

  • Usage Dashboard: New page or section showing:
    • Current token usage vs limits (progress bars)
    • Current tool call usage vs limits
    • Historical usage trends (charts)
    • Cost implications of usage
  • Invocation Details: Show rate limit information:
    • Tokens used / Token budget
    • Tool calls made / Tool call limit
    • Warning indicators when approaching limits
  • Rate Limit Errors: User-friendly error messages:
    • "Token budget exceeded (5,000 / 5,000). Increase your limit or try a shorter conversation."
    • "Tool call limit reached (20 / 20). Wait 1 hour before retrying."
  • Admin Panel: For managing user/agent limits (if applicable)

Monitoring & Alerting

  • CloudWatch Metrics:
    • Rate limit hits by type
    • Token usage by user/agent
    • Tool call counts by type
    • Limit breach attempts
  • Alerts:
    • Notify users when approaching limits (80%, 90%)
    • Alert admins of unusual usage patterns
    • Track cost implications of usage

Security Considerations

  • Prevent Abuse: Rate limits prevent:
    • Cost runaway from buggy or malicious agents
    • Resource exhaustion attacks
    • Unintentional infinite loops in agent logic
  • Fair Usage: Multi-tenant environments need per-user isolation
  • Admin Overrides: Support emergency limit increases for critical use cases

Acceptance Criteria

AC1: Token Budget Limits

  • Per-invocation token budgets enforced
  • Per-user daily/monthly token limits enforced
  • Token usage tracked accurately across providers
  • Streaming responses stop when token budget reached
  • Clear error messages when token limits exceeded
  • Warning notifications at threshold percentages

AC2: Tool Call Limits

  • Per-invocation tool call limits enforced
  • Per-user hourly/daily tool call limits enforced
  • Per-tool-type limits supported
  • Tool execution blocked when limits exceeded
  • Clear error messages when tool call limits exceeded
  • Circuit breaker for repeated tool failures

AC3: Configuration

  • Rate limits configurable in etc/environment.sh
  • User-specific overrides stored in database
  • Environment-specific limits (dev vs prod) work
  • Admin API for managing limits
  • Documentation for all configuration options

AC4: Frontend

  • Usage dashboard shows current usage vs limits
  • Progress bars and charts for usage visualization
  • Rate limit errors displayed clearly
  • Invocation details show limit information
  • Real-time updates during long-running invocations

AC5: Database & Persistence

  • Rate limit tracking tables created
  • Counters persist across restarts
  • Historical usage data queryable
  • Counter resets work correctly (daily/monthly)
  • Migration scripts for schema changes

AC6: Testing

  • Unit tests for rate limit enforcement
  • Integration tests with different limit types
  • Load tests verify limits under concurrent requests
  • Test limit resets at boundaries (midnight, month-end)
  • Test streaming response truncation

Implementation Notes

Suggested Approach

Phase 1: Infrastructure & Configuration (Week 1)

  1. Design database schema for rate limit tracking
  2. Create migration scripts using SQLAlchemy
  3. Add configuration parameters to etc/environment.sh
  4. Implement in-memory counter service (consider Redis for distributed)
  5. Create rate limit middleware framework

Phase 2: Token Budget Enforcement (Week 1-2)

  1. Implement token tracking in model invocation layer
  2. Add token budget checks before/during invocations
  3. Handle streaming response truncation
  4. Update cost tracking to include limit information
  5. Add per-user token limit enforcement
  6. Unit tests for token budget enforcement

Phase 3: Tool Call Enforcement (Week 2)

  1. Implement tool call counter in tool execution layer
  2. Add tool call limit checks before tool execution
  3. Implement per-tool-type limits
  4. Add circuit breaker for repeated failures
  5. Unit tests for tool call limit enforcement

Phase 4: Frontend Integration (Week 2-3)

  1. Create usage dashboard component
  2. Add progress bars for token and tool call usage
  3. Update invocation detail pages with limit info
  4. Implement error message UI for limit breaches
  5. Add real-time usage updates

Phase 5: Monitoring & Admin Tools (Week 3)

  1. Add CloudWatch metrics for rate limit events
  2. Create alert rules for unusual usage
  3. Build admin API for limit management (if needed)
  4. Add usage analytics and reporting
  5. Documentation for operators and users

Key Files to Modify/Create

Backend:

  • backend/middleware/ (new directory):
    • rate_limiter.py: Core rate limiting logic
    • token_budget.py: Token budget tracking
    • tool_call_limiter.py: Tool call limiting
  • backend/models/rate_limit.py: Database models for limits
  • backend/services/:
    • counter_service.py: In-memory counter management
    • limit_enforcer.py: Enforcement logic
  • backend/api/: Update endpoints to check limits
  • backend/database/migrations/: Add rate_limits table migration

Configuration:

  • etc/environment.sh: Add rate limit configuration
  • backend/config.py: Load and validate rate limit settings

Frontend:

  • frontend/src/components/UsageDashboard.tsx: Usage visualization
  • frontend/src/components/RateLimitProgress.tsx: Progress bar component
  • frontend/src/components/RateLimitError.tsx: Error message component
  • frontend/src/pages/InvocationDetailPage.tsx: Add limit information
  • frontend/src/api/rate-limits.ts: API client for rate limit data

Infrastructure:

  • iac/: SAM templates for ElastiCache Redis (if using for distributed counters)

Tests:

  • backend/tests/middleware/test_rate_limiter.py
  • backend/tests/services/test_counter_service.py
  • backend/tests/integration/test_rate_limits.py

Example Configuration

# etc/environment.sh additions

# Per-invocation limits
export DEFAULT_TOKEN_BUDGET=10000
export DEFAULT_TOOL_CALL_LIMIT=50
export DEFAULT_INVOCATION_TIMEOUT=300  # seconds

# Per-user limits
export USER_DAILY_TOKEN_LIMIT=100000
export USER_MONTHLY_TOKEN_LIMIT=1000000
export USER_HOURLY_TOOL_CALL_LIMIT=200

# Warning thresholds
export TOKEN_BUDGET_WARNING_THRESHOLD=80  # percent
export TOOL_CALL_WARNING_THRESHOLD=80

# Per-tool limits
export WEB_FETCH_LIMIT_PER_INVOCATION=10
export BASH_EXEC_LIMIT_PER_INVOCATION=20

Error Response Format

{
  "error": "RateLimitExceeded",
  "message": "Token budget exceeded",
  "details": {
    "limit_type": "token_budget_per_invocation",
    "current_usage": 10000,
    "limit_value": 10000,
    "reset_at": null
  }
}

Database Schema Example

CREATE TABLE rate_limits (
    id SERIAL PRIMARY KEY,
    user_id VARCHAR(255),
    agent_id VARCHAR(255),
    limit_type VARCHAR(50) NOT NULL,
    current_usage INTEGER DEFAULT 0,
    limit_value INTEGER NOT NULL,
    reset_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_rate_limits_user_id ON rate_limits(user_id);
CREATE INDEX idx_rate_limits_agent_id ON rate_limits(agent_id);
CREATE INDEX idx_rate_limits_reset_at ON rate_limits(reset_at);

Testing Strategy

  • Unit Tests: Mock counters and test enforcement logic
  • Integration Tests: Test with real database and counters
  • Load Tests: Verify limits under concurrent load
  • Boundary Tests: Test counter resets at time boundaries
  • Streaming Tests: Verify mid-stream truncation works correctly
  • Multi-Tenant Tests: Verify user isolation of limits

Cost Tracking Integration

Rate limiting directly impacts costs. Update the existing cost tracking to:

  • Show usage as percentage of limits
  • Project monthly costs based on current usage
  • Alert when costs approach budget limits
  • Maintain consistency with formatCost functions in:
    • CostDashboardPage.tsx
    • InvocationDetailPage.tsx
    • LatencySummary.tsx
    • InvocationTable.tsx

References

Dependencies

Priority

High - Critical for cost control and production deployments

Estimated Effort

Large (3 weeks) - Requires middleware, database changes, frontend work, and comprehensive testing

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions