Add support for rate limiting

# Add support for rate limiting

## Overview
Add support for rate limiting with agents for both model and tool calls. This will enable users to control costs, prevent runaway agents, and enforce usage constraints at various levels (per invocation, per user, per time window). Rate limiting is essential for production deployments and multi-tenant environments.

## Requirements

### R1: Model Call Rate Limiting
Users should be able to configure rate limiting for model calls, setting token budgets. This includes:
- Token budget limits per invocation (input + output tokens)
- Token budget limits per user (daily/monthly)
- Token budget limits per agent (daily/monthly)
- Graceful handling when token budgets are exceeded
- Real-time token usage tracking and enforcement
- Different limits for different model providers (see Issue #74)
- Warning thresholds before hitting hard limits

### R2: Tool Call Rate Limiting
Users should be able to configure rate limiting for tool calls, constraining the amount of tool calls that are made. This includes:
- Maximum tool calls per invocation
- Maximum tool calls per user (daily/hourly)
- Maximum tool calls per agent (daily/hourly)
- Per-tool-type limits (e.g., limit expensive operations like web fetches)
- Graceful handling when tool call limits are exceeded
- Real-time tool call counting and enforcement
- Circuit breaker patterns for repeated failures

## Technical Considerations

### Rate Limit Types
- **Per-Invocation Limits**: Hard caps on a single agent invocation
  - Token budget (input + output)
  - Tool call count
  - Time limit (max execution time)
- **Per-User Limits**: Quotas across all invocations by a user
  - Daily/monthly token budget
  - Hourly/daily tool call limits
  - Concurrent invocation limits
- **Per-Agent Limits**: Quotas for a specific agent
  - Daily/monthly token budget for that agent
  - Tool call limits specific to agent behavior
- **Per-Tool Limits**: Constraints on specific tools
  - Web fetch limits (prevent scraping abuse)
  - Bash execution limits (prevent resource exhaustion)
  - File operation limits

### Configuration Management
- Update `etc/environment.sh` to include rate limit configuration:
  - Default token budget per invocation
  - Default tool call limit per invocation
  - Per-user daily/monthly limits
  - Per-agent limits
  - Warning threshold percentages (e.g., warn at 80% of limit)
- Support environment-specific limits (dev vs production)
- Store user-specific overrides in database (RDS)
- Allow admin users to configure custom limits per user/agent

### Backend Changes

#### Rate Limit Store
- **Token Budget Tracking**: Track token usage in real-time
  - Store current usage in memory (Redis/ElastiCache for distributed)
  - Persist to database for historical tracking
  - Reset counters at appropriate intervals (daily/monthly)
- **Tool Call Tracking**: Track tool call counts
  - Per-invocation counter (in-memory)
  - Per-user counter (Redis/database)
  - Per-tool-type counters

#### Enforcement Layer
- **Middleware**: Create rate limiting middleware that:
  - Checks limits before processing requests
  - Updates counters during execution
  - Enforces hard limits and rejects requests when exceeded
  - Returns clear error messages with limit details
- **Streaming Support**: For streaming responses, enforce limits mid-stream
  - Stop generation when token budget is reached
  - Append truncation message to response
- **Tool Call Gating**: Before executing any tool call:
  - Check if tool call limit would be exceeded
  - Check if specific tool has per-tool limits
  - Allow or deny with clear error message

#### Database Schema
New tables or columns in existing tables:
- `rate_limits` table:
  - `user_id` (or agent_id)
  - `limit_type` (token_daily, token_monthly, tool_calls_hourly, etc.)
  - `current_usage`
  - `limit_value`
  - `reset_at` (timestamp for when counter resets)
  - `created_at`, `updated_at`
- Update `invocations` table:
  - `tokens_used` (input + output)
  - `tool_calls_count`
  - `rate_limited` (boolean flag)
  - `limit_exceeded_type` (which limit was hit)

### Frontend Changes
- **Usage Dashboard**: New page or section showing:
  - Current token usage vs limits (progress bars)
  - Current tool call usage vs limits
  - Historical usage trends (charts)
  - Cost implications of usage
- **Invocation Details**: Show rate limit information:
  - Tokens used / Token budget
  - Tool calls made / Tool call limit
  - Warning indicators when approaching limits
- **Rate Limit Errors**: User-friendly error messages:
  - "Token budget exceeded (5,000 / 5,000). Increase your limit or try a shorter conversation."
  - "Tool call limit reached (20 / 20). Wait 1 hour before retrying."
- **Admin Panel**: For managing user/agent limits (if applicable)

### Monitoring & Alerting
- **CloudWatch Metrics**:
  - Rate limit hits by type
  - Token usage by user/agent
  - Tool call counts by type
  - Limit breach attempts
- **Alerts**:
  - Notify users when approaching limits (80%, 90%)
  - Alert admins of unusual usage patterns
  - Track cost implications of usage

### Security Considerations
- **Prevent Abuse**: Rate limits prevent:
  - Cost runaway from buggy or malicious agents
  - Resource exhaustion attacks
  - Unintentional infinite loops in agent logic
- **Fair Usage**: Multi-tenant environments need per-user isolation
- **Admin Overrides**: Support emergency limit increases for critical use cases

## Acceptance Criteria

### AC1: Token Budget Limits
- [ ] Per-invocation token budgets enforced
- [ ] Per-user daily/monthly token limits enforced
- [ ] Token usage tracked accurately across providers
- [ ] Streaming responses stop when token budget reached
- [ ] Clear error messages when token limits exceeded
- [ ] Warning notifications at threshold percentages

### AC2: Tool Call Limits
- [ ] Per-invocation tool call limits enforced
- [ ] Per-user hourly/daily tool call limits enforced
- [ ] Per-tool-type limits supported
- [ ] Tool execution blocked when limits exceeded
- [ ] Clear error messages when tool call limits exceeded
- [ ] Circuit breaker for repeated tool failures

### AC3: Configuration
- [ ] Rate limits configurable in `etc/environment.sh`
- [ ] User-specific overrides stored in database
- [ ] Environment-specific limits (dev vs prod) work
- [ ] Admin API for managing limits
- [ ] Documentation for all configuration options

### AC4: Frontend
- [ ] Usage dashboard shows current usage vs limits
- [ ] Progress bars and charts for usage visualization
- [ ] Rate limit errors displayed clearly
- [ ] Invocation details show limit information
- [ ] Real-time updates during long-running invocations

### AC5: Database & Persistence
- [ ] Rate limit tracking tables created
- [ ] Counters persist across restarts
- [ ] Historical usage data queryable
- [ ] Counter resets work correctly (daily/monthly)
- [ ] Migration scripts for schema changes

### AC6: Testing
- [ ] Unit tests for rate limit enforcement
- [ ] Integration tests with different limit types
- [ ] Load tests verify limits under concurrent requests
- [ ] Test limit resets at boundaries (midnight, month-end)
- [ ] Test streaming response truncation

## Implementation Notes

### Suggested Approach

#### Phase 1: Infrastructure & Configuration (Week 1)
1. Design database schema for rate limit tracking
2. Create migration scripts using SQLAlchemy
3. Add configuration parameters to `etc/environment.sh`
4. Implement in-memory counter service (consider Redis for distributed)
5. Create rate limit middleware framework

#### Phase 2: Token Budget Enforcement (Week 1-2)
1. Implement token tracking in model invocation layer
2. Add token budget checks before/during invocations
3. Handle streaming response truncation
4. Update cost tracking to include limit information
5. Add per-user token limit enforcement
6. Unit tests for token budget enforcement

#### Phase 3: Tool Call Enforcement (Week 2)
1. Implement tool call counter in tool execution layer
2. Add tool call limit checks before tool execution
3. Implement per-tool-type limits
4. Add circuit breaker for repeated failures
5. Unit tests for tool call limit enforcement

#### Phase 4: Frontend Integration (Week 2-3)
1. Create usage dashboard component
2. Add progress bars for token and tool call usage
3. Update invocation detail pages with limit info
4. Implement error message UI for limit breaches
5. Add real-time usage updates

#### Phase 5: Monitoring & Admin Tools (Week 3)
1. Add CloudWatch metrics for rate limit events
2. Create alert rules for unusual usage
3. Build admin API for limit management (if needed)
4. Add usage analytics and reporting
5. Documentation for operators and users

### Key Files to Modify/Create

**Backend:**
- `backend/middleware/` (new directory):
  - `rate_limiter.py`: Core rate limiting logic
  - `token_budget.py`: Token budget tracking
  - `tool_call_limiter.py`: Tool call limiting
- `backend/models/rate_limit.py`: Database models for limits
- `backend/services/`:
  - `counter_service.py`: In-memory counter management
  - `limit_enforcer.py`: Enforcement logic
- `backend/api/`: Update endpoints to check limits
- `backend/database/migrations/`: Add rate_limits table migration

**Configuration:**
- `etc/environment.sh`: Add rate limit configuration
- `backend/config.py`: Load and validate rate limit settings

**Frontend:**
- `frontend/src/components/UsageDashboard.tsx`: Usage visualization
- `frontend/src/components/RateLimitProgress.tsx`: Progress bar component
- `frontend/src/components/RateLimitError.tsx`: Error message component
- `frontend/src/pages/InvocationDetailPage.tsx`: Add limit information
- `frontend/src/api/rate-limits.ts`: API client for rate limit data

**Infrastructure:**
- `iac/`: SAM templates for ElastiCache Redis (if using for distributed counters)

**Tests:**
- `backend/tests/middleware/test_rate_limiter.py`
- `backend/tests/services/test_counter_service.py`
- `backend/tests/integration/test_rate_limits.py`

### Example Configuration

```bash
# etc/environment.sh additions

# Per-invocation limits
export DEFAULT_TOKEN_BUDGET=10000
export DEFAULT_TOOL_CALL_LIMIT=50
export DEFAULT_INVOCATION_TIMEOUT=300  # seconds

# Per-user limits
export USER_DAILY_TOKEN_LIMIT=100000
export USER_MONTHLY_TOKEN_LIMIT=1000000
export USER_HOURLY_TOOL_CALL_LIMIT=200

# Warning thresholds
export TOKEN_BUDGET_WARNING_THRESHOLD=80  # percent
export TOOL_CALL_WARNING_THRESHOLD=80

# Per-tool limits
export WEB_FETCH_LIMIT_PER_INVOCATION=10
export BASH_EXEC_LIMIT_PER_INVOCATION=20
```

### Error Response Format

```json
{
  "error": "RateLimitExceeded",
  "message": "Token budget exceeded",
  "details": {
    "limit_type": "token_budget_per_invocation",
    "current_usage": 10000,
    "limit_value": 10000,
    "reset_at": null
  }
}
```

### Database Schema Example

```sql
CREATE TABLE rate_limits (
    id SERIAL PRIMARY KEY,
    user_id VARCHAR(255),
    agent_id VARCHAR(255),
    limit_type VARCHAR(50) NOT NULL,
    current_usage INTEGER DEFAULT 0,
    limit_value INTEGER NOT NULL,
    reset_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_rate_limits_user_id ON rate_limits(user_id);
CREATE INDEX idx_rate_limits_agent_id ON rate_limits(agent_id);
CREATE INDEX idx_rate_limits_reset_at ON rate_limits(reset_at);
```

### Testing Strategy
- **Unit Tests**: Mock counters and test enforcement logic
- **Integration Tests**: Test with real database and counters
- **Load Tests**: Verify limits under concurrent load
- **Boundary Tests**: Test counter resets at time boundaries
- **Streaming Tests**: Verify mid-stream truncation works correctly
- **Multi-Tenant Tests**: Verify user isolation of limits

### Cost Tracking Integration
Rate limiting directly impacts costs. Update the existing cost tracking to:
- Show usage as percentage of limits
- Project monthly costs based on current usage
- Alert when costs approach budget limits
- Maintain consistency with `formatCost` functions in:
  - `CostDashboardPage.tsx`
  - `InvocationDetailPage.tsx`
  - `LatencySummary.tsx`
  - `InvocationTable.tsx`

## References
- [AWS API Gateway Usage Plans](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-api-usage-plans.html)
- [Rate Limiting Patterns](https://cloud.google.com/architecture/rate-limiting-strategies-techniques)
- [Token Bucket Algorithm](https://en.wikipedia.org/wiki/Token_bucket)
- [Redis Rate Limiting](https://redis.io/docs/manual/patterns/rate-limiter/)

## Dependencies
- Issue #74 (Alternate LLM Providers) - Token limits may vary by provider

## Priority
High - Critical for cost control and production deployments

## Estimated Effort
Large (3 weeks) - Requires middleware, database changes, frontend work, and comprehensive testing


Add support for rate limiting #75

Description

Add support for rate limiting

Overview

Requirements

R1: Model Call Rate Limiting

R2: Tool Call Rate Limiting

Technical Considerations

Rate Limit Types

Configuration Management

Backend Changes

Rate Limit Store

Enforcement Layer

Database Schema

Frontend Changes

Monitoring & Alerting

Security Considerations

Acceptance Criteria

AC1: Token Budget Limits

AC2: Tool Call Limits

AC3: Configuration

AC4: Frontend

AC5: Database & Persistence

AC6: Testing

Implementation Notes

Suggested Approach

Phase 1: Infrastructure & Configuration (Week 1)

Phase 2: Token Budget Enforcement (Week 1-2)

Phase 3: Tool Call Enforcement (Week 2)

Phase 4: Frontend Integration (Week 2-3)

Phase 5: Monitoring & Admin Tools (Week 3)

Key Files to Modify/Create

Example Configuration

Error Response Format

Database Schema Example

Testing Strategy

Cost Tracking Integration

References

Dependencies

Priority

Estimated Effort

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions