feat: add ai-cache plugin for LLM semantic caching

### Description

Add an `ai-cache` plugin that provides two-layer caching for LLM API responses:

- **L1 exact-match**: hash-based lookup on the prompt content, zero cost, deterministic
- **L2 semantic**: embedding-based vector similarity search when L1 misses, catching semantically equivalent queries (e.g., "how to return an item" ≈ "what is the return policy")

This is one of the highest-ROI capabilities for AI Gateway — industry reports show 30-60% cost and latency savings for common workloads (FAQ bots, document Q&A, translation).

**Background:**

All major AI gateway products already ship semantic caching: Kong (`ai-semantic-cache`), LiteLLM, Portkey, Higress (`ai-cache`), Helicone. APISIX has `proxy-cache` for generic HTTP caching but nothing that understands LLM prompt semantics.

**Proposed design (for reference, open to adjustment during implementation):**

```yaml
plugins:
  ai-cache:
    layers: [exact, semantic]

    cache_key:
      include_consumer: false    # default shared; set true for multi-tenant isolation
      include_vars: []           # additional variables for cache key scoping

    exact:
      ttl: 3600
      match_fields: [messages]

    semantic:
      vector_backend: redis      # Redis Stack with RediSearch module
      similarity_threshold: 0.95
      top_k: 1
      ttl: 86400
      embedding:
        provider: openai
        model: text-embedding-3-small
        endpoint: "https://api.openai.com/v1/embeddings"
        auth_ref: "$secret://..."

    bypass_on:
      - header: "X-Cache-Bypass"
        equals: "1"

    headers:
      cache_status: "X-AI-Cache-Status"      # HIT-L1 / HIT-L2 / MISS / BYPASS
      cache_similarity: "X-AI-Cache-Similarity"
      cache_age: "X-AI-Cache-Age"
```

**Key design points:**

1. L1 hash → L2 embedding dual-layer (industry consensus, not either/or)
2. Cache key isolation is user-configured: default is shared (maximizes hit rate for public knowledge scenarios); multi-tenant or RAG scenarios explicitly enable `include_consumer` or `include_vars`
3. L2 hit backfills L1 for next identical query
4. Cache write only on complete successful upstream response (2xx)
5. Streaming responses: accumulate chunks then write; cache replay maintains SSE contract
6. Redis Stack as the vector backend (APISIX already depends on Redis for `limit-*` plugins; Redis Stack just adds the RediSearch module)
7. Response headers expose cache status for debugging and client-side logic
8. Prometheus metrics: `apisix_ai_cache_hits_total{layer}`, `apisix_ai_cache_misses_total`, `apisix_ai_cache_embedding_latency_seconds`

**Typical use cases:**

| Scenario | Recommended config |
|---|---|
| Public FAQ / translation | Default (shared cache), max ROI |
| Multi-tenant SaaS | `include_consumer: true` |
| RAG with different retrieval contexts | `include_consumer: true` + `include_vars: [retrieval scope vars]` |
| Sensitive prompts | Use `bypass_on` to skip caching |

Happy to submit a PR if this direction makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ai-cache plugin for LLM semantic caching #13290

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Recommended config
Public FAQ / translation	Default (shared cache), max ROI
Multi-tenant SaaS	`include_consumer: true`
RAG with different retrieval contexts	`include_consumer: true` + `include_vars: [retrieval scope vars]`
Sensitive prompts	Use `bypass_on` to skip caching

feat: add ai-cache plugin for LLM semantic caching #13290

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions