Skip to content

feat: add ai-cache plugin for LLM semantic caching #13290

@nic-6443

Description

@nic-6443

Description

Add an ai-cache plugin that provides two-layer caching for LLM API responses:

  • L1 exact-match: hash-based lookup on the prompt content, zero cost, deterministic
  • L2 semantic: embedding-based vector similarity search when L1 misses, catching semantically equivalent queries (e.g., "how to return an item" ≈ "what is the return policy")

This is one of the highest-ROI capabilities for AI Gateway — industry reports show 30-60% cost and latency savings for common workloads (FAQ bots, document Q&A, translation).

Background:

All major AI gateway products already ship semantic caching: Kong (ai-semantic-cache), LiteLLM, Portkey, Higress (ai-cache), Helicone. APISIX has proxy-cache for generic HTTP caching but nothing that understands LLM prompt semantics.

Proposed design (for reference, open to adjustment during implementation):

plugins:
  ai-cache:
    layers: [exact, semantic]

    cache_key:
      include_consumer: false    # default shared; set true for multi-tenant isolation
      include_vars: []           # additional variables for cache key scoping

    exact:
      ttl: 3600
      match_fields: [messages]

    semantic:
      vector_backend: redis      # Redis Stack with RediSearch module
      similarity_threshold: 0.95
      top_k: 1
      ttl: 86400
      embedding:
        provider: openai
        model: text-embedding-3-small
        endpoint: "https://api.openai.com/v1/embeddings"
        auth_ref: "$secret://..."

    bypass_on:
      - header: "X-Cache-Bypass"
        equals: "1"

    headers:
      cache_status: "X-AI-Cache-Status"      # HIT-L1 / HIT-L2 / MISS / BYPASS
      cache_similarity: "X-AI-Cache-Similarity"
      cache_age: "X-AI-Cache-Age"

Key design points:

  1. L1 hash → L2 embedding dual-layer (industry consensus, not either/or)
  2. Cache key isolation is user-configured: default is shared (maximizes hit rate for public knowledge scenarios); multi-tenant or RAG scenarios explicitly enable include_consumer or include_vars
  3. L2 hit backfills L1 for next identical query
  4. Cache write only on complete successful upstream response (2xx)
  5. Streaming responses: accumulate chunks then write; cache replay maintains SSE contract
  6. Redis Stack as the vector backend (APISIX already depends on Redis for limit-* plugins; Redis Stack just adds the RediSearch module)
  7. Response headers expose cache status for debugging and client-side logic
  8. Prometheus metrics: apisix_ai_cache_hits_total{layer}, apisix_ai_cache_misses_total, apisix_ai_cache_embedding_latency_seconds

Typical use cases:

Scenario Recommended config
Public FAQ / translation Default (shared cache), max ROI
Multi-tenant SaaS include_consumer: true
RAG with different retrieval contexts include_consumer: true + include_vars: [retrieval scope vars]
Sensitive prompts Use bypass_on to skip caching

Happy to submit a PR if this direction makes sense.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status

📋 Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions