β‘ Real-time inference optimizer for LLMs β faster generation, smarter decoding, and live observability πβ¨
Kairu (ζ΅γγ) β to flow, to stream.
Inference should be fluid β not blocked by latency, inefficiency, or opaque performance.
Kairu wraps any HuggingFace model and adds:
-
π¦ Speculative decoding (EAGLE-style)
-
β© Dynamic early exit
-
πΈ Token budget enforcement
-
π Live dashboard:
- tokens/sec
- latency
- quality tradeoffs
Speculative decoding works β but:
- locked inside heavy frameworks (vLLM, etc.)
- hard to experiment with
- no lightweight tooling
- no built-in observability
- Speculative decoding internals (EAGLE, Medusa)
- KV cache management
- Streaming inference
- Performance optimization
pip install kairufrom kairu import wrap_model
model = wrap_model("your-model")
model.generate("Hello world")pip install "kairu[server]"import uvicorn
from kairu import create_app, ServerConfig
app = create_app(config=ServerConfig(model_name="kairu-mock"))
uvicorn.run(app, host="0.0.0.0", port=8000)curl -N -X POST http://localhost:8000/generate \
-H 'content-type: application/json' \
-d '{"prompt": "hello world", "max_tokens": 16}'Each frame is OpenAI chat.completion.chunk-compatible with a kairu extension carrying per-token latency_ms and tokens_per_s. Stream terminates with data: [DONE]. Per-IP rate limit + per-request timeout are enforced at the boundary.
from kairu import (
AutoProfile, CachedModel, DynamicGammaScheduler,
LayerwiseEarlyExitDecoder, MockLayeredModel, MockModel, ModelWrapper,
)
# Auto-pick a decoder strategy + cache size for any model
profile = AutoProfile.recommend(MockModel(), name_hint="llama-3-8b", has_draft=True)
print(profile.strategy, profile.gamma, profile.rationale)
# Layerwise early exit on architectures that expose intermediate logits
decoder = LayerwiseEarlyExitDecoder(MockLayeredModel(num_layers=24), confidence_threshold=0.85)
tokens, stats = decoder.generate([1, 2, 3], max_new_tokens=16)
print(f"saved {stats['compute_saved']:.1%} of layer compute")
# Wrap any model with logits memoization + adaptive Ξ³
wrapper = ModelWrapper(
MockModel(), draft_model=MockModel(),
cache_capacity=256, adaptive_gamma=True,
)- Layerwise early exit β stops at the first transformer layer whose top-prob β₯ threshold
- Logits cache (
CachedModel) β recycles target-model calls across speculative verification - Adaptive Ξ³ (
DynamicGammaScheduler) β AIMD control loop over speculative lookahead AutoProfileβ picksvanilla/early_exit/layered_early_exit/speculativefrom model metadata
pip install "kairu[server,redis]"
# Single-process default
kairu serve --host 0.0.0.0 --port 8000 --cache-capacity 256
# Horizontally scaled β Redis-backed rate limit shared across replicas
kairu serve --host 0.0.0.0 --port 8000 --redis redis://redis:6379/0 \
--rate-limit 100 --rate-window 60# Prometheus scrape
curl http://localhost:8000/metrics
# kairu_requests_total{endpoint="/generate",status="200"} 42
# kairu_tokens_generated_total{finish_reason="length"} 2752
# kairu_token_latency_seconds_bucket{le="0.01"} 38
# ...# Docker β multi-arch image published to GHCR on every main push
docker run --rm -p 8000:8000 ghcr.io/konjoai/kairu:latest \
serve --host 0.0.0.0 --port 8000# HuggingFace KV-cache adapter β drop-in past_key_values reuse
from kairu._hf_backend import HuggingFaceKVCachedModel
model = HuggingFaceKVCachedModel("gpt2")
# model.next_token_logits([..., t0, t1, t2]) reuses cached state from
# the prior call when prefixes overlap. model.kv_cache_stats reports
# kv_hits / kv_misses / kv_hit_rate / cached_prefix_len.Make LLM inference fast, transparent, and controllable.