If KAME saved you from a rate-limit hell, consider a tip:
Bitcoin β 36BGYhMEVFgY8PLGMVux93pjGt92KVM6dJ
Every sat helps me keep this project alive and learning.
KAME is what API rotation should have been.
Round-robin libraries cycle keys blindly. They keep banging on a key that just hit a 429 because they have no memory. They have no idea which key has capacity left. They retry through dead keys and call it "resilience."
KAME does the opposite of every assumption round-robin makes:
|
Parses the provider's own No guessing. No fixed backoff. Per-minute or daily, on any provider, KAME does the right thing. |
A 60-second sliding window tracks each key's recent activity. KAME selects the key with the most remaining capacity, not just the next one in line. LRU tie-break ensures even spreading across the pool. |
|
When the entire pool is temporarily exhausted, KAME reads the soonest recovery time and sleeps until then (capped at 60s, re-checking after) β instead of burning wasted 429 requests that prolong the cooldown. Production-proven: 45 wasted requests in v0.5.7.x β 0 in v1.0.0. |
Zero artificial timeouts. If the API accepts your request without error, KAME waits patiently for it to finish β even if it's a 90,000-token compression that takes 90 seconds. No death-loops on slow models. No "timed out" crashes during legitimate work. |
|
Every wait includes KAME doesn't look like a bot. KAME looks like a thoughtful human who took a coffee break at exactly the right time. |
|
You give it a comma-separated list of API keys. It gives you an agent that never stops.
You don't have to take my word for it. Here's a single day of intensive Agent Zero usage:
| Metric | Value |
|---|---|
| KAME-managed operations | 1,163 |
| Rate limit (429) events encountered | 117 |
| Rate limits resolved by rotation alone | 116 (99.1%) |
| Pool-fully-sick events requiring sleep | 1 |
| False pulses (wasted retries against sick keys) | 0 |
| KAME engine crashes | 0 |
| Pool state "healthy" during operations | ~99% |
The single sleep event tells the whole story:
- KAME predicted wake: 18:09:00
- KAME actual wake: 18:09:00.291
- Off by: 291 milliseconds (the random jitter)
- After waking: picked the recovered key in 0.08ms, request succeeded
That's what production-validated means: predictions accurate within the jitter window, zero crashes, zero wasted requests.
A second real-world run, on a 15-key Gemini pool, hit the exact failure mode v1.0.1 was built for: a wave of daily-quota exhaustion that eventually took the entire pool cold at once. Here is KAME's own session summary at the 100-operation mark:
[KAME] Session: 100 ok Β· 15 limited (min 0, daily 15, quota 0) Β· 1 long-sleep Β· 11 server Β· 0 timeout Β· 0 auth Β· 0 other
| Metric | Value |
|---|---|
| Operations completed | 100 |
| Rate limits classified as daily (correctly) | 15 / 15 β the v1.0.1 fix |
| Rate limits mis-classified as per-minute (the v1.0.0 bug) | 0 |
| Auth / timeout / unknown errors | 0 |
| KAME engine crashes | 0 |
When all 15 keys went cold, KAME announced the outage once, then slept quietly (no API calls) and woke within seconds of the first recovery:
- Announced once: "All keys cooling β next recovery in ~17m β¦ retry around 06:32:23."
- Then ETA-sleeps only: "Sleeping 17.8s β¦ next key in ~16s" β woke and picked the recovered key in 0.16ms.
And the eternal carousel never gave up. The single hardest call rode out the whole storm and still returned a successful response:
- One
unified_call: 2154.1s wall time Β· 9 rotations Β· 18 sleeps Β· 1049s of local waiting Β· β success. - The pool then recovered all the way back to 15/15 healthy and stayed there for the rest of the ~6-hour session.
v1.0.0 would have trusted the provider's misleading short
retryDelayon those daily 429/503s and re-probed dead keys roughly once per second for hours. v1.0.1 rested each one for a real hour, slept through the total outage, and lost zero requests.
- Copy the
api_rotation_by_kame/folder into/a0/usr/plugins/ - In Agent Zero β Settings β Model Provider, enter your keys separated by commas:
key1, key2, key3, key4, ... - Restart Agent Zero. That's it.
No config required. No tuning. No code changes anywhere. The plugin monkey-patches Agent Zero's LiteLLM layer at boot and reverts cleanly on uninstall.
Look for this banner on startup:
=======================================================
π’β‘ KAME v1.0.1 β ACTIVE
β Identity-Aware Health
β Eternal Carousel Rotation
β RPM-Aware Predictive Selection
β Anti-Dogpile Guard
β Anti-Thundering-Herd (Pending Counter)
β Trust the Connection (No Artificial Timeouts)
β KAME-Aware Compression Guard
β Hybrid Learning (Parsed retry-delay + ETA-driven sleep)
β Daily-Quota & Account-Limit Aware (multi-provider)
β Adaptive Backoff (provider-agnostic safety net)
β Rate Limiter Lock Fix
β Token Callback Support
β Friendly Error Reporting (real status + kind)
Note: keys are shown as anonymized ids (e.g. 'k3f9a1') β NOT your real keys.
=======================================================
| Plain round-robin | KAME | |
|---|---|---|
| Selection logic | "next in line" | most remaining capacity (RPM-aware predictive) |
| Behavior on 429 | retry same key with backoff | read provider's retry-delay, sleep that exact time |
| Concurrent calls | all dogpile on key #1 | spread across keys (anti-dogpile + anti-thundering-herd) |
| Sick key recovery | guessed (often wrong) | respected to the second (parsed from response) |
| Wasted 429 requests | many | zero |
| Detectable as bot | yes (regular spin) | no (jitter on every wait) |
| Daily-quota / out-of-credit key | trusts a misleading short retry, hammers a dead key | detected β real cooldown, any provider |
| Compression flow | breaks on token limit | rotates mid-compression, finishes anyway |
| Memory of failures | none | identity-aware health (per provider:model) |
| Recovery from "all sick" pool | infinite retry, kills your quota | ETA-driven sleep, wake exactly on time |
If you're using round-robin, your keys are spending half their quota proving they're still rate-limited. With KAME, every request that hits the API actually gets answered.
| # | Shield | What it gives you |
|---|---|---|
| 1 | π Identity-Aware Health | Tracks key health per provider:model pair. Your gemini-2.5-flash pool is separate from your gemini-2.5-pro pool β a 429 on one doesn't disable the other. |
| 2 | π Eternal Carousel | Infinite rotation. Never gives up, never crashes. Survives any combination of failures. |
| 3 | π RPM-Aware Predictive Selection | 60-second sliding window per key. Picks the one with most remaining capacity. LRU tie-break for even spreading. |
| 4 | π‘οΈ Anti-Dogpile Guard | At selection, the chosen key is marked busy NOW. Concurrent calls naturally pick different keys. |
| 5 | π Anti-Thundering-Herd | The pending request counts in the 60s window BEFORE it completes, so other threads route around it. |
| 6 | π€ ETA-Driven Sleep | When all keys are sick, sleep until the soonest recovery (capped 60s, then re-check). Re-select after waking. Never call the API with a sick key. |
| 7 | π² Smart Hybrid Jitter | random.uniform(0.1, 1.5) seconds on every wait. Anti-bot-detection. Prevents multi-client sync collisions. |
| 8 | π€ Trust the Connection | Zero artificial timeouts. Slow legitimate work runs to completion. |
| 9 | π¦ KAME-Powered Compression | History compression goes through the same eternal carousel. Multi-key rotation during summarization. |
| 10 | π Daily-Quota & Account-Limit Aware | Detects daily-quota and out-of-credit (insufficient_quota) errors across providers and applies a real cooldown β instead of trusting a misleadingly short retry and hammering a dead key once per second. Configurable (daily_quota_cooldown_seconds, default 1h). |
| 11 | π Adaptive Backoff | Provider-agnostic safety net: if the same key keeps hitting rate limits, its cooldown escalates (20s β 40s β 80s β¦ up to the ceiling) and resets on the first success. Kills re-probe bursts even when the provider strips all error details. |
| 12 | π Rate Limiter Deadlock Fix | Replaces A0's asyncio.Lock with threading.Lock, eliminating an async deadlock under specific concurrency patterns. |
| 13 | π§Ή Clean Uninstall | hooks.py::uninstall() reverts every monkey-patch. Drop the folder and KAME is gone β no leftover state. |
flowchart TD
A[Agent Zero asks LiteLLM for a chat] --> B[KAME monkey-patched unified_call]
B --> C[_get_best_key for provider:model]
C --> D{Any healthy keys?}
D -->|Yes| E[Pick key with most<br/>remaining capacity]
D -->|No, all sick| F[Read soonest sick_until]
E --> G[Mark anti-dogpile + anti-herd]
G --> H[acompletion - real API call]
H --> I{Success?}
F --> J[Sleep min ETA+0.5s, 60s<br/>+ jitter 0.1-1.5s]
J --> K[NO API calls during sleep]
K --> C
I -->|Yes| L[Mark healthy<br/>reset backoff<br/>Return response]
I -->|No, rate-limit| M[Classify error +<br/>parse retry-delay]
M --> N{Daily / account limit?}
N -->|Yes| O[Long cooldown<br/>ignore misleading delay]
N -->|No| P[Per-minute: trust delay<br/>+ adaptive backoff]
O --> Q[Set sick_until]
P --> Q
Q --> C
style E fill:#10b981
style F fill:#f59e0b
style L fill:#10b981
style O fill:#f59e0b
style J fill:#3b82f6
The whole engine is ~1,050 lines in a single file (kame_engine.py), monkey-patching LiteLLMChatWrapper.unified_call, Topic.summarize_messages, Bulk.summarize, and the framework's rate limiter.
π Click for technical deep-dive (state schema + selection algorithm)
Every API key carries this dictionary, scoped under {provider}:{model}:
{
"sick_until": float, # epoch time when key becomes available again
"last_used": float, # for LRU tie-break + anti-dogpile
"request_log": [float],# 60s sliding window of request timestamps
"last_sick_at": float, # for compression-aware "fresh recovery" filter
"consecutive_rl":int, # consecutive rate-limit fails -> adaptive backoff (resets on success)
}best_key = min(healthy, key=lambda k: (
len(pool[k]["request_log"]), # primary: most remaining 60s-window capacity
pool[k]["last_used"], # secondary: LRU for even spreading
))
# Then: mark used NOW (anti-dogpile)
# count pending NOW in request_log (anti-thundering-herd)In a 15-key pool firing 100 requests in 60 seconds, KAME spreads them roughly evenly (~6-7 per key) without you doing anything.
soonest_eta = min(sick_until - now for each sick key)
if soonest_eta > 3.0:
wait = min(soonest_eta + 0.5, 60.0) + random.uniform(0.1, 1.5)
else:
wait = 2.0 + random.uniform(0.1, 1.5) # fallback for very short ETAs
await asyncio.sleep(wait)
continue # never fall through with a sick keyKAME explains itself in plain language. One setting β kame_log_level β controls how much it writes to the Docker log. The rotation algorithm is identical at every level; this only changes what you see. Change it in Settings β Plugins β KAME β Log level; it takes effect on the next monologue start (no restart).
One compact line per successful call, plus rotations, limit hits, sleeps and errors. Keys appear as anonymized fingerprints (configurable via key_log_style). The pool-health count shows only when the pool is degraded, so a healthy pool stays quiet:
[KAME] Chat|gemini-2.5-flash β
k0a770
When a key hits a limit, KAME tells you the real reason (v1.0.1, any provider) and rotates β then the success line says, in plain words, how many rotations it took (no more cryptic "3 attempts"):
[KAME] Chat|gemini-2.5-flash k0a770 β³ 429 per-minute β wait 37s Β· next key...
[KAME] Chat|gemini-2.5-flash β
k1b8c2 Β· 1 rotation Β· pool 14/15 healthy
A daily-quota or out-of-credit key is rested for a real cooldown instead of being hammered once per second:
[KAME] Chat|gemini-2.5-flash k1b8c2 β³ 429 daily-quota β cooling 1h Β· next key...
[KAME] Chat|gemini-2.5-flash β
k2c9d4 Β· 1 rotation Β· pool 13/15 healthy
Everything in normal, plus a Calling... heartbeat, the picked-key line, per-call wall time, the full pool snapshot on every success, a cascade breakdown, and a periodic session summary. Best while tuning the key pool or when you suspect KAME is the bottleneck and want to prove it isn't:
[KAME] Chat|gemini-2.5-flash β‘ Calling...
[KAME] Chat|gemini-2.5-flash β‘ k0a770 picked in 0.08ms
[KAME] Chat|gemini-2.5-flash β
k0a770 in 2.4s | pool 15/15 healthy
After a cascade across several keys (with a sleep in the middle):
[KAME] Chat|gemini-2.5-flash β
k2c9d4 in 9.4s | pool 13/15 healthy | 5 rotations, 1 sleep
KAME stays out of the log entirely: no banner, no per-call line, no rotation or sleep notices. Only a hard, unrecoverable error still surfaces. Internal stats and key health are still tracked β only the log output is suppressed. Use it when you want the plugin to be invisible in the Docker log for fully unattended runs.
When the whole pool is cooling, KAME never goes dark without telling you. Near a recovery it logs each cycle (throttled); on a long outage (e.g. a full daily quota) it announces once, then waits quietly instead of spamming the log:
[KAME] Chat|gemini-2.5-flash π€ All keys cooling. Sleeping 7.7s (no API calls) β next key in ~7s (wake 18:09:00)
[KAME] Chat|gemini-2.5-flash π€ All keys cooling β next recovery in ~1h. Waiting quietly (no API calls), retry around 19:05:00.
The sleep notice means KAME is intentionally waiting, not stuck β so you never mistake a healthy cooldown for a hang.
| Setting | Default | Purpose |
|---|---|---|
kame_log_level |
normal |
How much KAME writes to the log: silent (nothing but hard errors), normal (one line per success + events; pool count only when degraded), or verbose (full diagnostics). A legacy verbose_trace: true still maps to verbose. |
daily_quota_cooldown_seconds |
3600 |
How long to rest a key after a daily-quota / out-of-credit error (any provider). Also the adaptive-backoff ceiling. Clamped 1β86400. |
key_log_style |
fingerprint |
How keys appear in logs: fingerprint (anonymized id, never leaks the secret), prefix8 (first 8 chars), or full (debug only). |
Everything else is opinionated and validated in production β the algorithm, sleep timing, jitter range, 60s RPM window, and quarantine logic are all tuned and tested. If you really want to tweak, the code in kame_engine.py is well-commented.
rm -rf /a0/usr/plugins/api_rotation_by_kame/
# Restart Agent ZeroKAME's hooks.py::uninstall() runs BEFORE deletion and reverts every monkey-patch. No leftover state.
Do I need to restart Agent Zero after installing KAME?
No. A0 hot-reloads plugins: dropping KAME into your plugins folder (or toggling it on) clears the plugin/extension caches, and KAME activates on the next agent turn β no container restart, no framework restart, and it triggers no "reload the page" prompt (KAME ships no extensions/webui). The only prerequisite is having multiple API keys configured in A0's normal model settings β KAME never stores keys itself; it rotates the ones A0 already has.
I only have one API key. Does KAME help?
With one key, KAME is roughly equivalent to A0's native rate limiter. The eternal-carousel magic needs multiple keys. Recommend 5+ for good RPM spreading.
KAME picked the same key twice in a row. Bug?
Likely not. If you have only 2-3 keys and one just succeeded with fresh RPM capacity, RPM-aware selection may legitimately re-pick it. With more keys (10+), this becomes very rare due to anti-dogpile.
I'm seeing "429 daily-quota β cooling 1h" β is this a bug?
No β that's the v1.0.1 daily-quota shield working. KAME detected a daily-quota or out-of-credit error and is resting that key for a real cooldown (default 1h, set by daily_quota_cooldown_seconds) instead of trusting a misleadingly short retry and hammering a dead key once per second. Your other keys keep working; the rested key is re-tried after the cooldown.
Compression takes a long time. Is KAME slowing it down?
No. KAME's "Trust the Connection" philosophy means zero artificial timeouts. A 90,000-token compression that legitimately takes 90 seconds takes 90 seconds. Without KAME, A0's native flow can crash with "timed out" β KAME lets it finish.
Does verbose / debug logging cost extra API calls?
Zero. Every log level is pure local instrumentation β it only changes how many lines KAME prints, never how it calls the API. Even silent still tracks stats and key health internally; it just doesn't write them out.
Can I use KAME with Anthropic / OpenAI / others, not just Gemini?
Yes. KAME is provider-agnostic. It works wherever Agent Zero's LiteLLM layer works. The retry-delay parser handles Google, OpenAI, Anthropic, Groq, and generic HTTP Retry-After headers β including compound durations like "6m 11.52s". v1.0.1's daily-quota detection and adaptive backoff are also multi-provider, so an exhausted daily key is handled correctly no matter who serves it.
What if all my keys die permanently?
KAME keeps cycling. Auth errors (401) and daily/account limits trigger a long cooldown (default 1h) for that key. If literally every key is dead, your agent sleeps (up to 60s per cycle), wakes, re-checks, and sleeps again until you fix it β announcing a long outage just once instead of spamming the log. No infinite spin against a wall, no wasted API calls.
- Agent Zero: v1.14+ (verified through v1.18)
- Python: 3.10+
- Providers: any LiteLLM-supported provider (Google, OpenAI, Anthropic, Mistral, Groq, DeepSeek, xAI, Together, ...)
- No new dependencies β uses stdlib only on top of what A0 already ships
PRs welcome. The engine is intentionally small (~1,050 LOC, single file). When proposing changes:
- Keep the engine algorithm stable β selection, anti-dogpile, ETA-driven sleep are battle-tested.
- Add features behind opt-in settings when possible (see
kame_log_level/key_log_styleas a pattern). - Log production behavior in
test_logs/with version-named files so changes can be audited.
Bugs and feature requests via GitHub issues.
MIT License β see LICENSE.
Copyright (c) 2026 KAME (https://github.com/Kame696)
You can use, modify, distribute, and even sell KAME with the only requirement being to keep the copyright notice. The author retains all rights to KAME as the original work; the license simply grants permissive usage to others.
KAME has been in development since early 2026, learning from real production logs at every step:
| Version | Focus | Key insight |
|---|---|---|
| v1.0.1 | Quota awareness + reliability fixes + log overhaul | Google sends a misleading retryDelay: 1s on a daily 429 β trusting it re-probed a dead key once per second. Fixed with strict daily/account detection + a provider-agnostic adaptive backoff, plus honest per-call error reporting. Three reliability fixes: a mid-run chat message is now received without pressing nudge agent (KAME stopped swallowing A0's InterventionException); a stream that fails after emitting content no longer re-generates from scratch on another key (got_any_chunk guard, mirroring vanilla A0); and a sustained 503 server-busy outage on a big pool now escalates gently instead of spinning forever. The logs were reworked into a clear silent/normal/verbose tri-state with self-explanatory wording (no more cryptic "N attempts"). Engine selection path unchanged. |
| v1.0.0 | First stable release | Engine validated: 1,163 ops / 117 rate limits / 0 crashes. ETA-driven sleep proven in production. |
| v0.5.8.0 | The ETA Fix | Real log revealed: pulsing every 2s against sick keys burned ~45 wasted 429s in 26s. Fixed by sleeping exactly until next recovery. |
| v0.5.7.4 | Verbose Trace | Added opt-in observability: key short id, selection latency, pool snapshot, cascade summary, compression-aware filter. |
| v0.5.7.3 | The Trust Restored | Rolled back a misattributed "bug fix" that was actually the production-validated dispersion brake. |
| v0.5.7 | Packaging Cleanup | A0 v1.15 schema compliance, clean uninstall hooks. |
| v0.5.6 | The Trust | "Trust the Connection" philosophy formalized β zero artificial timeouts. |
| v0.5.0 - v0.5.5 | The Commander β The Refined | Identity-aware health, anti-dogpile, anti-thundering-herd, smart quarantine. |
| v0.4.x | The Seed β The Strategist | Foundational rotation, eternal carousel, basic RPM-awareness. |
The lesson across versions: the only way to build something this reliable is to run it in production and read the logs honestly. Every major improvement in KAME came from a real log showing real behavior β not from theory.
Built by KAME. Engine refinement guided by real production log analysis β including the v0.5.7.4 log that revealed the wasted-pulse bug fixed in v0.5.8.0. Special thanks to every 429 that taught KAME something new.
If KAME made your agent less frustrating, drop a star β β it costs you nothing and helps others find this.
Star Kame696/kame-api-engine on GitHub β
π’β‘ KAME v1.0.1 β because round-robin was never enough
Bitcoin β 36BGYhMEVFgY8PLGMVux93pjGt92KVM6dJ
4P1 R0T4T10N β 4FRE3D0M