Dual-Stream Memory Framework for LLM Infinite Context
BubbleStream is an inference-layer framework that gives any LLM effectively unlimited context by managing memory externally. Instead of relying on ever-growing context windows, it compresses conversation history into indexed memory blocks that "bubble up" on demand -- achieving 94% memory precision across 32K+ cumulative dialogue with only 16K effective context.
This project was the starting point for a broader research program into bounded-state sequence learning. The architectural limitations discovered here (embedding-based retrieval ceiling, inability to learn) led directly to the design of MathBrain, a trainable architecture that solves the memory problem from first principles.
User message
│
▼
┌─────────────────────────────────┐
│ Thinking Stream │
│ (strong model, reasoning) │
│ │
│ [Memory zone] + [rolling │
│ context] + [active zone] │
└──────────────┬──────────────────┘
│ completed segments
▼
┌─────────────────────────────────┐
│ Memory Stream │
│ (weak model, compression) │
│ │
│ compress → store → retrieve │
│ → deduplicate → decay │
└─────────────────────────────────┘
The core idea: separate reasoning from memory management.
- Thinking Stream uses a strong model (DeepSeek V3, Claude, etc.) and focuses only on reasoning. It sees a rolling context window plus injected memory blocks.
- Memory Stream uses a weaker/cheaper model and handles compression, storage, retrieval, and maintenance of memory "bubbles" asynchronously.
- Memory blocks bubble up into the thinking context when relevant -- either passively (reranked after each segment) or actively (when the reasoning model calls
query_memory()).
- Works with any LLM: Pure inference-layer solution, no model modification or fine-tuning
- Effectively unlimited context: Rolling window + external memory, tested up to 32K+ tokens
- 94% memory precision: With only 16K effective context window
- Async dual-stream: Memory compression never blocks reasoning
- Natural forgetting: Unused memories decay via heat-based scoring
- Full web UI + CLI + API: Complete frontend and backend included
# Clone and install
git clone https://github.com/Mr-Skeleton-Max/BubbleStream.git
cd BubbleStream
# Set up environment
cp .env.example .env
# Edit .env and add your API key
# Install dependencies
pip install -r requirements.txt
# Start API server
python run_api.py
# Server runs at http://localhost:8000cd Web
npm install
npm run dev
# Opens at http://localhost:5173python cli.py chat "Hello, let's have a long conversation"BubbleStream/
├── run_api.py # API server entry point
├── cli.py # CLI client
├── src/
│ ├── orchestrator.py # Dual-stream coordinator
│ ├── thinking/ # Thinking Stream
│ │ ├── stream.py # Main reasoning loop
│ │ ├── segment_detector.py # Segment boundary detection
│ │ ├── context_manager.py # Rolling context window
│ │ └── memory_interface.py # Memory query interface
│ ├── memory/ # Memory Stream
│ │ ├── bubble_generator.py # Compress segments → bubbles
│ │ ├── segment_queue.py # Async segment processing
│ │ ├── integration.py # Memory pipeline coordinator
│ │ └── graph/ # Graph-based memory store
│ ├── storage/ # Persistence layer (SQLite)
│ ├── shared/ # Config, LLM client, embeddings
│ └── api/ # FastAPI routes + WebSocket
├── Web/ # React + Vite frontend
├── cli/ # TypeScript CLI (alternative)
├── prompts/ # Prompt templates for both streams
└── docs/ # Design documents (Chinese)
- Separation of concerns: Reasoning model never manages memory directly
- Greedy compression: Better to over-store than to miss information
- Async by default: Memory processing never blocks the reasoning flow
- Natural decay: Unused memories lose heat and sink; accessed memories bubble up
- Single interface: The reasoning model's only memory operation is
query_memory(query) -> str
BubbleStream works well as an engineering solution, but it has a fundamental ceiling:
- Retrieval quality is bounded by embedding similarity. As memory grows, embedding-based retrieval becomes increasingly unreliable for nuanced or compositional queries.
- Not trainable. The system exploits Transformer attention properties but cannot learn or improve from experience.
- Memory precision is determined by the management mechanism, not the LLM's capability.
These limitations motivated the development of MathBrain -- a trainable architecture that replaces retrieval-based memory with categorical voting from bounded state.
- Python >= 3.11
- Node.js >= 18 (for frontend)
- An OpenAI-compatible API key (SiliconFlow, OpenAI, etc.)
MIT License. See LICENSE.