Skip to content

Mr-Skeleton-Max/BubbleStream

Repository files navigation

BubbleStream

Dual-Stream Memory Framework for LLM Infinite Context

BubbleStream is an inference-layer framework that gives any LLM effectively unlimited context by managing memory externally. Instead of relying on ever-growing context windows, it compresses conversation history into indexed memory blocks that "bubble up" on demand -- achieving 94% memory precision across 32K+ cumulative dialogue with only 16K effective context.

This project was the starting point for a broader research program into bounded-state sequence learning. The architectural limitations discovered here (embedding-based retrieval ceiling, inability to learn) led directly to the design of MathBrain, a trainable architecture that solves the memory problem from first principles.

How It Works

User message
     │
     ▼
┌─────────────────────────────────┐
│       Thinking Stream           │
│       (strong model, reasoning) │
│                                 │
│  [Memory zone] + [rolling       │
│   context] + [active zone]      │
└──────────────┬──────────────────┘
               │ completed segments
               ▼
┌─────────────────────────────────┐
│        Memory Stream            │
│     (weak model, compression)   │
│                                 │
│  compress → store → retrieve    │
│  → deduplicate → decay          │
└─────────────────────────────────┘

The core idea: separate reasoning from memory management.

  • Thinking Stream uses a strong model (DeepSeek V3, Claude, etc.) and focuses only on reasoning. It sees a rolling context window plus injected memory blocks.
  • Memory Stream uses a weaker/cheaper model and handles compression, storage, retrieval, and maintenance of memory "bubbles" asynchronously.
  • Memory blocks bubble up into the thinking context when relevant -- either passively (reranked after each segment) or actively (when the reasoning model calls query_memory()).

Key Properties

  • Works with any LLM: Pure inference-layer solution, no model modification or fine-tuning
  • Effectively unlimited context: Rolling window + external memory, tested up to 32K+ tokens
  • 94% memory precision: With only 16K effective context window
  • Async dual-stream: Memory compression never blocks reasoning
  • Natural forgetting: Unused memories decay via heat-based scoring
  • Full web UI + CLI + API: Complete frontend and backend included

Quick Start

Backend

# Clone and install
git clone https://github.com/Mr-Skeleton-Max/BubbleStream.git
cd BubbleStream

# Set up environment
cp .env.example .env
# Edit .env and add your API key

# Install dependencies
pip install -r requirements.txt

# Start API server
python run_api.py
# Server runs at http://localhost:8000

Frontend

cd Web
npm install
npm run dev
# Opens at http://localhost:5173

CLI

python cli.py chat "Hello, let's have a long conversation"

Architecture

BubbleStream/
├── run_api.py                    # API server entry point
├── cli.py                        # CLI client
├── src/
│   ├── orchestrator.py           # Dual-stream coordinator
│   ├── thinking/                 # Thinking Stream
│   │   ├── stream.py             #   Main reasoning loop
│   │   ├── segment_detector.py   #   Segment boundary detection
│   │   ├── context_manager.py    #   Rolling context window
│   │   └── memory_interface.py   #   Memory query interface
│   ├── memory/                   # Memory Stream
│   │   ├── bubble_generator.py   #   Compress segments → bubbles
│   │   ├── segment_queue.py      #   Async segment processing
│   │   ├── integration.py        #   Memory pipeline coordinator
│   │   └── graph/                #   Graph-based memory store
│   ├── storage/                  # Persistence layer (SQLite)
│   ├── shared/                   # Config, LLM client, embeddings
│   └── api/                      # FastAPI routes + WebSocket
├── Web/                          # React + Vite frontend
├── cli/                          # TypeScript CLI (alternative)
├── prompts/                      # Prompt templates for both streams
└── docs/                         # Design documents (Chinese)

Design Principles

  1. Separation of concerns: Reasoning model never manages memory directly
  2. Greedy compression: Better to over-store than to miss information
  3. Async by default: Memory processing never blocks the reasoning flow
  4. Natural decay: Unused memories lose heat and sink; accessed memories bubble up
  5. Single interface: The reasoning model's only memory operation is query_memory(query) -> str

Limitations and Lessons Learned

BubbleStream works well as an engineering solution, but it has a fundamental ceiling:

  • Retrieval quality is bounded by embedding similarity. As memory grows, embedding-based retrieval becomes increasingly unreliable for nuanced or compositional queries.
  • Not trainable. The system exploits Transformer attention properties but cannot learn or improve from experience.
  • Memory precision is determined by the management mechanism, not the LLM's capability.

These limitations motivated the development of MathBrain -- a trainable architecture that replaces retrieval-based memory with categorical voting from bounded state.

Requirements

  • Python >= 3.11
  • Node.js >= 18 (for frontend)
  • An OpenAI-compatible API key (SiliconFlow, OpenAI, etc.)

License

MIT License. See LICENSE.

About

An LLM engineering solution with unlimited context

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors