Skip to content

ThoughtTimeMachine/UFCE-Streaming

Repository files navigation

UFCE-Streaming

Zero-memory streaming kernels for large-scale spatial-temporal interactions on consumer hardware. Supports Massive scale large-context exact RAG, larger-than-VRAM LLM training via Project VELOCITY, and high-throughput JAX + CUDA implementations.

License: GPL v3 Python JAX

Performance Benchmarks

Hardware: AMD Ryzen 9 7950X (CPU) + NVIDIA RTX 4070 Ti (GPU) + 80GB DDR5 RAM

Kernel Implementation Throughput
Ryzen 9 7950X (CPU) 43.33 Billion ops/s
RTX 4070 Ti (Scientific) 1.50 Trillion ops/s
JAX "block-streaming kernel" 2.02 Trillion ops/s
JAX Streaming Softmax 232.6 Billion ops/s
JAX Top-K 1.40 Billion ops/s

Additional measured results:

  • 125 billion interaction points processed in 2.89 seconds (CPU) with 0.00 MB additional memory overhead.
  • 50 trillion interactions (equivalent to a 1 billion token context) processed in 24.7 seconds on RTX 4070 Ti.

Massive scale large-context RAG Agent

This repository includes a functional RAG agent that performs exact retrieval over large datasets (e.g., full Wikipedia) using a tiered memory reservoir (SSD-backed vector database) and Drill-Down attention.

Wikipedia Data Preparation

  1. Download the Wikipedia dump:
    wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
    

Step 2: Prepare the Data

  1. Extract Clean Text Shards (using WikiExtractor):
pip install wikiextractor
  1. Run extraction to create the wiki_shards/ folder:
WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 --output wiki_shards/ -b 1M

This produces multiple text files in wiki_shards/ (one per large article batch), ready for the ingestion pipeline.

Step 3: Ingest into Reservoir

Overview

The UFCE ingestion pipeline has been upgraded to a sharded, resumable architecture for handling truly massive datasets (e.g., full Wikipedia dumps, Common Crawl subsets, or multi-domain corpora). It consists of two scripts:

  • ufce_ingestion_pipeline_shard.py: Processes individual text shards into vector + metadata pairs.
  • merge_shards.py: Concatenates all shards into the final knowledge_base_full.dat and metadata_full.txt used by the UFCE agent.

Convert the text into a Tiered Memory Reservoir (SSD-backed Vector DB). The sharded pipeline processes Wikipedia articles in independent chunks, using streaming embedding per shard to avoid RAM overload on massive datasets.

python UFCE_ingestion_pipeline.py
# Output: knowledge_base.dat (Binary Vectors) + metadata.txt (Index)

Step 4: Launch the Agent with Local LLM (Ollama)

The UFCE agent connects to a running Ollama server to generate responses grounded in your knowledge base.

You have two options for running Ollama — choose based on convenience and performance.

Option 1: Ollama on Host Machine (Windows — Recommended for Speed & Ease)

This is the fastest and simplest setup — Ollama runs natively on your Windows machine. Host version is faster (no container overhead, direct GPU access if using Ollama GPU build).

  1. Install Ollama (if not already):

  2. Download Llama-3:

    ollama pull llama3      # 8B model (fast, ~4.7GB)
    # or
    ollama pull llama3:70b  # 70B model (if you have 48GB+ RAM/VRAM)
  3. Start the Model (in a separate CMD/PowerShell window):

ollama run llama3

Keep this window open — it runs the server on localhost:XXXXXX

Option 2: Ollama Inside Docker Container

Use this if you want everything isolated in the container

  1. Add to your Dockerfile (or run manually):
# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh

# Pull model (do this at build time or first run)
RUN ollama pull llama3
  1. Start Ollama Server in container background:
ollama serve &
  1. Update Agent URL(in UFCE_agent.py): OLLAMA_URL = "http://localhost:XXXXXX/api/generate"

Step 5: Launch the Agent

Connect the local LLM (via Ollama) to the reservoir.

python UFCE_agent.py

🧬 UFCE Genome: Searching the Code of Life

The repo supports semantic search over genomic data (e.g., E. coli genome or larger sequences).

Key Capabilities

  • Semantic DNA Search: Find "CRISPR arrays" or "Lac Operon" not just by exact string matching, but by functional similarity in vector space.
  • Smart Windowing: Automatically expands context from 600bp to 1500bp when biological keywords (e.g., "operon", "cluster") are detected.
  • Regex Highlighting: Instantly highlights motifs (e.g., GATTACA) regardless of case, with precise base-pair positioning.

🚀 Quick Start (Genome Demo)

Switch to the genome dataset in velocity_config.json and run the dedicated pipeline:

# 1. Download & Preprocess E. coli Genome
python genome/preprocessors/preprocess_fasta_ecoli.py

# 2. Vectorize (JAX Accelerated)
python ufce_ingestion_pipeline_shard.py

# 3. Merge Shards
python merge_shards.py

# 4. Run the Search Tool
python genome/genome_search/ufce_genome_search_demo.py

📂 Repository Contents

🧠 Core Details

  • ufce_jax_god_mode_benchmark.py — Runs the "God Mode" block-streaming kernel (2.02T ops/s).
  • ufce_jax_real_world_measurements.py — Runs the Streaming Softmax (LLM) and Top-K (Security) kernels.
  • ufce_attention_core.pyLegacy — Core attention logic (now integrated into the agent and trainers).

🛠️ Ingestion Pipeline (Sharded & Resumable)

  • ufce_ingestion_pipeline_shard.py — Processes individual Wikipedia shards into vector + metadata pairs (resumable, semantic chunking).
  • merge_shards.py — Concatenates all shards into the final knowledge_base_full.dat + metadata_full.txt.
  • prepare_wiki_dump.pyDeprecated — Legacy single-file processor (use WikiExtractor instead).

🚀 Training & Acceleration (VELOCITY)

  • velocity_70b_trainer_save_layers.py — Full trainer with weight persistence (70B-capable).
  • velocity_8b_trainer_save_layers.py — 8B version with saving.
  • velocity_8b_hybrid_trainer_save_layers.py — Hybrid with RAM cache and periodic checkpoints.
  • velocity_8b_hybrid_trainer_save_layers_less_128GB_ram.py — Disk-offload optimizer state for lower RAM systems.

🤖 Agent & Demo

  • UFCE_agent.py — The interactive RAG agent with Massive scale large-context retrieval over the knowledge base.

📜 Legacy & Docs

  • validate_attention.cu — Optimized CUDA C++ kernels.

📊 Benchmarks & Legacy Demos

  • test_search.py — Simple interactive vector search demo on a test subset (loads memmap vectors + metadata, performs cosine similarity search with SentenceTransformer).
  • ufce_layer_swapper.pyLegacy Demo — Early proof-of-concept showing a 32GB model forward pass on 12GB VRAM (~25s total). Superseded by full trainers.

🛠️ Quick Start (VS Code Dev Container)

This repo is configured as a VS Code Dev Container — the easiest way to reproduce.

  1. Install VS Code and the "Dev Containers" extension.
  2. Clone the repo and open the folder in VS Code.
  3. When prompted, click "Reopen in Container".

Run the Benchmarks

CPU Tests (inside container terminal from UFCE Algorithms python C++ folder):

python cyber_validation.py
python blockchain_validation.py

Run the "God Mode" Benchmark:

python ufce_jax_god_mode_benchmark.py

GPU Test (CUDA C++ Native): (UFCE Algorithm CUDA folder)

# For RTX 40-series (Ada Lovelace)
nvcc -o attention_gpu validate_attention.cu -O3 -arch=sm_89
./attention_gpu

🧠 The "Cognitive Tax": Blind vs. Smart Processing

We benchmarked the cost of introducing physics-informed decision logic into the kernel.

Kernel Type Logic Throughput Insight
Blind "God Mode" No decision (Always GPU) 2.02 Trillion Ops/s Pure Tensor Core saturation.
Cognitive Hybrid Physics-Check per block 0.35 Trillion Ops/s The cost of flexibility. Conditional logic (if/else) breaks pure kernel fusion, but enables dynamic energy saving.

Conclusion: For maximum raw power, use the Blind Kernel. For energy-efficient robotics (where you want to idle the GPU during low-flux), use the Cognitive Kernel.

⚡ Project VELOCITY: Breaking the 12GB VRAM Barrier

While traditional training requires the entire model to fit in VRAM, Project VELOCITY implements a Layer-Wise Swapper. This treats your System RAM (DDR5) as a high-speed L4 cache and your GPU VRAM as a dedicated compute core.

The "Massive scale large-context Model" Benchmark

We successfully executed a 32GB Model (Llama-3-8B equivalent) on a single 12GB RTX 4070 Ti.

Metric Achievement Impact
Model Size 32 GB (FP32/BF16) 2.7x larger than physical VRAM capacity.
Layer Latency ~0.78–0.94s (forward) Sub-second per layer with real weights.
Effective Throughput 37.5M tokens/sec (empirical) Up to 512 GB/s theoretical with 4-bit transport.
Training Capability Full backpropagation Real forward + backward passes; persistent fine-tuning possible.

🔬 How it Works: The Quad-Buffered Ring Pipeline

Project VELOCITY eliminates the "Stop-and-Go" latency of standard data loading. By using 4-zone asynchronous DMA, the system "teleports" the next layer into the GPU while the current layer is computing.

  1. Ingest: Fetches the next layer from the L4 Cache (System RAM) or falls back to Storage (SSD).
  2. Tokenize/Prepare: Prepares data (optional). Bitcasts or formats the tensor (e.g., Int16 View) for optimal transport.
  3. Pin/Feed: Pages the memory into a "Page-Locked" DMA zone to trigger a direct Asynchronous DMA transfer across the PCIe bus.
  4. Compute: The GPU executes the forward/backward pass. This is the only moment the layer occupies VRAM
  5. Writeback: Updates gradients and clears the VRAM for the next incoming layer.

⚖️ Licensing & Commercial Use

Open Source License

This project is open-source under the GNU General Public License v3.0 (GPLv3). This ensures that the core project remains free for researchers, students, and open-source projects.

Commercial Licensing

For proprietary software, closed-source applications, or enterprise use cases where GPLv3 compliance is not feasible, a Commercial License is available. This license waives the copyleft requirements.

Contact: thoughttimemachinexr@gmail.com for enterprise inquiries.

📚 Citation

If you use the UFCE Streaming Kernels or the Massive scale large-context Agent in your research, please cite the project:

@software{UFCE-Streaming_2025,
  author = {Kyle Killian},
  title = {The UniField Coupling Equation (UFCE) : Zero-Memory Streaming Kernels},
  year = {2025},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.17906337},
  url = {[https://github.com/thoughttimemachinexr/UFCE](https://github.com/thoughttimemachinexr/UFCE)}
}

## Disclaimer

This software is provided "AS IS", without warranty of any kind.  
It is experimental research code. Use it entirely at your own risk.

About

UFCE — Zero-memory streaming kernels for large-scale spatial-temporal interactions on consumer hardware. Supports Massive-scale context exact RAG and larger-than-VRAM LLM training via Project VELOCITY. Open-source JAX + CUDA framework.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors