Zero-memory streaming kernels for large-scale spatial-temporal interactions on consumer hardware. Supports Massive scale large-context exact RAG, larger-than-VRAM LLM training via Project VELOCITY, and high-throughput JAX + CUDA implementations.
Hardware: AMD Ryzen 9 7950X (CPU) + NVIDIA RTX 4070 Ti (GPU) + 80GB DDR5 RAM
| Kernel Implementation | Throughput |
|---|---|
| Ryzen 9 7950X (CPU) | 43.33 Billion ops/s |
| RTX 4070 Ti (Scientific) | 1.50 Trillion ops/s |
| JAX "block-streaming kernel" | 2.02 Trillion ops/s |
| JAX Streaming Softmax | 232.6 Billion ops/s |
| JAX Top-K | 1.40 Billion ops/s |
Additional measured results:
- 125 billion interaction points processed in 2.89 seconds (CPU) with 0.00 MB additional memory overhead.
- 50 trillion interactions (equivalent to a 1 billion token context) processed in 24.7 seconds on RTX 4070 Ti.
This repository includes a functional RAG agent that performs exact retrieval over large datasets (e.g., full Wikipedia) using a tiered memory reservoir (SSD-backed vector database) and Drill-Down attention.
- Download the Wikipedia dump:
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
- Extract Clean Text Shards (using WikiExtractor):
pip install wikiextractor- Run extraction to create the wiki_shards/ folder:
WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 --output wiki_shards/ -b 1MThis produces multiple text files in wiki_shards/ (one per large article batch), ready for the ingestion pipeline.
The UFCE ingestion pipeline has been upgraded to a sharded, resumable architecture for handling truly massive datasets (e.g., full Wikipedia dumps, Common Crawl subsets, or multi-domain corpora). It consists of two scripts:
ufce_ingestion_pipeline_shard.py: Processes individual text shards into vector + metadata pairs.merge_shards.py: Concatenates all shards into the finalknowledge_base_full.datandmetadata_full.txtused by the UFCE agent.
Convert the text into a Tiered Memory Reservoir (SSD-backed Vector DB). The sharded pipeline processes Wikipedia articles in independent chunks, using streaming embedding per shard to avoid RAM overload on massive datasets.
python UFCE_ingestion_pipeline.py
# Output: knowledge_base.dat (Binary Vectors) + metadata.txt (Index)The UFCE agent connects to a running Ollama server to generate responses grounded in your knowledge base.
You have two options for running Ollama — choose based on convenience and performance.
This is the fastest and simplest setup — Ollama runs natively on your Windows machine. Host version is faster (no container overhead, direct GPU access if using Ollama GPU build).
-
Install Ollama (if not already):
- Download from https://ollama.com/download
- Run the installer.
-
Download Llama-3:
ollama pull llama3 # 8B model (fast, ~4.7GB) # or ollama pull llama3:70b # 70B model (if you have 48GB+ RAM/VRAM)
-
Start the Model (in a separate CMD/PowerShell window):
ollama run llama3Keep this window open — it runs the server on localhost:XXXXXX
Use this if you want everything isolated in the container
- Add to your Dockerfile (or run manually):
# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | sh
# Pull model (do this at build time or first run)
RUN ollama pull llama3- Start Ollama Server in container background:
ollama serve &- Update Agent URL(in UFCE_agent.py): OLLAMA_URL = "http://localhost:XXXXXX/api/generate"
Connect the local LLM (via Ollama) to the reservoir.
python UFCE_agent.pyThe repo supports semantic search over genomic data (e.g., E. coli genome or larger sequences).
- Semantic DNA Search: Find "CRISPR arrays" or "Lac Operon" not just by exact string matching, but by functional similarity in vector space.
- Smart Windowing: Automatically expands context from 600bp to 1500bp when biological keywords (e.g., "operon", "cluster") are detected.
- Regex Highlighting: Instantly highlights motifs (e.g., GATTACA) regardless of case, with precise base-pair positioning.
Switch to the genome dataset in velocity_config.json and run the dedicated pipeline:
# 1. Download & Preprocess E. coli Genome
python genome/preprocessors/preprocess_fasta_ecoli.py
# 2. Vectorize (JAX Accelerated)
python ufce_ingestion_pipeline_shard.py
# 3. Merge Shards
python merge_shards.py
# 4. Run the Search Tool
python genome/genome_search/ufce_genome_search_demo.pyufce_jax_god_mode_benchmark.py— Runs the "God Mode" block-streaming kernel (2.02T ops/s).ufce_jax_real_world_measurements.py— Runs the Streaming Softmax (LLM) and Top-K (Security) kernels.ufce_attention_core.py— Legacy — Core attention logic (now integrated into the agent and trainers).
ufce_ingestion_pipeline_shard.py— Processes individual Wikipedia shards into vector + metadata pairs (resumable, semantic chunking).merge_shards.py— Concatenates all shards into the finalknowledge_base_full.dat+metadata_full.txt.prepare_wiki_dump.py— Deprecated — Legacy single-file processor (use WikiExtractor instead).
velocity_70b_trainer_save_layers.py— Full trainer with weight persistence (70B-capable).velocity_8b_trainer_save_layers.py— 8B version with saving.velocity_8b_hybrid_trainer_save_layers.py— Hybrid with RAM cache and periodic checkpoints.velocity_8b_hybrid_trainer_save_layers_less_128GB_ram.py— Disk-offload optimizer state for lower RAM systems.
UFCE_agent.py— The interactive RAG agent with Massive scale large-context retrieval over the knowledge base.
validate_attention.cu— Optimized CUDA C++ kernels.
test_search.py— Simple interactive vector search demo on a test subset (loads memmap vectors + metadata, performs cosine similarity search with SentenceTransformer).ufce_layer_swapper.py— Legacy Demo — Early proof-of-concept showing a 32GB model forward pass on 12GB VRAM (~25s total). Superseded by full trainers.
This repo is configured as a VS Code Dev Container — the easiest way to reproduce.
- Install VS Code and the "Dev Containers" extension.
- Clone the repo and open the folder in VS Code.
- When prompted, click "Reopen in Container".
CPU Tests (inside container terminal from UFCE Algorithms python C++ folder):
python cyber_validation.py
python blockchain_validation.pyRun the "God Mode" Benchmark:
python ufce_jax_god_mode_benchmark.pyGPU Test (CUDA C++ Native): (UFCE Algorithm CUDA folder)
# For RTX 40-series (Ada Lovelace)
nvcc -o attention_gpu validate_attention.cu -O3 -arch=sm_89
./attention_gpuWe benchmarked the cost of introducing physics-informed decision logic into the kernel.
| Kernel Type | Logic | Throughput | Insight |
|---|---|---|---|
| Blind "God Mode" | No decision (Always GPU) | 2.02 Trillion Ops/s | Pure Tensor Core saturation. |
| Cognitive Hybrid | Physics-Check per block | 0.35 Trillion Ops/s | The cost of flexibility. Conditional logic (if/else) breaks pure kernel fusion, but enables dynamic energy saving. |
Conclusion: For maximum raw power, use the Blind Kernel. For energy-efficient robotics (where you want to idle the GPU during low-flux), use the Cognitive Kernel.
While traditional training requires the entire model to fit in VRAM, Project VELOCITY implements a Layer-Wise Swapper. This treats your System RAM (DDR5) as a high-speed L4 cache and your GPU VRAM as a dedicated compute core.
We successfully executed a 32GB Model (Llama-3-8B equivalent) on a single 12GB RTX 4070 Ti.
| Metric | Achievement | Impact |
|---|---|---|
| Model Size | 32 GB (FP32/BF16) | 2.7x larger than physical VRAM capacity. |
| Layer Latency | ~0.78–0.94s (forward) | Sub-second per layer with real weights. |
| Effective Throughput | 37.5M tokens/sec (empirical) | Up to 512 GB/s theoretical with 4-bit transport. |
| Training Capability | Full backpropagation | Real forward + backward passes; persistent fine-tuning possible. |
Project VELOCITY eliminates the "Stop-and-Go" latency of standard data loading. By using 4-zone asynchronous DMA, the system "teleports" the next layer into the GPU while the current layer is computing.
- Ingest: Fetches the next layer from the L4 Cache (System RAM) or falls back to Storage (SSD).
- Tokenize/Prepare: Prepares data (optional). Bitcasts or formats the tensor (e.g., Int16 View) for optimal transport.
- Pin/Feed: Pages the memory into a "Page-Locked" DMA zone to trigger a direct Asynchronous DMA transfer across the PCIe bus.
- Compute: The GPU executes the forward/backward pass. This is the only moment the layer occupies VRAM
- Writeback: Updates gradients and clears the VRAM for the next incoming layer.
This project is open-source under the GNU General Public License v3.0 (GPLv3). This ensures that the core project remains free for researchers, students, and open-source projects.
For proprietary software, closed-source applications, or enterprise use cases where GPLv3 compliance is not feasible, a Commercial License is available. This license waives the copyleft requirements.
Contact: thoughttimemachinexr@gmail.com for enterprise inquiries.
If you use the UFCE Streaming Kernels or the Massive scale large-context Agent in your research, please cite the project:
@software{UFCE-Streaming_2025,
author = {Kyle Killian},
title = {The UniField Coupling Equation (UFCE) : Zero-Memory Streaming Kernels},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.17906337},
url = {[https://github.com/thoughttimemachinexr/UFCE](https://github.com/thoughttimemachinexr/UFCE)}
}
## Disclaimer
This software is provided "AS IS", without warranty of any kind.
It is experimental research code. Use it entirely at your own risk.