CodeR-RAG is an end-to-end retrieval-augmented system built on top of CodeR, a state-of-the-art code embedding model trained using a massive synthetic dataset (CodeR-Pile). This framework brings high-accuracy code retrieval, multi-language code understanding, and task-aware augmentation directly to LLM-based developer tools.
Powered by CodeR embeddings, optimized for:
- Text2Code (e.g., natural language → code examples)
- Code2Text (summaries, explanations, API usage)
- Code2Code (similar code, translation, refinement)
- Hybrid retrieval (bug fixing, optimization, style, security)
The system includes:
- Document ingestion + code parsing
- Embedding generation (instruction-aware InfoNCE)
- Vector storage (FAISS / LanceDB / Milvus)
- Query routing + retrieval
- Augmentation formatting
- LLM generation modules
CodeR-RAG supports retrieval across 20 programming languages, including:
Python, Java, Go, C++, TypeScript, Rust, C#, PHP…
Includes tools and prompts to extend CodeR-Pile using the DRU principle:
- Diversity (multiple task types)
- Reliability (LLM and GPT-filtered)
- Usability (instruction-aware queries)
Easily swap:
- Embedding models (CodeR or custom)
- LLMs for generation
- Vector DB backends
- Chunking & parsing strategies