Skip to content

ftosoni/mediawiki-code2code-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

149 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

MediaWiki Code2Code Search

A high-performance semantic code search engine designed for the MediaWiki ecosystem. Built on the Qwen 0.6B neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations. Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).

As featured on Wikimedia Diff.

✨ Key Features

  • πŸ“‚ Global MediaWiki Indexing: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
  • 🧠 Single-Stage Neural Retrieval: Uses Qwen3-Embedding-0.6B with FAISS IndexIVFPQ for lightning-fast results (approx. 0.3s).
  • 🌳 Granular Structural Filtering: High-precision extraction and filtering of Functions, Types, Template Functions, and Template Types across 10 languages.
  • πŸ—οΈ Split-Build Architecture: Optimized for asymmetric hardwareβ€”run heavy extraction on a laptop and neural vectorization on a GPU.
  • 🌍 Massive Localization Footprint: Fully localized UI supporting 17 languages.
  • 🎨 Codex UI: A clean, accessible frontend built with Wikimedia's Codex Design System for a native look and feel.
  • πŸ” Advanced Multi-select Filtering: Granular control over results by repository group, programming language, and entry type.

πŸ“‚ Project Structure

mediawiki-code2code-search/
β”œβ”€β”€ frontend/                  # Codex-based Static Frontend
β”‚   β”œβ”€β”€ css/style.css          # Stylesheets using the Codex Design System
β”‚   β”œβ”€β”€ js/main.js             # Main frontend application logic
β”‚   └── i18n/                  # Localization JSONs supporting 17 languages
β”œβ”€β”€ backend/                   # FAISS Index, SQLite & Vector DB Management
β”‚   β”œβ”€β”€ generate_embeddings.py # Computes neural embeddings from raw snippets (saves embeddings.npy)
β”‚   β”œβ”€β”€ build_index.py         # Trains and builds the FAISS search index from saved embeddings
β”‚   β”œβ”€β”€ migrate_to_sqlite.py   # RAM optimization script (JSON metadata -> SQLite)
β”‚   β”œβ”€β”€ functions.db           # SQLite metadata store for fast lookups
β”‚   └── mediawiki.index        # Compiled FAISS vector index
β”œβ”€β”€ preprocessing/             # Global-Scale Indexing Pipeline (Phases 1-3)
β”‚   β”œβ”€β”€ list_repos.py          # Discovers and lists 2,400+ MediaWiki repositories
β”‚   β”œβ”€β”€ download_repos.py      # Handles shallow clones of target repositories
β”‚   β”œβ”€β”€ extract_entities.py    # Structural parsing & AST entity extraction
β”‚   β”œβ”€β”€ archive_to_swh.py      # Software Heritage archiving pipeline scripts
β”‚   └── resolve_swh_hashes.py  # Resolves local Git hashes to SWH SHA1 IDs
β”œβ”€β”€ tests/                     # Parser & API Verification Suite
β”‚   β”œβ”€β”€ test_api.py            # Backend API endpoint tests
β”‚   β”œβ”€β”€ test_*_parser.py       # Syntax extraction validations for 10+ languages
β”‚   └── example.*              # Target language snippets parsed during testing
β”œβ”€β”€ scripts/                   # Internal utilities & metadata migration helpers
β”œβ”€β”€ manuscript/                # Academic paper & System documentation (LaTeX)
β”‚   β”œβ”€β”€ main.tex               # Manuscript source file documenting architecture
β”‚   └── main.pdf               # Compiled system documentation/paper
β”œβ”€β”€ app.py                     # Root FastAPI web application entry point
β”œβ”€β”€ requirements.txt           # Python backend dependencies
└── CITATION.cff               # CITATION file for academic/repository reference

πŸš€ Scaling & Pipeline

The indexing pipeline is designed for a mass-scale, distributed build.

πŸ› οΈ Setup

Backend (Python)

Create and activate a virtual environment (optional but recommended), then install dependencies:

python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

pip install -r requirements.txt

Frontend (Static Assets)

The frontend is built with vanilla JavaScript and the Codex Design System. It consists of static HTML, CSS, and JS files located in the frontend/ directory. These files are served directly by the FastAPI backend.

There is no compilation step required for the frontend.

Phase 1: Discovery & Mirroring (Local)

First, discover the ecosystem and mirror it for processing:

cd preprocessing
python list_repos.py      # Fetches 2,400+ repo URLs
python download_repos.py  # Shallow clones (approx. 8GB disk space)

Phase 2: Archiving (Global)

Ensure all repositories are archived in Software Heritage for on-demand retrieval.

Note

archive_to_swh.py requires a "bulk_save" token. For most users, it is recommended to use:

python archive_individual_to_swh.py

Phase 3: Extraction (Local/CPU)

Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., Class::Method) and handles complex language features.

Phase 3a: Structural Extraction

python extract_structural_entities.py

Phase 3b: Identity Resolution Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):

  • Option A: Local Resolution (Recommended)
    python resolve_swh_hashes_local.py
  • Option B: API-based Resolution
    python resolve_swh_hashes.py

Phase 4: Indexing (Remote/GPU)

Move raw_functions.json to a GPU-equipped environment to compute neural vectors and build the FAISS index.

cd backend
python generate_embeddings.py  # Computes and saves embeddings to embeddings.npy
python build_index.py          # Trains and builds FAISS index from embeddings.npy

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:

cd backend
python migrate_to_sqlite.py

Once the index and database are ready, start the FastAPI backend from the root directory:

# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000

The server will be available at http://localhost:8000. You can access the automatic API documentation at http://localhost:8000/docs.


πŸš€ Deployment (Toolforge)

Follow these steps to deploy the application on Wikimedia Toolforge.

Note

The examples below use supnabla as the username and code2codesearch as the project name. Replace these with your own Toolforge credentials where applicable.

1. Upload Assets

Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:

# From the project root
scp -rp "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -rp "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -rp "./backend/functions.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/

2. Configure Permissions

Log into Toolforge and set the necessary permissions:

ssh supnabla@login.toolforge.org

chmod -R a+rX /data/project/code2codesearch/models/
chmod a+r /data/project/code2codesearch/backend/functions.db
chmod a+r /data/project/code2codesearch/backend/mediawiki.index

3. Deploy

Now you are ready to deploy the webservice:

# Switch to the code2codesearch project
become code2codesearch

# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y

# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search

# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi

# Monitor logs
toolforge webservice logs -f

πŸ› οΈ Technology Stack & Project Status

CI Status License Code Style: PEP8 SWH Origin SWH Directory

Codex JavaScript

FastAPI Python 3.11+ Uvicorn

FAISS Vector indexes (1024d) SQLite

Qwen3 Embedding 0.6B Tree-sitter Software Heritage

Toolforge GitHub Actions pytest

πŸ“„ Licence

Apache 2.0 License. Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.

About

MediaWiki Code2Code Search is a high-performance semantic search tool designed specifically for the MediaWiki open-source ecosystem, integrated with the Software Heritage archive. It utilises a single-stage neural retrieval architecture to help developers navigate complex codebases with high precision and minimal resource usage.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors