MediaWiki Code2Code Search

A high-performance semantic code search engine designed for the MediaWiki ecosystem. Built on the Qwen 0.6B neural retrieval model, optimized for large-scale codebases like MediaWiki Core, Extensions, and WMF Operations. Metadata is managed via indexed SQLite for sub-second responses and a low-memory footprint (Toolforge compatible).

As featured on Wikimedia Diff.

✨ Key Features

📂 Global MediaWiki Indexing: Covers Core, Extensions, Skins, Libraries, Services, and more (2,400+ unique repos).
🧠 Single-Stage Neural Retrieval: Uses Qwen3-Embedding-0.6B with FAISS IndexIVFPQ for lightning-fast results (approx. 0.3s).
🌳 Granular Structural Filtering: High-precision extraction and filtering of Functions, Types, Template Functions, and Template Types across 10 languages.
🏗️ Split-Build Architecture: Optimized for asymmetric hardware—run heavy extraction on a laptop and neural vectorization on a GPU.
🌍 Massive Localization Footprint: Fully localized UI supporting 17 languages.
🎨 Codex UI: A clean, accessible frontend built with Wikimedia's Codex Design System for a native look and feel.
🔍 Advanced Multi-select Filtering: Granular control over results by repository group, programming language, and entry type.

📂 Project Structure

mediawiki-code2code-search/
├── frontend/                  # Codex-based Static Frontend
│   ├── css/style.css          # Stylesheets using the Codex Design System
│   ├── js/main.js             # Main frontend application logic
│   └── i18n/                  # Localization JSONs supporting 17 languages
├── backend/                   # FAISS Index, SQLite & Vector DB Management
│   ├── generate_embeddings.py # Computes neural embeddings from raw snippets (saves embeddings.npy)
│   ├── build_index.py         # Trains and builds the FAISS search index from saved embeddings
│   ├── migrate_to_sqlite.py   # RAM optimization script (JSON metadata -> SQLite)
│   ├── functions.db           # SQLite metadata store for fast lookups
│   └── mediawiki.index        # Compiled FAISS vector index
├── preprocessing/             # Global-Scale Indexing Pipeline (Phases 1-3)
│   ├── list_repos.py          # Discovers and lists 2,400+ MediaWiki repositories
│   ├── download_repos.py      # Handles shallow clones of target repositories
│   ├── extract_entities.py    # Structural parsing & AST entity extraction
│   ├── archive_to_swh.py      # Software Heritage archiving pipeline scripts
│   └── resolve_swh_hashes.py  # Resolves local Git hashes to SWH SHA1 IDs
├── tests/                     # Parser & API Verification Suite
│   ├── test_api.py            # Backend API endpoint tests
│   ├── test_*_parser.py       # Syntax extraction validations for 10+ languages
│   └── example.*              # Target language snippets parsed during testing
├── scripts/                   # Internal utilities & metadata migration helpers
├── manuscript/                # Academic paper & System documentation (LaTeX)
│   ├── main.tex               # Manuscript source file documenting architecture
│   └── main.pdf               # Compiled system documentation/paper
├── app.py                     # Root FastAPI web application entry point
├── requirements.txt           # Python backend dependencies
└── CITATION.cff               # CITATION file for academic/repository reference

🚀 Scaling & Pipeline

The indexing pipeline is designed for a mass-scale, distributed build.

🛠️ Setup

Backend (Python)

Create and activate a virtual environment (optional but recommended), then install dependencies:

python -m venv venv
# Windows:
.\venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

pip install -r requirements.txt

Frontend (Static Assets)

The frontend is built with vanilla JavaScript and the Codex Design System. It consists of static HTML, CSS, and JS files located in the frontend/ directory. These files are served directly by the FastAPI backend.

There is no compilation step required for the frontend.

Phase 1: Discovery & Mirroring (Local)

First, discover the ecosystem and mirror it for processing:

cd preprocessing
python list_repos.py      # Fetches 2,400+ repo URLs
python download_repos.py  # Shallow clones (approx. 8GB disk space)

Phase 2: Archiving (Global)

Ensure all repositories are archived in Software Heritage for on-demand retrieval.

Note

archive_to_swh.py requires a "bulk_save" token. For most users, it is recommended to use:

python archive_individual_to_swh.py

Phase 3: Extraction (Local/CPU)

Perform high-precision structural parsing on your local machine. This captures functions/types with qualified names (e.g., Class::Method) and handles complex language features.

Phase 3a: Structural Extraction

python extract_structural_entities.py

Phase 3b: Identity Resolution Resolve Git-compatible hashes to standard SHA1. You can do this either locally (fast) or via the Software Heritage API (official):

Option A: Local Resolution (Recommended)
```
python resolve_swh_hashes_local.py
```
Option B: API-based Resolution
```
python resolve_swh_hashes.py
```

Phase 4: Indexing (Remote/GPU)

Move raw_functions.json to a GPU-equipped environment to compute neural vectors and build the FAISS index.

cd backend
python generate_embeddings.py  # Computes and saves embeddings to embeddings.npy
python build_index.py          # Trains and builds FAISS index from embeddings.npy

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

Before deploying, convert the production metadata to SQLite to stay within 6GiB RAM limits:

cd backend
python migrate_to_sqlite.py

Once the index and database are ready, start the FastAPI backend from the root directory:

# From the project root
uvicorn app:app --host 0.0.0.0 --port 8000

The server will be available at http://localhost:8000. You can access the automatic API documentation at http://localhost:8000/docs.

🚀 Deployment (Toolforge)

Follow these steps to deploy the application on Wikimedia Toolforge.

Note

The examples below use supnabla as the username and code2codesearch as the project name. Replace these with your own Toolforge credentials where applicable.

1. Upload Assets

Since the model weights and indexes are large, they should be uploaded from your local machine to the Toolforge project data directory:

# From the project root
scp -rp "./models" supnabla@login.toolforge.org:/data/project/code2codesearch/
scp -rp "./backend/mediawiki.index" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/
scp -rp "./backend/functions.db" supnabla@login.toolforge.org:/data/project/code2codesearch/backend/

2. Configure Permissions

Log into Toolforge and set the necessary permissions:

ssh supnabla@login.toolforge.org

chmod -R a+rX /data/project/code2codesearch/models/
chmod a+r /data/project/code2codesearch/backend/functions.db
chmod a+r /data/project/code2codesearch/backend/mediawiki.index

3. Deploy

Now you are ready to deploy the webservice:

# Switch to the code2codesearch project
become code2codesearch

# Stop and clean existing build
toolforge webservice buildservice stop --mount=all
toolforge build clean -y

# Start build from repository
toolforge build start https://github.com/ftosoni/mediawiki-code2code-search

# Start webservice with 6GiB RAM
toolforge webservice buildservice start --mount=all -m 6Gi

# Monitor logs
toolforge webservice logs -f

🛠️ Technology Stack & Project Status

📄 Licence

Apache 2.0 License. Created for advanced code-to-code retrieval within the Wikimedia developer ecosystem.

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
.github		.github
assets/branding		assets/branding
backend		backend
frontend		frontend
preprocessing		preprocessing
scripts		scripts
tests		tests
.babelrc		.babelrc
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GUIDELINES.md		GUIDELINES.md
LICENSE.md		LICENSE.md
Procfile		Procfile
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
codemeta.json		codemeta.json
download_models.py		download_models.py
package-lock.json		package-lock.json
package.json		package.json
profile_search.py		profile_search.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
update_release.py		update_release.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MediaWiki Code2Code Search

✨ Key Features

📂 Project Structure

🚀 Scaling & Pipeline

🛠️ Setup

Backend (Python)

Frontend (Static Assets)

Phase 1: Discovery & Mirroring (Local)

Phase 2: Archiving (Global)

Phase 3: Extraction (Local/CPU)

Phase 4: Indexing (Remote/GPU)

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

🚀 Deployment (Toolforge)

1. Upload Assets

2. Configure Permissions

3. Deploy

🛠️ Technology Stack & Project Status

📄 Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MediaWiki Code2Code Search

✨ Key Features

📂 Project Structure

🚀 Scaling & Pipeline

🛠️ Setup

Backend (Python)

Frontend (Static Assets)

Phase 1: Discovery & Mirroring (Local)

Phase 2: Archiving (Global)

Phase 3: Extraction (Local/CPU)

Phase 4: Indexing (Remote/GPU)

Phase 5: Memory Optimization & Deployment (Local/Toolforge)

🚀 Deployment (Toolforge)

1. Upload Assets

2. Configure Permissions

3. Deploy

🛠️ Technology Stack & Project Status

📄 Licence

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages