ECS736P/U Information Retrieval — Coursework 2
This repository contains the full implementation of a search engine built over the TREC Disk 4 & 5 corpus, including indexing, retrieval, query expansion, and evaluation.
The system implements:
- BM25F (field-aware ranking) over title and body fields
- Positional inverted index supporting phrase and proximity search
- Phrase and proximity scoring bonuses
- Controlled WordNet query expansion with drift filters
- Ablation evaluation on 249 TREC Robust04 topics
- Streamlit GUI and command-line interface
Due to licensing restrictions, the full TREC Disk 4 & 5 dataset is not included in this repository. The following are excluded:
- TREC Disk 4 & 5 corpus files
- Prebuilt full index
- Qrels and topic files
A lightweight sample dataset (prebuilt index + evaluation files) is provided separately via a OneDrive link included in the submission materials.
Instructions:
- Download the sample dataset from the provided OneDrive link
- Place the files into the root directory of this project
- Run:
run_sample.bat # Windows
bash run_sample.sh # Mac / Linuxgit clone https://github.com/husaam-atq/BM25F_Search_Engine.git
cd BM25F_Search_Enginepip install -r requirements.txtpython setup_nltk.pyAfter downloading from OneDrive (link in submission materials), place the files inside the project folder.
project-folder/
├── app.py
├── ...
├── sample_index/
├── sample_topics.txt
└── sample_qrels.txt
project-folder/
├── app.py
├── ...
├── TREC-Disk-4/
│ └── TREC-Disk-4/
│ ├── FT/
│ ├── FR94/
│ └── CR_103RD/
└── TREC-Disk-5/
└── TREC-Disk-5/
├── FBIS/
└── LATIMES/
Note: The outer folder name does not matter — the code locates the dataset automatically.
Runs instantly using the prebuilt sample data.
run_sample.bat # Windows
bash run_sample.sh # Mac / LinuxThen launch the GUI:
streamlit run app.pyStep 1 — Build the index
python build_index.pyTakes approximately 10–30 minutes. Requires several GB of RAM.
Step 2 — Run the system
streamlit run app.pyor
run_full.bat # WindowsThe Streamlit interface provides:
- Free-text queries
- TREC topic picker (auto-fills query from any of the 249 Robust04 topics)
- Result cards with source badges and qrels relevance badges (green = relevant)
- Query expansion details toggle
- On-demand full article loading and plain-text download
- Evaluation Results tab with interactive ablation table and bar charts
Interactive mode:
python search.pySingle query:
python search.py "child support enforcement"Options:
python search.py --top-k 20 "international trade"
python search.py --no-expand "information retrieval"
python search.py --debug "jet aircraft flight"- Parsing —
parse_docs.pyextractsdocno,title, andbodyfrom all five SGML collections (FT, FR94, CR, FBIS, LA Times) with Latin-1 encoding fallback - Preprocessing —
preprocess.pyapplies lowercasing, tokenisation, stopword removal, and Porter stemming - Indexing —
build_index.pybuilds a positional inverted index using SPIMI (Single-Pass In-Memory Indexing) in 20,000-document chunks with checkpoint/resume support - Statistics — Per-document field lengths and collection-level averages persisted alongside the index
- Query preprocessing (same pipeline as documents)
- Optional WordNet expansion with drift filters
- BM25F scoring + phrase and proximity bonuses
- Ranked results returned via GUI or CLI
| Parameter | Value |
|---|---|
| K1 | 1.2 |
| B_TITLE | 0.75 |
| B_BODY | 0.75 |
| W_TITLE | 5.0 |
| W_BODY | 1.0 |
- Phrase bonus — +1.5 for each pair of consecutive query terms appearing adjacent in a field
- Proximity bonus — up to +0.5 per term pair appearing within an 8-word window
- WordNet expansion — synonyms added at weight γ = 0.3 with five drift filters (nouns only, IDF threshold, DF cap, co-occurrence requirement, max 3 per term)
Score(d, q) =
BM25F(original terms)
+ 0.3 × BM25F(expanded terms)
+ phrase_bonus
+ proximity_bonus
Evaluated over 249 TREC Robust04 topics across six ablation variants.
| System | MAP | P@10 | nDCG@10 | Recall@100 | R-Precision |
|---|---|---|---|---|---|
| BM25 Flattened (baseline) | 0.1832 | 0.3843 | 0.3852 | 0.3777 | 0.2509 |
| BM25 Separate Fields (unweighted) | 0.1603 | 0.3631 | 0.3685 | 0.3418 | 0.2273 |
| BM25F (field-weighted) | 0.1865 | 0.4012 | 0.3997 | 0.3804 | 0.2528 |
| BM25F + Phrase & Proximity ⭐ | 0.1961 | 0.4040 | 0.4033 | 0.3938 | 0.2655 |
| BM25F + Phrase/Prox + WordNet | 0.1958 | 0.4040 | 0.4014 | 0.3936 | 0.2657 |
| BM25F + Phrase/Prox + WordNet + Neural Rerank | 0.1795 | 0.3795 | 0.3794 | 0.3936 | 0.2449 |
Key findings:
- Phrase and proximity modelling produced the largest single performance gain
- WordNet expansion improved recall without significantly changing MAP
- Neural reranking did not improve MAP — likely due to truncated document representations limiting cross-encoder signal
| File | Description |
|---|---|
app.py |
Streamlit GUI |
config.py |
Configuration and hyperparameters |
build_index.py |
Index construction (SPIMI) |
parse_docs.py |
SGML document parsing |
preprocess.py |
Text normalisation pipeline |
rank.py |
BM25F scoring + phrase/proximity |
query_expand.py |
WordNet expansion |
search.py |
CLI interface |
evaluate.py |
Evaluation pipeline |
metrics.py |
MAP, P@10, nDCG, Recall, R-Precision |
index_store.py |
Index access layer |
topics_parser.py |
TREC topic parsing |
qrels_parser.py |
Qrels parsing |
- Blazej Olszta
- Muhamad Husaam Ateeq
- Max Monaghan
- Sulaiman Bhatti