Field-Aware BM25F Search Engine

ECS736P/U Information Retrieval — Coursework 2

This repository contains the full implementation of a search engine built over the TREC Disk 4 & 5 corpus, including indexing, retrieval, query expansion, and evaluation.

Overview

The system implements:

BM25F (field-aware ranking) over title and body fields
Positional inverted index supporting phrase and proximity search
Phrase and proximity scoring bonuses
Controlled WordNet query expansion with drift filters
Ablation evaluation on 249 TREC Robust04 topics
Streamlit GUI and command-line interface

Dataset & Reproducibility

Due to licensing restrictions, the full TREC Disk 4 & 5 dataset is not included in this repository. The following are excluded:

TREC Disk 4 & 5 corpus files
Prebuilt full index
Qrels and topic files

A lightweight sample dataset (prebuilt index + evaluation files) is provided separately via a OneDrive link included in the submission materials.

Instructions:

Download the sample dataset from the provided OneDrive link
Place the files into the root directory of this project
Run:

run_sample.bat      # Windows
bash run_sample.sh  # Mac / Linux

Setup Instructions

1. Clone the repository

git clone https://github.com/husaam-atq/BM25F_Search_Engine.git
cd BM25F_Search_Engine

2. Install dependencies

pip install -r requirements.txt

3. Download NLTK resources

python setup_nltk.py

4. Add the downloaded data

After downloading from OneDrive (link in submission materials), place the files inside the project folder.

Folder Structure

Sample Mode (recommended for quick testing)

project-folder/
├── app.py
├── ...
├── sample_index/
├── sample_topics.txt
└── sample_qrels.txt

Full Mode (requires full TREC dataset)

project-folder/
├── app.py
├── ...
├── TREC-Disk-4/
│   └── TREC-Disk-4/
│       ├── FT/
│       ├── FR94/
│       └── CR_103RD/
└── TREC-Disk-5/
    └── TREC-Disk-5/
        ├── FBIS/
        └── LATIMES/

Note: The outer folder name does not matter — the code locates the dataset automatically.

Running the System

Sample Mode (FAST — for markers)

Runs instantly using the prebuilt sample data.

run_sample.bat      # Windows
bash run_sample.sh  # Mac / Linux

Then launch the GUI:

streamlit run app.py

Full Mode (full dataset)

Step 1 — Build the index

python build_index.py

Takes approximately 10–30 minutes. Requires several GB of RAM.

Step 2 — Run the system

streamlit run app.py

or

run_full.bat        # Windows

GUI Usage

The Streamlit interface provides:

Free-text queries
TREC topic picker (auto-fills query from any of the 249 Robust04 topics)
Result cards with source badges and qrels relevance badges (green = relevant)
Query expansion details toggle
On-demand full article loading and plain-text download
Evaluation Results tab with interactive ablation table and bar charts

Command-Line Usage

Interactive mode:

python search.py

Single query:

python search.py "child support enforcement"

Options:

python search.py --top-k 20 "international trade"
python search.py --no-expand "information retrieval"
python search.py --debug "jet aircraft flight"

System Architecture

Offline Pipeline (runs once)

Parsing — parse_docs.py extracts docno, title, and body from all five SGML collections (FT, FR94, CR, FBIS, LA Times) with Latin-1 encoding fallback
Preprocessing — preprocess.py applies lowercasing, tokenisation, stopword removal, and Porter stemming
Indexing — build_index.py builds a positional inverted index using SPIMI (Single-Pass In-Memory Indexing) in 20,000-document chunks with checkpoint/resume support
Statistics — Per-document field lengths and collection-level averages persisted alongside the index

Online Pipeline (every query)

Query preprocessing (same pipeline as documents)
Optional WordNet expansion with drift filters
BM25F scoring + phrase and proximity bonuses
Ranked results returned via GUI or CLI

Retrieval Model

BM25F Parameters

Parameter	Value
K1	1.2
B_TITLE	0.75
B_BODY	0.75
W_TITLE	5.0
W_BODY	1.0

Additional Scoring Signals

Phrase bonus — +1.5 for each pair of consecutive query terms appearing adjacent in a field
Proximity bonus — up to +0.5 per term pair appearing within an 8-word window
WordNet expansion — synonyms added at weight γ = 0.3 with five drift filters (nouns only, IDF threshold, DF cap, co-occurrence requirement, max 3 per term)

Scoring Formula

Score(d, q) =
    BM25F(original terms)
  + 0.3 × BM25F(expanded terms)
  + phrase_bonus
  + proximity_bonus

Evaluation Results

Evaluated over 249 TREC Robust04 topics across six ablation variants.

System	MAP	P@10	nDCG@10	Recall@100	R-Precision
BM25 Flattened (baseline)	0.1832	0.3843	0.3852	0.3777	0.2509
BM25 Separate Fields (unweighted)	0.1603	0.3631	0.3685	0.3418	0.2273
BM25F (field-weighted)	0.1865	0.4012	0.3997	0.3804	0.2528
BM25F + Phrase & Proximity ⭐	0.1961	0.4040	0.4033	0.3938	0.2655
BM25F + Phrase/Prox + WordNet	0.1958	0.4040	0.4014	0.3936	0.2657
BM25F + Phrase/Prox + WordNet + Neural Rerank	0.1795	0.3795	0.3794	0.3936	0.2449

Key findings:

Phrase and proximity modelling produced the largest single performance gain
WordNet expansion improved recall without significantly changing MAP
Neural reranking did not improve MAP — likely due to truncated document representations limiting cross-encoder signal

Key Files

File	Description
`app.py`	Streamlit GUI
`config.py`	Configuration and hyperparameters
`build_index.py`	Index construction (SPIMI)
`parse_docs.py`	SGML document parsing
`preprocess.py`	Text normalisation pipeline
`rank.py`	BM25F scoring + phrase/proximity
`query_expand.py`	WordNet expansion
`search.py`	CLI interface
`evaluate.py`	Evaluation pipeline
`metrics.py`	MAP, P@10, nDCG, Recall, R-Precision
`index_store.py`	Index access layer
`topics_parser.py`	TREC topic parsing
`qrels_parser.py`	Qrels parsing

Authors

Blazej Olszta
Muhamad Husaam Ateeq
Max Monaghan
Sulaiman Bhatti

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Field-Aware BM25F Search Engine

Overview

Dataset & Reproducibility

Setup Instructions

1. Clone the repository

2. Install dependencies

3. Download NLTK resources

4. Add the downloaded data

Folder Structure

Sample Mode (recommended for quick testing)

Full Mode (requires full TREC dataset)

Running the System

Sample Mode (FAST — for markers)

Full Mode (full dataset)

GUI Usage

Command-Line Usage

System Architecture

Offline Pipeline (runs once)

Online Pipeline (every query)

Retrieval Model

BM25F Parameters

Additional Scoring Signals

Scoring Formula

Evaluation Results

Key Files

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
app.py		app.py
build_index.py		build_index.py
config.py		config.py
evaluate.py		evaluate.py
index_store.py		index_store.py
make_sample_package.py		make_sample_package.py
metrics.py		metrics.py
parse_docs.py		parse_docs.py
preprocess.py		preprocess.py
qrels_parser.py		qrels_parser.py
query_expand.py		query_expand.py
rank.py		rank.py
requirements.txt		requirements.txt
run_full.bat		run_full.bat
run_sample.bat		run_sample.bat
run_sample.sh		run_sample.sh
search.py		search.py
setup_nltk.py		setup_nltk.py
topics_parser.py		topics_parser.py
variants.py		variants.py

Folders and files

Latest commit

History

Repository files navigation

Field-Aware BM25F Search Engine

Overview

Dataset & Reproducibility

Setup Instructions

1. Clone the repository

2. Install dependencies

3. Download NLTK resources

4. Add the downloaded data

Folder Structure

Sample Mode (recommended for quick testing)

Full Mode (requires full TREC dataset)

Running the System

Sample Mode (FAST — for markers)

Full Mode (full dataset)

GUI Usage

Command-Line Usage

System Architecture

Offline Pipeline (runs once)

Online Pipeline (every query)

Retrieval Model

BM25F Parameters

Additional Scoring Signals

Scoring Formula

Evaluation Results

Key Files

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages