CS253 Assignment 1
| Name | Ronit Kumar |
| Roll No | 230875 |
| ronitk23@iitk.ac.in |
A command-line C++ tool that builds a word-level index over large text files using a fixed-size buffer (256–1024 KB), ensuring constant I/O memory usage regardless of file size. It supports versioned files and three query types: single word frequency (word), top-K most frequent words (top), and frequency difference between two versions (diff). Missing arguments are interactively prompted at runtime.
- Fixed-size buffered file reading (256–1024 KB) — memory does not grow with file size
- Case-insensitive word indexing
- Support for multiple file versions
- Word count, Top-K, and Diff queries
- Interactive mode — prompts for any missing arguments
- Object-oriented C++ design with inheritance, templates, and polymorphism
git clone https://github.com/Ronit-k/CS253_MemoryEfficientVersionedFileIndexer.git
cd CS253_MemoryEfficientVersionedFileIndexerwget https://media.githubusercontent.com/media/krishna-kumar-bais/CS253-Software-DevOps-Assignment1/main/test_logs.txt
wget https://media.githubusercontent.com/media/krishna-kumar-bais/CS253-Software-DevOps-Assignment1/main/verbose_logs.txtg++ -std=c++17 -O2 -o analyzer analyser.cpp| Argument | Description |
|---|---|
--query |
Query type: word, top, or diff |
--buffer |
Buffer size in KB (256–1024) |
--file |
Input file path (for word / top) |
--version |
Version label (for word / top) |
--word |
Word to search (for word / diff) |
--top |
Number of top words (for top) |
--file1, --file2 |
File paths (for diff) |
--version1, --version2 |
Version labels (for diff) |
./analyzer --file test_logs.txt --version v1 --buffer 512 --query word --word errorIndexing v1 [########################################] 100%
Version: v1
Count: 605079
Buffer Size (KB): 256
Execution Time (s): 1.02554
./analyzer --file test_logs.txt --version v1 --buffer 512 --query top --top 10Indexing v1 [########################################] 100%
Top-10 words in version v1:
devops 1209558
debug 605150
error 605079
info 604266
warning 604149
orderservice 484437
paymentservice 484078
authservice 483842
searchservice 483162
userservice 483125
Buffer Size (KB): 256
Execution Time (s): 0.976015
./analyzer --file1 test_logs.txt --file2 verbose_logs.txt --version1 v1 --version2 v2 --buffer 512 --query diff --word errorIndexing v1 [########################################] 100%
Indexing v2 [########################################] 100%
Difference (v2 - v1): -495377
Buffer Size (KB): 256
Execution Time (s): 1.9306
If any required argument is missing, the program prompts for it instead of throwing an error. Running with no arguments starts a fully guided session:
./analyzerEnter query type (word / diff / top): word
Enter buffer size in KB (256–1024): 512
Enter file path (--file): test_logs.txt
Enter version name (--version): v1
Enter word to search (--word): error
Partial arguments also work — only the missing ones are prompted. Execution timing starts after all inputs are collected.
BufferedFileReader— reads files in fixed-size chunks; never loads the entire file into memoryTokenizer— extracts lowercase alphanumeric words from raw buffer data; handles words spanning chunk boundariesWordIndex<CountType>— template class storing word → frequency mappings in anunordered_mapVersionedIndex— manages multiple named versions, each backed by aWordIndexQuery(abstract) →WordCountQuery,DiffQuery,TopKQuery— query hierarchy using virtual dispatch
See the project report for detailed design discussion.
An optional progress bar is displayed during indexing if progressbar.h is present alongside analyser.cpp. It is auto-detected at compile time via __has_include — if the header is absent, the program compiles and runs identically without it. The progress bar writes to stderr so it does not interfere with query output on stdout.
Indexing v1 [####################--------------------] 50%
If progressbar.h is not present and you'd like to enable the progress bar, download it with:
wget https://raw.githubusercontent.com/Ronit-k/CS253_MemoryEfficientVersionedFileIndexer/main/progressbar.hThen recompile to pick it up.
BufferedFileReader allocates a single fixed-size buffer (256–1024 KB) and streams the file in chunks — the file is never loaded entirely into memory. The Tokenizer handles words that span chunk boundaries using a partial-token accumulator. The WordIndex grows only with the number of unique words, not the file size.
- Invalid buffer size (outside 256–1024 KB) →
invalid_argument - File not found →
runtime_error - Unknown or invalid query type →
invalid_argument - Missing required arguments after prompting →
invalid_argument - Duplicate version name →
runtime_error - Non-positive Top-K value →
invalid_argument
This project was developed as part of the CS253 course at IIT Kanpur.