Name: Virendra Kala (Roll No. 241172)
Course: CS253 – Software Development and Operations
Professor: Prof. Indranil Saha
Institute: Indian Institute of Technology Kanpur
The Memory-Efficient Versioned File Indexer is a C++ command-line application designed to process and analyze large text-based datasets while maintaining strict memory constraints. The system incrementally reads large files using a fixed-size buffer and constructs a word-level index that stores the frequency of each unique word.
The program guarantees that the entire file is never loaded into memory, making it suitable for handling large datasets efficiently. It supports multiple versions of datasets within a single execution, allowing users to perform analytical queries such as word frequency lookup, top-K frequent word retrieval, and comparison of word frequencies across different dataset versions.
This project demonstrates strong object-oriented design, disciplined memory management, and the use of core C++ programming concepts including inheritance, runtime polymorphism, function templates, and exception handling.
- Uses a fixed-size buffer (256 KB – 1024 KB) to read files incrementally.
- The entire file is never loaded into memory at any point.
- Memory usage depends only on the number of unique words, not the file size.
- Correctly handles words that are split across buffer boundaries using a carry-word mechanism.
- Builds a mapping from each unique word to its frequency in the file.
- Words are defined as contiguous sequences of alphanumeric characters.
- Word matching is fully case-insensitive (all input is normalized to lowercase).
- Each indexed file corresponds to a distinct, user-named version.
- Multiple versions can be indexed and queried within the same execution.
- Queries can operate on a single version or compare two versions.
Returns the frequency of a specific word in a given version.
Displays the K most frequent words in a specified version, sorted in descending order of frequency. Ties are broken alphabetically.
Computes the difference in frequency of a word between two versions (version2 − version1).
The program follows a modular object-oriented architecture. Each class has a single, clearly defined responsibility.
BufferedFileReader
Reads the input file incrementally using a fixed-size buffer allocated at construction
time. Validates that the requested buffer size lies within the permitted range
(256–1024 KB) and throws a runtime_error if the constraint is violated or the file
cannot be opened. Exposes a readChunk() method that fills the buffer and reports
the number of bytes actually read.
Tokenizer
Processes raw buffer data and extracts valid words. Converts each character to
lowercase and identifies alphanumeric sequences as words. Maintains a carryWord
string across calls to correctly reconstruct words that span buffer boundaries,
ensuring no token is lost or corrupted at chunk edges.
VersionedIndexer
Stores word-frequency indexes for multiple file versions. Internally uses a
map<string, unordered_map<string, int>> to associate each version name with its
frequency table. Exposes methods to add a version, retrieve the count of a specific
word, and access the full index for a version.
Query (Abstract Base Class)
Defines the interface for all query types. Declares a pure virtual execute() method
that accepts a VersionedIndexer reference. All concrete query classes derive from
this class and override execute(), enabling runtime polymorphism via dynamic dispatch
through a Query* pointer.
WordQuery
Derived from Query. Executes a word frequency lookup and prints the version name
and the count of the queried word.
DiffQuery
Derived from Query. Computes and prints the difference in frequency of a word
between two versions (v2 − v1).
TopQuery
Derived from Query. Collects all word-frequency pairs for a version, calls the
printTopK<T> function template to sort them in descending order of frequency
(alphabetically for ties), and prints the top K results.
printTopK<T> (Function Template)
A generic function template that sorts a vector<pair<string, T>> in descending
order of value (with lexicographic tie-breaking) and prints the top K entries.
Satisfies the assignment's requirement for a user-defined template.
Ensure you have the following installed:
- C++ Compiler supporting C++17 (
g++recommended) - Git (optional, for cloning the repository)
sudo apt-get update
sudo apt-get install -y build-essential gitgit clone https://github.com/virendrakala/Memory-Efficient-Versioned-File-Indexer.git
cd Memory-Efficient-Versioned-File-Indexerg++ -std=c++17 -O2 241172_Virendra.cpp -o analyzer./analyzer --file dataset_v1.txt --version v1 --buffer 512 --query word --word errorExample Output: Query: word Version: v1 Word: error Count: 605079 Buffer Size (KB): 256 Execution Time (s): 0.843523
./analyzer --file dataset_v1.txt --version v1 --buffer 512 --query top --top 10Example Output: Query: top Top-10 words in version v1:
- devops 1209558
- debug 605150
- error 605079
- info 604266
- warning 604149
- orderservice 484437
- paymentservice 484078
- authservice 483842
- searchservice 483162
- userservice 483125 Buffer Size (KB): 256 Execution Time (s): 0.848306
./analyzer --file1 dataset_v1.txt --version1 v1 \
--file2 dataset_v2.txt --version2 v2 \
--buffer 512 --
query diff --word errorExample Output: Query: diff Word: error Difference (v2 - v1) :-495377 Buffer Size (KB): 256 Execution Time (s): 1.70038
| Argument | Description |
|---|---|
--file <path> |
Input file for single-version queries |
--file1 <path> |
First input file for diff query |
--file2 <path> |
Second input file for diff query |
--version <name> |
Version identifier for single-version queries |
--version1 <name> |
First version identifier for diff query |
--version2 <name> |
Second version identifier for diff query |
--buffer <kb> |
Buffer size in KB (must be 256–1024) |
--query <type> |
Query type: word, top, or diff |
--word <token> |
Word to search (used in word and diff queries) |
--top <k> |
Number of top results to display (used in top query) |
| Concept | Where Used |
|---|---|
| Abstract base class | Query with pure virtual execute() |
| Inheritance | WordQuery, DiffQuery, TopQuery derived from Query |
| Runtime polymorphism | Query* pointer dispatches to correct execute() at runtime |
| Function template | printTopK<T> for generic sorting and printing |
| Exception handling | try/catch in main; throw runtime_error in BufferedFileReader |
| Function overloading | Overloaded getWordCount() in VersionedIndexer |
| Encapsulation | Each class exposes only the interface needed; internals are private |
The program strictly guarantees:
- Memory usage is independent of file size.
- The entire dataset is never loaded into memory.
- Processing occurs incrementally using a fixed-size buffer.
- Memory usage grows only with the number of unique words.
- Words split across buffer boundaries are correctly handled via the carry-word mechanism in
Tokenizer.
- Words consist only of alphanumeric characters; all other characters act as delimiters.
- Word matching is case-insensitive; all tokens are normalized to lowercase.
- Buffer size must be between 256 KB and 1024 KB (enforced with an exception).
- For the diff query, the difference is computed as version2 − version1.
Developed as Assignment 1 for CS253 – Software Development and Operations,
Indian Institute of Technology Kanpur.
Special thanks to Prof. Indranil Saha and the teaching assistants of CS253 for
their guidance throughout the course.