Memory-Efficient Versioned File Indexer Using C++

Name: Virendra Kala (Roll No. 241172)
Course: CS253 – Software Development and Operations
Professor: Prof. Indranil Saha
Institute: Indian Institute of Technology Kanpur

Overview

The Memory-Efficient Versioned File Indexer is a C++ command-line application designed to process and analyze large text-based datasets while maintaining strict memory constraints. The system incrementally reads large files using a fixed-size buffer and constructs a word-level index that stores the frequency of each unique word.

The program guarantees that the entire file is never loaded into memory, making it suitable for handling large datasets efficiently. It supports multiple versions of datasets within a single execution, allowing users to perform analytical queries such as word frequency lookup, top-K frequent word retrieval, and comparison of word frequencies across different dataset versions.

This project demonstrates strong object-oriented design, disciplined memory management, and the use of core C++ programming concepts including inheritance, runtime polymorphism, function templates, and exception handling.

Features

Memory-Efficient File Processing

Uses a fixed-size buffer (256 KB – 1024 KB) to read files incrementally.
The entire file is never loaded into memory at any point.
Memory usage depends only on the number of unique words, not the file size.
Correctly handles words that are split across buffer boundaries using a carry-word mechanism.

Word-Level Indexing

Builds a mapping from each unique word to its frequency in the file.
Words are defined as contiguous sequences of alphanumeric characters.
Word matching is fully case-insensitive (all input is normalized to lowercase).

Version Management

Each indexed file corresponds to a distinct, user-named version.
Multiple versions can be indexed and queried within the same execution.
Queries can operate on a single version or compare two versions.

Supported Queries

1. Word Count Query

Returns the frequency of a specific word in a given version.

2. Top-K Query

Displays the K most frequent words in a specified version, sorted in descending order of frequency. Ties are broken alphabetically.

3. Difference Query

Computes the difference in frequency of a word between two versions (version2 − version1).

System Architecture

The program follows a modular object-oriented architecture. Each class has a single, clearly defined responsibility.

Classes

BufferedFileReader
Reads the input file incrementally using a fixed-size buffer allocated at construction time. Validates that the requested buffer size lies within the permitted range (256–1024 KB) and throws a runtime_error if the constraint is violated or the file cannot be opened. Exposes a readChunk() method that fills the buffer and reports the number of bytes actually read.

Tokenizer
Processes raw buffer data and extracts valid words. Converts each character to lowercase and identifies alphanumeric sequences as words. Maintains a carryWord string across calls to correctly reconstruct words that span buffer boundaries, ensuring no token is lost or corrupted at chunk edges.

VersionedIndexer
Stores word-frequency indexes for multiple file versions. Internally uses a map<string, unordered_map<string, int>> to associate each version name with its frequency table. Exposes methods to add a version, retrieve the count of a specific word, and access the full index for a version.

Query (Abstract Base Class)
Defines the interface for all query types. Declares a pure virtual execute() method that accepts a VersionedIndexer reference. All concrete query classes derive from this class and override execute(), enabling runtime polymorphism via dynamic dispatch through a Query* pointer.

WordQuery
Derived from Query. Executes a word frequency lookup and prints the version name and the count of the queried word.

DiffQuery
Derived from Query. Computes and prints the difference in frequency of a word between two versions (v2 − v1).

TopQuery
Derived from Query. Collects all word-frequency pairs for a version, calls the printTopK<T> function template to sort them in descending order of frequency (alphabetically for ties), and prints the top K results.

printTopK<T> (Function Template)
A generic function template that sorts a vector<pair<string, T>> in descending order of value (with lexicographic tie-breaking) and prints the top K entries. Satisfies the assignment's requirement for a user-defined template.

Prerequisites

Ensure you have the following installed:

C++ Compiler supporting C++17 (g++ recommended)
Git (optional, for cloning the repository)

sudo apt-get update
sudo apt-get install -y build-essential git

Compilation and Execution

Clone the Repository

git clone https://github.com/virendrakala/Memory-Efficient-Versioned-File-Indexer.git
cd Memory-Efficient-Versioned-File-Indexer

Compile

g++ -std=c++17 -O2 241172_Virendra.cpp -o analyzer

Run — Word Query

./analyzer --file dataset_v1.txt --version v1 --buffer 512 --query word --word error

Example Output: Query: word Version: v1 Word: error Count: 605079 Buffer Size (KB): 256 Execution Time (s): 0.843523

Run — Top-K Query

./analyzer --file dataset_v1.txt --version v1 --buffer 512 --query top --top 10

Example Output: Query: top Top-10 words in version v1:

devops 1209558
debug 605150
error 605079
info 604266
warning 604149
orderservice 484437
paymentservice 484078
authservice 483842
searchservice 483162
userservice 483125 Buffer Size (KB): 256 Execution Time (s): 0.848306

Run — Difference Query

./analyzer --file1 dataset_v1.txt --version1 v1 \
           --file2 dataset_v2.txt --version2 v2 \
           --buffer 512 --
query diff --word error

Example Output: Query: diff Word: error Difference (v2 - v1) :-495377 Buffer Size (KB): 256 Execution Time (s): 1.70038

Command-Line Arguments

Argument	Description
`--file <path>`	Input file for single-version queries
`--file1 <path>`	First input file for diff query
`--file2 <path>`	Second input file for diff query
`--version <name>`	Version identifier for single-version queries
`--version1 <name>`	First version identifier for diff query
`--version2 <name>`	Second version identifier for diff query
`--buffer <kb>`	Buffer size in KB (must be 256–1024)
`--query <type>`	Query type: `word`, `top`, or `diff`
`--word <token>`	Word to search (used in `word` and `diff` queries)
`--top <k>`	Number of top results to display (used in `top` query)

OOP and C++ Concepts Demonstrated

Concept	Where Used
Abstract base class	`Query` with pure virtual `execute()`
Inheritance	`WordQuery`, `DiffQuery`, `TopQuery` derived from `Query`
Runtime polymorphism	`Query*` pointer dispatches to correct `execute()` at runtime
Function template	`printTopK<T>` for generic sorting and printing
Exception handling	`try/catch` in `main`; `throw runtime_error` in `BufferedFileReader`
Function overloading	Overloaded `getWordCount()` in `VersionedIndexer`
Encapsulation	Each class exposes only the interface needed; internals are private

Memory Constraints

The program strictly guarantees:

Memory usage is independent of file size.
The entire dataset is never loaded into memory.
Processing occurs incrementally using a fixed-size buffer.
Memory usage grows only with the number of unique words.
Words split across buffer boundaries are correctly handled via the carry-word mechanism in Tokenizer.

Assumptions

Words consist only of alphanumeric characters; all other characters act as delimiters.
Word matching is case-insensitive; all tokens are normalized to lowercase.
Buffer size must be between 256 KB and 1024 KB (enforced with an exception).
For the diff query, the difference is computed as version2 − version1.

Acknowledgements

Developed as Assignment 1 for CS253 – Software Development and Operations, Indian Institute of Technology Kanpur.
Special thanks to Prof. Indranil Saha and the teaching assistants of CS253 for their guidance throughout the course.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
241172_Virendra.cpp		241172_Virendra.cpp
OOPS_assignment.pdf		OOPS_assignment.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory-Efficient Versioned File Indexer Using C++

Overview

Features

Memory-Efficient File Processing

Word-Level Indexing

Version Management

Supported Queries

1. Word Count Query

2. Top-K Query

3. Difference Query

System Architecture

Classes

Prerequisites

Compilation and Execution

Clone the Repository

Compile

Run — Word Query

Run — Top-K Query

Run — Difference Query

Command-Line Arguments

OOP and C++ Concepts Demonstrated

Memory Constraints

Assumptions

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memory-Efficient Versioned File Indexer Using C++

Overview

Features

Memory-Efficient File Processing

Word-Level Indexing

Version Management

Supported Queries

1. Word Count Query

2. Top-K Query

3. Difference Query

System Architecture

Classes

Prerequisites

Compilation and Execution

Clone the Repository

Compile

Run — Word Query

Run — Top-K Query

Run — Difference Query

Command-Line Arguments

OOP and C++ Concepts Demonstrated

Memory Constraints

Assumptions

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages