Skip to content

virendrakala/Memory-Efficient-Versioned-File-Indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Memory-Efficient Versioned File Indexer Using C++

Name: Virendra Kala (Roll No. 241172)
Course: CS253 – Software Development and Operations
Professor: Prof. Indranil Saha
Institute: Indian Institute of Technology Kanpur


Overview

The Memory-Efficient Versioned File Indexer is a C++ command-line application designed to process and analyze large text-based datasets while maintaining strict memory constraints. The system incrementally reads large files using a fixed-size buffer and constructs a word-level index that stores the frequency of each unique word.

The program guarantees that the entire file is never loaded into memory, making it suitable for handling large datasets efficiently. It supports multiple versions of datasets within a single execution, allowing users to perform analytical queries such as word frequency lookup, top-K frequent word retrieval, and comparison of word frequencies across different dataset versions.

This project demonstrates strong object-oriented design, disciplined memory management, and the use of core C++ programming concepts including inheritance, runtime polymorphism, function templates, and exception handling.


Features

Memory-Efficient File Processing

  • Uses a fixed-size buffer (256 KB – 1024 KB) to read files incrementally.
  • The entire file is never loaded into memory at any point.
  • Memory usage depends only on the number of unique words, not the file size.
  • Correctly handles words that are split across buffer boundaries using a carry-word mechanism.

Word-Level Indexing

  • Builds a mapping from each unique word to its frequency in the file.
  • Words are defined as contiguous sequences of alphanumeric characters.
  • Word matching is fully case-insensitive (all input is normalized to lowercase).

Version Management

  • Each indexed file corresponds to a distinct, user-named version.
  • Multiple versions can be indexed and queried within the same execution.
  • Queries can operate on a single version or compare two versions.

Supported Queries

1. Word Count Query

Returns the frequency of a specific word in a given version.

2. Top-K Query

Displays the K most frequent words in a specified version, sorted in descending order of frequency. Ties are broken alphabetically.

3. Difference Query

Computes the difference in frequency of a word between two versions (version2 − version1).


System Architecture

The program follows a modular object-oriented architecture. Each class has a single, clearly defined responsibility.

Classes

BufferedFileReader
Reads the input file incrementally using a fixed-size buffer allocated at construction time. Validates that the requested buffer size lies within the permitted range (256–1024 KB) and throws a runtime_error if the constraint is violated or the file cannot be opened. Exposes a readChunk() method that fills the buffer and reports the number of bytes actually read.

Tokenizer
Processes raw buffer data and extracts valid words. Converts each character to lowercase and identifies alphanumeric sequences as words. Maintains a carryWord string across calls to correctly reconstruct words that span buffer boundaries, ensuring no token is lost or corrupted at chunk edges.

VersionedIndexer
Stores word-frequency indexes for multiple file versions. Internally uses a map<string, unordered_map<string, int>> to associate each version name with its frequency table. Exposes methods to add a version, retrieve the count of a specific word, and access the full index for a version.

Query (Abstract Base Class)
Defines the interface for all query types. Declares a pure virtual execute() method that accepts a VersionedIndexer reference. All concrete query classes derive from this class and override execute(), enabling runtime polymorphism via dynamic dispatch through a Query* pointer.

WordQuery
Derived from Query. Executes a word frequency lookup and prints the version name and the count of the queried word.

DiffQuery
Derived from Query. Computes and prints the difference in frequency of a word between two versions (v2 − v1).

TopQuery
Derived from Query. Collects all word-frequency pairs for a version, calls the printTopK<T> function template to sort them in descending order of frequency (alphabetically for ties), and prints the top K results.

printTopK<T> (Function Template)
A generic function template that sorts a vector<pair<string, T>> in descending order of value (with lexicographic tie-breaking) and prints the top K entries. Satisfies the assignment's requirement for a user-defined template.


Prerequisites

Ensure you have the following installed:

  • C++ Compiler supporting C++17 (g++ recommended)
  • Git (optional, for cloning the repository)
sudo apt-get update
sudo apt-get install -y build-essential git

Compilation and Execution

Clone the Repository

git clone https://github.com/virendrakala/Memory-Efficient-Versioned-File-Indexer.git
cd Memory-Efficient-Versioned-File-Indexer

Compile

g++ -std=c++17 -O2 241172_Virendra.cpp -o analyzer

Run — Word Query

./analyzer --file dataset_v1.txt --version v1 --buffer 512 --query word --word error

Example Output: Query: word Version: v1 Word: error Count: 605079 Buffer Size (KB): 256 Execution Time (s): 0.843523

Run — Top-K Query

./analyzer --file dataset_v1.txt --version v1 --buffer 512 --query top --top 10

Example Output: Query: top Top-10 words in version v1:

  1. devops 1209558
  2. debug 605150
  3. error 605079
  4. info 604266
  5. warning 604149
  6. orderservice 484437
  7. paymentservice 484078
  8. authservice 483842
  9. searchservice 483162
  10. userservice 483125 Buffer Size (KB): 256 Execution Time (s): 0.848306

Run — Difference Query

./analyzer --file1 dataset_v1.txt --version1 v1 \
           --file2 dataset_v2.txt --version2 v2 \
           --buffer 512 --
query diff --word error

Example Output: Query: diff Word: error Difference (v2 - v1) :-495377 Buffer Size (KB): 256 Execution Time (s): 1.70038

Command-Line Arguments

Argument Description
--file <path> Input file for single-version queries
--file1 <path> First input file for diff query
--file2 <path> Second input file for diff query
--version <name> Version identifier for single-version queries
--version1 <name> First version identifier for diff query
--version2 <name> Second version identifier for diff query
--buffer <kb> Buffer size in KB (must be 256–1024)
--query <type> Query type: word, top, or diff
--word <token> Word to search (used in word and diff queries)
--top <k> Number of top results to display (used in top query)

OOP and C++ Concepts Demonstrated

Concept Where Used
Abstract base class Query with pure virtual execute()
Inheritance WordQuery, DiffQuery, TopQuery derived from Query
Runtime polymorphism Query* pointer dispatches to correct execute() at runtime
Function template printTopK<T> for generic sorting and printing
Exception handling try/catch in main; throw runtime_error in BufferedFileReader
Function overloading Overloaded getWordCount() in VersionedIndexer
Encapsulation Each class exposes only the interface needed; internals are private

Memory Constraints

The program strictly guarantees:

  • Memory usage is independent of file size.
  • The entire dataset is never loaded into memory.
  • Processing occurs incrementally using a fixed-size buffer.
  • Memory usage grows only with the number of unique words.
  • Words split across buffer boundaries are correctly handled via the carry-word mechanism in Tokenizer.

Assumptions

  • Words consist only of alphanumeric characters; all other characters act as delimiters.
  • Word matching is case-insensitive; all tokens are normalized to lowercase.
  • Buffer size must be between 256 KB and 1024 KB (enforced with an exception).
  • For the diff query, the difference is computed as version2 − version1.

Acknowledgements

Developed as Assignment 1 for CS253 – Software Development and Operations, Indian Institute of Technology Kanpur.
Special thanks to Prof. Indranil Saha and the teaching assistants of CS253 for their guidance throughout the course.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages