Skip to content

AlexTuring010/database-internals-hashfile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

database-internals-hashfile

A disk-resident extendible-hash file implemented in C on top of a custom buffer manager (the BF library distributed with the course). Records live in fixed-size blocks paged in and out by the BF layer; the index above them grows dynamically by splitting buckets and doubling the directory when needed.

First homework of the Database Implementation course at the University of Athens (Department of Informatics & Telecommunications), 2021–2022. Three-person team project (see Team below).

What it does

The course-provided API (hash_file.h) lets a client create, open, and close hash-file indices, insert Record { id, name, surname, city } entries, look records up by key, and dump statistics. We implemented the layer that turns that API into actual on-disk extendible hashing.

                      ┌───────────────────────────────────┐
                      │       HT_Handler (in memory)      │
                      │  ┌─────────────────────────────┐  │
                      │  │ file_header { global depth, │  │
                      │  │   2^depth dir entries,      │  │
                      │  │   records-per-block }       │  │
                      │  └─────────────────────────────┘  │
                      │  arr[ hash(key) ] = block_id      │
                      └────────────────┬──────────────────┘
                                       │
                                       ▼  via BF_GetBlock / BF_Block_GetData
                ┌──────────────────────┴──────────────────────────┐
                │                                                 │
        ┌───────┴────────┐  ┌─────────────────┐  ┌────────────────┴───┐
        │   bucket b₀    │  │   bucket b₁     │  │   bucket b₂        │
        │  records[...]  │  │  records[...]   │  │  records[...]      │
        │  block_info {  │  │  block_info {   │  │  block_info {      │
        │   local_depth, │  │   local_depth,  │  │   local_depth,     │
        │   friends,     │  │   friends,      │  │   friends,         │
        │   lowest_index,│  │   lowest_index, │  │   lowest_index,    │
        │   next_block_id│  │   next_block_id │  │   next_block_id    │
        │  }             │  │  }              │  │  }                 │
        └────────────────┘  └─────────────────┘  └────────────────────┘
                ▲
                │  multiple directory slots can point to the same bucket
                │  while local_depth < global_depth (the "friends" count
                │  tracks how many)

When a bucket fills:

  • If local_depth < global_depth, split locally — allocate a new block, redistribute records by the next bit of the hash, and update the directory entries that point to the split bucket.
  • If local_depth == global_depth, double the directory (global_depth++) and then split.

Overflow lists exist via next_block_id for the pathological case where a single hash key maps to too many records to fit in one block — but as the team readme.txt honestly notes, repeated identical IDs cause infinite recursive doubling. That is a property of the algorithm we were asked to implement, not a bug in our implementation.

What was interesting

  • Page-aware data layout. Block metadata (block_info) is packed at the end of each page (data + BF_BLOCK_SIZE - sizeof(block_info)), leaving the front of the block for Record payload. Same trick for the file header in block 0.
  • An iterator abstraction. Rather than open-coding BF_GetBlock / BF_UnpinBlock pairs every time we walked records, we wrote an iterator that hides block boundaries — caller asks for "next record", the iterator handles page transitions and pin/unpin. Big readability win across HT_InsertEntry, HT_PrintAllEntries, and the split logic.
  • friends and lowest_index per bucket. Each bucket knows how many directory slots currently point to it and which is the lowest. That makes both the split logic and the directory rewrites O(friends) instead of needing to scan the whole directory.

Team

Role Contributor
Iterator + insert path Alexandros Gkiafissdi2200284
File create / open / close Charalampos Rafail Dimitriousdi1900354
PrintAllEntries + statistics Christos Dokoutsidissdi2000269
General-purpose helpers, design, debugging Shared

The split of "who owns which function" is a formality; design, debugging, comments, and tests were done together as a team.

Build & run

The actual project sits in code/. From there:

make ht        # builds the hash-file driver (build/runner)
./build/runner # runs examples/ht_main.c against a fresh data.db

examples/ht_main.c inserts 1000 randomly-generated records into a freshly-created data.db, prints all entries matching a randomly-chosen id, then dumps statistics.

make bf builds a separate driver against the raw buffer-manager API, useful for sanity-checking the BF layer in isolation.

Files

Path Purpose
code/include/bf.h Buffer-manager interface (course-provided)
code/include/hash_file.h Hash-file API the assignment asked us to implement
code/src/hash_file.c Our implementation — directory, splits, iterator, insert, print, stats
code/examples/ht_main.c Driver used to evaluate the implementation
code/lib/libbf.so Pre-built buffer-manager library (course-provided)
DB_askisi_1_2021-2022.pdf The original assignment specification (Greek)

License

MIT — applies to our own code in this repo (code/src/hash_file.c). The BF library, the hash_file.h API contract, and the assignment PDF were distributed by the course and retain their original copyright.

About

Disk-resident extendible-hash file in C, built on a course-provided buffer manager (BF). Dynamic directory doubling, per-bucket local depth, on-disk page layout with metadata packed at block end. 3-person Database Implementation class project (UoA DIT, 2021-22).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors