A disk-resident extendible-hash file implemented in C on top of a custom buffer manager (the BF library distributed with the course). Records live in fixed-size blocks paged in and out by the BF layer; the index above them grows dynamically by splitting buckets and doubling the directory when needed.
First homework of the Database Implementation course at the University of Athens (Department of Informatics & Telecommunications), 2021–2022. Three-person team project (see Team below).
The course-provided API (hash_file.h) lets a client create, open, and close hash-file indices, insert Record { id, name, surname, city } entries, look records up by key, and dump statistics. We implemented the layer that turns that API into actual on-disk extendible hashing.
┌───────────────────────────────────┐
│ HT_Handler (in memory) │
│ ┌─────────────────────────────┐ │
│ │ file_header { global depth, │ │
│ │ 2^depth dir entries, │ │
│ │ records-per-block } │ │
│ └─────────────────────────────┘ │
│ arr[ hash(key) ] = block_id │
└────────────────┬──────────────────┘
│
▼ via BF_GetBlock / BF_Block_GetData
┌──────────────────────┴──────────────────────────┐
│ │
┌───────┴────────┐ ┌─────────────────┐ ┌────────────────┴───┐
│ bucket b₀ │ │ bucket b₁ │ │ bucket b₂ │
│ records[...] │ │ records[...] │ │ records[...] │
│ block_info { │ │ block_info { │ │ block_info { │
│ local_depth, │ │ local_depth, │ │ local_depth, │
│ friends, │ │ friends, │ │ friends, │
│ lowest_index,│ │ lowest_index, │ │ lowest_index, │
│ next_block_id│ │ next_block_id │ │ next_block_id │
│ } │ │ } │ │ } │
└────────────────┘ └─────────────────┘ └────────────────────┘
▲
│ multiple directory slots can point to the same bucket
│ while local_depth < global_depth (the "friends" count
│ tracks how many)
When a bucket fills:
- If
local_depth < global_depth, split locally — allocate a new block, redistribute records by the next bit of the hash, and update the directory entries that point to the split bucket. - If
local_depth == global_depth, double the directory (global_depth++) and then split.
Overflow lists exist via next_block_id for the pathological case where a single hash key maps to too many records to fit in one block — but as the team readme.txt honestly notes, repeated identical IDs cause infinite recursive doubling. That is a property of the algorithm we were asked to implement, not a bug in our implementation.
- Page-aware data layout. Block metadata (
block_info) is packed at the end of each page (data + BF_BLOCK_SIZE - sizeof(block_info)), leaving the front of the block forRecordpayload. Same trick for the file header in block 0. - An iterator abstraction. Rather than open-coding
BF_GetBlock/BF_UnpinBlockpairs every time we walked records, we wrote an iterator that hides block boundaries — caller asks for "next record", the iterator handles page transitions and pin/unpin. Big readability win acrossHT_InsertEntry,HT_PrintAllEntries, and the split logic. friendsandlowest_indexper bucket. Each bucket knows how many directory slots currently point to it and which is the lowest. That makes both the split logic and the directory rewrites O(friends) instead of needing to scan the whole directory.
| Role | Contributor |
|---|---|
| Iterator + insert path | Alexandros Gkiafis — sdi2200284 |
| File create / open / close | Charalampos Rafail Dimitriou — sdi1900354 |
PrintAllEntries + statistics |
Christos Dokoutsidis — sdi2000269 |
| General-purpose helpers, design, debugging | Shared |
The split of "who owns which function" is a formality; design, debugging, comments, and tests were done together as a team.
The actual project sits in code/. From there:
make ht # builds the hash-file driver (build/runner)
./build/runner # runs examples/ht_main.c against a fresh data.dbexamples/ht_main.c inserts 1000 randomly-generated records into a freshly-created data.db, prints all entries matching a randomly-chosen id, then dumps statistics.
make bf builds a separate driver against the raw buffer-manager API, useful for sanity-checking the BF layer in isolation.
| Path | Purpose |
|---|---|
code/include/bf.h |
Buffer-manager interface (course-provided) |
code/include/hash_file.h |
Hash-file API the assignment asked us to implement |
code/src/hash_file.c |
Our implementation — directory, splits, iterator, insert, print, stats |
code/examples/ht_main.c |
Driver used to evaluate the implementation |
code/lib/libbf.so |
Pre-built buffer-manager library (course-provided) |
DB_askisi_1_2021-2022.pdf |
The original assignment specification (Greek) |
MIT — applies to our own code in this repo (code/src/hash_file.c). The BF library, the hash_file.h API contract, and the assignment PDF were distributed by the course and retain their original copyright.