diff --git a/.github/workflows/theseus-engine.yml b/.github/workflows/theseus-engine.yml
index 817949a..f127981 100644
--- a/.github/workflows/theseus-engine.yml
+++ b/.github/workflows/theseus-engine.yml
@@ -40,6 +40,9 @@ jobs:
# Genesis (historical fossil) is left completely untouched.
poetry run python scripts/add_fossils.py --update-survivor
+ - name: Clean & Minify data payloads
+ run: poetry run python scripts/cleanup_data.py
+
- name: Commit and push data updates
run: |
git config --local user.email "action@github.com"
diff --git a/README.md b/README.md
index 426b70e..244d5fa 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,78 @@
-# Theseus
-Does a codebase remain the same if every line is replaced? A monthly pulse on software entropy.
+
+

+
Ship of Theseus
+
Does a codebase remain the same if every line is replaced? A monthly pulse on software entropy.
+
+
+
+
+
+
+
+
+
+---
+
+## đź“– The Philosophy: Why This Project Matters
+
+The **Ship of Theseus** is a famous thought experiment: if you replace every wooden plank on a ship over time, is it still the same ship?
+
+This exact paradox plays out daily in modern software engineering. Repositories live for years, or even decades. The developers who started them leave, entire architectural paradigms shift, and eventually, the very last line of original code is overwritten. Yet, the repository retains its name, its URL, and its identity.
+
+**This project exists to visualize that journey.** It pulls back the curtain on repository decay and renewal by measuring *codebase entropy*; tracking when lines of code were written and how long they survive before being rewritten, effectively showing you the "age" of a massive software project at a glance.
+
+### Why People Care About This
+1. **Repository Health & Churn Visibility:** Open-source maintainers and engineering managers can visually assess how quickly a codebase is turning over. Is the core architecture stable (lots of old code), or is it undergoing a frantic rewrite?
+2. **Identifying Key Surviving Code:** By identifying "Historical" and "Living" fossils, this project highlights the original architectural foundation blocks that have stood the test of time (and edge-cases).
+3. **Data-Driven Storytelling:** It acts as a historical lens for famous open-source projects, allowing developers to see how massive frameworks (like React or Django) have evolved through different eras.
+
+## QuickStart Guide
+
+### 1. Requirements
+* `git`
+* `python` > 3.12
+* `poetry` (for dependency management)
+
+### 2. Installation
+```bash
+git clone https://github.com/Asifdotexe/Theseus.git
+cd Theseus
+poetry install
+```
+
+### 3. Running the Engine Locally
+The analytical engine is driven through the centralized `theseus.config.json` configuration file.
+You can run the full timeline snapshot engine:
+```bash
+poetry run python scripts/analyse_repository.py
+```
+
+To backfill or incrementally update the "Fossil" pointers (the absolute oldest lines of code):
+```bash
+poetry run python scripts/add_fossils.py --update-survivor
+```
+
+### 4. Viewing the Interactive Chart
+Simply open `index.html` in your favorite modern browser:
+```bash
+# On Mac
+open index.html
+
+# On Windows
+start index.html
+```
+
+---
+
+## Dive Deeper (Documentation)
+
+The technical internals of the Ship of Theseus engine are separated into structured documentation guides:
+
+- **[Architecture & The Data Pipeline](docs/ARCHITECTURE.md):** How we traverse `git` histories incrementally and capture "Fossils".
+- **[Configuration Guide](docs/CONFIGURATION.md):** How to plug in your own repositories by editing `theseus.config.json`.
+- **[DevOps & CI/CD](docs/DEVOPS.md):** How the system updates itself autonomously via GitHub Actions.
+
+---
+
+## License
+This project is open-source and available under the terms defined in the `LICENSE` file.
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..82256a4
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,96 @@
+# Architecture & Internals
+
+The Ship of Theseus engine is composed of a disconnected backend (data generator) and frontend (UI visualizer). They communicate entirely via an intermediary static JSON format.
+
+This architecture allows the system to remain highly secure, completely serverless, and free to host using static GitHub Pages. (woohoo, who doesn't like free things?)
+
+---
+
+## The Data Pipeline (`analyse_repository.py`)
+
+The heart of the application is a python script that orchestrates `git` shell commands. Because `git` is heavily optimized in C, shelling out to the native git binary is orders of magnitude faster than relying on pure Python implementations like `GitPython` or `pygit2`.
+
+### Incremental Snapshot Generation
+
+To view codebase health *over time*, we need snapshots of the codebase. Instead of re-parsing every commit since the dawn of time, the engine works incrementally.
+
+```mermaid
+flowchart TD
+ A[Start: read `theseus.config.json`] --> B{Has Data File?}
+ B -- No --> C[Full Clone & 1st Commit]
+ B -- Yes --> D[Look at Last Snapshot Date]
+ D --> E[Is Last Snapshot < Current Month?]
+ E -- Yes --> F[Clone & Jump to Next Month]
+ E -- No --> G[Skip: Up to Date]
+ C --> H[Run Git Blame Parallel]
+ F --> H
+ H --> I[Count Lines by Authorship Year]
+ I --> J[Append Snapshot to JSON]
+```
+
+### The `git blame` Parallelization
+
+When checking out a specific month's commit, the system needs to `git blame` every single valid file in the repository.
+
+1. **Ls-Files Filter:** We run `git ls-files` to get solely the tracked text files (excluding binary garbage).
+2. **ThreadPool Executor:** The script fires off multiple parallel workers to run `git blame --line-porcelain` concurrently across CPUs.
+3. **Regex Extraction:** It rips the UNIX timestamps out of the porcelain format and bins them into "years".
+
+---
+
+## The Fossil Extraction (`add_fossils.py`)
+
+Fossils are pointers to specific, historically significant lines of code that serve as fun easter-eggs for the UI. They are evaluated completely independently to prevent slowing down the main incremental snapshot pipeline.
+
+```mermaid
+stateDiagram-v2
+ direction LR
+ [*] --> ReadManifest
+
+ state "Fossil Extractor" as extractor {
+ GenesisFossil: Historical Genesis
+ SurvivorFossil: Living Survivor
+
+ GenesisFossil --> SortCommits
+ SortCommits --> FindOldestBlamedLine
+
+ SurvivorFossil --> CheckoutHEAD
+ CheckoutHEAD --> FindOldestStillAlive
+ }
+
+ ReadManifest --> extractor
+ extractor --> AppendMetadataJSON
+```
+
+### Historical (Genesis) Protocol
+Repos imported from SVN/Mercurial can have wildly inaccurate committer timestamps. We resolve this by running `git log --all --pretty=format:%H %at` to sort all commits explicitly by `author-time`, stepping through the absolute oldest `genesis_depth` commits, and extracting the first line of code ever pushed to the repo's history regardless of branch logic.
+
+### Living (Survivor) Protocol
+This focuses strictly on the default branch `HEAD`. It recursively blames the latest state of the codebase. Because it's checking `HEAD`, this value frequently moves as old code is finally refactored out.
+
+---
+
+## Data Delivery via Vanilla UI (`app.js`)
+
+The UI is intentionally lightweight. We avoided heavy React or bundle-chain systems to ensure the repository remains simple and easy to fork.
+
+The UI loads `theseus.config.json` via the browser Fetch API, builds out a repository selection grid dynamically, and upon clicking a card, pulls the corresponding static `data/{repo}_data.json`.
+
+```mermaid
+sequenceDiagram
+ participant Browser
+ participant app.js
+ participant config.json
+ participant data.json
+
+ Browser->>app.js: Load index.html
+ app.js->>config.json: fetch("theseus.config.json")
+ config.json-->>app.js: Returns [{repo1}, {repo2}]
+ app.js->>Browser: Renders UI Selection Grid
+
+ Browser->>app.js: Click Repo 1
+ app.js->>Browser: Shows CSS Skeleton Loader Overlay
+ app.js->>data.json: fetch("data/repo1_data.json")
+ data.json-->>app.js: Loads Timeseries + Fossils
+ app.js->>Browser: Computes D3/DOM Chart & Hides Skeleton
+```
diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md
new file mode 100644
index 0000000..0a506fa
--- /dev/null
+++ b/docs/CONFIGURATION.md
@@ -0,0 +1,52 @@
+# ⚙️ Configuration Guide
+
+The Ship of Theseus engine operates centrally off a single file: `theseus.config.json`. By modifying this file, you instruct both the Python backend and the JavaScript frontend on which repositories to scrape and display.
+
+## Base Schema (`theseus.config.json`)
+
+```json
+{
+ "$schema": "./schema.json",
+ "dataDir": "./data",
+ "repositories": [
+ {
+ "name": "react",
+ "repo": "facebook/react",
+ "displayName": "React",
+ "description": "A JavaScript library for building user interfaces"
+ }
+ ]
+}
+```
+
+### Global Settings
+
+* `dataDir` *(string)*: The relative path to the directory where the engine will save output JSONs. Usually `"./data"`. This config also controls the Javascript engine, so the frontend needs this accurate to know where to fetch data.
+
+### Repositories Array
+
+The `repositories` array takes objects consisting of the following key attributes:
+
+| Key | Type | Description | Example |
+| :--- | :---: | :--- | :--- |
+| `name` | *String* | A safe, unique identifier. Used for the JSON filename (`{name}_data.json`). Must be snake_case or kebab-case. | `"django"` |
+| `repo` | *String* | The GitHub repository namespace (the URL ending). The engine automatically strips trailing slashes and resolves this to `https://github.com/namespace/repo.git`. | `"django/django"` |
+| `displayName` | *String* | The aesthetic name rendered on UI Cards. | `"Django"` |
+| `description` | *String* | A short UI subheading clarifying what the project is. | `"The web framework for perfectionists with deadlines."` |
+
+---
+
+## Modifying Configurations
+
+### Adding a new target
+To begin visualizing a new repository, append it to the `repositories` array.
+
+1. Add your object to `theseus.config.json`
+2. Locally run `poetry run python scripts/analyse_repository.py`
+3. The engine will clone the repo into `./temp_repos/` (which can be over `1GB` for massive codebases, so ensure disk space).
+4. Local data processing will generate `data/{your_repo}_data.json`.
+5. Run `poetry run python scripts/add_fossils.py` to fill in the Genesis/Survivor line references.
+6. Check your `index.html` file to see the newly generated visual graph!
+
+> [!CAUTION]
+> Avoid modifying the output data within `data/` manually. Doing so will corrupt the incremental snapshot logic, forcing the pipeline to wipe out the cache and restart checking out massive commit trees from scratch.
diff --git a/docs/DEVOPS.md b/docs/DEVOPS.md
new file mode 100644
index 0000000..2c5d4a6
--- /dev/null
+++ b/docs/DEVOPS.md
@@ -0,0 +1,42 @@
+# 🤖 DevOps & CI/CD Pipeline
+
+The Ship of Theseus engine doesn't just run once—codebases never stop evolving. The system relies entirely on GitHub Actions to provide zero-maintenance "monthly pulses" that autonomously update the data output repository.
+
+## The Automation Engine (`.github/workflows/theseus-engine.yml`)
+
+The primary workflow handles generating the JSON snapshot objects incrementally every month, tracking any changes, and pushing them back to the repository data block.
+
+```mermaid
+journey
+ title Monthly Pipeline Action execution
+ section Bootstrapping
+ Clone Primary Repository: 5: GitHub Actions
+ Read Config File: 5: Python
+ Checkout Specific Repositories: 3: Python
+ section Analysis
+ Perform Incremental Snapshot: 4: Python (analyse_repository.py)
+ Blame Lines / Evaluate Entropy: 4: Python
+ Update Survivor Fossils: 3: Python (add_fossils.py)
+ section Persistence
+ Clean & Minify data payloads: 5: Python (cleanup_data.py)
+ Commit Diff Data: 5: git config user.name "github-actions[bot]"
+ Push JSON to Origin: 5: GitHub Actions
+```
+
+### 1. `analyse_repository.py` Trigger
+The analyzer looks at `theseus.config.json` and pulls from the local `data/` cache. Because `analyse_repository.py` is fully incremental, it will read `snapshot_date="2025-02"` in the JSON, look at the wall-clock calendar time (e.g. `2025-05`), and figure out that it needs to specifically checkout the repositories at `2025-03`, `2025-04` and `2025-05` to catch up to the current date. It will execute these checkouts locally within the GitHub Actions runner.
+
+### 2. `add_fossils.py --update-survivor` Trigger
+Genesis fossils rarely change unless a codebase undergoes an extreme edge-case rewrite of its absolute first commit history. The UI primarily benefits from tracking the *"Living Fossil"*.
+
+To save processing time during CI constraints, the Action only triggers `add_fossils.py` with the `--update-survivor` flag, bypassing sorting all commits for Genesis creation completely, and simply updating the `view_commit` tip to track code changes.
+
+### 3. File Re-commit Handling
+Finally, the action checks if the snapshot array or the survivor fossil commit length actually triggered a diff against the origin.
+
+If `git status` shows modifications to the JSON payloads inside `data/`, the robotic GitHub Actions bot commits the payload and forces a synchronized write onto `main`. This allows the repository to essentially act as its own self-healing backend Database.
+
+---
+
+> [!TIP]
+> Ensure the Action is allowed Write permissions in the repository settings: `Settings -> Actions -> General -> Workflow permissions -> Read and write permissions`. Otherwise, the robotic commit will result in `HTTP 403` and the pipeline will fail silently.
diff --git a/scripts/cleanup_data.py b/scripts/cleanup_data.py
index b11c3dd..1b7369b 100644
--- a/scripts/cleanup_data.py
+++ b/scripts/cleanup_data.py
@@ -6,19 +6,25 @@
from pathlib import Path
-def cleanup_data(data_dir: str):
+def cleanup_data(data_dir: str) -> bool:
"""
Cleans up all JSON data files in the specified directory.
- Removes 'total_lines' (redundant)
- Removes future-year keys in 'composition'
- Minifies output
+ Returns True if an error occurred, False otherwise.
"""
data_path = Path(data_dir)
+ if not data_path.exists() or not data_path.is_dir():
+ print(f"Data directory not found or not a directory: {data_dir}")
+ return True
+
json_files = list(data_path.glob("*.json"))
+ had_failures = False
if not json_files:
print(f"No JSON files found in {data_dir}")
- return
+ return had_failures
for json_file in json_files:
if json_file.name == "manifest.json":
@@ -43,7 +49,7 @@ def cleanup_data(data_dir: str):
max_year = int(snapshot_date[:4])
composition = snapshot.get("composition", {})
keys_to_remove = [
- year for year in composition.keys() if int(year) > max_year
+ year for year in composition.keys() if int(year) > max_year
]
for key in keys_to_remove:
del composition[key]
@@ -60,8 +66,24 @@ def cleanup_data(data_dir: str):
except Exception as e:
print(f" Error processing {json_file.name}: {e}")
+ had_failures = True
+
+ return had_failures
+
+def main():
+ import sys
+ config_path = "theseus.config.json"
+ if not Path(config_path).exists():
+ print(f"Configuration file not found: {config_path}")
+ sys.exit(1)
+
+ with open(config_path, "r", encoding="utf-8") as f:
+ config = json.load(f)
+ data_dir = config.get("dataDir", "./data")
+ if cleanup_data(data_dir):
+ print("One or more files failed to clean up. Exiting non-zero.")
+ sys.exit(1)
if __name__ == "__main__":
- DATA_DIR = "./data"
- cleanup_data(DATA_DIR)
+ main()