Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/theseus-engine.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ jobs:
# Genesis (historical fossil) is left completely untouched.
poetry run python scripts/add_fossils.py --update-survivor

- name: Clean & Minify data payloads
run: poetry run python scripts/cleanup_data.py

- name: Commit and push data updates
run: |
git config --local user.email "action@github.com"
Expand Down
80 changes: 78 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,78 @@
# Theseus
Does a codebase remain the same if every line is replaced? A monthly pulse on software entropy.
<div align="center">
<img src="assets/theseus_favicon.png" alt="Ship of Theseus Logo" width="120" />
<h1>Ship of Theseus</h1>
<p><i>Does a codebase remain the same if every line is replaced? A monthly pulse on software entropy.</i></p>
<p>
<img src="https://img.shields.io/badge/Python-3.12%2B-blue?style=flat&logo=python&logoColor=white" alt="Python 3.12+" />
<img src="https://img.shields.io/badge/Vanilla_JS-ES6-F7DF1E?style=flat&logo=javascript&logoColor=black" alt="Vanilla JS" />
<img src="https://img.shields.io/badge/Deployed-GitHub_Pages-2EA44F?style=flat&logo=github" alt="Deployed on GitHub Pages" />
<img src="https://img.shields.io/badge/GitHub_Actions-Automated-2088FF?style=flat&logo=github-actions&logoColor=white" alt="GitHub Actions" />
<img src="https://img.shields.io/badge/Code_Style-Black-000000?style=flat&logo=python&logoColor=white" alt="Black Style" />
</p>
</div>

---

## 📖 The Philosophy: Why This Project Matters

The **Ship of Theseus** is a famous thought experiment: if you replace every wooden plank on a ship over time, is it still the same ship?

This exact paradox plays out daily in modern software engineering. Repositories live for years, or even decades. The developers who started them leave, entire architectural paradigms shift, and eventually, the very last line of original code is overwritten. Yet, the repository retains its name, its URL, and its identity.

**This project exists to visualize that journey.** It pulls back the curtain on repository decay and renewal by measuring *codebase entropy*; tracking when lines of code were written and how long they survive before being rewritten, effectively showing you the "age" of a massive software project at a glance.

### Why People Care About This
1. **Repository Health & Churn Visibility:** Open-source maintainers and engineering managers can visually assess how quickly a codebase is turning over. Is the core architecture stable (lots of old code), or is it undergoing a frantic rewrite?
2. **Identifying Key Surviving Code:** By identifying "Historical" and "Living" fossils, this project highlights the original architectural foundation blocks that have stood the test of time (and edge-cases).
3. **Data-Driven Storytelling:** It acts as a historical lens for famous open-source projects, allowing developers to see how massive frameworks (like React or Django) have evolved through different eras.

## QuickStart Guide

### 1. Requirements
* `git`
* `python` > 3.12
* `poetry` (for dependency management)

### 2. Installation
```bash
git clone https://github.com/Asifdotexe/Theseus.git
cd Theseus
poetry install
```

### 3. Running the Engine Locally
The analytical engine is driven through the centralized `theseus.config.json` configuration file.
You can run the full timeline snapshot engine:
```bash
poetry run python scripts/analyse_repository.py
```

To backfill or incrementally update the "Fossil" pointers (the absolute oldest lines of code):
```bash
poetry run python scripts/add_fossils.py --update-survivor
```

### 4. Viewing the Interactive Chart
Simply open `index.html` in your favorite modern browser:
```bash
# On Mac
open index.html

# On Windows
start index.html
```

---

## Dive Deeper (Documentation)

The technical internals of the Ship of Theseus engine are separated into structured documentation guides:

- **[Architecture & The Data Pipeline](docs/ARCHITECTURE.md):** How we traverse `git` histories incrementally and capture "Fossils".
- **[Configuration Guide](docs/CONFIGURATION.md):** How to plug in your own repositories by editing `theseus.config.json`.
- **[DevOps & CI/CD](docs/DEVOPS.md):** How the system updates itself autonomously via GitHub Actions.

---

## License
This project is open-source and available under the terms defined in the `LICENSE` file.
96 changes: 96 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Architecture & Internals

The Ship of Theseus engine is composed of a disconnected backend (data generator) and frontend (UI visualizer). They communicate entirely via an intermediary static JSON format.

This architecture allows the system to remain highly secure, completely serverless, and free to host using static GitHub Pages. (woohoo, who doesn't like free things?)

---
Comment thread
coderabbitai[bot] marked this conversation as resolved.

## The Data Pipeline (`analyse_repository.py`)

The heart of the application is a python script that orchestrates `git` shell commands. Because `git` is heavily optimized in C, shelling out to the native git binary is orders of magnitude faster than relying on pure Python implementations like `GitPython` or `pygit2`.

### Incremental Snapshot Generation

To view codebase health *over time*, we need snapshots of the codebase. Instead of re-parsing every commit since the dawn of time, the engine works incrementally.

```mermaid
flowchart TD
A[Start: read `theseus.config.json`] --> B{Has Data File?}
B -- No --> C[Full Clone & 1st Commit]
B -- Yes --> D[Look at Last Snapshot Date]
D --> E[Is Last Snapshot < Current Month?]
E -- Yes --> F[Clone & Jump to Next Month]
E -- No --> G[Skip: Up to Date]
C --> H[Run Git Blame Parallel]
F --> H
H --> I[Count Lines by Authorship Year]
I --> J[Append Snapshot to JSON]
```

### The `git blame` Parallelization

When checking out a specific month's commit, the system needs to `git blame` every single valid file in the repository.

1. **Ls-Files Filter:** We run `git ls-files` to get solely the tracked text files (excluding binary garbage).
2. **ThreadPool Executor:** The script fires off multiple parallel workers to run `git blame --line-porcelain` concurrently across CPUs.
3. **Regex Extraction:** It rips the UNIX timestamps out of the porcelain format and bins them into "years".

---

## The Fossil Extraction (`add_fossils.py`)

Fossils are pointers to specific, historically significant lines of code that serve as fun easter-eggs for the UI. They are evaluated completely independently to prevent slowing down the main incremental snapshot pipeline.

```mermaid
stateDiagram-v2
direction LR
[*] --> ReadManifest

state "Fossil Extractor" as extractor {
GenesisFossil: Historical Genesis
SurvivorFossil: Living Survivor

GenesisFossil --> SortCommits
SortCommits --> FindOldestBlamedLine

SurvivorFossil --> CheckoutHEAD
CheckoutHEAD --> FindOldestStillAlive
}

ReadManifest --> extractor
extractor --> AppendMetadataJSON
```

### Historical (Genesis) Protocol
Repos imported from SVN/Mercurial can have wildly inaccurate committer timestamps. We resolve this by running `git log --all --pretty=format:%H %at` to sort all commits explicitly by `author-time`, stepping through the absolute oldest `genesis_depth` commits, and extracting the first line of code ever pushed to the repo's history regardless of branch logic.

### Living (Survivor) Protocol
This focuses strictly on the default branch `HEAD`. It recursively blames the latest state of the codebase. Because it's checking `HEAD`, this value frequently moves as old code is finally refactored out.

---

## Data Delivery via Vanilla UI (`app.js`)

The UI is intentionally lightweight. We avoided heavy React or bundle-chain systems to ensure the repository remains simple and easy to fork.

The UI loads `theseus.config.json` via the browser Fetch API, builds out a repository selection grid dynamically, and upon clicking a card, pulls the corresponding static `data/{repo}_data.json`.

```mermaid
sequenceDiagram
participant Browser
participant app.js
participant config.json
participant data.json

Browser->>app.js: Load index.html
app.js->>config.json: fetch("theseus.config.json")
config.json-->>app.js: Returns [{repo1}, {repo2}]
app.js->>Browser: Renders UI Selection Grid

Browser->>app.js: Click Repo 1
app.js->>Browser: Shows CSS Skeleton Loader Overlay
app.js->>data.json: fetch("data/repo1_data.json")
data.json-->>app.js: Loads Timeseries + Fossils
app.js->>Browser: Computes D3/DOM Chart & Hides Skeleton
```
52 changes: 52 additions & 0 deletions docs/CONFIGURATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# ⚙️ Configuration Guide

The Ship of Theseus engine operates centrally off a single file: `theseus.config.json`. By modifying this file, you instruct both the Python backend and the JavaScript frontend on which repositories to scrape and display.

## Base Schema (`theseus.config.json`)

```json
{
"$schema": "./schema.json",
"dataDir": "./data",
"repositories": [
{
"name": "react",
"repo": "facebook/react",
"displayName": "React",
"description": "A JavaScript library for building user interfaces"
}
]
}
```

### Global Settings

* `dataDir` *(string)*: The relative path to the directory where the engine will save output JSONs. Usually `"./data"`. This config also controls the Javascript engine, so the frontend needs this accurate to know where to fetch data.

### Repositories Array

The `repositories` array takes objects consisting of the following key attributes:

| Key | Type | Description | Example |
| :--- | :---: | :--- | :--- |
| `name` | *String* | A safe, unique identifier. Used for the JSON filename (`{name}_data.json`). Must be snake_case or kebab-case. | `"django"` |
| `repo` | *String* | The GitHub repository namespace (the URL ending). The engine automatically strips trailing slashes and resolves this to `https://github.com/namespace/repo.git`. | `"django/django"` |
| `displayName` | *String* | The aesthetic name rendered on UI Cards. | `"Django"` |
| `description` | *String* | A short UI subheading clarifying what the project is. | `"The web framework for perfectionists with deadlines."` |

---

## Modifying Configurations

### Adding a new target
To begin visualizing a new repository, append it to the `repositories` array.

1. Add your object to `theseus.config.json`
2. Locally run `poetry run python scripts/analyse_repository.py`
3. The engine will clone the repo into `./temp_repos/` (which can be over `1GB` for massive codebases, so ensure disk space).
4. Local data processing will generate `data/{your_repo}_data.json`.
5. Run `poetry run python scripts/add_fossils.py` to fill in the Genesis/Survivor line references.
6. Check your `index.html` file to see the newly generated visual graph!

> [!CAUTION]
> Avoid modifying the output data within `data/` manually. Doing so will corrupt the incremental snapshot logic, forcing the pipeline to wipe out the cache and restart checking out massive commit trees from scratch.
42 changes: 42 additions & 0 deletions docs/DEVOPS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# 🤖 DevOps & CI/CD Pipeline

The Ship of Theseus engine doesn't just run once—codebases never stop evolving. The system relies entirely on GitHub Actions to provide zero-maintenance "monthly pulses" that autonomously update the data output repository.

## The Automation Engine (`.github/workflows/theseus-engine.yml`)

The primary workflow handles generating the JSON snapshot objects incrementally every month, tracking any changes, and pushing them back to the repository data block.

```mermaid
journey
title Monthly Pipeline Action execution
section Bootstrapping
Clone Primary Repository: 5: GitHub Actions
Read Config File: 5: Python
Checkout Specific Repositories: 3: Python
section Analysis
Perform Incremental Snapshot: 4: Python (analyse_repository.py)
Blame Lines / Evaluate Entropy: 4: Python
Update Survivor Fossils: 3: Python (add_fossils.py)
section Persistence
Clean & Minify data payloads: 5: Python (cleanup_data.py)
Commit Diff Data: 5: git config user.name "github-actions[bot]"
Push JSON to Origin: 5: GitHub Actions
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

### 1. `analyse_repository.py` Trigger
The analyzer looks at `theseus.config.json` and pulls from the local `data/` cache. Because `analyse_repository.py` is fully incremental, it will read `snapshot_date="2025-02"` in the JSON, look at the wall-clock calendar time (e.g. `2025-05`), and figure out that it needs to specifically checkout the repositories at `2025-03`, `2025-04` and `2025-05` to catch up to the current date. It will execute these checkouts locally within the GitHub Actions runner.

### 2. `add_fossils.py --update-survivor` Trigger
Genesis fossils rarely change unless a codebase undergoes an extreme edge-case rewrite of its absolute first commit history. The UI primarily benefits from tracking the *"Living Fossil"*.

To save processing time during CI constraints, the Action only triggers `add_fossils.py` with the `--update-survivor` flag, bypassing sorting all commits for Genesis creation completely, and simply updating the `view_commit` tip to track code changes.

### 3. File Re-commit Handling
Finally, the action checks if the snapshot array or the survivor fossil commit length actually triggered a diff against the origin.

If `git status` shows modifications to the JSON payloads inside `data/`, the robotic GitHub Actions bot commits the payload and forces a synchronized write onto `main`. This allows the repository to essentially act as its own self-healing backend Database.

---

> [!TIP]
> Ensure the Action is allowed Write permissions in the repository settings: `Settings -> Actions -> General -> Workflow permissions -> Read and write permissions`. Otherwise, the robotic commit will result in `HTTP 403` and the pipeline will fail silently.
32 changes: 27 additions & 5 deletions scripts/cleanup_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,25 @@
from pathlib import Path


def cleanup_data(data_dir: str):
def cleanup_data(data_dir: str) -> bool:
"""
Cleans up all JSON data files in the specified directory.
- Removes 'total_lines' (redundant)
- Removes future-year keys in 'composition'
- Minifies output
Returns True if an error occurred, False otherwise.
"""
data_path = Path(data_dir)
if not data_path.exists() or not data_path.is_dir():
print(f"Data directory not found or not a directory: {data_dir}")
return True

json_files = list(data_path.glob("*.json"))
had_failures = False

if not json_files:
print(f"No JSON files found in {data_dir}")
return
return had_failures
Comment thread
coderabbitai[bot] marked this conversation as resolved.

for json_file in json_files:
if json_file.name == "manifest.json":
Expand All @@ -43,7 +49,7 @@ def cleanup_data(data_dir: str):
max_year = int(snapshot_date[:4])
composition = snapshot.get("composition", {})
keys_to_remove = [
year for year in composition.keys() if int(year) > max_year
year for year in composition.keys() if int(year) > max_year
]
for key in keys_to_remove:
del composition[key]
Expand All @@ -60,8 +66,24 @@ def cleanup_data(data_dir: str):

except Exception as e:
print(f" Error processing {json_file.name}: {e}")
had_failures = True

return had_failures

def main():
import sys
config_path = "theseus.config.json"
if not Path(config_path).exists():
print(f"Configuration file not found: {config_path}")
sys.exit(1)

with open(config_path, "r", encoding="utf-8") as f:
config = json.load(f)

data_dir = config.get("dataDir", "./data")
if cleanup_data(data_dir):
print("One or more files failed to clean up. Exiting non-zero.")
sys.exit(1)

if __name__ == "__main__":
DATA_DIR = "./data"
cleanup_data(DATA_DIR)
main()
Loading