-
Notifications
You must be signed in to change notification settings - Fork 2
perform documentation #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,78 @@ | ||
| # Theseus | ||
| Does a codebase remain the same if every line is replaced? A monthly pulse on software entropy. | ||
| <div align="center"> | ||
| <img src="assets/theseus_favicon.png" alt="Ship of Theseus Logo" width="120" /> | ||
| <h1>Ship of Theseus</h1> | ||
| <p><i>Does a codebase remain the same if every line is replaced? A monthly pulse on software entropy.</i></p> | ||
| <p> | ||
| <img src="https://img.shields.io/badge/Python-3.12%2B-blue?style=flat&logo=python&logoColor=white" alt="Python 3.12+" /> | ||
| <img src="https://img.shields.io/badge/Vanilla_JS-ES6-F7DF1E?style=flat&logo=javascript&logoColor=black" alt="Vanilla JS" /> | ||
| <img src="https://img.shields.io/badge/Deployed-GitHub_Pages-2EA44F?style=flat&logo=github" alt="Deployed on GitHub Pages" /> | ||
| <img src="https://img.shields.io/badge/GitHub_Actions-Automated-2088FF?style=flat&logo=github-actions&logoColor=white" alt="GitHub Actions" /> | ||
| <img src="https://img.shields.io/badge/Code_Style-Black-000000?style=flat&logo=python&logoColor=white" alt="Black Style" /> | ||
| </p> | ||
| </div> | ||
|
|
||
| --- | ||
|
|
||
| ## 📖 The Philosophy: Why This Project Matters | ||
|
|
||
| The **Ship of Theseus** is a famous thought experiment: if you replace every wooden plank on a ship over time, is it still the same ship? | ||
|
|
||
| This exact paradox plays out daily in modern software engineering. Repositories live for years, or even decades. The developers who started them leave, entire architectural paradigms shift, and eventually, the very last line of original code is overwritten. Yet, the repository retains its name, its URL, and its identity. | ||
|
|
||
| **This project exists to visualize that journey.** It pulls back the curtain on repository decay and renewal by measuring *codebase entropy*; tracking when lines of code were written and how long they survive before being rewritten, effectively showing you the "age" of a massive software project at a glance. | ||
|
|
||
| ### Why People Care About This | ||
| 1. **Repository Health & Churn Visibility:** Open-source maintainers and engineering managers can visually assess how quickly a codebase is turning over. Is the core architecture stable (lots of old code), or is it undergoing a frantic rewrite? | ||
| 2. **Identifying Key Surviving Code:** By identifying "Historical" and "Living" fossils, this project highlights the original architectural foundation blocks that have stood the test of time (and edge-cases). | ||
| 3. **Data-Driven Storytelling:** It acts as a historical lens for famous open-source projects, allowing developers to see how massive frameworks (like React or Django) have evolved through different eras. | ||
|
|
||
| ## QuickStart Guide | ||
|
|
||
| ### 1. Requirements | ||
| * `git` | ||
| * `python` > 3.12 | ||
| * `poetry` (for dependency management) | ||
|
|
||
| ### 2. Installation | ||
| ```bash | ||
| git clone https://github.com/Asifdotexe/Theseus.git | ||
| cd Theseus | ||
| poetry install | ||
| ``` | ||
|
|
||
| ### 3. Running the Engine Locally | ||
| The analytical engine is driven through the centralized `theseus.config.json` configuration file. | ||
| You can run the full timeline snapshot engine: | ||
| ```bash | ||
| poetry run python scripts/analyse_repository.py | ||
| ``` | ||
|
|
||
| To backfill or incrementally update the "Fossil" pointers (the absolute oldest lines of code): | ||
| ```bash | ||
| poetry run python scripts/add_fossils.py --update-survivor | ||
| ``` | ||
|
|
||
| ### 4. Viewing the Interactive Chart | ||
| Simply open `index.html` in your favorite modern browser: | ||
| ```bash | ||
| # On Mac | ||
| open index.html | ||
|
|
||
| # On Windows | ||
| start index.html | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Dive Deeper (Documentation) | ||
|
|
||
| The technical internals of the Ship of Theseus engine are separated into structured documentation guides: | ||
|
|
||
| - **[Architecture & The Data Pipeline](docs/ARCHITECTURE.md):** How we traverse `git` histories incrementally and capture "Fossils". | ||
| - **[Configuration Guide](docs/CONFIGURATION.md):** How to plug in your own repositories by editing `theseus.config.json`. | ||
| - **[DevOps & CI/CD](docs/DEVOPS.md):** How the system updates itself autonomously via GitHub Actions. | ||
|
|
||
| --- | ||
|
|
||
| ## License | ||
| This project is open-source and available under the terms defined in the `LICENSE` file. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Architecture & Internals | ||
|
|
||
| The Ship of Theseus engine is composed of a disconnected backend (data generator) and frontend (UI visualizer). They communicate entirely via an intermediary static JSON format. | ||
|
|
||
| This architecture allows the system to remain highly secure, completely serverless, and free to host using static GitHub Pages. (woohoo, who doesn't like free things?) | ||
|
|
||
| --- | ||
|
|
||
| ## The Data Pipeline (`analyse_repository.py`) | ||
|
|
||
| The heart of the application is a python script that orchestrates `git` shell commands. Because `git` is heavily optimized in C, shelling out to the native git binary is orders of magnitude faster than relying on pure Python implementations like `GitPython` or `pygit2`. | ||
|
|
||
| ### Incremental Snapshot Generation | ||
|
|
||
| To view codebase health *over time*, we need snapshots of the codebase. Instead of re-parsing every commit since the dawn of time, the engine works incrementally. | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[Start: read `theseus.config.json`] --> B{Has Data File?} | ||
| B -- No --> C[Full Clone & 1st Commit] | ||
| B -- Yes --> D[Look at Last Snapshot Date] | ||
| D --> E[Is Last Snapshot < Current Month?] | ||
| E -- Yes --> F[Clone & Jump to Next Month] | ||
| E -- No --> G[Skip: Up to Date] | ||
| C --> H[Run Git Blame Parallel] | ||
| F --> H | ||
| H --> I[Count Lines by Authorship Year] | ||
| I --> J[Append Snapshot to JSON] | ||
| ``` | ||
|
|
||
| ### The `git blame` Parallelization | ||
|
|
||
| When checking out a specific month's commit, the system needs to `git blame` every single valid file in the repository. | ||
|
|
||
| 1. **Ls-Files Filter:** We run `git ls-files` to get solely the tracked text files (excluding binary garbage). | ||
| 2. **ThreadPool Executor:** The script fires off multiple parallel workers to run `git blame --line-porcelain` concurrently across CPUs. | ||
| 3. **Regex Extraction:** It rips the UNIX timestamps out of the porcelain format and bins them into "years". | ||
|
|
||
| --- | ||
|
|
||
| ## The Fossil Extraction (`add_fossils.py`) | ||
|
|
||
| Fossils are pointers to specific, historically significant lines of code that serve as fun easter-eggs for the UI. They are evaluated completely independently to prevent slowing down the main incremental snapshot pipeline. | ||
|
|
||
| ```mermaid | ||
| stateDiagram-v2 | ||
| direction LR | ||
| [*] --> ReadManifest | ||
|
|
||
| state "Fossil Extractor" as extractor { | ||
| GenesisFossil: Historical Genesis | ||
| SurvivorFossil: Living Survivor | ||
|
|
||
| GenesisFossil --> SortCommits | ||
| SortCommits --> FindOldestBlamedLine | ||
|
|
||
| SurvivorFossil --> CheckoutHEAD | ||
| CheckoutHEAD --> FindOldestStillAlive | ||
| } | ||
|
|
||
| ReadManifest --> extractor | ||
| extractor --> AppendMetadataJSON | ||
| ``` | ||
|
|
||
| ### Historical (Genesis) Protocol | ||
| Repos imported from SVN/Mercurial can have wildly inaccurate committer timestamps. We resolve this by running `git log --all --pretty=format:%H %at` to sort all commits explicitly by `author-time`, stepping through the absolute oldest `genesis_depth` commits, and extracting the first line of code ever pushed to the repo's history regardless of branch logic. | ||
|
|
||
| ### Living (Survivor) Protocol | ||
| This focuses strictly on the default branch `HEAD`. It recursively blames the latest state of the codebase. Because it's checking `HEAD`, this value frequently moves as old code is finally refactored out. | ||
|
|
||
| --- | ||
|
|
||
| ## Data Delivery via Vanilla UI (`app.js`) | ||
|
|
||
| The UI is intentionally lightweight. We avoided heavy React or bundle-chain systems to ensure the repository remains simple and easy to fork. | ||
|
|
||
| The UI loads `theseus.config.json` via the browser Fetch API, builds out a repository selection grid dynamically, and upon clicking a card, pulls the corresponding static `data/{repo}_data.json`. | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant Browser | ||
| participant app.js | ||
| participant config.json | ||
| participant data.json | ||
|
|
||
| Browser->>app.js: Load index.html | ||
| app.js->>config.json: fetch("theseus.config.json") | ||
| config.json-->>app.js: Returns [{repo1}, {repo2}] | ||
| app.js->>Browser: Renders UI Selection Grid | ||
|
|
||
| Browser->>app.js: Click Repo 1 | ||
| app.js->>Browser: Shows CSS Skeleton Loader Overlay | ||
| app.js->>data.json: fetch("data/repo1_data.json") | ||
| data.json-->>app.js: Loads Timeseries + Fossils | ||
| app.js->>Browser: Computes D3/DOM Chart & Hides Skeleton | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| # ⚙️ Configuration Guide | ||
|
|
||
| The Ship of Theseus engine operates centrally off a single file: `theseus.config.json`. By modifying this file, you instruct both the Python backend and the JavaScript frontend on which repositories to scrape and display. | ||
|
|
||
| ## Base Schema (`theseus.config.json`) | ||
|
|
||
| ```json | ||
| { | ||
| "$schema": "./schema.json", | ||
| "dataDir": "./data", | ||
| "repositories": [ | ||
| { | ||
| "name": "react", | ||
| "repo": "facebook/react", | ||
| "displayName": "React", | ||
| "description": "A JavaScript library for building user interfaces" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ### Global Settings | ||
|
|
||
| * `dataDir` *(string)*: The relative path to the directory where the engine will save output JSONs. Usually `"./data"`. This config also controls the Javascript engine, so the frontend needs this accurate to know where to fetch data. | ||
|
|
||
| ### Repositories Array | ||
|
|
||
| The `repositories` array takes objects consisting of the following key attributes: | ||
|
|
||
| | Key | Type | Description | Example | | ||
| | :--- | :---: | :--- | :--- | | ||
| | `name` | *String* | A safe, unique identifier. Used for the JSON filename (`{name}_data.json`). Must be snake_case or kebab-case. | `"django"` | | ||
| | `repo` | *String* | The GitHub repository namespace (the URL ending). The engine automatically strips trailing slashes and resolves this to `https://github.com/namespace/repo.git`. | `"django/django"` | | ||
| | `displayName` | *String* | The aesthetic name rendered on UI Cards. | `"Django"` | | ||
| | `description` | *String* | A short UI subheading clarifying what the project is. | `"The web framework for perfectionists with deadlines."` | | ||
|
|
||
| --- | ||
|
|
||
| ## Modifying Configurations | ||
|
|
||
| ### Adding a new target | ||
| To begin visualizing a new repository, append it to the `repositories` array. | ||
|
|
||
| 1. Add your object to `theseus.config.json` | ||
| 2. Locally run `poetry run python scripts/analyse_repository.py` | ||
| 3. The engine will clone the repo into `./temp_repos/` (which can be over `1GB` for massive codebases, so ensure disk space). | ||
| 4. Local data processing will generate `data/{your_repo}_data.json`. | ||
| 5. Run `poetry run python scripts/add_fossils.py` to fill in the Genesis/Survivor line references. | ||
| 6. Check your `index.html` file to see the newly generated visual graph! | ||
|
|
||
| > [!CAUTION] | ||
| > Avoid modifying the output data within `data/` manually. Doing so will corrupt the incremental snapshot logic, forcing the pipeline to wipe out the cache and restart checking out massive commit trees from scratch. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # 🤖 DevOps & CI/CD Pipeline | ||
|
|
||
| The Ship of Theseus engine doesn't just run once—codebases never stop evolving. The system relies entirely on GitHub Actions to provide zero-maintenance "monthly pulses" that autonomously update the data output repository. | ||
|
|
||
| ## The Automation Engine (`.github/workflows/theseus-engine.yml`) | ||
|
|
||
| The primary workflow handles generating the JSON snapshot objects incrementally every month, tracking any changes, and pushing them back to the repository data block. | ||
|
|
||
| ```mermaid | ||
| journey | ||
| title Monthly Pipeline Action execution | ||
| section Bootstrapping | ||
| Clone Primary Repository: 5: GitHub Actions | ||
| Read Config File: 5: Python | ||
| Checkout Specific Repositories: 3: Python | ||
| section Analysis | ||
| Perform Incremental Snapshot: 4: Python (analyse_repository.py) | ||
| Blame Lines / Evaluate Entropy: 4: Python | ||
| Update Survivor Fossils: 3: Python (add_fossils.py) | ||
| section Persistence | ||
| Clean & Minify data payloads: 5: Python (cleanup_data.py) | ||
| Commit Diff Data: 5: git config user.name "github-actions[bot]" | ||
| Push JSON to Origin: 5: GitHub Actions | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| ### 1. `analyse_repository.py` Trigger | ||
| The analyzer looks at `theseus.config.json` and pulls from the local `data/` cache. Because `analyse_repository.py` is fully incremental, it will read `snapshot_date="2025-02"` in the JSON, look at the wall-clock calendar time (e.g. `2025-05`), and figure out that it needs to specifically checkout the repositories at `2025-03`, `2025-04` and `2025-05` to catch up to the current date. It will execute these checkouts locally within the GitHub Actions runner. | ||
|
|
||
| ### 2. `add_fossils.py --update-survivor` Trigger | ||
| Genesis fossils rarely change unless a codebase undergoes an extreme edge-case rewrite of its absolute first commit history. The UI primarily benefits from tracking the *"Living Fossil"*. | ||
|
|
||
| To save processing time during CI constraints, the Action only triggers `add_fossils.py` with the `--update-survivor` flag, bypassing sorting all commits for Genesis creation completely, and simply updating the `view_commit` tip to track code changes. | ||
|
|
||
| ### 3. File Re-commit Handling | ||
| Finally, the action checks if the snapshot array or the survivor fossil commit length actually triggered a diff against the origin. | ||
|
|
||
| If `git status` shows modifications to the JSON payloads inside `data/`, the robotic GitHub Actions bot commits the payload and forces a synchronized write onto `main`. This allows the repository to essentially act as its own self-healing backend Database. | ||
|
|
||
| --- | ||
|
|
||
| > [!TIP] | ||
| > Ensure the Action is allowed Write permissions in the repository settings: `Settings -> Actions -> General -> Workflow permissions -> Read and write permissions`. Otherwise, the robotic commit will result in `HTTP 403` and the pipeline will fail silently. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.