Checkpoint/Resume for long-running init operations
Problem
When running repowise init on large repositories, the process can take many hours. If the process is interrupted (timeout, crash, user termination, network issues), all progress is lost and the init must restart from scratch.
My experience:
- Repository: ~553 pages to generate
- Process ran for 8+ hours (416/553 pages completed)
- Process timed out after ~11,218 seconds
- Result: All 416 pages generated were lost - SQL database showed 0 pages
The pages existed in LanceDB (_transactions/, _versions/, data/ folders with 416 fragments) but were not committed to SQL and were not recoverable via standard commands. I had to manually write a Python script to extract data from LanceDB and import it to SQL.
I'm frustrated when I spend 8 hours waiting for indexing only to lose everything because of a timeout or interruption.
Proposed Solution
1. Periodic checkpoints - Commit to SQL database every N pages (configurable)
repowise init . --checkpoint-interval 50 # Save every 50 pages
2. Auto-resume on re-run - repowise init should detect partial state and continue:
$ repowise init .
Resuming from checkpoint: 416/553 pages already generated
Generating pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 137/553
3. Graceful shutdown - SIGTERM/SIGINT should trigger final checkpoint before exit
4. Recovery command (optional, for cases where checkpoint failed):
repowise recover . # Scan LanceDB and import uncommitted pages to SQL
Alternatives Considered
Workaround I used:
- Manually extracted data from
.repowise/lancedb/wiki_pages.lance/ using Python + lancedb
- Created
state.json manually with page count
- Inserted pages into SQL via sqlite3
- Rebuilt FTS index manually
This is too complex for average users.
Other approaches:
- External process manager (systemd, supervisord) - doesn't help, data loss occurs within repowise
- Smaller batch sizes - still loses all progress if interrupted mid-batch
repowise update instead of init - only works for git changes, not for resume after crash
Additional Context
Environment:
- repowise: latest
- Repository size: 553 pages
- Runtime before interruption: ~8 hours
- Provider: litellm with zai/glm-5
Why this matters:
Large codebases (1000+ files, enterprise repos) are common. For these users, repowise is unusable in production environments where long-running processes may be interrupted. A single timeout wastes hours of compute time and API costs.
Relevant code locations:
generation_jobs table tracks progress but doesn't seem to implement checkpoints
- LanceDB has uncommitted transactions in
_transactions/ folder
repowise doctor detects "Coordinator drift" but can't fix it automatically
Checkpoint/Resume for long-running
initoperationsProblem
When running
repowise initon large repositories, the process can take many hours. If the process is interrupted (timeout, crash, user termination, network issues), all progress is lost and the init must restart from scratch.My experience:
The pages existed in LanceDB (
_transactions/,_versions/,data/folders with 416 fragments) but were not committed to SQL and were not recoverable via standard commands. I had to manually write a Python script to extract data from LanceDB and import it to SQL.I'm frustrated when I spend 8 hours waiting for indexing only to lose everything because of a timeout or interruption.
Proposed Solution
1. Periodic checkpoints - Commit to SQL database every N pages (configurable)
2. Auto-resume on re-run -
repowise initshould detect partial state and continue:$ repowise init . Resuming from checkpoint: 416/553 pages already generated Generating pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 137/5533. Graceful shutdown - SIGTERM/SIGINT should trigger final checkpoint before exit
4. Recovery command (optional, for cases where checkpoint failed):
Alternatives Considered
Workaround I used:
.repowise/lancedb/wiki_pages.lance/using Python + lancedbstate.jsonmanually with page countThis is too complex for average users.
Other approaches:
repowise updateinstead ofinit- only works for git changes, not for resume after crashAdditional Context
Environment:
Why this matters:
Large codebases (1000+ files, enterprise repos) are common. For these users, repowise is unusable in production environments where long-running processes may be interrupted. A single timeout wastes hours of compute time and API costs.
Relevant code locations:
generation_jobstable tracks progress but doesn't seem to implement checkpoints_transactions/folderrepowise doctordetects "Coordinator drift" but can't fix it automatically