Skip to content

UPDATE Documentation by --resume case for long-running init operations #113

@kibarik

Description

@kibarik

Checkpoint/Resume for long-running init operations

Problem

When running repowise init on large repositories, the process can take many hours. If the process is interrupted (timeout, crash, user termination, network issues), all progress is lost and the init must restart from scratch.

My experience:

  • Repository: ~553 pages to generate
  • Process ran for 8+ hours (416/553 pages completed)
  • Process timed out after ~11,218 seconds
  • Result: All 416 pages generated were lost - SQL database showed 0 pages

The pages existed in LanceDB (_transactions/, _versions/, data/ folders with 416 fragments) but were not committed to SQL and were not recoverable via standard commands. I had to manually write a Python script to extract data from LanceDB and import it to SQL.

I'm frustrated when I spend 8 hours waiting for indexing only to lose everything because of a timeout or interruption.

Proposed Solution

1. Periodic checkpoints - Commit to SQL database every N pages (configurable)

repowise init . --checkpoint-interval 50  # Save every 50 pages

2. Auto-resume on re-run - repowise init should detect partial state and continue:

$ repowise init .
Resuming from checkpoint: 416/553 pages already generated
Generating pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 137/553

3. Graceful shutdown - SIGTERM/SIGINT should trigger final checkpoint before exit

4. Recovery command (optional, for cases where checkpoint failed):

repowise recover .  # Scan LanceDB and import uncommitted pages to SQL

Alternatives Considered

Workaround I used:

  1. Manually extracted data from .repowise/lancedb/wiki_pages.lance/ using Python + lancedb
  2. Created state.json manually with page count
  3. Inserted pages into SQL via sqlite3
  4. Rebuilt FTS index manually

This is too complex for average users.

Other approaches:

  • External process manager (systemd, supervisord) - doesn't help, data loss occurs within repowise
  • Smaller batch sizes - still loses all progress if interrupted mid-batch
  • repowise update instead of init - only works for git changes, not for resume after crash

Additional Context

Environment:

  • repowise: latest
  • Repository size: 553 pages
  • Runtime before interruption: ~8 hours
  • Provider: litellm with zai/glm-5

Why this matters:
Large codebases (1000+ files, enterprise repos) are common. For these users, repowise is unusable in production environments where long-running processes may be interrupted. A single timeout wastes hours of compute time and API costs.

Relevant code locations:

  • generation_jobs table tracks progress but doesn't seem to implement checkpoints
  • LanceDB has uncommitted transactions in _transactions/ folder
  • repowise doctor detects "Coordinator drift" but can't fix it automatically

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions