Cross-platform local archive toolkit for mapping large personal file collections, detecting duplicate candidates, extracting text/transcripts, and building a Markdown knowledge base.
The toolkit is intentionally script-first. An AI agent may help maintain it, but the repeatable work is done by PowerShell 7 scripts with validation, logs, dry-runs, and review reports.
- Source files are read-only during stages
01through08. - Generated files go under the configured output directory.
- Delete and move actions are isolated in
09-ApplyReviewedActions.ps1. - The action script refuses to run without an approved manifest.
- Deletion is disabled unless both config and command-line flags allow it.
- LLM output is never trusted for filesystem actions.
Install PowerShell 7 first, then copy config/pipeline.example.json to a working config and edit archive roots.
pwsh ./scripts/01-Inventory.ps1 -ConfigPath ./config/pipeline.example.json -DryRun
pwsh ./scripts/01-Inventory.ps1 -ConfigPath ./config/pipeline.example.json
pwsh ./scripts/02-Metadata.ps1 -ConfigPath ./config/pipeline.example.json
pwsh ./scripts/03-Dedupe.ps1 -ConfigPath ./config/pipeline.example.json
pwsh ./scripts/04-ExtractText.ps1 -ConfigPath ./config/pipeline.example.json
pwsh ./scripts/07-BuildKnowledgeBase.ps1 -ConfigPath ./config/pipeline.example.json
pwsh ./scripts/08-ReviewReports.ps1 -ConfigPath ./config/pipeline.example.jsonRun against the included fixtures first:
pwsh ./tests/Invoke-StaticChecks.ps1
pwsh ./scripts/01-Inventory.ps1 -RootPath ./tests/fixtures/source -OutputPath ./outputs -DryRun
pwsh ./scripts/01-Inventory.ps1 -RootPath ./tests/fixtures/source -OutputPath ./outputsoutputs/inventory/inventory.csvoutputs/inventory/file-errors.csvoutputs/metadata/metadata.csvoutputs/reports/exact-duplicates.csvoutputs/reports/near-duplicates.csv(when near-duplicate detection is enabled)outputs/reports/near-duplicate-status.csvoutputs/extracted/*.mdoutputs/transcripts/*.mdoutputs/classification/classification.csvoutputs/vault/*.mdoutputs/reports/review-summary.mdoutputs/logs/*.log
The scripts work in layers. Inventory, metadata fallback, exact duplicate detection, simple text extraction, and knowledge-base generation do not require AI.
Optional tools improve coverage:
- ExifTool for richer media/document metadata.
- FFmpeg/ffprobe for audio/video metadata and extraction.
- Czkawka CLI for near-duplicate image/video detection.
- ImageMagick for perceptual hash pre-filtering of similar images.
- Tesseract or PaddleOCR for image OCR.
- MarkItDown or Docling for document conversion.
- whisper.cpp or faster-whisper for transcripts.
- Ollama plus a local model for classification.
External command templates live in config/pipeline.example.json so tool flags can be adjusted without editing scripts.
If a run fails, check:
outputs/logs/- Stage-specific error reports.
- Whether previous-stage CSV files exist.
- Whether output path is outside the source root.
The scripts are independent. Fix the issue and rerun the failed stage.
All pipeline scripts follow a consistent exit code contract:
| Code | Meaning | Action |
|---|---|---|
0 |
Success | Proceed to next stage |
1 |
Fatal error | Check logs, fix issue, rerun this stage |
3 |
Partial success | Errors occurred but pipeline can continue; check error reports |
The test harness (Invoke-FixturePipeline.ps1) accepts both 0 and 3 as valid exit codes.
This project is licensed under the MIT License.
This project is primarily maintained by an AI agent for personal archive use. Contributions and suggestions are welcome.
Please read CONTRIBUTING.md for guidelines on commits, development setup, and the PR process. All contributors are expected to follow the Code of Conduct.
If you find a security vulnerability, see SECURITY.md for disclosure instructions.