Skip to content

lh0x00/docsifer

Repository files navigation

title Docsifer
emoji 📚
colorFrom indigo
colorTo purple
sdk docker
app_port 7860
pinned false
license mit
short_description Convert documents into clean, LLM-ready Markdown.

📚 Docsifer

Convert documents into clean, LLM-ready Markdown.

PDF · Word · PowerPoint · Excel · HTML · Audio · Image · CSV · JSON · ZIP

Python FastAPI Docker License


Features

  • Multi-format — PDF, Office, audio (Whisper), images (vision), HTML, CSV, JSON, ZIP, and more — powered by MarkItDown.
  • Optional LLM — bring your own OpenAI-compatible key for OCR, transcription and structured layout extraction.
  • Production-grade — bounded concurrency, per-IP fairness, memory watchdog, disk cleanup, circuit breaker.
  • Hardened — SSRF guard, path-traversal sanitation, body-size limits, security headers.
  • Observable — JSON logs with request id, /v1/healthz, /v1/readyz, /v1/stats.
  • Privacy-first — files are processed in-memory and discarded immediately.

Quickstart

# Local
make install
cp .env.example .env
make run

# Docker
docker build -t docsifer .
docker run --rm -p 7860:7860 --env-file .env docsifer

Open http://localhost:7860 for the UI or http://localhost:7860/docs for the API.

API

Method Path Description
POST /v1/convert Convert a file or URL to Markdown
GET /v1/stats Usage analytics snapshot
GET /v1/healthz Liveness probe
GET /v1/readyz Readiness probe

Examples

Basic conversion:

curl -X POST http://localhost:7860/v1/convert \
     -F "file=@document.pdf"

With LLM enhancement:

curl -X POST http://localhost:7860/v1/convert \
     -F "file=@page.html" \
     -F 'openai={"api_key":"sk-...","model":"gpt-4o-mini"}'

Convert a URL:

curl -X POST http://localhost:7860/v1/convert \
     -F "url=https://example.com/article"

Configuration

All settings are environment-driven (prefix DOCSIFER_). See .env.example for the full list. Common knobs:

Variable Default Purpose
DOCSIFER_MAX_UPLOAD_BYTES 10MB Hard upload limit
DOCSIFER_MAX_CONCURRENT_CONVERSIONS 2 Global parallelism
DOCSIFER_MAX_QUEUE_DEPTH 10 Reject 503 when exceeded
DOCSIFER_MAX_PER_IP_CONCURRENT 1 Per-IP fairness
DOCSIFER_REQUEST_TIMEOUT_SEC 55 Conversion timeout
DOCSIFER_REDIS_URL / _TOKEN local Upstash Redis for analytics
DOCSIFER_URL_ALLOW_PRIVATE_NETWORKS false Disable to block SSRF

Architecture

docsifer/
├── api/         FastAPI layer — routes, schemas, middleware
├── core/        Pure logic — converter, MIME, tokenizer, LLM cache
├── analytics/   Lifespan-managed analytics (Upstash + in-memory)
├── safety/      Anti-crash primitives (gate, limiter, watchdog, breaker)
├── ui/          Optional Gradio playground
├── config.py    Pydantic settings
├── exceptions.py
├── logging_config.py
└── main.py      App factory + lifespan

See ARCHITECTURE.md for the full design notes.

Development

make install   # runtime + dev deps
make lint      # ruff
make format    # ruff format
make type      # mypy
make test      # pytest
make cov       # pytest with coverage

License

MIT © Lam Hieu — built on top of the wonderful MarkItDown.

About

Docsifer is a powerful tool for converting various data formats into Markdown for applications such as indexing, text analysis, and more. It supports PDF, PowerPoint, Word, Excel, Images, Audio, HTML, and other text-based formats, and leverages LLMs to enhance performance.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors