Coverage-aware course notes from lecture slides
Turn PPT/PDF into readable, traceable notes with images, OCR/vision, Lecture-Weave writing, and coverage checks.
Not just a slide summarizer — a faithful study-document pipeline.
English | 中文 | Config | Roadmap
git clone https://github.com/Cat-blizzard/SlideNote.git
cd SlideNote
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e ".[dev,llm]"
python -m slidenote doctorFor text-only AI notes with one LLM key:
$env:DEEPSEEK_API_KEY="..."
python -m slidenote build path\to\lecture.pdf --out outputs\lecture --use-llm --provider deepseek --vision off --figure-crop offFor image-aware notes, add a Qwen/DashScope key. Qwen is the default vision provider:
$env:DASHSCOPE_API_KEY="..."
$env:DEEPSEEK_API_KEY="..."
python -m slidenote build path\to\lecture.pdf --out outputs\lecture --use-llm --provider deepseekOpen outputs\lecture\notes.md. Images are bundled under outputs\lecture\notes.assets\.
- Supports
.pptxand.pdf;.pptis handled by attempting a LibreOffice conversion to PDF. - Extracts titles, text blocks, tables, embedded images, and slide/page screenshots.
- Classifies each page as native text, mixed, image-only, shape-diagram-like, or decorative to route OCR, vision, and figure cropping.
- Ranks images by study value so vision calls and notes prefer diagrams, charts, figure crops, and high-signal visuals.
- Detects composite figures made from many embedded picture pieces, crops the whole region from the page screenshot, and keeps the pieces as hidden source refs.
- Writes
sections.json; with LLM enabled,--section-detection autocan ask the model to refine section boundaries before Lecture-Weave. - Writes
deck_brief.json/deck_brief.mdin high-quality Lecture-Weave mode: a global course map used only as navigation, not as a replacement for page-level coverage. - Writes
content_guard.jsonwith page roles, high-confidence learning items, required visible coverage, repair attempts, and residual risks. - Produces
content.jsonas the source inventory. - Produces
notes.mdwith hidden source markers by default, plus optional visible page references. - Produces
coverage.json/coverage.mdto flag elements that may be missing from the notes. - Optional exports can generate
notes.toc.md,notes.docx,notes.pdf, andnotes.tex; Word/PDF/LaTeX require Pandoc. - Optional vision extraction writes OCR text and visual summaries back into the structured content.
- Optional LLM generation supports OpenAI/ChatGPT, DeepSeek, Qwen, Doubao/Volcengine Ark, GLM, Gemini, and Claude.
- Optional
lecture-weavenote strategy first generates detailed per-page explanations, then weaves them into coherent sections. - Configurable note language and term policy: English slides can produce Chinese or English notes, and Chinese notes can preserve key academic English terms.
- Local caching and usage reports make token cost visible and reusable by a future GUI.
SlideNote started from a very personal learning problem.
I have never been the kind of student who learns best by simply listening to lectures. Sometimes I cannot fully follow a teacher's explanation in real time, and I usually learn more efficiently by reading. Reading lets me slow down, go back, skip ahead, and control the pace of understanding by myself.
But lecture slides are not the same as readable notes. After class, reading the PPT directly often feels incomplete: the bullets are fragmented, the logic is implicit, and many important details live in diagrams, screenshots, formulas, or the teacher's spoken explanation. Manually rewriting everything into notes is possible, but it is time-consuming, hard to keep complete, and not always pleasant to revisit later.
So I wanted to build a tool that could turn course slides into structured, readable, traceable notes: not just a summary, but a faithful learning document that preserves images, keeps page references, checks coverage, and helps convert lecture materials into something I can actually study from.
That idea became SlideNote.
SlideNote does not require a local GPU. The system is layered: the local parser can run with only Python dependencies, while LLM rewriting, OCR, and vision extraction require API keys for the providers you choose.
- Python
3.10or newer. - A virtual environment is recommended.
- Core Python dependencies are managed by
pyproject.toml:python-pptx: parses.pptxstructure, text, tables, and embedded images.PyMuPDF: parses.pdffiles and renders PDF page screenshots.Pillow: processes, resizes, and saves images.
| Software | Required? | Purpose |
|---|---|---|
| LibreOffice | Recommended | Converts .ppt / .pptx to PDF and enables full-slide screenshots when PowerPoint is unavailable. |
| Microsoft PowerPoint | Optional | On Windows, can export PPTX full-slide screenshots through COM automation. |
| WPS Office | Manual fallback | The current CLI does not automate WPS, but you can manually export a PPT to PDF with WPS and then process the PDF with SlideNote. |
For Windows users without PowerPoint, the recommended path is to install the Windows version of LibreOffice. LibreOffice is not Linux-only. A common installation path is:
C:\Program Files\LibreOffice\program\soffice.exeThe current code looks for soffice or libreoffice on your system PATH. After installation, check:
soffice --versionIf the command is not found, add this directory to your Windows PATH:
C:\Program Files\LibreOffice\program
The PowerPoint route requires pywin32:
python -m pip install pywin32If neither LibreOffice nor PowerPoint is available:
.pdffiles can still be parsed..pptxfiles can still yield text, tables, and embedded images, but full-slide screenshots may be missing.- Old
.pptfiles are usually not handled directly; export them to PDF first with WPS, PowerPoint, or LibreOffice.
LLM, OCR, and vision extraction are optional. You only need API keys for the features you enable.
Common environment variables:
# LLM
$env:OPENAI_API_KEY="..."
$env:DEEPSEEK_API_KEY="..."
$env:DASHSCOPE_API_KEY="..."
$env:ARK_API_KEY="..."
$env:GLM_API_KEY="..."
$env:GEMINI_API_KEY="..."
$env:ANTHROPIC_API_KEY="..."
# OCR
$env:BAIDU_OCR_API_KEY="..."
$env:BAIDU_OCR_SECRET_KEY="..."
$env:MATHPIX_APP_ID="..."
$env:MATHPIX_APP_KEY="..."
$env:GOOGLE_VISION_API_KEY="..."PowerShell $env:...="..." values only apply to the current terminal session. For regular use, configure them as Windows system environment variables or set them before each run.
| Goal | Requirements |
|---|---|
| Parse PDF / PPTX and generate a local rule-based draft | Python 3.10+ and core dependencies |
| Generate polished notes with an LLM | Install .[llm] and configure the chosen LLM API key |
| Process scanned PDFs or image-only slides | Configure an OCR API and run with --ocr auto |
| Understand diagrams, screenshots, charts, and visual layouts | Configure a vision model API and run with --vision auto |
| Preserve PPTX full-slide screenshots | Install LibreOffice, or install PowerPoint + pywin32 on Windows |
Process old .ppt files |
Recommended: install LibreOffice; fallback: manually export to PDF |
| Export Word, PDF, or LaTeX notes | Install Pandoc and run with --export docx,pdf,latex |
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e ".[dev,llm]"If you only need the local rule-based draft mode:
python -m pip install -e ".[dev]"Rule-based draft:
python -m slidenote build path\to\lecture.pptx --out outputs\lecture --vision offLLM rewriting:
python -m slidenote build path\to\lecture.pdf --out outputs\lecture --use-llm --provider openaiOutput structure:
outputs/lecture/
content.json
page_modalities.json
table_understanding.json
semantic_layout.json
element_ir.json
image_importance.json
sections.json
deck_brief.md
deck_brief.json
content_guard.json
notes.md
page_notes.md
page_notes.json
weave_report.json
llm_usage.json
composite_figures.json
figures.json
figure_usage.json
figure_grounding.json
ocr.json
ocr_usage.json
visuals.json
vision_usage.json
coverage.json
coverage.md
source_map.json
export_report.json
notes.toc.md
notes.docx
notes.pdf
notes.tex
progress.json
run_summary.json
notes.assets/
figures/
images/
screenshots/
By default, notes.md references bundled image copies under notes.assets/. If you move or package notes.md together with notes.assets/, images should continue to render.
page_modalities.json records the local page-type detector. It helps later stages choose the cheaper stable path:
native_text: use extracted text directly.mixed: use extracted text plus embedded images.image_only: prefer page OCR, figure cropping, and page-level vision.shape_diagram: use extracted labels plus page screenshot cropping, because the diagram may be built from PPT shapes.decorative: low priority unless the user explicitly refreshes it.
image_importance.json records per-image study-value scores and reasons. Vision auto uses that ranking to choose the best local figure crop or embedded image before falling back to a full-page screenshot.
table_understanding.json records local table summaries, conclusions, and key rows. Note generation uses these fields as the primary study signal, so tables are explained by what they mean rather than by mechanically repeating every cell.
semantic_layout.json records local page-level semantic blocks, groups, and relations. It is especially useful for code examples, console output, cause/fix annotations, and multi-part visual scenes that should be explained as one learning unit.
element_ir.json is the normalized Page IR / Element IR consumed by prompts, coverage, and source maps. Each element has a stable element_id, kind, bbox, roles, evidence, and source_ids, so later GUI editing, local revise flows, and block-level source tracing can read one format instead of many dataclass-specific fields.
composite_figures.json records local detections where a diagram was assembled from multiple embedded picture pieces. SlideNote crops the whole visual region as one composite_figure, marks the small pieces as composite_child, and keeps their IDs in hidden source refs instead of inserting them separately.
figure_grounding.json records where each study-value figure belongs in the note: layout order, nearby text/table anchors, grounding confidence, explanation status, and whether the figure needs manual review. notes.md uses this metadata to place figures near the relevant paragraph instead of dumping all images at the end.
sections.json records the section plan used by --note-context section and lecture-weave. In --section-detection auto, SlideNote uses local rules without LLM notes, and switches to LLM-assisted section detection when section-based LLM notes are enabled.
deck_brief.json is generated when --deck-brief auto runs with --use-llm --note-strategy lecture-weave (or when --deck-brief force is set). It stores the deck's topic, core questions, concept map, page roles, and cross-page links. Later page-note prompts treat it as a navigation map only: the current page remains the only source for each page explanation, and coverage checks still use original text/table/image IDs.
content_guard.json is generated by default when --content-guard auto is enabled. Without --use-llm, it records a local heuristic review. With --use-llm, SlideNote first preselects candidate tables, formulas, definitions, conditions, OCR text, visual summaries, and non-decorative figures, then asks the text model to classify page roles and element learning roles. Only high-confidence must_explain items count toward required_visible_coverage and can trigger one natural repair pass; low-confidence items remain audit information.
Extra exports are opt-in. --export markdown-toc writes notes.toc.md without Pandoc. --export docx,pdf,latex uses Pandoc to write notes.docx, notes.pdf, and notes.tex; conversion status and any Pandoc errors are written to export_report.json.
If you are not sure what your machine is missing, run:
python -m slidenote doctorIt checks:
- Python version.
- Core dependencies: PyMuPDF, python-pptx, Pillow.
- Optional dependencies: OpenAI SDK, pywin32.
- External tools: LibreOffice /
soffice, Pandoc. - Common LLM/OCR API key environment variables.
- Per-check impact, fix suggestions, and GUI-readable readiness flags.
You can also write the report as JSON:
python -m slidenote doctor --json doctor.jsonFor the complete configuration reference, see CONFIG.zh-CN.md.
Large decks can take time, especially with OCR, vision extraction, and LLM rewriting enabled. SlideNote now writes:
progress.json # Current or most recent run progress
run_summary.json # Final run overview
The CLI also prints live stage progress. To suppress terminal progress while still writing progress.json:
python -m slidenote build lecture.pdf --out outputs\lecture --quietSpeed modes do not enable OCR or LLM by themselves. Vision is auto by the quality-first default and can be disabled with --vision off. Speed modes fill unset limits:
--speed-mode fast # Fewer OCR/vision targets and smaller output budgets
--speed-mode balanced # Cost/time tradeoff
--speed-mode quality # Default: higher image resolution and output budgets
--speed-mode debug # Small target counts for debuggingExample:
python -m slidenote build lecture.pdf `
--out outputs\lecture-fast `
--speed-mode fast `
--vision auto `
--vision-provider qwen `
--use-llm `
--provider deepseekOCR, vision, and LLM note contexts can run concurrently. Higher concurrency can be faster, but may hit provider rate limits. Start with 2 or 3:
--concurrency 3To reuse cache across different output directories, set a global cache root:
python -m slidenote build lecture.pdf `
--out outputs\lecture-v2 `
--global-cache-dir .slidenote-cache `
--use-llm `
--provider deepseekTo force selected slides to bypass local cache while other slides still reuse cache:
--refresh-pages 3,5-8Note: --refresh-pages currently means "bypass local cache for these slides", not "only output these slides".
The default output is a detailed lecture-style study note: it is organized by concepts instead of slide-by-slide translation, while keeping depth for definitions, formulas, examples, conditions, and figure/table conclusions. Source element IDs are hidden from the visible body, and images without OCR/vision summaries are inserted without noisy "image not parsed" explanations:
--note-style article # Default: organize as study notes, not a summary
--source-display hidden # Default: store source refs in HTML comments and source_map.json
--asset-mode bundle # Default: copy images into notes.assets/
--note-context section # Default: weave notes by section
--note-strategy lecture-weave # Default: explain each page, then weave sections
--note-depth detailed # Default: detailed lecture-note depth
--deck-brief auto # Default: build a global map before Lecture-Weave only
--content-guard auto # Default: protect high-confidence required learning content
--note-language zh # Default: write Simplified Chinese notes
--term-policy bilingual # Default: preserve key English academic terms in Chinese noteslecture-weave is the default LLM note strategy. This mode is more expensive, but it better matches the "explain this slide" workflow: first SlideNote can build a Deck Brief for global navigation, then each page is explained in detail, and finally those page notes are woven into coherent sections. The Deck Brief is explicitly guarded so it cannot replace current-page evidence or make page explanations shorter.
--content-guard auto is on by default. It prevents the model from treating prompts as a compiler by giving the note prompt explicit learning_items and then checking whether required items appear in visible prose, not only hidden source markers. Use --content-guard off when you want the older behavior or need to minimize extra LLM calls.
Language controls are independent of the slide language. For English courseware and Chinese notes, use the default --note-language zh --term-policy bilingual; key terms are prompted as 中文译名(English term/acronym) on first mention. For English notes, use --note-language en. Use --term-policy preserve when you want the source terminology kept as much as possible, or --term-policy translate when you prefer translated terms where safe.
python -m slidenote build lecture.pdf `
--out outputs\lecture `
--use-llm `
--provider deepseek `
--weave-dedup softlecture-weave also writes deck_brief.json, deck_brief.md, page_notes.json, page_notes.md, and weave_report.json. These are intermediate/debug artifacts; notes.md remains the final readable note.
To show compact page references in the note body:
--source-display footnoteFor strict debugging, use page context and inline source references:
--note-context page --source-display inline --note-style faithfulFull-page screenshots are now a fallback by default. If a page already has an embedded image or a local figure crop, notes.md does not insert the full-page screenshot unless you opt back into it:
--screenshot-policy alwaysPDF/PPT files often contain logos, tiny icons, background fragments, and decorative image resources. SlideNote keeps the raw files, but marks likely decorative images in content.json:
{
"role": "decorative",
"ignored": true,
"ignore_reason": "tiny_area"
}Ignored images are skipped by default in notes, coverage checks, OCR fallback, and standalone vision targets. Full-page screenshots are still preserved in screenshots/ as the visual fallback.
Some lecture materials do not store diagrams as independent image objects. A page may be a scanned image, or a diagram may be made from PowerPoint shapes, arrows, text boxes, or many small picture objects. SlideNote first runs a local composite-figure detector for picture-piece diagrams, then can ask a vision model to locate other meaningful local figure regions on the full-page screenshot:
--composite-figures auto # Default: crop clustered embedded picture pieces as one figure
--composite-figures off # Disable local composite-figure detection
--figure-crop auto # Default: only calls the vision model when --vision is enabled
--figure-crop vision # Force bbox detection even if --vision is off
--figure-crop off # Disable local figure croppingOutputs:
figures/
composite_figures.json
figures.json
figure_usage.json
Default limits:
--figure-max-targets 80
--figure-max-crops-per-page 3
--figure-min-confidence 0.45
--figure-min-area 40000
--figure-cache onFigure cropping is best-effort. The model returns bounding boxes, and SlideNote validates, filters, deduplicates, and crops them locally. If no reliable local figure is found, notes fall back to the full-page screenshot.
After OCR/vision and before note writing, SlideNote anchors non-decorative figures to nearby text or table elements. The default is local and deterministic:
--figure-grounding auto # Default: local layout anchoring, reusing existing OCR/vision summaries
--figure-placement inline # Default: insert figures near their anchored concept
--figure-audit local # Report missing explanations or low-confidence anchorsUse --figure-grounding vision when you want image explanations even if --vision off; this will trigger the normal vision extraction path for important visual targets. Coverage reports now include a figure section showing which images were inserted, where they were anchored, and which ones need review.
source_map.json records the mapping between note blocks and source elements:
note block -> PPT/PDF page -> text/table/image element id
By default, visible notes use hidden comments such as <!-- slidenote-source: p4:s4_t1,s4_t2 -->, while source_map.json keeps the full mapping. This keeps reading clean without losing coverage checks or GUI traceability.
OpenAI-compatible providers use the OpenAI SDK after installing .[llm]. Gemini and Claude use native REST calls and do not require extra SDKs.
The note-writing step is text-first by default. It does not automatically send image bytes to the note model. For non-vision models such as DeepSeek, SlideNote passes text blocks, tables, image paths, element IDs, and any existing OCR/vision summaries. Image understanding is handled as a separate vision step, so a vision-capable model can create visual_summary fields that a cheaper text model can later reuse.
| Provider | Usage | Default Model | API Key Env Vars | Base URL |
|---|---|---|---|---|
| ChatGPT/OpenAI | --provider openai |
gpt-4.1-mini |
OPENAI_API_KEY |
OpenAI SDK default |
| DeepSeek | --provider deepseek |
deepseek-v4-flash |
DEEPSEEK_API_KEY |
https://api.deepseek.com |
| Qwen | --provider qwen |
text: qwen-plus; vision: qwen-vl-plus |
QWEN_API_KEY or DASHSCOPE_API_KEY |
https://dashscope.aliyuncs.com/compatible-mode/v1 |
| Doubao / Volcengine Ark | --provider doubao |
pass --model; vision needs --vision-model or ARK_VISION_MODEL |
DOUBAO_API_KEY / ARK_API_KEY / VOLCENGINE_API_KEY |
https://ark.cn-beijing.volces.com/api/v3 |
| GLM / Zhipu | --provider glm |
glm-5.1 |
GLM_API_KEY / ZAI_API_KEY / ZHIPUAI_API_KEY |
https://open.bigmodel.cn/api/paas/v4/ |
| Gemini | --provider gemini |
gemini-3-flash-preview |
GEMINI_API_KEY or GOOGLE_API_KEY |
https://generativelanguage.googleapis.com/v1beta |
| Claude | --provider claude |
claude-sonnet-4-20250514 |
ANTHROPIC_API_KEY or CLAUDE_API_KEY |
https://api.anthropic.com |
Examples:
$env:DEEPSEEK_API_KEY="..."
python -m slidenote build lecture.pptx --out outputs\lecture --use-llm --provider deepseek$env:DASHSCOPE_API_KEY="..."
python -m slidenote build lecture.pptx --out outputs\lecture --use-llm --provider qwen --model qwen-plus$env:ARK_API_KEY="..."
python -m slidenote build lecture.pptx --out outputs\lecture --use-llm --provider doubao --model ep-xxxxxxxxCommon generation controls:
python -m slidenote build lecture.pptx `
--out outputs\lecture `
--use-llm `
--provider glm `
--model glm-5.1 `
--max-output-tokens 6000 `
--temperature 0.2 `
--cache onFor proxy gateways, private gateways, or different regions, override the base URL:
python -m slidenote build lecture.pptx --use-llm --provider qwen --base-url https://dashscope-intl.aliyuncs.com/compatible-mode/v1You can also override model and base URL with environment variables:
$env:SLIDENOTE_MODEL="qwen-plus"
$env:SLIDENOTE_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"SlideNote now separates OCR from visual understanding.
OCR = read text from an image
Vision = explain diagrams, layout, trends, flows, and visual relationships
Dedicated OCR is useful for scanned PDFs, image-only PPT slides, screenshots, and scanned textbook pages. It runs before vision extraction and note generation, then writes recognized text back into content.json:
{
"page_ocr_text": "...",
"page_ocr_status": "parsed",
"images": [
{
"id": "s12_img1",
"ocr_text": "...",
"ocr_status": "parsed"
}
]
}OCR is off by default:
--ocr offRecommended Chinese OCR setup with Baidu OCR:
$env:BAIDU_OCR_API_KEY="..."
$env:BAIDU_OCR_SECRET_KEY="..."
python -m slidenote build lecture.pdf `
--out outputs\lecture `
--ocr auto `
--ocr-provider baidu--ocr auto does not OCR every page. It first uses local text extraction. Only pages with too little extracted text, or pages that look like scanned/image-only pages, are sent to OCR.
Supported OCR providers:
| Provider | Usage | Required credentials |
|---|---|---|
| Baidu OCR | --ocr-provider baidu |
BAIDU_OCR_API_KEY + BAIDU_OCR_SECRET_KEY |
| Mathpix | --ocr-provider mathpix |
MATHPIX_APP_ID + MATHPIX_APP_KEY |
| Google Vision OCR | --ocr-provider google |
GOOGLE_VISION_API_KEY or GOOGLE_API_KEY |
Examples:
$env:MATHPIX_APP_ID="..."
$env:MATHPIX_APP_KEY="..."
python -m slidenote build math_notes.pdf --out outputs\math --ocr auto --ocr-provider mathpix$env:GOOGLE_VISION_API_KEY="..."
python -m slidenote build lecture.pdf --out outputs\lecture --ocr auto --ocr-provider googleOCR controls:
--ocr auto # OCR only pages with little extracted text
--ocr all # OCR every page screenshot when possible
--ocr-max-targets 120
--ocr-min-text-chars 80
--ocr-max-edge 1800
--ocr-language CHN_ENG
--ocr-cache onOCR outputs:
ocr.json
ocr_usage.json
ocr_usage.json records selected targets, cache hits, API calls, and recognized character counts. OCR results are cached separately from vision summaries, so changing the note model or vision model does not force OCR to run again.
Many lecture slides are image-driven: diagrams, screenshots, formula images, charts, flowcharts, and layout cues may carry the real teaching content. SlideNote can run a separate vision extraction step before note generation. The vision step writes OCR text and visual summaries back into content.json and also outputs:
visuals.json
vision_usage.json
Vision extraction is auto by default because SlideNote now favors note quality over token savings. The default vision provider is Qwen, so image-aware runs need DASHSCOPE_API_KEY or QWEN_API_KEY unless you choose another --vision-provider. Disable vision when you only want local parsing or text-only LLM rewriting:
--vision offRecommended China-friendly setup: use the default Qwen-VL vision path and DeepSeek for text rewriting.
$env:DASHSCOPE_API_KEY="..."
$env:DEEPSEEK_API_KEY="..."
python -m slidenote build lecture.pptx `
--out outputs\lecture `
--vision auto `
--vision-provider qwen `
--use-llm `
--provider deepseekThis does not create isolated captions only. The vision prompt includes the current slide title, text blocks, and table preview, so the generated visual_summary can describe how the image relates to the surrounding slide text. The final note-writing prompt also asks the text model to merge visual_summary with related text/table elements into coherent knowledge paragraphs.
Doubao / Volcengine Ark vision is also supported, but you usually need to create a vision model endpoint in Ark and pass the endpoint/model ID via --vision-model or ARK_VISION_MODEL:
$env:ARK_API_KEY="..."
$env:ARK_VISION_MODEL="ep-xxxxxxxx"
$env:DEEPSEEK_API_KEY="..."
python -m slidenote build lecture.pptx --out outputs\lecture --vision auto --vision-provider doubao --use-llm --provider deepseekSelection modes:
--vision auto # Recommended: parse high-value screenshots/images only
--vision all # Parse every page screenshot when possible; highest cost
--vision off # Disable vision extractionBy default, auto prioritizes local figure crops when they exist, then large embedded images, and only falls back to full-page screenshots when no better local visual target is available. This keeps visual summaries focused on the actual diagram instead of the entire slide.
Cost controls:
--vision-max-targets 80
--vision-min-area 120000
--vision-max-edge 1400
--vision-detail low
--vision-cache onConservative low-cost example:
python -m slidenote build lecture.pptx `
--out outputs\lecture `
--vision auto `
--vision-provider qwen `
--vision-max-targets 30 `
--vision-max-edge 1000 `
--use-llm `
--provider deepseekIf the deck is highly image-driven and quality matters more than cost:
python -m slidenote build lecture.pptx --out outputs\lecture --vision all --vision-provider openai --use-llm --provider openaiFull vision parsing is not recommended as the default. A better workflow is: run auto, inspect vision_usage.json and coverage.md, then selectively refresh low-quality or missing pages.
Vision results are written into page/image fields:
{
"page_ocr_text": "...",
"page_visual_summary": "...",
"images": [
{
"id": "s12_img1",
"ocr_text": "...",
"visual_summary": "..."
}
]
}Text-only models such as DeepSeek can then use these textualized vision results without seeing the image pixels directly.
LLM rewriting uses local caching by default. Each note context cache key is based on:
structured note context + prompt version + note strategy + provider + model + base_url + temperature + max-output-tokens + figure/screenshot rendering options
If the same context and parameters are generated again, SlideNote reuses the local cache instead of calling the model. Cache hit metadata is not inserted into the note body; it is written to llm_usage.json for GUI and debugging use.
In lecture-weave mode, page-note caches and weave caches are separate. Refreshing one slide with --refresh-pages 12 reruns that slide's page explanation and the weave context that contains it, while unrelated page explanations can still hit cache.
Deck Brief uses the same LLM cache directory with generation_stage="deck_brief", so an unchanged deck and section plan can reuse the global map without calling the model again.
Cache modes:
--cache on # Default: read/write local cache
--cache refresh # Ignore old cache, call the model, and overwrite cache
--cache off # Disable local cacheCustom cache directory:
python -m slidenote build lecture.pptx --use-llm --provider deepseek --cache-dir .slidenote-cache\llmllm_usage.json records:
- Per-context
cache_status:local_hit,miss,refresh, ordisabled - Per-context
cache_keyandcache_file - Actual LLM call count and cache hit count
deck_brief,page_note_calls, andweave_callswhen--note-strategy lecture-weaveis enabled- Provider-reported input/output/total tokens
- Provider-side cached input tokens, when returned by the API
SlideNote deliberately avoids this shortcut:
PPT -> LLM -> Summary
Instead, it follows:
PPT/PDF -> structured extraction -> source inventory -> note generation -> coverage check -> export
The local rule-based draft is only a baseline for debugging extraction and coverage. Production notes should use --use-llm, while coverage checks still rely on element IDs so the model cannot silently summarize away details.
Internally, the build is being organized around explicit pipeline stages with named dependencies and artifacts. run_summary.json includes the registered artifact map, while element_ir.json is the shared contract for prompt payloads, coverage, and source_map.json. This keeps the current CLI behavior stable while making future GUI and partial-revise work less brittle.
- OpenAI Chat Completions API
- OpenAI Images and vision
- DeepSeek API
- Alibaba Cloud Model Studio OpenAI-compatible API
- Volcengine Ark OpenAI SDK compatibility
- Zhipu GLM OpenAI compatibility
- Baidu OCR API
- Mathpix OCR API
- Google Cloud Vision OCR
- Gemini generateContent API
- Gemini image understanding
- Claude Messages API
- Claude Vision