You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: frontend/docs/pages/cookbooks/pdf-pipeline.mdx
+70-1Lines changed: 70 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,6 +186,75 @@ This example includes an end-to-end test that generates a tiny PDF, runs the ful
186
186
187
187
This pipeline is a good fit for a DAG because the processing stages are known before execution starts and the dependencies are easy to declare up front. The workflow does not need a loop, an unknown number of iterations, or a runtime decision about which tasks exist. Once `extract_text` completes, `classify_document`, `summarize_text`, and `extract_keywords` can run independently, and `format_result` waits for all three outputs before combining them. Each task points forward to later tasks, and no child step feeds back into an ancestor. If you later wanted to process a runtime-determined set of PDFs, a durable task could handle that outer decision and spawn this DAG once per document.
188
188
189
+
## Swapping in Reducto for PDF extraction
190
+
191
+
The local example above runs without external services, but production document pipelines often need more than basic text extraction. Reducto's [Parse API](https://docs.reducto.ai/parse/overview) can run OCR, detect document layout, and return structured chunks for content such as tables, figures, scanned pages, and multi-column documents. Since Hatchet's DAG isolates PDF parsing in a single task, swapping in Reducto is as simple as replacing `extract_text`; the downstream tasks stay the same. Hatchet still owns the workflow execution, retries, concurrency, and observability. Reducto handles the document parsing.
192
+
193
+
[Install the Reducto SDK](https://docs.reducto.ai/quickstart#install-the-sdk) and set `REDUCTO_API_KEY` in your environment.
194
+
195
+
For small documents, the snippets below handle the inline result returned when `result.result.type` is `"full"`. For larger documents, Reducto may return a URL result instead; production code should check that field and fetch the result from the URL if needed.
From here you could add more processing stages, such as language detection, entity extraction, or metadata enrichment. You could also replace keyword extraction or classification with an LLM call, configure retries or concurrency limits for the slowest stages, or fan out to process multiple PDFs by spawning the DAG as a child workflow from a [durable task](/v1/durable-tasks). For this cookbook, the five-task DAG is enough to show the core pattern.
260
+
From here you could add more processing stages, such as language detection, entity extraction, or metadata enrichment. You could also use Reducto for richer production parsing, replace keyword extraction or classification with an LLM call, configure retries or concurrency limits for the slowest stages, or fan out to process multiple PDFs by spawning the DAG as a child workflow from a [durable task](/v1/durable-tasks). For this cookbook, the five-task DAG is enough to show the core pattern.
0 commit comments