docs(cookbooks): add Reducto PDF extraction

BloggerBust · BloggerBust · commit f87cecf32148 · 2026-05-01T22:27:19.000-06:00
diff --git a/frontend/docs/pages/cookbooks/pdf-pipeline.mdx b/frontend/docs/pages/cookbooks/pdf-pipeline.mdx
@@ -186,6 +186,75 @@ This example includes an end-to-end test that generates a tiny PDF, runs the ful
 
 This pipeline is a good fit for a DAG because the processing stages are known before execution starts and the dependencies are easy to declare up front. The workflow does not need a loop, an unknown number of iterations, or a runtime decision about which tasks exist. Once `extract_text` completes, `classify_document`, `summarize_text`, and `extract_keywords` can run independently, and `format_result` waits for all three outputs before combining them. Each task points forward to later tasks, and no child step feeds back into an ancestor. If you later wanted to process a runtime-determined set of PDFs, a durable task could handle that outer decision and spawn this DAG once per document.
 
+## Swapping in Reducto for PDF extraction
+
+The local example above runs without external services, but production document pipelines often need more than basic text extraction. Reducto's [Parse API](https://docs.reducto.ai/parse/overview) can run OCR, detect document layout, and return structured chunks for content such as tables, figures, scanned pages, and multi-column documents. Since Hatchet's DAG isolates PDF parsing in a single task, swapping in Reducto is as simple as replacing `extract_text`; the downstream tasks stay the same. Hatchet still owns the workflow execution, retries, concurrency, and observability. Reducto handles the document parsing.
+
+[Install the Reducto SDK](https://docs.reducto.ai/quickstart#install-the-sdk) and set `REDUCTO_API_KEY` in your environment.
+
+For small documents, the snippets below handle the inline result returned when `result.result.type` is `"full"`. For larger documents, Reducto may return a URL result instead; production code should check that field and fetch the result from the URL if needed.
+
+<UniversalTabs items={["Python", "Typescript"]}>
+  <Tabs.Tab title="Python">
+
+    ```python
+    import base64
+    import tempfile
+    from pathlib import Path
+    from reducto import Reducto
+
+    @pdf_pipeline.task()
+    def extract_text(input: PdfInput, ctx: Context) -> ExtractOutput:
+        client = Reducto()  # reads REDUCTO_API_KEY from environment
+
+        decoded = base64.b64decode(input.content_base64)
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            tmp_path = Path(tmp_dir) / "input.pdf"
+            tmp_path.write_bytes(decoded)
+            upload = client.upload(file=tmp_path)
+
+        result = client.parse.run(input=upload.file_id)
+        text = "\n".join(chunk.content for chunk in result.result.chunks)
+
+        return ExtractOutput(text=text, page_count=result.usage.num_pages)
+    ```
+
+  </Tabs.Tab>
+  <Tabs.Tab title="Typescript">
+
+    ```typescript
+    import Reducto from "reductoai";
+    import { writeFileSync, rmSync, mkdtempSync } from "fs";
+    import { createReadStream } from "fs";
+    import { join } from "path";
+    import { tmpdir } from "os";
+
+    const extractText = pdfPipeline.task({
+      name: "extract_text",
+      fn: async (input: PdfInput) => {
+        const client = new Reducto(); // reads REDUCTO_API_KEY from environment
+
+        const decoded = Buffer.from(input.contentBase64, "base64");
+        const tmpDir = mkdtempSync(join(tmpdir(), "hatchet-pdf-"));
+        const tmpFile = join(tmpDir, "input.pdf");
+        writeFileSync(tmpFile, decoded);
+
+        try {
+          const upload = await client.upload({ file: createReadStream(tmpFile) });
+          const result = await client.parse.run({ input: upload.file_id });
+          const text = result.result.chunks.map((c: any) => c.content).join("\n");
+
+          return { text, pageCount: result.usage.num_pages };
+        } finally {
+          try { rmSync(tmpDir, { recursive: true, force: true }); } catch {}
+        }
+      },
+    });
+    ```
+
+  </Tabs.Tab>
+</UniversalTabs>
+
 ## Next steps
 
-From here you could add more processing stages, such as language detection, entity extraction, or metadata enrichment. You could also replace keyword extraction or classification with an LLM call, configure retries or concurrency limits for the slowest stages, or fan out to process multiple PDFs by spawning the DAG as a child workflow from a [durable task](/v1/durable-tasks). For this cookbook, the five-task DAG is enough to show the core pattern.
+From here you could add more processing stages, such as language detection, entity extraction, or metadata enrichment. You could also use Reducto for richer production parsing, replace keyword extraction or classification with an LLM call, configure retries or concurrency limits for the slowest stages, or fan out to process multiple PDFs by spawning the DAG as a child workflow from a [durable task](/v1/durable-tasks). For this cookbook, the five-task DAG is enough to show the core pattern.