Skip to content

Commit f87cecf

Browse files
committed
docs(cookbooks): add Reducto PDF extraction
1 parent f0c7bf4 commit f87cecf

1 file changed

Lines changed: 70 additions & 1 deletion

File tree

frontend/docs/pages/cookbooks/pdf-pipeline.mdx

Lines changed: 70 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,75 @@ This example includes an end-to-end test that generates a tiny PDF, runs the ful
186186

187187
This pipeline is a good fit for a DAG because the processing stages are known before execution starts and the dependencies are easy to declare up front. The workflow does not need a loop, an unknown number of iterations, or a runtime decision about which tasks exist. Once `extract_text` completes, `classify_document`, `summarize_text`, and `extract_keywords` can run independently, and `format_result` waits for all three outputs before combining them. Each task points forward to later tasks, and no child step feeds back into an ancestor. If you later wanted to process a runtime-determined set of PDFs, a durable task could handle that outer decision and spawn this DAG once per document.
188188

189+
## Swapping in Reducto for PDF extraction
190+
191+
The local example above runs without external services, but production document pipelines often need more than basic text extraction. Reducto's [Parse API](https://docs.reducto.ai/parse/overview) can run OCR, detect document layout, and return structured chunks for content such as tables, figures, scanned pages, and multi-column documents. Since Hatchet's DAG isolates PDF parsing in a single task, swapping in Reducto is as simple as replacing `extract_text`; the downstream tasks stay the same. Hatchet still owns the workflow execution, retries, concurrency, and observability. Reducto handles the document parsing.
192+
193+
[Install the Reducto SDK](https://docs.reducto.ai/quickstart#install-the-sdk) and set `REDUCTO_API_KEY` in your environment.
194+
195+
For small documents, the snippets below handle the inline result returned when `result.result.type` is `"full"`. For larger documents, Reducto may return a URL result instead; production code should check that field and fetch the result from the URL if needed.
196+
197+
<UniversalTabs items={["Python", "Typescript"]}>
198+
<Tabs.Tab title="Python">
199+
200+
```python
201+
import base64
202+
import tempfile
203+
from pathlib import Path
204+
from reducto import Reducto
205+
206+
@pdf_pipeline.task()
207+
def extract_text(input: PdfInput, ctx: Context) -> ExtractOutput:
208+
client = Reducto() # reads REDUCTO_API_KEY from environment
209+
210+
decoded = base64.b64decode(input.content_base64)
211+
with tempfile.TemporaryDirectory() as tmp_dir:
212+
tmp_path = Path(tmp_dir) / "input.pdf"
213+
tmp_path.write_bytes(decoded)
214+
upload = client.upload(file=tmp_path)
215+
216+
result = client.parse.run(input=upload.file_id)
217+
text = "\n".join(chunk.content for chunk in result.result.chunks)
218+
219+
return ExtractOutput(text=text, page_count=result.usage.num_pages)
220+
```
221+
222+
</Tabs.Tab>
223+
<Tabs.Tab title="Typescript">
224+
225+
```typescript
226+
import Reducto from "reductoai";
227+
import { writeFileSync, rmSync, mkdtempSync } from "fs";
228+
import { createReadStream } from "fs";
229+
import { join } from "path";
230+
import { tmpdir } from "os";
231+
232+
const extractText = pdfPipeline.task({
233+
name: "extract_text",
234+
fn: async (input: PdfInput) => {
235+
const client = new Reducto(); // reads REDUCTO_API_KEY from environment
236+
237+
const decoded = Buffer.from(input.contentBase64, "base64");
238+
const tmpDir = mkdtempSync(join(tmpdir(), "hatchet-pdf-"));
239+
const tmpFile = join(tmpDir, "input.pdf");
240+
writeFileSync(tmpFile, decoded);
241+
242+
try {
243+
const upload = await client.upload({ file: createReadStream(tmpFile) });
244+
const result = await client.parse.run({ input: upload.file_id });
245+
const text = result.result.chunks.map((c: any) => c.content).join("\n");
246+
247+
return { text, pageCount: result.usage.num_pages };
248+
} finally {
249+
try { rmSync(tmpDir, { recursive: true, force: true }); } catch {}
250+
}
251+
},
252+
});
253+
```
254+
255+
</Tabs.Tab>
256+
</UniversalTabs>
257+
189258
## Next steps
190259

191-
From here you could add more processing stages, such as language detection, entity extraction, or metadata enrichment. You could also replace keyword extraction or classification with an LLM call, configure retries or concurrency limits for the slowest stages, or fan out to process multiple PDFs by spawning the DAG as a child workflow from a [durable task](/v1/durable-tasks). For this cookbook, the five-task DAG is enough to show the core pattern.
260+
From here you could add more processing stages, such as language detection, entity extraction, or metadata enrichment. You could also use Reducto for richer production parsing, replace keyword extraction or classification with an LLM call, configure retries or concurrency limits for the slowest stages, or fan out to process multiple PDFs by spawning the DAG as a child workflow from a [durable task](/v1/durable-tasks). For this cookbook, the five-task DAG is enough to show the core pattern.

0 commit comments

Comments
 (0)