Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,27 @@
"Why split the work. Nova handles the high-volume native multimodal extraction in one call. Claude is invoked once per page for the reasoning step that benefits from extra thinking. Each stage can be tuned or swapped independently.\n",
"\n",
"**What you will run.** Three synthetic yearbook pages — a portrait grid, a mixed-layout floor show page, and a group-photo academic page — end-to-end through both stages, with visualizations of the matched names and faces."
]
],
"id": "de174006e51c"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup\n",
"## 1. Prerequisites\n",
"\n",
"Install dependencies and create the Bedrock Runtime client. Make sure the IAM role / user calling this notebook has `bedrock:InvokeModel` and `bedrock:Converse` permission on `us.amazon.nova-2-lite-v1:0` and `us.anthropic.claude-sonnet-4-6` in `us-west-2` (or change the region below)."
]
"- An AWS account with access to Amazon Bedrock.\n",
"- Model access enabled in your Amazon Bedrock region for `us.amazon.nova-2-lite-v1:0` and `us.anthropic.claude-sonnet-4-6`.\n",
"- An AWS Identity and Access Management (AWS IAM) role or user with `bedrock:InvokeModel` and `bedrock:Converse` permission on those two models.\n",
"- Python 3.10 or later, plus `boto3` and `Pillow` (installed in the next cell).\n",
"\n",
"## 2. Setup\n",
"\n",
"Install dependencies and create the Amazon Bedrock Runtime client. The notebook uses the `us-west-2` region by default; change `AWS_REGION` in the configuration cell below if your access lives elsewhere.\n",
"\n",
"If you are running on Amazon SageMaker Studio or a Notebook Instance, the dependencies installed in the next cell are already available."
],
"id": "3cd01b97acf3"
},
{
"cell_type": "code",
Expand All @@ -34,7 +45,8 @@
"outputs": [],
"source": [
"%pip install --quiet boto3 Pillow"
]
],
"id": "45b6edaf3e9b"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -67,7 +79,8 @@
"print(\"Nova model :\", NOVA_MODEL_ID)\n",
"print(\"Claude model:\", CLAUDE_MODEL_ID)\n",
"print(\"Region :\", AWS_REGION)"
]
],
"id": "fe7700dd189f"
},
{
"cell_type": "markdown",
Expand All @@ -82,7 +95,8 @@
"| `page_001_portrait_grid.png` | 4×5 portrait grid with one name printed under each headshot |\n",
"| `page_002_floor_show.png` | Mixed layout: one group photo (no caption) plus several candid/portrait photos with italic captions |\n",
"| `page_003_decathlon.png` | Single group photo with a caption listing every person back-row-then-front-row |"
]
],
"id": "ad764466c9f3"
},
{
"cell_type": "code",
Expand All @@ -93,16 +107,22 @@
"samples = sorted(SAMPLES_DIR.glob(\"page_*.png\"))\n",
"for p in samples:\n",
" print(f\"{p.name:40s} {p.stat().st_size/1024:.0f} KB\")"
]
],
"id": "431352d6e2e6"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(IPyImage(filename=str(samples[0]), width=420))"
]
"from IPython.display import Markdown\n",
"# Render with alt text so screen readers can describe the image.\n",
"display(Markdown(\n",
" f'![Synthetic yearbook portrait grid sample page used as input to Stage 1.]({samples[0]})'\n",
"))"
],
"id": "6c92392a26e0"
},
{
"cell_type": "markdown",
Expand All @@ -117,7 +137,8 @@
"- `page_title`, `page_category`, `page_summary`: page-level metadata that doubles as the second use case (search indexing, content tagging) without an extra API call.\n",
"\n",
"All bounding boxes are on a 0–1000 normalized grid for both axes. The same coordinate space carries through to Stage 2, so no conversion is needed between calls."
]
],
"id": "1c505fbc3c27"
},
{
"cell_type": "code",
Expand All @@ -134,7 +155,8 @@
"print(\"Photos detected:\", len(extraction[\"photos\"]))\n",
"print(\"Names detected :\", len(extraction[\"names\"]))\n",
"print(\"Token usage :\", extraction[\"_usage\"])"
]
],
"id": "53c61d63ef20"
},
{
"cell_type": "code",
Expand All @@ -143,7 +165,8 @@
"outputs": [],
"source": [
"display(JSON({\"first_three_photos\": extraction[\"photos\"][:3], \"first_three_names\": extraction[\"names\"][:3]}))"
]
],
"id": "9bbd9cb9662b"
},
{
"cell_type": "markdown",
Expand All @@ -152,7 +175,8 @@
"## 4. Stage 2 — Claude Sonnet 4.6 matches names to faces\n",
"\n",
"Claude receives the original page image plus Nova's photos and names. Adaptive thinking is enabled so Claude allocates more reasoning to mixed-layout pages than to clean grids. The effort is set to `high` to make sure Claude always reasons through the spatial layout (the API call is identical to a regular Converse call apart from `additionalModelRequestFields`)."
]
],
"id": "4e3c1c2e5130"
},
{
"cell_type": "code",
Expand All @@ -172,7 +196,8 @@
"print(\"Unmatched faces :\", matching[\"unmatched_face_indices\"])\n",
"print(\"Thinking chars :\", len(matching[\"thinking_text\"] or \"\"))\n",
"print(\"Token usage :\", matching[\"_usage\"])"
]
],
"id": "4866e4205fef"
},
{
"cell_type": "code",
Expand All @@ -186,7 +211,8 @@
" f\"({assoc['match_type']}, conf {assoc['confidence']:.2f})\\n\"\n",
" f\" reason: {assoc.get('reasoning', '')}\"\n",
" )"
]
],
"id": "8a268632d99c"
},
{
"cell_type": "markdown",
Expand All @@ -195,7 +221,8 @@
"## 5. End-to-end on all three sample pages\n",
"\n",
"`run_pipeline` runs Stage 1 and Stage 2 back to back and returns one combined result. `visualize_result` draws the photo bounding boxes plus a colored line from each printed name to the matched face."
]
],
"id": "fc0c02b24433"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -231,19 +258,25 @@
" all_results[sample.name] = result\n",
" print(f\" visualization -> {viz_path}\")\n",
" print(f\" raw output -> {json_path}\\n\")"
]
],
"id": "74d0d47ed6e8"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import Markdown\n",
"for sample in samples:\n",
" viz = RESULTS_DIR / f\"{sample.stem}_links.jpg\"\n",
" print(viz.name)\n",
" display(IPyImage(filename=str(viz), width=520))"
]
" # Render with alt text describing what the visualization shows.\n",
" display(Markdown(\n",
" f'![Visualization of {sample.stem}: photo bounding boxes with colored lines linking each printed name to the matched face.]({viz})'\n",
" ))"
],
"id": "48611cac2f90"
},
{
"cell_type": "markdown",
Expand All @@ -252,7 +285,8 @@
"## 6. Use case 2 — page-level metadata in the same call\n",
"\n",
"The Nova call already returned `page_title`, `page_category`, and `page_summary`. That is enough to build a search index, a per-event filter, or a table of contents across hundreds of pages without any extra API call."
]
],
"id": "fa815392c22f"
},
{
"cell_type": "code",
Expand All @@ -266,25 +300,31 @@
" print(f\" category : {result['page_category']}\")\n",
" print(f\" summary : {result['page_summary']}\")\n",
" print()"
]
],
"id": "33218fc3b230"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Notes on cost, latency, and tuning\n",
"\n",
"- **Cost.** Nova 2 Lite bills image inputs at a fixed 230 tokens per image (effective March 31, 2026), so Stage 1 cost is predictable regardless of page resolution. Claude's adaptive-thinking output dominates per-page cost on complex pages.\n",
"- **Cost.** Nova 2 Lite supports a fixed per-image token tier for image and document-page inputs, which makes Stage 1 cost predictable regardless of page resolution. Confirm the current rate on the [Amazon Bedrock pricing page](https://aws.amazon.com/bedrock/pricing/) before you size a workload. Claude's adaptive-thinking output dominates per-page cost on complex pages.\n",
"- **Latency.** Stage 1 typically returns in a few seconds. Stage 2 is the longer call because of the image input plus adaptive reasoning; expect roughly 20–30 seconds on complex pages.\n",
"- **Effort knob.** `output_config.effort` accepts `low`, `medium`, `high`, or `max`. Lowering effort to `medium` skips reasoning on simple grids and saves output tokens; raising to `max` (Opus only) gives Claude the deepest reasoning budget.\n",
"- **Batch inference.** Both Nova 2 Lite and Claude support [Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html) at a 50% discount for workloads that can run asynchronously.\n",
"- **Batch inference.** Both Nova 2 Lite and Claude support [Amazon Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html) at a 50% discount for workloads that can run asynchronously.\n",
"- **Prompt caching.** The Nova extraction prompt is identical across pages and is a good candidate for [prompt caching](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html) when running at volume.\n",
"- **Adapt the prompts.** `utils.py` exposes `NOVA_SYSTEM_PROMPT`, `NOVA_INSTRUCTION`, and `CLAUDE_SPATIAL_PROMPT_TEMPLATE`. Edit them in place to handle other document layouts (real estate listings, product catalogs, magazine spreads) — the rest of the pipeline stays the same.\n",
"- **Adapt the prompts.** `utils.py` exposes `NOVA_SYSTEM_PROMPT`, `NOVA_INSTRUCTION`, and `CLAUDE_SPATIAL_PROMPT_TEMPLATE`. Edit them in place to handle other document layouts (real estate listings, product catalogs, magazine spreads). The rest of the pipeline stays the same.\n",
"\n",
"## 8. Clean up\n",
"\n",
"This pattern is fully serverless — there are no Bedrock endpoints, SageMaker instances, or persistent storage to delete. If you uploaded sample pages to S3 to run this at scale, remove the bucket or objects when you finish."
]
"This pattern is fully serverless. There are no Amazon Bedrock endpoints, Amazon SageMaker AI instances, or persistent storage to delete. The notebook writes outputs to the local `results/` directory; delete this directory if you no longer need the visualization JPEGs and JSON files. If you uploaded sample pages to Amazon Simple Storage Service (Amazon S3) to run this at scale, remove the bucket or objects when you finish. Warning: deleting Amazon S3 objects is permanent and cannot be undone — back up any data you need to keep before deletion.\n",
"\n",
"## 9. Conclusion\n",
"\n",
"Two Amazon Bedrock calls are enough to map printed names to faces on a scanned page. Amazon Nova 2 Lite carries the native multimodal extraction in a single call, and Claude Sonnet 4.6 with adaptive thinking handles the spatial reasoning step that benefits from extra reasoning. Keeping the two stages on the same 0–1000 coordinate space removes glue code between calls, and each stage stays independently tunable. As next steps, swap the synthetic samples in `samples/` for your own documents, adapt the prompts in `utils.py` to match your layout, and rerun this notebook end to end to verify per-page accuracy on your own data."
],
"id": "4951ae6963d8"
}
],
"metadata": {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Hybrid vision + spatial reasoning on Amazon Bedrock

This pattern matches printed names to the photographs they belong to on a scanned page using two Amazon Bedrock models: **Amazon Nova 2 Lite** for native multimodal extraction and **Anthropic Claude Sonnet 4.6** for spatial reasoning. The example data is a yearbook layout — portrait grids, mixed group/candid spreads, and roster-style group photos — but the same pattern applies to any document where the link between a photo and the people in it lives only in the page layout.
This pattern matches printed names to the photographs they belong to on a scanned page using two Amazon Bedrock models: **Amazon Nova 2 Lite** for native multimodal extraction and **Anthropic Claude Sonnet 4.6** for spatial reasoning. The example data is a yearbook layout — portrait grids, mixed group/candid spreads, and roster-style group photos — and the same pattern applies to many documents where the link between a photo and the people in it lives only in the page layout (for example, real estate listings, personnel directories, magazine spreads, and product catalogs).

```
┌────────────────────────────┐
Expand Down Expand Up @@ -90,8 +90,9 @@ Open `01_yearbook_name_face_matching.ipynb` and run the cells top to bottom. The
1. Creates a Bedrock Runtime client.
2. Calls Stage 1 (Nova) on a single page so you can see the raw extraction output.
3. Calls Stage 2 (Claude) on the same page so you can see the matched associations and adaptive-thinking trace length.
4. Runs the full `run_pipeline` helper on all three sample pages and writes JSON + JPEG outputs into `results/`.
5. Shows the page-level metadata (title, category, summary) that Nova returned in the same call as the photos and names — that metadata is the second use case (search indexing, content tagging) without any extra API call.
4. Runs the full `run_pipeline` helper on all three sample pages.
5. Writes JSON output and visualization JPEG files into the local `results/` directory.
6. Shows the page-level metadata (title, category, summary) that Nova returned in the same call as the photos and names. That metadata is the second use case (search indexing, content tagging) without any extra API call.

## Results

Expand Down Expand Up @@ -146,11 +147,15 @@ Each association comes with a short `reasoning` string from Claude, which is use
- `CLAUDE_SPATIAL_PROMPT_TEMPLATE` — change the reasoning rules if your captions live somewhere other than directly above/below the photo.
- `match_names_to_faces(..., effort="high")` — drop to `medium` to skip thinking on simple pages, or up to `max` (Opus models only) for the deepest reasoning budget.

Both stages can be batched via [Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html) for a 50% discount on workloads that can run asynchronously, and the Stage 1 prompt is identical across pages — a good fit for [prompt caching](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html) at scale.
Both stages can be batched via [Amazon Bedrock Batch Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html) for a 50% discount on workloads that can run asynchronously, and the Stage 1 prompt is identical across pages — a good fit for [prompt caching](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html) at scale.

## Clean up

This pattern is fully serverless. There are no provisioned Bedrock endpoints, SageMaker instances, or persistent storage to delete. If you uploaded sample pages to S3 to run this at volume, remove the bucket or objects when you finish.
This pattern is fully serverless. There are no provisioned Amazon Bedrock endpoints, Amazon SageMaker AI instances, or persistent storage to delete. The pipeline writes outputs to the local `results/` directory; delete this directory if you no longer need the visualization JPEGs and JSON files. If you uploaded sample pages to Amazon Simple Storage Service (Amazon S3) to run this at volume, remove the bucket or objects when you finish. Warning: deleting Amazon S3 objects is permanent and cannot be undone — back up any data you need to keep before deletion.

## Conclusion

This pattern shows that two Amazon Bedrock calls are enough to map printed names to faces on a scanned page. Amazon Nova 2 Lite carries the native multimodal extraction in a single call, and Claude Sonnet 4.6 with adaptive thinking handles the spatial reasoning step that benefits from extra reasoning. Keeping the two stages on the same 0–1000 coordinate space removes glue code between calls, and each stage stays independently tunable. As next steps, swap the synthetic samples in `samples/` for your own documents, adapt the prompts in `utils.py` to match your layout, and run the notebook end to end to verify per-page accuracy on your own data.

## Further reading

Expand Down