Fix/pdf signature anonymization#82
Conversation
Reviewer's GuideRefines how PDF signature widgets are anonymized by introducing signature-specific rect refinement and redaction flows, changing widget handling to flatten rather than delete, tightening background repaint regions, and extending tests to cover variable layout signatures and pixel-level rendering behavior. Sequence diagram for the updated signature redaction application flowsequenceDiagram
participant Anonymizer
participant _apply_redactions
participant _prepare_signature_widget_ops
participant _apply_text_redactions
participant _apply_asset_redactions
participant _apply_signature_redactions
participant Document
participant Page
Anonymizer->>_apply_redactions: call _apply_redactions(doc, page_ops, widget_ops, signature_widget_ops)
_apply_redactions->>_prepare_signature_widget_ops: _prepare_signature_widget_ops(doc, signature_widget_ops)
_prepare_signature_widget_ops->>Document: doc.bake(annots=false, widgets=true)
_apply_redactions->>_apply_text_redactions: _apply_text_redactions(doc, text_page_ops)
_apply_redactions->>_apply_asset_redactions: _apply_asset_redactions(doc, asset_page_ops)
_apply_redactions->>_apply_signature_redactions: _apply_signature_redactions(doc, signature_widget_ops)
loop per_page
_apply_signature_redactions->>Document: doc[page_idx]
_apply_signature_redactions->>Page: add_redact_annot(op.redact_rect, text=None)
_apply_signature_redactions->>Page: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS, graphics=PDF_REDACT_LINE_ART_NONE, text=PDF_REDACT_TEXT_REMOVE)
_apply_signature_redactions->>Page: _render_text_op(page, op)
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
_prepare_signature_widget_ops, when a widget cannot be resolved you still proceed to usewidget_rectafter theif widget is not Noneblock; this will raise anUnboundLocalError—it would be safer tocontinueto the next group when the widget is missing. - In
_apply_signature_redactions, you indexop["redact_rect"]directly; consider usinggetand skipping ops that lack a redact rect to avoid unexpected KeyErrors if upstream logic ever produces a signature op without that field.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_prepare_signature_widget_ops`, when a widget cannot be resolved you still proceed to use `widget_rect` after the `if widget is not None` block; this will raise an `UnboundLocalError`—it would be safer to `continue` to the next group when the widget is missing.
- In `_apply_signature_redactions`, you index `op["redact_rect"]` directly; consider using `get` and skipping ops that lack a redact rect to avoid unexpected KeyErrors if upstream logic ever produces a signature op without that field.
## Individual Comments
### Comment 1
<location path="aymurai/text/anonymization/pdf/ops.py" line_range="192-201" />
<code_context>
return rects
+def _distance_between_rect_centers(
+ left: pymupdf.Rect,
+ right: pymupdf.Rect,
+) -> float:
+ """
+ Computes the squared distance between two rectangle centers.
+
+ Args:
+ left (pymupdf.Rect): The first rectangle.
+ right (pymupdf.Rect): The second rectangle.
+
+ Returns:
+ float: The squared distance between rectangle centers.
+ """
+ left_center = ((left.x0 + left.x1) / 2.0, (left.y0 + left.y1) / 2.0)
+ right_center = ((right.x0 + right.x1) / 2.0, (right.y0 + right.y1) / 2.0)
+ return (left_center[0] - right_center[0]) ** 2 + (
+ left_center[1] - right_center[1]
+ ) ** 2
+
+
</code_context>
<issue_to_address>
**nitpick:** Function name suggests actual distance while implementation returns squared distance.
Since this returns a squared distance, the name can easily be misread as returning the actual distance. That mismatch may lead to misuse if the helper is reused outside the current `min`/`key` pattern. Consider renaming to something like `_squared_distance_between_rect_centers` (or similar) so the metric is clear from the name alone.
</issue_to_address>
### Comment 2
<location path="tests/api/routers/anonymizer/test_anonymizer.py" line_range="400-409" />
<code_context>
+ assert background.y1 <= 82
+
+
+def test_signature_text_rect_refinement_does_not_include_role_text(tmp_path):
+ source_path = _write_pdf(
+ tmp_path / "signature-role.pdf",
+ lambda _doc, page: (
+ page.insert_text((100, 80), "RUIZ"),
+ page.insert_text((100, 96), "JUEZ/A"),
+ ),
+ )
+
+ with pymupdf.open(source_path) as doc:
+ page = doc[0]
+ signer_rect = page.search_for("RUIZ")[0]
+ role_rect = page.search_for("JUEZ/A")[0]
+ loose_rect = pymupdf.Rect(signer_rect)
+ loose_rect.include_rect(role_rect)
+
+ refined = _refine_signature_text_rect(
+ page,
+ "RUIZ",
+ pymupdf.Rect(80, 60, 200, 115),
+ loose_rect,
+ )
+
+ assert refined.intersects(signer_rect)
+ assert not refined.intersects(role_rect)
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend `_refine_signature_text_rect` coverage to cases with no hits and multiple hits.
To more fully exercise `_refine_signature_text_rect`, please also add: (1) a case where `search_for` finds no text within the widget clip (e.g., entity text lies entirely outside), asserting it returns a copy of `current_rect`; and (2) a case with multiple hits where only some intersect `current_rect`, asserting it selects the rect closest to `current_rect`’s center. This will lock in the `intersecting_hits` and `_distance_between_rect_centers` behaviors against regressions.
</issue_to_address>
### Comment 3
<location path="tests/api/routers/anonymizer/test_anonymizer.py" line_range="229-231" />
<code_context>
+ return path, preds, signers, preserved_texts, qr_rects
+
+
@pytest.mark.integration
def test_anonymization_package_exports_and_registry_are_stable():
assert PdfAnonymizer.__name__ == "PdfAnonymizer"
</code_context>
<issue_to_address>
**suggestion (testing):** Add a scenario where some signature names are not labeled to ensure only marked entities are redacted.
Please also cover a case where only a subset of signer names appear in `preds` (e.g., one signer is intentionally left unredacted). Asserting that unlabeled signer names remain visible while only labeled ones are tokenized will validate that the redaction flow doesn’t over-redact and that widget-level changes are scoped to the provided labels.
Suggested implementation:
```python
@pytest.mark.integration
def test_anonymization_package_exports_and_registry_are_stable():
assert PdfAnonymizer.__name__ == "PdfAnonymizer"
def test_only_labeled_signer_names_are_redacted(tmp_path):
# Build a test PDF with multiple signer blocks
path, preds, signers, preserved_texts, qr_rects = _make_test_document_with_signatures(
output_dir=tmp_path
)
# Use only a subset of signer labels in preds: leave the last signer unlabeled
labeled_signers = set(signers[:-1])
filtered_preds = []
for pred in preds:
# Keep labels only if their signer is in labeled_signers, otherwise drop labels
labels = []
for label in pred["labels"]:
if any(signer in label["text"] for signer in labeled_signers):
labels.append(label)
filtered_preds.append({**pred, "labels": labels})
# Run anonymization with the filtered predictions
anonymizer = PdfAnonymizer(
input_pdf=path,
predictions=filtered_preds,
redaction_strategy="tokenize",
)
output_path = anonymizer.run()
# Open the output PDF and assert that:
# - labeled signer names are tokenized
# - unlabeled signer name is still visible
doc = pymupdf.open(output_path)
try:
text = ""
for page in doc:
text += page.get_text()
# All labeled signers should be tokenized
for signer in labeled_signers:
assert signer not in text
assert "<PER>" in text
# The unlabeled signer (last in original list) must remain visible
unlabeled_signer = signers[-1]
assert unlabeled_signer in text
finally:
doc.close()
```
1. Ensure the helper used to construct the document is actually named `_make_test_document_with_signatures` and that its return signature matches `(path, preds, signers, preserved_texts, qr_rects)`. If it has a different name or signature, update the call accordingly.
2. If `PdfAnonymizer` or `pymupdf` are not already imported in this file, add the appropriate imports at the top of the file (e.g., `from anonymizer.pdf import PdfAnonymizer` and `import pymupdf` or `import fitz as pymupdf`, depending on the existing convention).
3. If labels produced by `_label_for_document_text` do not expose signer names under `label["text"]`, adjust the filtering condition inside the loop that builds `filtered_preds` so that it correctly identifies which labels correspond to which signer.
4. If the anonymizer API in this project is not used via `PdfAnonymizer(input_pdf=..., predictions=..., redaction_strategy=...)`, adapt the test to call the appropriate router/endpoint or factory that performs the redaction using the `predictions` passed in; the key part is that `filtered_preds` only labels a subset of `signers` and the assertions remain as written.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def test_signature_text_rect_refinement_does_not_include_role_text(tmp_path): | ||
| source_path = _write_pdf( | ||
| tmp_path / "signature-role.pdf", | ||
| lambda _doc, page: ( | ||
| page.insert_text((100, 80), "RUIZ"), | ||
| page.insert_text((100, 96), "JUEZ/A"), | ||
| ), | ||
| ) | ||
|
|
||
| with pymupdf.open(source_path) as doc: |
There was a problem hiding this comment.
suggestion (testing): Extend _refine_signature_text_rect coverage to cases with no hits and multiple hits.
To more fully exercise _refine_signature_text_rect, please also add: (1) a case where search_for finds no text within the widget clip (e.g., entity text lies entirely outside), asserting it returns a copy of current_rect; and (2) a case with multiple hits where only some intersect current_rect, asserting it selects the rect closest to current_rect’s center. This will lock in the intersecting_hits and _distance_between_rect_centers behaviors against regressions.
| @pytest.mark.integration | ||
| def test_anonymization_package_exports_and_registry_are_stable(): | ||
| assert PdfAnonymizer.__name__ == "PdfAnonymizer" |
There was a problem hiding this comment.
suggestion (testing): Add a scenario where some signature names are not labeled to ensure only marked entities are redacted.
Please also cover a case where only a subset of signer names appear in preds (e.g., one signer is intentionally left unredacted). Asserting that unlabeled signer names remain visible while only labeled ones are tokenized will validate that the redaction flow doesn’t over-redact and that widget-level changes are scoped to the provided labels.
Suggested implementation:
@pytest.mark.integration
def test_anonymization_package_exports_and_registry_are_stable():
assert PdfAnonymizer.__name__ == "PdfAnonymizer"
def test_only_labeled_signer_names_are_redacted(tmp_path):
# Build a test PDF with multiple signer blocks
path, preds, signers, preserved_texts, qr_rects = _make_test_document_with_signatures(
output_dir=tmp_path
)
# Use only a subset of signer labels in preds: leave the last signer unlabeled
labeled_signers = set(signers[:-1])
filtered_preds = []
for pred in preds:
# Keep labels only if their signer is in labeled_signers, otherwise drop labels
labels = []
for label in pred["labels"]:
if any(signer in label["text"] for signer in labeled_signers):
labels.append(label)
filtered_preds.append({**pred, "labels": labels})
# Run anonymization with the filtered predictions
anonymizer = PdfAnonymizer(
input_pdf=path,
predictions=filtered_preds,
redaction_strategy="tokenize",
)
output_path = anonymizer.run()
# Open the output PDF and assert that:
# - labeled signer names are tokenized
# - unlabeled signer name is still visible
doc = pymupdf.open(output_path)
try:
text = ""
for page in doc:
text += page.get_text()
# All labeled signers should be tokenized
for signer in labeled_signers:
assert signer not in text
assert "<PER>" in text
# The unlabeled signer (last in original list) must remain visible
unlabeled_signer = signers[-1]
assert unlabeled_signer in text
finally:
doc.close()- Ensure the helper used to construct the document is actually named
_make_test_document_with_signaturesand that its return signature matches(path, preds, signers, preserved_texts, qr_rects). If it has a different name or signature, update the call accordingly. - If
PdfAnonymizerorpymupdfare not already imported in this file, add the appropriate imports at the top of the file (e.g.,from anonymizer.pdf import PdfAnonymizerandimport pymupdforimport fitz as pymupdf, depending on the existing convention). - If labels produced by
_label_for_document_textdo not expose signer names underlabel["text"], adjust the filtering condition inside the loop that buildsfiltered_predsso that it correctly identifies which labels correspond to which signer. - If the anonymizer API in this project is not used via
PdfAnonymizer(input_pdf=..., predictions=..., redaction_strategy=...), adapt the test to call the appropriate router/endpoint or factory that performs the redaction using thepredictionspassed in; the key part is thatfiltered_predsonly labels a subset ofsignersand the assertions remain as written.
There was a problem hiding this comment.
Pull request overview
This PR improves PDF signature anonymization by redacting only signer-name text inside signature widgets while preserving the rest of the signature appearance (including nearby text and images), and adds integration tests covering variable signature layouts.
Changes:
- Added signature-specific rect refinement and operation-building to more tightly target signer-name text inside signature widgets.
- Introduced a dedicated signature-redaction application flow to avoid wiping the full signature appearance.
- Updated signature widget preparation/cleanup behavior and added new test utilities + integration coverage for signature preservation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
aymurai/text/anonymization/pdf/ops.py |
Adds signature-specific rect refinement + op building and applies signature redactions in a dedicated pass. |
aymurai/text/anonymization/pdf/widgets.py |
Adjusts signature background rect computation and switches signature widget handling to flattening (bake) instead of deletion. |
aymurai/text/anonymization/pdf/sanitize.py |
Refines cleanup-rect selection for signature widget operations to prefer redaction/background regions. |
tests/api/routers/anonymizer/test_anonymizer.py |
Adds utilities and new integration tests validating signer-only redaction and preserved signature/QR appearance across layouts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if should_bake_widgets: | ||
| try: | ||
| doc.bake(annots=False, widgets=True) | ||
| except Exception as exc: | ||
| logger.warning("Failed to flatten PDF signature widgets: %s", exc) |
a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87) 47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86) c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83) f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85) 181e0356 Fix/invalid entity offsets (#82) git-subtree-dir: frontend git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff
This pull request introduces significant improvements to the handling of signature redactions in PDF anonymization, focusing on more precise redaction of signer names within signature widgets, improved rendering, and safer manipulation of PDF widgets. The changes include new helper functions for refining text rects, a dedicated signature redaction application flow, and more robust widget flattening and cleanup logic. Additionally, comprehensive test utilities are added to support these new features.
Signature Redaction Improvements:
_refine_signature_text_rectand_build_signature_page_opfunctions inpdf/ops.pyto more accurately identify and target signer name text within signature widgets, ensuring tighter and more reliable redaction._collect_page_redactionsto use the new_build_signature_page_opfor signature widgets, streamlining and centralizing signature-specific operation logic. [1] [2] [3]Redaction Application Flow:
_apply_signature_redactionsinpdf/ops.pyto apply redactions specifically for signer names, preserving the rest of the signature appearance and ensuring correct rendering order. The main_apply_redactionsflow now calls this new function. [1] [2]Widget Handling and Cleanup:
_prepare_signature_widget_opsinpdf/widgets.pyto flatten (not delete) signature widgets and only bake widgets if necessary, improving compatibility and reliability. The function also now removes obsolete keys from signature widget ops and computes background rects more safely. [1] [2] [3]_cleanup_rect_for_signature_widget_opinpdf/sanitize.pyto use the most appropriate rectangle for cleanup, preferring the redaction rect if available._signature_background_rectinpdf/widgets.pyto prioritize the correct rect sources and minimize the repaint area, reducing the risk of covering adjacent non-sensitive content.Testing Enhancements:
test_anonymizer.pyfor generating variable signature PDFs, constructing render contexts, and verifying dark pixel ratios and text counts, enabling more robust and targeted testing of signature redaction features. [1] [2]These changes collectively make signature redaction in PDFs more precise, visually reliable, and maintainable.
Summary by Sourcery
Improve precision and visual fidelity of PDF signature anonymization while preserving non-sensitive signature appearance and QR imagery.
Enhancements:
Tests: