Skip to content

Fix/pdf signature anonymization#82

Merged
jansaldo merged 6 commits into
release/v1.5.0from
fix/pdf-signature-anonymization
May 26, 2026
Merged

Fix/pdf signature anonymization#82
jansaldo merged 6 commits into
release/v1.5.0from
fix/pdf-signature-anonymization

Conversation

@jansaldo
Copy link
Copy Markdown
Contributor

@jansaldo jansaldo commented May 26, 2026

This pull request introduces significant improvements to the handling of signature redactions in PDF anonymization, focusing on more precise redaction of signer names within signature widgets, improved rendering, and safer manipulation of PDF widgets. The changes include new helper functions for refining text rects, a dedicated signature redaction application flow, and more robust widget flattening and cleanup logic. Additionally, comprehensive test utilities are added to support these new features.

Signature Redaction Improvements:

  • Added _refine_signature_text_rect and _build_signature_page_op functions in pdf/ops.py to more accurately identify and target signer name text within signature widgets, ensuring tighter and more reliable redaction.
  • Updated _collect_page_redactions to use the new _build_signature_page_op for signature widgets, streamlining and centralizing signature-specific operation logic. [1] [2] [3]

Redaction Application Flow:

  • Introduced _apply_signature_redactions in pdf/ops.py to apply redactions specifically for signer names, preserving the rest of the signature appearance and ensuring correct rendering order. The main _apply_redactions flow now calls this new function. [1] [2]

Widget Handling and Cleanup:

  • Changed _prepare_signature_widget_ops in pdf/widgets.py to flatten (not delete) signature widgets and only bake widgets if necessary, improving compatibility and reliability. The function also now removes obsolete keys from signature widget ops and computes background rects more safely. [1] [2] [3]
  • Updated _cleanup_rect_for_signature_widget_op in pdf/sanitize.py to use the most appropriate rectangle for cleanup, preferring the redaction rect if available.
  • Refined _signature_background_rect in pdf/widgets.py to prioritize the correct rect sources and minimize the repaint area, reducing the risk of covering adjacent non-sensitive content.

Testing Enhancements:

  • Added utility functions and fixtures in test_anonymizer.py for generating variable signature PDFs, constructing render contexts, and verifying dark pixel ratios and text counts, enabling more robust and targeted testing of signature redaction features. [1] [2]

These changes collectively make signature redaction in PDFs more precise, visually reliable, and maintainable.

Summary by Sourcery

Improve precision and visual fidelity of PDF signature anonymization while preserving non-sensitive signature appearance and QR imagery.

Enhancements:

  • Refine signature text rect detection and signature-specific redaction operations to target only signer names within widget bounds.
  • Adjust signature background computation and widget preparation to flatten widgets safely and minimize repaint area to avoid covering adjacent content.
  • Update signature widget cleanup logic to prefer redaction and background rectangles for more accurate post-processing.

Tests:

  • Add utilities for generating variable-layout signature PDFs and render contexts to validate signer-name redaction behavior.
  • Extend PDF anonymizer integration tests to ensure only marked signer names are redacted, signature/QR visuals are preserved, and role or neighboring text remains visible.

@jansaldo jansaldo requested a review from Copilot May 26, 2026 19:49
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 26, 2026

Reviewer's Guide

Refines how PDF signature widgets are anonymized by introducing signature-specific rect refinement and redaction flows, changing widget handling to flatten rather than delete, tightening background repaint regions, and extending tests to cover variable layout signatures and pixel-level rendering behavior.

Sequence diagram for the updated signature redaction application flow

sequenceDiagram
    participant Anonymizer
    participant _apply_redactions
    participant _prepare_signature_widget_ops
    participant _apply_text_redactions
    participant _apply_asset_redactions
    participant _apply_signature_redactions
    participant Document
    participant Page

    Anonymizer->>_apply_redactions: call _apply_redactions(doc, page_ops, widget_ops, signature_widget_ops)
    _apply_redactions->>_prepare_signature_widget_ops: _prepare_signature_widget_ops(doc, signature_widget_ops)
    _prepare_signature_widget_ops->>Document: doc.bake(annots=false, widgets=true)

    _apply_redactions->>_apply_text_redactions: _apply_text_redactions(doc, text_page_ops)
    _apply_redactions->>_apply_asset_redactions: _apply_asset_redactions(doc, asset_page_ops)

    _apply_redactions->>_apply_signature_redactions: _apply_signature_redactions(doc, signature_widget_ops)
    loop per_page
        _apply_signature_redactions->>Document: doc[page_idx]
        _apply_signature_redactions->>Page: add_redact_annot(op.redact_rect, text=None)
        _apply_signature_redactions->>Page: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS, graphics=PDF_REDACT_LINE_ART_NONE, text=PDF_REDACT_TEXT_REMOVE)
        _apply_signature_redactions->>Page: _render_text_op(page, op)
    end
Loading

File-Level Changes

Change Details Files
Introduce signature-specific rect refinement and page op construction for signer-name redactions within widgets.
  • Add _distance_between_rect_centers to compare rect centers when choosing the best hit.
  • Add _refine_signature_text_rect to search within a widget’s clip and pick the closest matching text rect for the signer name.
  • Add _build_signature_page_op to construct redaction ops scoped to the refined signer text and attach widget metadata.
  • Update _collect_page_redactions to use _build_signature_page_op for all signature widgets and stop treating them as generic image/text ops.
aymurai/text/anonymization/pdf/ops.py
Add a dedicated application flow for signature redactions that operates separately from general text/image redactions.
  • Introduce _apply_signature_redactions to add redact annotations over signer text, apply redactions with pixel-level image handling, then re-render replacement tokens.
  • Adjust _apply_redactions to stop treating signature widget ops as generic asset ops and instead invoke _apply_signature_redactions after text/asset redactions.
  • Ensure signature widget ops now primarily encode redaction rects and text rendering parameters rather than image/asset cleanup hints.
aymurai/text/anonymization/pdf/ops.py
Change signature widget preparation to flatten widgets safely and compute minimal background repaint areas.
  • Modify _signature_background_rect to prioritize canvas/redact rects, keep repainting constrained to the text line, and reduce padding.
  • Update _prepare_signature_widget_ops to mark when widgets should be baked, remove obsolete keys like asset_rect/image_rect/graphics_mode from ops, recompute widget_rect, and call doc.bake(widgets=True) instead of deleting widgets directly.
  • Harden widget handling with logging around resolution and baking failures.
aymurai/text/anonymization/pdf/widgets.py
Refine how cleanup rectangles are chosen for signature widget operations.
  • Change _cleanup_rect_for_signature_widget_op to prefer redact_rect, then background_rect, then canvas_rect when deciding the cleanup area.
  • Remove dependency on widget_rect for cleanup, aligning with the new, tighter repaint behavior.
aymurai/text/anonymization/pdf/sanitize.py
Extend tests and helpers to validate signature redaction precision, appearance preservation, and pixel behavior in PDFs.
  • Add PNG_BLACK_1X1 fixture and image-based dark pixel ratio helper to assert that QR or seal areas stay dark after redaction.
  • Introduce helpers for deriving labels from document text, building render contexts with LabelPolicy/RenderPolicy, and counting text occurrences.
  • Add _write_variable_signature_pdf to generate pages with multiple signatures, QR placements, and signer lines versus surrounding context.
  • Add unit tests for _signature_background_rect and _refine_signature_text_rect, update existing signature anonymization test to ensure appearance preservation, and add a new integration test that verifies only marked signer names are redacted while widgets are flattened and non-sensitive text/images remain.
  • Wire render_context through _run_pdf_anonymizer for tests that rely on suffix-based token rendering.
tests/api/routers/anonymizer/test_anonymizer.py
aymurai/text/anonymization/pdf/ops.py
aymurai/text/anonymization/pdf/widgets.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In _prepare_signature_widget_ops, when a widget cannot be resolved you still proceed to use widget_rect after the if widget is not None block; this will raise an UnboundLocalError—it would be safer to continue to the next group when the widget is missing.
  • In _apply_signature_redactions, you index op["redact_rect"] directly; consider using get and skipping ops that lack a redact rect to avoid unexpected KeyErrors if upstream logic ever produces a signature op without that field.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_prepare_signature_widget_ops`, when a widget cannot be resolved you still proceed to use `widget_rect` after the `if widget is not None` block; this will raise an `UnboundLocalError`—it would be safer to `continue` to the next group when the widget is missing.
- In `_apply_signature_redactions`, you index `op["redact_rect"]` directly; consider using `get` and skipping ops that lack a redact rect to avoid unexpected KeyErrors if upstream logic ever produces a signature op without that field.

## Individual Comments

### Comment 1
<location path="aymurai/text/anonymization/pdf/ops.py" line_range="192-201" />
<code_context>
     return rects


+def _distance_between_rect_centers(
+    left: pymupdf.Rect,
+    right: pymupdf.Rect,
+) -> float:
+    """
+    Computes the squared distance between two rectangle centers.
+
+    Args:
+        left (pymupdf.Rect): The first rectangle.
+        right (pymupdf.Rect): The second rectangle.
+
+    Returns:
+        float: The squared distance between rectangle centers.
+    """
+    left_center = ((left.x0 + left.x1) / 2.0, (left.y0 + left.y1) / 2.0)
+    right_center = ((right.x0 + right.x1) / 2.0, (right.y0 + right.y1) / 2.0)
+    return (left_center[0] - right_center[0]) ** 2 + (
+        left_center[1] - right_center[1]
+    ) ** 2
+
+
</code_context>
<issue_to_address>
**nitpick:** Function name suggests actual distance while implementation returns squared distance.

Since this returns a squared distance, the name can easily be misread as returning the actual distance. That mismatch may lead to misuse if the helper is reused outside the current `min`/`key` pattern. Consider renaming to something like `_squared_distance_between_rect_centers` (or similar) so the metric is clear from the name alone.
</issue_to_address>

### Comment 2
<location path="tests/api/routers/anonymizer/test_anonymizer.py" line_range="400-409" />
<code_context>
+    assert background.y1 <= 82
+
+
+def test_signature_text_rect_refinement_does_not_include_role_text(tmp_path):
+    source_path = _write_pdf(
+        tmp_path / "signature-role.pdf",
+        lambda _doc, page: (
+            page.insert_text((100, 80), "RUIZ"),
+            page.insert_text((100, 96), "JUEZ/A"),
+        ),
+    )
+
+    with pymupdf.open(source_path) as doc:
+        page = doc[0]
+        signer_rect = page.search_for("RUIZ")[0]
+        role_rect = page.search_for("JUEZ/A")[0]
+        loose_rect = pymupdf.Rect(signer_rect)
+        loose_rect.include_rect(role_rect)
+
+        refined = _refine_signature_text_rect(
+            page,
+            "RUIZ",
+            pymupdf.Rect(80, 60, 200, 115),
+            loose_rect,
+        )
+
+    assert refined.intersects(signer_rect)
+    assert not refined.intersects(role_rect)
+
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend `_refine_signature_text_rect` coverage to cases with no hits and multiple hits.

To more fully exercise `_refine_signature_text_rect`, please also add: (1) a case where `search_for` finds no text within the widget clip (e.g., entity text lies entirely outside), asserting it returns a copy of `current_rect`; and (2) a case with multiple hits where only some intersect `current_rect`, asserting it selects the rect closest to `current_rect`’s center. This will lock in the `intersecting_hits` and `_distance_between_rect_centers` behaviors against regressions.
</issue_to_address>

### Comment 3
<location path="tests/api/routers/anonymizer/test_anonymizer.py" line_range="229-231" />
<code_context>
+    return path, preds, signers, preserved_texts, qr_rects
+
+
 @pytest.mark.integration
 def test_anonymization_package_exports_and_registry_are_stable():
     assert PdfAnonymizer.__name__ == "PdfAnonymizer"
</code_context>
<issue_to_address>
**suggestion (testing):** Add a scenario where some signature names are not labeled to ensure only marked entities are redacted.

Please also cover a case where only a subset of signer names appear in `preds` (e.g., one signer is intentionally left unredacted). Asserting that unlabeled signer names remain visible while only labeled ones are tokenized will validate that the redaction flow doesn’t over-redact and that widget-level changes are scoped to the provided labels.

Suggested implementation:

```python
@pytest.mark.integration
def test_anonymization_package_exports_and_registry_are_stable():
    assert PdfAnonymizer.__name__ == "PdfAnonymizer"


def test_only_labeled_signer_names_are_redacted(tmp_path):
    # Build a test PDF with multiple signer blocks
    path, preds, signers, preserved_texts, qr_rects = _make_test_document_with_signatures(
        output_dir=tmp_path
    )

    # Use only a subset of signer labels in preds: leave the last signer unlabeled
    labeled_signers = set(signers[:-1])
    filtered_preds = []
    for pred in preds:
        # Keep labels only if their signer is in labeled_signers, otherwise drop labels
        labels = []
        for label in pred["labels"]:
            if any(signer in label["text"] for signer in labeled_signers):
                labels.append(label)
        filtered_preds.append({**pred, "labels": labels})

    # Run anonymization with the filtered predictions
    anonymizer = PdfAnonymizer(
        input_pdf=path,
        predictions=filtered_preds,
        redaction_strategy="tokenize",
    )
    output_path = anonymizer.run()

    # Open the output PDF and assert that:
    # - labeled signer names are tokenized
    # - unlabeled signer name is still visible
    doc = pymupdf.open(output_path)
    try:
        text = ""
        for page in doc:
            text += page.get_text()

        # All labeled signers should be tokenized
        for signer in labeled_signers:
            assert signer not in text
            assert "<PER>" in text

        # The unlabeled signer (last in original list) must remain visible
        unlabeled_signer = signers[-1]
        assert unlabeled_signer in text
    finally:
        doc.close()

```

1. Ensure the helper used to construct the document is actually named `_make_test_document_with_signatures` and that its return signature matches `(path, preds, signers, preserved_texts, qr_rects)`. If it has a different name or signature, update the call accordingly.
2. If `PdfAnonymizer` or `pymupdf` are not already imported in this file, add the appropriate imports at the top of the file (e.g., `from anonymizer.pdf import PdfAnonymizer` and `import pymupdf` or `import fitz as pymupdf`, depending on the existing convention).
3. If labels produced by `_label_for_document_text` do not expose signer names under `label["text"]`, adjust the filtering condition inside the loop that builds `filtered_preds` so that it correctly identifies which labels correspond to which signer.
4. If the anonymizer API in this project is not used via `PdfAnonymizer(input_pdf=..., predictions=..., redaction_strategy=...)`, adapt the test to call the appropriate router/endpoint or factory that performs the redaction using the `predictions` passed in; the key part is that `filtered_preds` only labels a subset of `signers` and the assertions remain as written.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread aymurai/text/anonymization/pdf/ops.py Outdated
Comment on lines +400 to +409
def test_signature_text_rect_refinement_does_not_include_role_text(tmp_path):
source_path = _write_pdf(
tmp_path / "signature-role.pdf",
lambda _doc, page: (
page.insert_text((100, 80), "RUIZ"),
page.insert_text((100, 96), "JUEZ/A"),
),
)

with pymupdf.open(source_path) as doc:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Extend _refine_signature_text_rect coverage to cases with no hits and multiple hits.

To more fully exercise _refine_signature_text_rect, please also add: (1) a case where search_for finds no text within the widget clip (e.g., entity text lies entirely outside), asserting it returns a copy of current_rect; and (2) a case with multiple hits where only some intersect current_rect, asserting it selects the rect closest to current_rect’s center. This will lock in the intersecting_hits and _distance_between_rect_centers behaviors against regressions.

Comment on lines 229 to 231
@pytest.mark.integration
def test_anonymization_package_exports_and_registry_are_stable():
assert PdfAnonymizer.__name__ == "PdfAnonymizer"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a scenario where some signature names are not labeled to ensure only marked entities are redacted.

Please also cover a case where only a subset of signer names appear in preds (e.g., one signer is intentionally left unredacted). Asserting that unlabeled signer names remain visible while only labeled ones are tokenized will validate that the redaction flow doesn’t over-redact and that widget-level changes are scoped to the provided labels.

Suggested implementation:

@pytest.mark.integration
def test_anonymization_package_exports_and_registry_are_stable():
    assert PdfAnonymizer.__name__ == "PdfAnonymizer"


def test_only_labeled_signer_names_are_redacted(tmp_path):
    # Build a test PDF with multiple signer blocks
    path, preds, signers, preserved_texts, qr_rects = _make_test_document_with_signatures(
        output_dir=tmp_path
    )

    # Use only a subset of signer labels in preds: leave the last signer unlabeled
    labeled_signers = set(signers[:-1])
    filtered_preds = []
    for pred in preds:
        # Keep labels only if their signer is in labeled_signers, otherwise drop labels
        labels = []
        for label in pred["labels"]:
            if any(signer in label["text"] for signer in labeled_signers):
                labels.append(label)
        filtered_preds.append({**pred, "labels": labels})

    # Run anonymization with the filtered predictions
    anonymizer = PdfAnonymizer(
        input_pdf=path,
        predictions=filtered_preds,
        redaction_strategy="tokenize",
    )
    output_path = anonymizer.run()

    # Open the output PDF and assert that:
    # - labeled signer names are tokenized
    # - unlabeled signer name is still visible
    doc = pymupdf.open(output_path)
    try:
        text = ""
        for page in doc:
            text += page.get_text()

        # All labeled signers should be tokenized
        for signer in labeled_signers:
            assert signer not in text
            assert "<PER>" in text

        # The unlabeled signer (last in original list) must remain visible
        unlabeled_signer = signers[-1]
        assert unlabeled_signer in text
    finally:
        doc.close()
  1. Ensure the helper used to construct the document is actually named _make_test_document_with_signatures and that its return signature matches (path, preds, signers, preserved_texts, qr_rects). If it has a different name or signature, update the call accordingly.
  2. If PdfAnonymizer or pymupdf are not already imported in this file, add the appropriate imports at the top of the file (e.g., from anonymizer.pdf import PdfAnonymizer and import pymupdf or import fitz as pymupdf, depending on the existing convention).
  3. If labels produced by _label_for_document_text do not expose signer names under label["text"], adjust the filtering condition inside the loop that builds filtered_preds so that it correctly identifies which labels correspond to which signer.
  4. If the anonymizer API in this project is not used via PdfAnonymizer(input_pdf=..., predictions=..., redaction_strategy=...), adapt the test to call the appropriate router/endpoint or factory that performs the redaction using the predictions passed in; the key part is that filtered_preds only labels a subset of signers and the assertions remain as written.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves PDF signature anonymization by redacting only signer-name text inside signature widgets while preserving the rest of the signature appearance (including nearby text and images), and adds integration tests covering variable signature layouts.

Changes:

  • Added signature-specific rect refinement and operation-building to more tightly target signer-name text inside signature widgets.
  • Introduced a dedicated signature-redaction application flow to avoid wiping the full signature appearance.
  • Updated signature widget preparation/cleanup behavior and added new test utilities + integration coverage for signature preservation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
aymurai/text/anonymization/pdf/ops.py Adds signature-specific rect refinement + op building and applies signature redactions in a dedicated pass.
aymurai/text/anonymization/pdf/widgets.py Adjusts signature background rect computation and switches signature widget handling to flattening (bake) instead of deletion.
aymurai/text/anonymization/pdf/sanitize.py Refines cleanup-rect selection for signature widget operations to prefer redaction/background regions.
tests/api/routers/anonymizer/test_anonymizer.py Adds utilities and new integration tests validating signer-only redaction and preserved signature/QR appearance across layouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +323 to +327
if should_bake_widgets:
try:
doc.bake(annots=False, widgets=True)
except Exception as exc:
logger.warning("Failed to flatten PDF signature widgets: %s", exc)
@jansaldo jansaldo merged commit e25ddcf into release/v1.5.0 May 26, 2026
3 checks passed
@jansaldo jansaldo deleted the fix/pdf-signature-anonymization branch May 26, 2026 20:18
dmazzini added a commit that referenced this pull request May 27, 2026
a37adc20 fix(useLocal): stop persisting groupOrder and remove dead categoryAssignments (#78) (#87)
47fb1fb3 fix(disambiguate): match response items by text instead of array index (#78) (#86)
c873c8e0 Mover configuración de pnpm de `package.json` a `pnpm-workspace.yml` (#83)
f4ce881b fix(useFileParse): use position-based paragraph ID to avoid key collisions (#85)
181e0356 Fix/invalid entity offsets (#82)

git-subtree-dir: frontend
git-subtree-split: a37adc20f579276b3a0e5979424ba7809fb7e2ff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants