Skip to content

BleTib/text-anonymizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Anonymizer

text-anonymizer is a Python package for detecting and anonymizing personally identifiable information (PII) in text. It builds on top of Microsoft's Presidio and adds custom recognizers and stable placeholder mapping for German and English content.

Features

  • Detect and anonymize multiple types of PII: names, organizations, locations, phone numbers, email addresses, IBAN codes, ZIP codes, IP addresses, URLs, and IDs
  • Main usage is anonymization of German text
  • Support for German and English text (can be enhanced with more languages)
  • Usage of Flair model for German text (flair/ner-german-large), which provides significantly better NER accuracy than spaCy especially for German. Falls back to spaCy when Flair is not installed
  • Custom recognizers for improved detection accuracy
  • Entity instance tracking — the same entity is always replaced with the same placeholder (e.g. <PERSON_0>)
  • De-anonymization capability to restore original text from the entity mapping

Installation

Requires Python 3.12.

uv sync

To enable the optional German Flair NER model:

uv sync --extra german-ner

Install the required spaCy models:

uv pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
uv pip install de_core_news_sm@https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl

For development:

uv sync --extra dev

Quick start

from text_anonymizer import TextAnonymizer

anonymizer = TextAnonymizer()

result = anonymizer.anonymize(
    text="Alice lives in Berlin and her email is alice@example.com",
    language="en",
)

print(result["text"])
print(result["entities"])

restored = anonymizer.deanonymize(result["text"], result["entities"])
print(restored)

Example output:

<PERSON_0> lives in <LOCATION_0> and her email is <EMAIL_ADDRESS_0>

CLI usage

After installation, you can anonymize text from the command line:

text-anonymizer --language en --json "Alice lives in Berlin"

Or pipe input through stdin:

echo "Alice lives in Berlin" | text-anonymizer --language en

Supported entity types

The default configuration anonymizes:

  • ID
  • ORGANIZATION
  • URL
  • PHONE_NUMBER
  • EMAIL_ADDRESS
  • EMAIL
  • IBAN_CODE
  • PERSON
  • LOCATION
  • IP_ADDRESS
  • ZIP_CODE

You can pass a smaller list through the entities argument to anonymize only selected types.

Development

Run tests:

uv run pytest

Run linting:

uv run ruff check .
uv run ruff format --check .

Limitations

  • Quality depends on the installed NLP models and their language coverage
  • German NER is strongest when the optional Flair model is installed
  • De-anonymization requires the entity mapping returned by anonymize()
  • No guarantees are made for regulatory compliance without validating the system on your own data

Performance notes

Response time tested per number of characters of emails on a D4v3 Azure cluster with and without a T4 GPU.

Flair comparison: german vs german-large on CPU and GPU

Using a sequential approach — running regex-based recognizers before NER — speeds up processing by reducing the amount of text the NER model needs to analyze.

Sequential vs parallel recognizer pipeline on GPU

About

PII anonymization for German and English text built on Microsoft Presidio

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors