text-anonymizer is a Python package for detecting and anonymizing personally identifiable information (PII) in text. It builds on top of Microsoft's Presidio and adds custom recognizers and stable placeholder mapping for German and English content.
- Detect and anonymize multiple types of PII: names, organizations, locations, phone numbers, email addresses, IBAN codes, ZIP codes, IP addresses, URLs, and IDs
- Main usage is anonymization of German text
- Support for German and English text (can be enhanced with more languages)
- Usage of Flair model for German text (
flair/ner-german-large), which provides significantly better NER accuracy than spaCy especially for German. Falls back to spaCy when Flair is not installed - Custom recognizers for improved detection accuracy
- Entity instance tracking — the same entity is always replaced with the same placeholder (e.g.
<PERSON_0>) - De-anonymization capability to restore original text from the entity mapping
Requires Python 3.12.
uv syncTo enable the optional German Flair NER model:
uv sync --extra german-nerInstall the required spaCy models:
uv pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
uv pip install de_core_news_sm@https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whlFor development:
uv sync --extra devfrom text_anonymizer import TextAnonymizer
anonymizer = TextAnonymizer()
result = anonymizer.anonymize(
text="Alice lives in Berlin and her email is alice@example.com",
language="en",
)
print(result["text"])
print(result["entities"])
restored = anonymizer.deanonymize(result["text"], result["entities"])
print(restored)Example output:
<PERSON_0> lives in <LOCATION_0> and her email is <EMAIL_ADDRESS_0>
After installation, you can anonymize text from the command line:
text-anonymizer --language en --json "Alice lives in Berlin"Or pipe input through stdin:
echo "Alice lives in Berlin" | text-anonymizer --language enThe default configuration anonymizes:
IDORGANIZATIONURLPHONE_NUMBEREMAIL_ADDRESSEMAILIBAN_CODEPERSONLOCATIONIP_ADDRESSZIP_CODE
You can pass a smaller list through the entities argument to anonymize only selected types.
Run tests:
uv run pytestRun linting:
uv run ruff check .
uv run ruff format --check .- Quality depends on the installed NLP models and their language coverage
- German NER is strongest when the optional Flair model is installed
- De-anonymization requires the entity mapping returned by
anonymize() - No guarantees are made for regulatory compliance without validating the system on your own data
Response time tested per number of characters of emails on a D4v3 Azure cluster with and without a T4 GPU.
Using a sequential approach — running regex-based recognizers before NER — speeds up processing by reducing the amount of text the NER model needs to analyze.

