Text Anonymizer

text-anonymizer is a Python package for detecting and anonymizing personally identifiable information (PII) in text. It builds on top of Microsoft's Presidio and adds custom recognizers and stable placeholder mapping for German and English content.

Features

Detect and anonymize multiple types of PII: names, organizations, locations, phone numbers, email addresses, IBAN codes, ZIP codes, IP addresses, URLs, and IDs
Main usage is anonymization of German text
Support for German and English text (can be enhanced with more languages)
Usage of Flair model for German text (flair/ner-german-large), which provides significantly better NER accuracy than spaCy especially for German. Falls back to spaCy when Flair is not installed
Custom recognizers for improved detection accuracy
Entity instance tracking — the same entity is always replaced with the same placeholder (e.g. <PERSON_0>)
De-anonymization capability to restore original text from the entity mapping

Installation

Requires Python 3.12.

uv sync

To enable the optional German Flair NER model:

uv sync --extra german-ner

Install the required spaCy models:

uv pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
uv pip install de_core_news_sm@https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl

For development:

uv sync --extra dev

Quick start

from text_anonymizer import TextAnonymizer

anonymizer = TextAnonymizer()

result = anonymizer.anonymize(
    text="Alice lives in Berlin and her email is alice@example.com",
    language="en",
)

print(result["text"])
print(result["entities"])

restored = anonymizer.deanonymize(result["text"], result["entities"])
print(restored)

Example output:

<PERSON_0> lives in <LOCATION_0> and her email is <EMAIL_ADDRESS_0>

CLI usage

After installation, you can anonymize text from the command line:

text-anonymizer --language en --json "Alice lives in Berlin"

Or pipe input through stdin:

echo "Alice lives in Berlin" | text-anonymizer --language en

Supported entity types

The default configuration anonymizes:

ID
ORGANIZATION
URL
PHONE_NUMBER
EMAIL_ADDRESS
EMAIL
IBAN_CODE
PERSON
LOCATION
IP_ADDRESS
ZIP_CODE

You can pass a smaller list through the entities argument to anonymize only selected types.

Development

Run tests:

uv run pytest

Run linting:

uv run ruff check .
uv run ruff format --check .

Limitations

Quality depends on the installed NLP models and their language coverage
German NER is strongest when the optional Flair model is installed
De-anonymization requires the entity mapping returned by anonymize()
No guarantees are made for regulatory compliance without validating the system on your own data

Performance notes

Response time tested per number of characters of emails on a D4v3 Azure cluster with and without a T4 GPU.

Using a sequential approach — running regex-based recognizers before NER — speeds up processing by reducing the amount of text the NER model needs to analyze.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
images		images
notebooks		notebooks
src/text_anonymizer		src/text_anonymizer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Anonymizer

Features

Installation

Quick start

CLI usage

Supported entity types

Development

Limitations

Performance notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Anonymizer

Features

Installation

Quick start

CLI usage

Supported entity types

Development

Limitations

Performance notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages