WordPress ETL Toolkit

A Python ETL pipeline for migrating WordPress export data into structured records you can clean, review, and load into another system.

What This Project Does

This tool processes WordPress export files in four stages:

Extract: Reads WordPress XML export data.
Translate: Converts raw export fields into Python objects with consistent structure.
Sanitize: Cleans and normalizes names, text, and author relationships.
Format: Produces SQL-ready output files for import.

In short: it helps turn messy legacy WordPress data into cleaner, migration-ready data.

Who This Is For

Teams migrating from WordPress to another CMS or database.
Developers or data operators who need repeatable cleanup of historical content.
Anyone who wants a scripted pipeline instead of manual copy/paste migration.

Requirements

Python 3.10+
WordPress export XML files (posts and related author data)

How to Export Data from WordPress

Use these steps to download the XML files this pipeline expects:

Sign in to your WordPress admin dashboard.
Open Tools -> Export.
Choose All content (recommended for full migration) or export specific content types as needed.
Click Download Export File to get the XML export.
If your site uses a guest-author plugin, export that author data too (from the plugin's export screen or plugin data tools).
Save all exported XML files in a local folder you will use as ETL input.
If the export is very large, WordPress may produce multiple XML files. Keep all of them.

Quick Start

From the wordpress-etl directory:

python3 main.py

Optional Modes

Generate article embedding SQL output:

python3 main.py --generate-embeddings

Run without interactive prompts (best-guess matching):

python3 main.py --best-guess

Customize embedding settings:

python3 main.py \
  --generate-embeddings \
  --embedding-model sentence-transformers/paraphrase-MiniLM-L3-v2 \
  --embedding-batch-size 64 \
  --embedding-max-chars 5000

Input and Output

Input

WordPress export XML data.

Output

Cleaned, transformed data structures used by the pipeline.
SQL command files/log output suitable for downstream import.
Optional embedding SQL output at logs/sql/article_embeddings.sql.

Notes

--best-guess mode resolves ambiguous author matches automatically using similarity scoring.
Those match decisions are cached so future runs avoid repeating the same prompts.

Disclaimer

This repository contains migration logic that reflects one real-world WordPress dataset. You may need to adjust sanitization and mapping rules for your own content model.

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
.vscode		.vscode
Formatter		Formatter
Sanitizer		Sanitizer
Translator		Translator
Utils		Utils
.gitignore		.gitignore
App.py		App.py
Extractor.py		Extractor.py
README.MD		README.MD
TUI.py		TUI.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordPress ETL Toolkit

What This Project Does

Who This Is For

Requirements

How to Export Data from WordPress

Quick Start

Optional Modes

Input and Output

Input

Output

Notes

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WordPress ETL Toolkit

What This Project Does

Who This Is For

Requirements

How to Export Data from WordPress

Quick Start

Optional Modes

Input and Output

Input

Output

Notes

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages