A Python ETL pipeline for migrating WordPress export data into structured records you can clean, review, and load into another system.
This tool processes WordPress export files in four stages:
- Extract: Reads WordPress XML export data.
- Translate: Converts raw export fields into Python objects with consistent structure.
- Sanitize: Cleans and normalizes names, text, and author relationships.
- Format: Produces SQL-ready output files for import.
In short: it helps turn messy legacy WordPress data into cleaner, migration-ready data.
- Teams migrating from WordPress to another CMS or database.
- Developers or data operators who need repeatable cleanup of historical content.
- Anyone who wants a scripted pipeline instead of manual copy/paste migration.
- Python 3.10+
- WordPress export XML files (posts and related author data)
Use these steps to download the XML files this pipeline expects:
- Sign in to your WordPress admin dashboard.
- Open Tools -> Export.
- Choose All content (recommended for full migration) or export specific content types as needed.
- Click Download Export File to get the XML export.
- If your site uses a guest-author plugin, export that author data too (from the plugin's export screen or plugin data tools).
- Save all exported XML files in a local folder you will use as ETL input.
- If the export is very large, WordPress may produce multiple XML files. Keep all of them.
From the wordpress-etl directory:
python3 main.pyGenerate article embedding SQL output:
python3 main.py --generate-embeddingsRun without interactive prompts (best-guess matching):
python3 main.py --best-guessCustomize embedding settings:
python3 main.py \
--generate-embeddings \
--embedding-model sentence-transformers/paraphrase-MiniLM-L3-v2 \
--embedding-batch-size 64 \
--embedding-max-chars 5000- WordPress export XML data.
- Cleaned, transformed data structures used by the pipeline.
- SQL command files/log output suitable for downstream import.
- Optional embedding SQL output at
logs/sql/article_embeddings.sql.
--best-guessmode resolves ambiguous author matches automatically using similarity scoring.- Those match decisions are cached so future runs avoid repeating the same prompts.
This repository contains migration logic that reflects one real-world WordPress dataset. You may need to adjust sanitization and mapping rules for your own content model.