Comparative corpus analysis for modern Python workflows.
pycorpdiff is the missing comparative layer between R's
quanteda, the closed-source SketchEngine
platform, and the fragmented Python NLP stack
(nltk/spaCy/gensim/sentence-transformers). Three public verbs
— compare(a, b), track(c, term), compare.before_after(c, event) —
consolidate keyness, collocations, dispersion, temporal trajectories,
changepoint detection, interrupted time series, causal-impact analysis,
forecasting, online changepoint detection, and embedding-based semantic
shift under a single notebook-native API. Keyness and collocation
results carry their own KWIC evidence: .explain(term) returns the
source-text concordances behind any ranked term.
The package answers the questions corpus linguistics, digital humanities, and computational social science routinely have:
- How does corpus A differ from corpus B? —
compare(a, b).keyness() - How has discourse around X evolved over time? —
track(c, "x").over_time() - What did "migrant" mean in 2005 vs 2023? —
compare(...).semantic_shift("migrant", embedder=...) - Did this event actually shift the conversation? —
track(...).causal_impact(event_date=...) - Where is the discourse heading? —
track(...).forecast(horizon=4)
pycorpdiff is positioned as orchestration, not reinvention.
Tokenizers (spaCy, Stanza, jieba, fugashi) and embedders (any
SBERT-compatible model) plug in via two typing.Protocol extension
points — one-line adapters, no plugin registry. The base install's
direct runtime dependencies are numpy, pandas, scipy, and
pyarrow; everything else is opt-in via extras.
Status: alpha (0.1.0a12). Public API is stable for the features described below; on PyPI as
pip install pycorpdiff.
| Layer | Purpose | Key surface |
|---|---|---|
1 — Ingestion + Corpus |
get text in, slice it, hash it | from_dataframe, read_csv, read_parquet, read_txt, read_duckdb, from_huggingface, fetch_hansard, Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars |
| 2 — Pure math | statistics with no I/O | keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}; collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}; semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}; temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd} |
| 3 — Verbs + Results | public API | compare, track, compare.before_after, keyness_multi, plus 9 frozen-dataclass Result types each implementing the relevant subset of .to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json() |
pip install "pycorpdiff[viz]"import pycorpdiff as pcd
# Bundled synthetic Hansard-style sample — runs offline, no data download.
corpus = pcd.load_hansard_sample()
immigration = corpus.slice(topic="immigration")
# Which words separate the humanising and criminalising frames?
keyness = pcd.compare(
immigration.slice(frame="humanising"),
immigration.slice(frame="criminalising"),
).keyness(min_count=3)
keyness.plot() # volcano plot — picture the result
# keyness.table.head(10) # or look at the ranked table directly
# keyness.explain("criminal") # KWIC concordances showing the textual evidenceThat's the entire surface in five lines: load a corpus, slice it, compare two slices, plot the result. Every other analytical method — collocation shifts, semantic drift, temporal trajectories, changepoint detection, causal-impact analysis, forecasting, co-occurrence networks, N-way keyness — follows the same shape. See the showcase notebook for the full feature tour, or the cheat sheet below for one-line API previews.
# Compare verbs (returns Result objects; methods exposed vary by Result)
pcd.compare(a, b).keyness() # default formula="rayson" (LL Wizard)
pcd.compare(a, b).keyness(formula="dunning") # full 4-cell G² (matches quanteda / NLTK)
pcd.compare(a, b).collocation_shift("immigrant")
pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder()) # [semantic]
# SBERTEmbedder downloads a sentence-transformers model on first call;
# use pcd.HashEmbedder() for offline / deterministic-test settings.
# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods)
tr = pcd.track(corpus, "immigrant").over_time(freq="Y")
tr.changepoints() # offline PELT
tr.changepoints_online(hazard=1/24) # Bayesian online (Adams & MacKay 2007)
tr.interrupted_time_series(event_date="2016") # segmented OLS
tr.causal_impact(event_date="2016") # Bayesian counterfactual (Brodersen 2015)
tr.forecast(horizon=4) # 4 periods at the over_time freq (state-space ETS)
# Before / after a known event
pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
# N-way (≥ 2 corpora)
pcd.keyness_multi([a, b, c, d], labels=["A", "B", "C", "D"])
# The discourse as a graph
pcd.cooccurrence_network(corpus, top_n=30).plot()See examples/pycorpdiff_showcase.ipynb
for a walkthrough on the synthetic Hansard-style corpus exercising
every analytical surface.
pip install pycorpdiff # lexical-comparative core (MIT)
pip install "pycorpdiff[viz]" # + altair / matplotlib / networkx
pip install "pycorpdiff[semantic]" # + sentence-transformers
pip install "pycorpdiff[temporal]" # + ruptures / statsmodels
pip install "pycorpdiff[notebooks]" # + jupyter / vl-convert
pip install "pycorpdiff[all]" # everything MIT-compatible
pip install "pycorpdiff[all,showcase]" # + pysofra (GPL-3.0-or-later) for the JAMA-style showcaseThe base install's direct runtime dependencies are numpy, pandas,
scipy, and pyarrow; optional extras land per analytical layer so
you only pay for what you use. [showcase] is broken out separately
because pysofra is GPL-3.0-or-later — pure pycorpdiff use without
that extra remains MIT-only.
To work from source:
git clone https://github.com/jturner-uofl/pycorpdiff
cd pycorpdiff
pip install -e ".[dev]"
pytest -qThe math is checked against standard tools by automated test. The fast tier runs on every push (matrix CI); the slow tier needs heavy optional dependencies (R + quanteda, NLTK, rpy2, Stanford SNAP downloads) and runs on main pushes only.
Fast tier:
- Rayson's LL Wizard — hand-derived contingency-table reference
triples (
tests/integration/test_crossval_rayson.py)
Slow tier:
- NLTK
BigramAssocMeasures— PMI + t-score agreement to ≤ 1e-12 on every adjacent bigram - Scattertext (Kessler 2017) — behavioural agreement on the 2012 US Conventions corpus
- HistWords (Hamilton et al. 2016) — known-shifter / stable-word sanity check on Stanford SNAP COHA decade embeddings (skips gracefully if the archive isn't reachable)
If you use pycorpdiff in academic work, please cite the software via
the CITATION.cff file in this repository — GitHub renders a "Cite this
repository" widget directly from it.
MIT — see LICENSE.
docs/design.md— three-layer architecturedocs/statistical-methods.md— every metric's formula + citationexamples/pycorpdiff_showcase.ipynb— full feature tour as a notebookdocs/rendered/— static HTML renders for offline viewing