An interactive 3D visualization for analyzing and characterizing lexical semantic change across corpora using topic modeling.
- Lexical Semantic Change Detection: Quantify how word meanings shift between two corpora using contextual embeddings
- Interactive Visualization: 3D visualization of word trajectories and topic clusters in PCA space over time
- Topic Modeling: Automatically discover topic clusters in document collections using Top2Vec
- Multi-Model Support: Evaluate semantic shift across different transformer models (XL-Lexeme, all-mpnet-base-v2, XLM-RoBERTa, etc.)
- Layer-wise Analysis: Compute and compare change metrics across different transformer layers
- Caching System: Efficient embedding caching to avoid recomputation across runs
- Scoring Functions: Multiple metrics for quantifying semantic change (APD, PRT)
- Python 3.10 <= version <=3.12.10
sudo apt update
sudo apt-get install python3.12 python3.12-venv- can use Python 3.13+ if you download Microsoft Visual C++ 14.0 or greater
The visualization tool analyzes a pair of corpora and displays word trajectories + topic clusters in 3D space.
- Download the latest release
- Unzip the folder
- Download the sample cache files HERE
- Place all downloaded cache files into the
cachedirectory - If on Windows:
- Open
scriptsdirectory - Double-click install.bat and allow it to finish installation
- go back to
TASCdirectory - Double-click run.bat
- Open
- If on Linux:
chmod +x install.sh run.sh
./install.sh
./run.sh- If on MacOS:
- Open
scriptsdirectory - Double-click tasc_install.app and allow it to finish installation
- go back to
TASCdirectory - Double-click tasc_run.app
- Open
You only need to install once. After, you can simply double-click on your Run executable to start the app!
Visit http://localhost:9000 to explore:
- Left panel: Select words to visualize
- Center: 3D plot showing word trajectories (corpus1 → corpus2) and topic clusters
- Right panel: Topic information and top words per cluster
- Bottom: Example sentences for selected
The release comes with a sample dataset, the SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection benchmark.
If you would like to analyze your own datasets/diachronic corpora in the tool, see Using Your Own Data.
Evaluate how word meanings change between two corpora using different models:
score [first corpus path] \
[second corpus path] \
--gold [optional, graded scores for comparison] \
--models [models to use, can add 1 or multiple]
# example
score corpora/semeval2020_ulscd_eng/corpus1 \
corpora/semeval2020_ulscd_eng/corpus2 \
--gold corpora/semeval2020_ulscd_eng/truth.csv \
--models pierluigic/xl-lexeme sentence-transformers/all-mpnet-base-v2Output: eval_results.csv containing Spearman correlations for each model/layer combination
Supported Models:
sentence-transformers/all-mpnet-base-v2(Multilingual, 768-dim)pierluigic/xl-lexeme(English-specific, 512-dim)FacebookAI/roberta-base(RoBERTa base)- Any HuggingFace transformer model
TASC generates embeddings from the source corpora and stores them in a cache for efficient reuse. During the initial run, embeddings are computed from the corpus files and saved to the cache directory. Subsequent runs will load the cached embeddings automatically, significantly reducing processing time. Embeddings will only be regenerated if the corresponding cache files are removed or renamed.
Corpus Format: Plain text files, one sentence per line
The quick brown fox jumps over the lazy dog.
Machine learning is transforming technology.
...
Gold Standard Format: Tab-separated CSV with columns lemma and change_graded
lemma change_graded
word1 0.85
word2 0.12
...
- Prepare the following files:
- Text files(s) composing the first corpus
- Text files(s) composing the second corpus
- (Optional) A CSV file named
truth.csvcontaining at least one column namedlemma. When provided, TASC will only create embeddings and visualize the listed lemmas. Useful when analyzing a predefined set of target terms or you want to increase the speed of the program.
- TASC expects a directory structure within the
corporadirectory similar to the following:
corpora/
datasets1/
├── corpus1/
├── corpus2/
└── ...
datasets2/
├── corpus1/
├── corpus2/
├── truth.csv # optional!
└── ...
...
Each corpus should reside in its own dedicated subdirectory within datasets directories
-
Create the required directory structure if it does not already exist (the release comes with a sample dataset with the correct structure for your reference)
-
Place your corpus files in the appropriate locations
-
Update the
CORPUS1,CORPUS2, andTERMS_FILE(if applicable) variables intasc/config.pyto point to the corresponding file paths
# if files are in a directory called "datasets2"
# change variables in config.py to ->
CORPUS1 = str(CORPORA_DIR / "datasets2" / "corpus1")
CORPUS2 = str(CORPORA_DIR / "datasets2" / "corpus2")
TERMS_FILE = str(CORPORA_DIR / "datasets2" / "truth.csv")Multiple diachronic datasets may be stored within the corpora directory for convenient organization and reuse. Ensure that each corpora directory has a unique name as cache identifiers are derived from these names. When switching to a different dataset, repeat the above setup procedure and be sure to update the configuration paths in config.py (as described above).
Embeddings are cached as compressed NPZ files to avoid recomputation:
datasets2_c1_sentence-transformers_all-mpnet-base-v2_L8.npz
datasets2_c1_sentence-transformers_all-mpnet-base-v2_L8.npz
top2vec_datasets2.pkl
Delete ALL cache files with the same label as your dataset (ex. datasets2) to force recomputation.
- Embedding computation: ~10-50 min per corpus (depending on model & corpus size)
- First run includes transformation & PCA fitting; uses cached results on subsequent runs
- Memory: Typical usage 4-8GB for mid-size corpora; increase
batch_sizeparameter if memory-constrained - GPU: Significantly faster; ensure CUDA is available via
torch.cuda.is_available()
.
├── LSC/ # LSC analysis library
│ ├── pyproject.toml # Package metadata & dependencies
│ └── lsc/ # Installable package (import as 'lsc')
│ ├── extraction/ # Word candidate extraction & corpus processing
│ │ ├── word_extractor.py # Extract common words, filter by POS
│ │ └── word_cache.py # NPZ caching for extracted words
│ ├── representation/ # Embedding generation & caching
│ │ ├── embedding_creator.py # Contextual word embeddings from transformers
│ │ ├── embed_cache.py # Efficient NPZ-based caching
│ │ └── models.py # Data structures (TermSummary)
│ ├── assessment/ # Semantic change scoring
│ │ └── scoring.py # Evaluate model/layer combinations
│ ├── utils.py # Shared utilities (metrics, normalization)
│ └── config.py # Configuration & paths
│
├── Top2Vec/ # Modified Top2Vec fork (installable)
│ ├── setup.py
│ └── top2vec/
│ └── Top2Vec.py
│
├── tasc/ # Interactive visualization app
│ ├── cli.py # Entry point (tasc run / tasc dev)
│ ├── backend/ # FastAPI server
│ │ ├── app/
│ │ │ ├── main.py # API endpoints & lifespan
│ │ │ ├── core.py # Data loading & PCA fitting
│ │ │ ├── topic.py # Top2Vec topic extraction
│ │ │ └── config.py # Backend configuration
│ │ ├── run.py # Production server runner
│ │ ├── dev.py # Development server runner
│ └── frontend/ # React visualization
│ ├── src/
│ │ ├── App.jsx # Root component
│ │ ├── api.js # Backend API client
│ │ ├── Plotly.jsx # Plotly integration
│ │ └── components/
│ │ ├── PlotCanvas.jsx # 3D scatter plot
│ │ ├── WordList.jsx # Word selection
│ │ ├── TopicList.jsx # Topic display
│ │ └── OccurrenceBar.jsx # Sentence examples
│ └── package.json
│
├── corpora/ # Datasets
│ └── semeval2020_ulscd_eng/ # Sample data: SemEval2020 Task 1 English
│ ├── corpus1/*.txt # First corpus files
│ └── corpus2/*.txt # Second corpus files
│ └── truth.csv # Graded scores
│
├── pyproject.toml # Project metadata, dependencies & CLI entry points
├── install.py # Automated installer
├── Install.bat # Windows: run to install
└── Run.bat # Windows: run to start
The framework supports multiple metrics for measuring semantic change:
- APD (Average Pairwise Distance): Mean pairwise distance between embeddings across corpora
- PRT (Inverted Similarity over Prototype Distance): Inverse of cosine similarity between mean embeddings
MIT License (see LICENCE.md)
# clone repository
git clone https://github.com/tobiia/TASC.git
cd TASC
# create virtual environment
py -3.12 -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate
# install dependencies and package
python -m pip install --upgrade pip
pip install lib/LSC
pip install lib/Top2Vec
pip install -e .
python -m spacy download en_core_web_sm
# start dev server
tasc devIf you copy the repo using git, you will need to download the sample dataset as it is too large to upload to Github.
To use the sample dataset:
- Download the sample cache files HERE
- Create a directory named
cachein the project root - Place all downloaded cache files into the
cachedirectory
# download release
# open scripts/
# click dev_install.bat --> creates venv + installs all dependencies
# start dev server
tasc dev