TASC: Topic-Aware Semantic Change

An interactive 3D visualization for analyzing and characterizing lexical semantic change across corpora using topic modeling.

Features

Lexical Semantic Change Detection: Quantify how word meanings shift between two corpora using contextual embeddings
Interactive Visualization: 3D visualization of word trajectories and topic clusters in PCA space over time
Topic Modeling: Automatically discover topic clusters in document collections using Top2Vec
Multi-Model Support: Evaluate semantic shift across different transformer models (XL-Lexeme, all-mpnet-base-v2, XLM-RoBERTa, etc.)
Layer-wise Analysis: Compute and compare change metrics across different transformer layers
Caching System: Efficient embedding caching to avoid recomputation across runs
Scoring Functions: Multiple metrics for quantifying semantic change (APD, PRT)

Requirements

Python 3.10 <= version <=3.12.10
- Direct download links for Python 3.12.10: Windows, Mac
- For Python 3.12.10 on linux, run

sudo apt update
sudo apt-get install python3.12 python3.12-venv

can use Python 3.13+ if you download Microsoft Visual C++ 14.0 or greater

Usage

1. Topical Semantic Change (Interactive Visualization)

The visualization tool analyzes a pair of corpora and displays word trajectories + topic clusters in 3D space.

Quick Installation with Sample Data

Download the latest release
Unzip the folder
Download the sample cache files HERE
Place all downloaded cache files into the cache directory
If on Windows:
- Open scripts directory
- Double-click install.bat and allow it to finish installation
- go back to TASC directory
- Double-click run.bat
If on Linux:

  chmod +x install.sh run.sh
./install.sh
./run.sh

If on MacOS:
- Open scripts directory
- Double-click tasc_install.app and allow it to finish installation
- go back to TASC directory
- Double-click tasc_run.app

You only need to install once. After, you can simply double-click on your Run executable to start the app!

Visit http://localhost:9000 to explore:

Left panel: Select words to visualize
Center: 3D plot showing word trajectories (corpus1 → corpus2) and topic clusters
Right panel: Topic information and top words per cluster
Bottom: Example sentences for selected

The release comes with a sample dataset, the SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection benchmark.

If you would like to analyze your own datasets/diachronic corpora in the tool, see Using Your Own Data.

2. Lexical Semantic Change Analysis Framework

Evaluate how word meanings change between two corpora using different models:

score [first corpus path] \
[second corpus path] \
--gold [optional, graded scores for comparison] \
--models [models to use, can add 1 or multiple]

# example
score corpora/semeval2020_ulscd_eng/corpus1 \
  corpora/semeval2020_ulscd_eng/corpus2 \
  --gold corpora/semeval2020_ulscd_eng/truth.csv \
  --models pierluigic/xl-lexeme sentence-transformers/all-mpnet-base-v2

Output: eval_results.csv containing Spearman correlations for each model/layer combination

Supported Models:

sentence-transformers/all-mpnet-base-v2 (Multilingual, 768-dim)
pierluigic/xl-lexeme (English-specific, 512-dim)
FacebookAI/roberta-base (RoBERTa base)
Any HuggingFace transformer model

Data

TASC generates embeddings from the source corpora and stores them in a cache for efficient reuse. During the initial run, embeddings are computed from the corpus files and saved to the cache directory. Subsequent runs will load the cached embeddings automatically, significantly reducing processing time. Embeddings will only be regenerated if the corresponding cache files are removed or renamed.

Data Format

Corpus Format: Plain text files, one sentence per line

The quick brown fox jumps over the lazy dog.
Machine learning is transforming technology.
...

Gold Standard Format: Tab-separated CSV with columns lemma and change_graded

lemma	change_graded
word1	0.85
word2	0.12
...

Using Your Own Data

Prepare the following files:

Text files(s) composing the first corpus
Text files(s) composing the second corpus
(Optional) A CSV file named truth.csv containing at least one column named lemma. When provided, TASC will only create embeddings and visualize the listed lemmas. Useful when analyzing a predefined set of target terms or you want to increase the speed of the program.

TASC expects a directory structure within the corpora directory similar to the following:

corpora/
  datasets1/
  ├── corpus1/
  ├── corpus2/
  └── ...
  datasets2/
  ├── corpus1/
  ├── corpus2/
  ├── truth.csv # optional!
  └── ...
  ...

Each corpus should reside in its own dedicated subdirectory within datasets directories

Create the required directory structure if it does not already exist (the release comes with a sample dataset with the correct structure for your reference)
Place your corpus files in the appropriate locations
Update the CORPUS1, CORPUS2, and TERMS_FILE (if applicable) variables in tasc/config.py to point to the corresponding file paths

# if files are in a directory called "datasets2"
# change variables in config.py to ->
CORPUS1 = str(CORPORA_DIR / "datasets2" / "corpus1")
CORPUS2 = str(CORPORA_DIR / "datasets2" / "corpus2")
TERMS_FILE = str(CORPORA_DIR / "datasets2" / "truth.csv")

Multiple diachronic datasets may be stored within the corpora directory for convenient organization and reuse. Ensure that each corpora directory has a unique name as cache identifiers are derived from these names. When switching to a different dataset, repeat the above setup procedure and be sure to update the configuration paths in config.py (as described above).

Caching

Embeddings are cached as compressed NPZ files to avoid recomputation:

datasets2_c1_sentence-transformers_all-mpnet-base-v2_L8.npz
datasets2_c1_sentence-transformers_all-mpnet-base-v2_L8.npz
top2vec_datasets2.pkl

Delete ALL cache files with the same label as your dataset (ex. datasets2) to force recomputation.

Performance Notes

Embedding computation: ~10-50 min per corpus (depending on model & corpus size)
First run includes transformation & PCA fitting; uses cached results on subsequent runs
Memory: Typical usage 4-8GB for mid-size corpora; increase batch_size parameter if memory-constrained
GPU: Significantly faster; ensure CUDA is available via torch.cuda.is_available()

Project Structure (OUTDATED)

.
├── LSC/                              # LSC analysis library
│   ├── pyproject.toml                # Package metadata & dependencies
│   └── lsc/                          # Installable package (import as 'lsc')
│       ├── extraction/               # Word candidate extraction & corpus processing
│       │   ├── word_extractor.py     # Extract common words, filter by POS
│       │   └── word_cache.py         # NPZ caching for extracted words
│       ├── representation/           # Embedding generation & caching
│       │   ├── embedding_creator.py  # Contextual word embeddings from transformers
│       │   ├── embed_cache.py        # Efficient NPZ-based caching
│       │   └── models.py             # Data structures (TermSummary)
│       ├── assessment/               # Semantic change scoring
│       │   └── scoring.py            # Evaluate model/layer combinations
│       ├── utils.py                  # Shared utilities (metrics, normalization)
│       └── config.py                 # Configuration & paths
│
├── Top2Vec/                          # Modified Top2Vec fork (installable)
│   ├── setup.py
│   └── top2vec/
│       └── Top2Vec.py
│
├── tasc/                             # Interactive visualization app
│   ├── cli.py                        # Entry point (tasc run / tasc dev)
│   ├── backend/                      # FastAPI server
│   │   ├── app/
│   │   │   ├── main.py               # API endpoints & lifespan
│   │   │   ├── core.py               # Data loading & PCA fitting
│   │   │   ├── topic.py              # Top2Vec topic extraction
│   │   │   └── config.py             # Backend configuration
│   │   ├── run.py                    # Production server runner
│   │   ├── dev.py                    # Development server runner
│   └── frontend/                     # React visualization
│       ├── src/
│       │   ├── App.jsx               # Root component
│       │   ├── api.js                # Backend API client
│       │   ├── Plotly.jsx            # Plotly integration
│       │   └── components/
│       │       ├── PlotCanvas.jsx    # 3D scatter plot
│       │       ├── WordList.jsx      # Word selection
│       │       ├── TopicList.jsx     # Topic display
│       │       └── OccurrenceBar.jsx # Sentence examples
│       └── package.json
│
├── corpora/                          # Datasets
│   └── semeval2020_ulscd_eng/        # Sample data: SemEval2020 Task 1 English
│           ├── corpus1/*.txt         # First corpus files
│           └── corpus2/*.txt         # Second corpus files
│           └── truth.csv             # Graded scores
│
├── pyproject.toml                    # Project metadata, dependencies & CLI entry points
├── install.py                        # Automated installer
├── Install.bat                       # Windows: run to install
└── Run.bat                           # Windows: run to start

Scoring Functions

The framework supports multiple metrics for measuring semantic change:

APD (Average Pairwise Distance): Mean pairwise distance between embeddings across corpora
PRT (Inverted Similarity over Prototype Distance): Inverse of cosine similarity between mean embeddings

License

MIT License (see LICENCE.md)

Contributing

Requirements

Python 3.10 <= version <=3.12.10
Git
Node.js

Contributor Installation

Copy from Git

# clone repository
git clone https://github.com/tobiia/TASC.git
cd TASC

# create virtual environment
py -3.12 -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate

# install dependencies and package
python -m pip install --upgrade pip
pip install lib/LSC
pip install lib/Top2Vec
pip install -e .
python -m spacy download en_core_web_sm

# start dev server
tasc dev

If you copy the repo using git, you will need to download the sample dataset as it is too large to upload to Github.

To use the sample dataset:

Download the sample cache files HERE
Create a directory named cache in the project root
Place all downloaded cache files into the cache directory

Use the release version

# download release
# open scripts/
# click dev_install.bat --> creates venv + installs all dependencies

# start dev server
tasc dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TASC: Topic-Aware Semantic Change

Features

Requirements

Usage

1. Topical Semantic Change (Interactive Visualization)

Quick Installation with Sample Data

2. Lexical Semantic Change Analysis Framework

Data

Data Format

Using Your Own Data

Caching

Performance Notes

Project Structure (OUTDATED)

Scoring Functions

License

Contributing

Requirements

Contributor Installation

Copy from Git

Use the release version

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
lib		lib
results		results
scripts		scripts
src		src
tasc_run.app/Contents		tasc_run.app/Contents
LICENCE.md		LICENCE.md
README.md		README.md
Run.bat		Run.bat
config.toml		config.toml
dev.bat		dev.bat
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

TASC: Topic-Aware Semantic Change

Features

Requirements

Usage

1. Topical Semantic Change (Interactive Visualization)

Quick Installation with Sample Data

2. Lexical Semantic Change Analysis Framework

Data

Data Format

Using Your Own Data

Caching

Performance Notes

Project Structure (OUTDATED)

Scoring Functions

License

Contributing

Requirements

Contributor Installation

Copy from Git

Use the release version

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages