A hybrid legal text classification system combining:
- Rule-based keyword matching
- Unsupervised machine learning (TF-IDF + K-means)
- Dimensionality reduction (UMAP)
Motivation: As a pre-law student, I developed this to solve the challenge of efficiently finding relevant legal cases for research across different domains (family law, criminal law, etc.).
Source: Kaggle Legal Text Dataset
Contents: 24,985 Australian legal cases with:
- Case ID (unique identifier)
- Outcome (judicial decision)
- Title (case name)
- Text (full narrative)
# Sample preprocessing steps
text = text.lower()
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
tokens = word_tokenize(text) # Tokenization
lemmatized = [lemmatizer.lemmatize(t) for t in tokens]
clean_text = [t for t in lemmatized if t not in stopwords]legal_categories = {
'family': ['custody', 'divorce', 'marriage', ...],
'criminal': ['theft', 'murder', 'fraud', ...],
# 8 other domains...
}Unsupervised Learning:
- TF-IDF Vectorization (1-3 ngrams)
- UMAP Dimensionality Reduction (n_components=20)
- K--means Clustering (k=19 via silhouette score)
| Stage | Silhouette Score | Calinski-Harabasz |
|---|---|---|
| Initial | 0.036 | 159.75 |
| After UMAP | 0.36 | 7607.45 |
Top Business Law Terms:
['case', 'court', 'agreement', 'party', 'contract', 'ltd']
['fca', 'decision', 'applicant', 'immigration', 'tribunal']- Successfully categorized 60% of cases via initial keyword matching
- Cluster quality improved 10x after UMAP reduction
- Identified limitations in business/financial law separation
✅ Efficient - Quick categorization of obvious cases
✅ Transparent - Clear reasoning for classifications
✅ Adaptable - Expandable keyword dictionary
- Web interface for legal researchers
- Dynamic keyword expansion
- BERT fine-tuning with better hardware
- Hierarchical classification system
Yoshita Aligina
Rutgers University CS 2026
- Nghiem et al. (2022) - Transformer-based legal classification (https://aclanthology.org/2022.lrec-1.504.pdf)
- UMAP Documentation
- HuggingFace Transformers
- https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
- https://stanfordnlp.github.io/CoreNLP/lemma.html