Legal Text Classification Project

📌 Introduction

A hybrid legal text classification system combining:

Rule-based keyword matching
Unsupervised machine learning (TF-IDF + K-means)
Dimensionality reduction (UMAP)

Motivation: As a pre-law student, I developed this to solve the challenge of efficiently finding relevant legal cases for research across different domains (family law, criminal law, etc.).

📂 Dataset

Source: Kaggle Legal Text Dataset
Contents: 24,985 Australian legal cases with:

Case ID (unique identifier)
Outcome (judicial decision)
Title (case name)
Text (full narrative)

🛠️ Methodology

1. Preprocessing Pipeline

# Sample preprocessing steps
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
tokens = word_tokenize(text)          # Tokenization
lemmatized = [lemmatizer.lemmatize(t) for t in tokens]
clean_text = [t for t in lemmatized if t not in stopwords]

2. Hybrid Classification

legal_categories = {
    'family': ['custody', 'divorce', 'marriage', ...],
    'criminal': ['theft', 'murder', 'fraud', ...],
    # 8 other domains...
}

Unsupervised Learning:

TF-IDF Vectorization (1-3 ngrams)
UMAP Dimensionality Reduction (n_components=20)
K--means Clustering (k=19 via silhouette score)

3. Performance Metrics

Stage	Silhouette Score	Calinski-Harabasz
Initial	0.036	159.75
After UMAP	0.36	7607.45

📊 Results

Cluster Analysis

Top Business Law Terms:

['case', 'court', 'agreement', 'party', 'contract', 'ltd']
['fca', 'decision', 'applicant', 'immigration', 'tribunal']

🔍 Key Findings

Successfully categorized 60% of cases via initial keyword matching
Cluster quality improved 10x after UMAP reduction
Identified limitations in business/financial law separation

📝 Discussion

Advantages

✅ Efficient - Quick categorization of obvious cases
✅ Transparent - Clear reasoning for classifications
✅ Adaptable - Expandable keyword dictionary

Limitations

⚠️ Keyword dependence - May miss niche terminology
⚠️ Cluster overlap - Some domain boundaries unclear
⚠️ Resource constraints - Limited BERT implementation

🌟 Future Work

Web interface for legal researchers
Dynamic keyword expansion
BERT fine-tuning with better hardware
Hierarchical classification system

👩‍💻 Contributor

Yoshita Aligina
Rutgers University CS 2026

📚 References

Nghiem et al. (2022) - Transformer-based legal classification (https://aclanthology.org/2022.lrec-1.504.pdf)
UMAP Documentation
HuggingFace Transformers
https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
https://stanfordnlp.github.io/CoreNLP/lemma.html

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
FInalProj		FInalProj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal Text Classification Project

📌 Introduction

📂 Dataset

🛠️ Methodology

1. Preprocessing Pipeline

2. Hybrid Classification

3. Performance Metrics

📊 Results

Cluster Analysis

🔍 Key Findings

📝 Discussion

Advantages

Limitations

🌟 Future Work

👩‍💻 Contributor

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Legal Text Classification Project

📌 Introduction

📂 Dataset

🛠️ Methodology

1. Preprocessing Pipeline

2. Hybrid Classification

3. Performance Metrics

📊 Results

Cluster Analysis

🔍 Key Findings

📝 Discussion

Advantages

Limitations

🌟 Future Work

👩‍💻 Contributor

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages