Skip to content

Siddhazntx/AutoEIT-II

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AutoEIT: Automated Scoring for Spanish Elicited Imitation Tasks

Specific Test II Submission

AutoEIT is an end-to-end, reproducible NLP pipeline for automatically scoring Spanish Elicited Imitation Task (EIT) responses against target prompt sentences using a meaning-based rubric. Rather than depending purely on black-box LLMs, the system combines a deterministic Early Quality Gate, linguistically informed feature extraction, and an interpretable Heuristic Mathematical Scorer, with a lightweight ordinal ML baseline for comparison.

The project is designed to be:

  • Interpretable β€” feature contributions and thresholds are transparent
  • Efficient β€” obvious cases are resolved before expensive semantic inference
  • Reproducible β€” configuration-driven, modular, and research-friendly
  • Low-resource aware β€” built to perform well even with limited labeled data

πŸ† Performance Summary

The AutoEIT pipeline predicts scores on a 0–4 rubric scale. Because the dataset is highly imbalanced (with a large majority of responses receiving score 4), model selection and evaluation are driven primarily by Quadratic Weighted Kappa (QWK) on a stratified validation split.

πŸ“‚ Data Architecture & Evaluation Strategy

AutoEIT utilizes a two-pronged dataset approach to ensure robust performance in production. (Note: Datasets are proprietary and omitted from this repository).

  • Historical Tuning Set (1,560 labeled rows, 29 participants): This dataset exhibits a natural, real-world class imbalance (~75% of ground-truth scores are 4).
    • Data Cleaning: Participants 1, 2, and 3 were entirely dropped from this set due to missing transcriptions.
    • Weight Optimization: This set is used strictly for deriving the Heuristic Engine's scoring weights via Grid Search.
    • Baseline Validation: Establishing our Quadratic Weighted Kappa (QWK) baseline of 0.8187, calculated via stratified K-Fold cross-validation to account for class imbalance (indicating "almost perfect agreement" per Landis & Koch, 1977).
  • Production Holdout Set (120 unlabeled rows, 4 participants): The actual target inference environment. Because this data lacks ground truth, our Training-Free Heuristic Engine (Approach A) is deployed here. While the thresholds were tuned on historical data, the scoring engine itself relies purely on semantic rules rather than learned ordinal classifications.

A/B Evaluation Results

Metric Approach A (Heuristic Scorer) Approach B (Ordinal ML)
Quadratic Weighted Kappa (QWK) 0.8187 0.7956
Accuracy 76.92% 69.55%
Macro F1-Score 0.5854 0.4901

Conclusion: The interpretable heuristic approach achieved the strongest validation performance, reaching expert-level agreement with human scoring while remaining fully transparent and easy to audit.


🧠 System Architecture

The pipeline follows a staged architecture designed for both efficiency and rubric alignment.

1. Text Cleaning & Normalization

Input transcriptions are standardized to reduce noise before scoring.

This includes:

  • lowercasing
  • punctuation normalization
  • accent normalization
  • removal of transcription artifacts such as pauses, cough markers, and gibberish tags

2. Early Quality Gate

A lightweight deterministic filter handles trivial cases before any expensive NLP inference.

Rules:

  • Exact match with target stimulus β†’ assign Score 4
  • Empty / pure gibberish response β†’ assign Score 0
  • otherwise β†’ continue to feature extraction

This optimization reduces unnecessary compute and isolates the ambiguous middle-range responses that actually require modeling.


3. Linguistic Feature Extraction

For unresolved cases, the pipeline extracts linguistically meaningful features using spaCy.

Features include:

  • Lemma Recall β€” how much core lexical content from the target was retained
  • Idea-Unit / Content-Word Recall β€” overlap over meaning-bearing words only

These features provide interpretable evidence for partial retention of form and meaning.


4. Semantic Similarity Layer

The pipeline computes sentence-level semantic similarity using SBERT.

  • SBERT similarity captures broad semantic closeness between target and response
  • this acts as a fast, robust semantic baseline

5. Deep Meaning Verification via NLI

The final semantic refinement stage uses a multilingual Natural Language Inference (NLI) model.

Instead of treating the task like generic similarity, NLI directly tests whether the learner response preserves the meaning of the target sentence.

This is especially well aligned with EIT scoring, where the central question is not just lexical overlap, but whether the response still entails the original message.


6. Interpretable Scoring Engine

For ambiguous responses, extracted features are combined using a weighted mathematical scoring function:

Raw Score = (0.55 Γ— NLI_margin) + (0.30 Γ— SBERT_similarity) + (0.15 Γ— Lemma_recall)

Note: These specific weights were not arbitrarily chosen. They were determined via a Grid Search optimization process over the 1,560-row historical tuning set to maximize the Quadratic Weighted Kappa metric.

7. QWK-Based Threshold Optimization

Rather than manually guessing score boundaries, the system tunes the threshold cutoffs using Powell optimization to maximize Quadratic Weighted Kappa (QWK) on the training split.

This produces empirically grounded boundaries between rubric levels:

  • 0
  • 1
  • 2
  • 3
  • 4

8. A/B Scientific Design

To make the project more rigorous, AutoEIT includes two scoring approaches:

Approach A β€” Heuristic Scorer

A fully interpretable weighted feature scorer with optimized thresholds.

Approach B β€” Ordinal ML Model

A lightweight ordinal classifier trained on the extracted features as an experimental challenger.

This allows the system to compare:

  • interpretability vs learned mapping
  • deterministic thresholding vs data-driven ordinal classification

🚧 System Limitations & Handled Edge Cases

Known Limitations

  • Transcription Dependency: The system evaluates text transcripts, not raw audio. The quality of the human/ASR transcription directly bounds the system's accuracy.
  • Proficiency Bias: Because the tuning dataset is heavily skewed toward high-proficiency learners, boundary detection between lower scores (0, 1, and 2) is weakly constrained.
  • Language Specificity: Currently validated exclusively for Spanish Elicited Imitation Tasks.

Handled Edge Cases

  • Non-Target Language & Gibberish: Responses entirely in English (e.g., [en inglΓ©s]) or marked as gibberish (xxx) are caught by the Early Gate and automatically scored 0.
  • Transcription Artifacts: Markers such as [pause], [laugh], and partial word attempts (e.g., cor-) are aggressively normalized prior to semantic evaluation to prevent false entailment scores.

βš™οΈ Installation

Prerequisites

  • Python 3.11+
  • Git

1. Clone the Repository

git clone https://github.com/Siddhazntx/AutoEIT-II
cd AutoEit-II

2. Create and Activate a Virtual Environment

Windows (PowerShell):

python -m venv autoeit311
.\autoeit311\Scripts\activate

Linux / macOS:

python -m venv autoeit311
source autoeit311/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Download the spaCy Spanish Model

python -m spacy download es_core_news_lg

▢️ Running the Pipeline

The project includes a command-line entry script for reproducible execution.

Standard Run

python run_pipeline.py

Debug Mode

python run_pipeline.py --debug

πŸ§ͺ Testing and Sanity Checks

To run a quick sanity check on the pipeline components with sample data:

python test_run.py

This will load a few sample data points, show preprocessing steps, and verify that the feature extractors are working correctly.


πŸ“€ Exporting Results

To export the final predictions into a Excel output:

python export_results.py

πŸ“Š Notebooks

The project includes Jupyter notebooks for exploratory data analysis and evaluation:

  • notebooks/data_exploration.ipynb: Exploratory Data Analysis (EDA) for understanding dataset distribution, class imbalance, and justifying architectural decisions.
  • notebooks/final_evaluation.ipynb: Detailed evaluation of the winning model with statistical analysis, confusion matrices, and error analysis.

To run the notebooks, ensure you have Jupyter installed and the virtual environment activated:

jupyter notebook notebooks/

πŸ“ Repository Structure

AutoEit_2/
β”œβ”€β”€ configs/
β”‚   └── config.yaml              # Centralized configuration file
β”œβ”€β”€ data/
β”‚   └── raw/                     # Input data files (Excel/CSV)
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ data_exploration.ipynb   # EDA and dataset profiling
β”‚   └── final_evaluation.ipynb   # Model evaluation and analysis
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ pipeline.py              # Master orchestration pipeline
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── data_loader.py       # Data ingestion layer
β”‚   β”œβ”€β”€ preprocessing/
β”‚   β”‚   └── preprocessor.py      # Cleaning + early quality gate
β”‚   β”œβ”€β”€ features/                # Linguistic, SBERT, NLI extractors
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ cross_encoder.py
β”‚   β”‚   β”œβ”€β”€ feature_extractor.py
β”‚   β”‚   β”œβ”€β”€ linguistic.py
β”‚   β”‚   β”œβ”€β”€ nli_scorer.py
β”‚   β”‚   └── sbert.py
β”‚   └── scoring/                 # Heuristic scorer, threshold optimizer, ordinal model
β”‚       β”œβ”€β”€ heuristic_scorer.py
β”‚       β”œβ”€β”€ ordinal_model.py
β”‚       └── thresholding.py
β”œβ”€β”€ export_results.py            # Utility for exporting final reports
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ run_pipeline.py              # Main CLI entry point
└── test_run.py                  # Sanity check script

Note: Generated outputs (in data/processed/ and data/cache/), virtual environments, and temporary files are not included in the repository and should be added to .gitignore.


πŸ”§ Configuration

The pipeline is configured via configs/config.yaml. This file contains parameters for:

  • Data paths and preprocessing settings
  • Feature extraction hyperparameters
  • Scoring weights and thresholds
  • Model training parameters

Modify this file to customize the pipeline behavior.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors