Specific Test II Submission
AutoEIT is an end-to-end, reproducible NLP pipeline for automatically scoring Spanish Elicited Imitation Task (EIT) responses against target prompt sentences using a meaning-based rubric. Rather than depending purely on black-box LLMs, the system combines a deterministic Early Quality Gate, linguistically informed feature extraction, and an interpretable Heuristic Mathematical Scorer, with a lightweight ordinal ML baseline for comparison.
The project is designed to be:
- Interpretable β feature contributions and thresholds are transparent
- Efficient β obvious cases are resolved before expensive semantic inference
- Reproducible β configuration-driven, modular, and research-friendly
- Low-resource aware β built to perform well even with limited labeled data
The AutoEIT pipeline predicts scores on a 0β4 rubric scale. Because the dataset is highly imbalanced (with a large majority of responses receiving score 4), model selection and evaluation are driven primarily by Quadratic Weighted Kappa (QWK) on a stratified validation split.
AutoEIT utilizes a two-pronged dataset approach to ensure robust performance in production. (Note: Datasets are proprietary and omitted from this repository).
- Historical Tuning Set (1,560 labeled rows, 29 participants): This dataset exhibits a natural, real-world class imbalance (~75% of ground-truth scores are 4).
- Data Cleaning: Participants 1, 2, and 3 were entirely dropped from this set due to missing transcriptions.
- Weight Optimization: This set is used strictly for deriving the Heuristic Engine's scoring weights via Grid Search.
- Baseline Validation: Establishing our Quadratic Weighted Kappa (QWK) baseline of 0.8187, calculated via stratified K-Fold cross-validation to account for class imbalance (indicating "almost perfect agreement" per Landis & Koch, 1977).
- Production Holdout Set (120 unlabeled rows, 4 participants): The actual target inference environment. Because this data lacks ground truth, our Training-Free Heuristic Engine (Approach A) is deployed here. While the thresholds were tuned on historical data, the scoring engine itself relies purely on semantic rules rather than learned ordinal classifications.
| Metric | Approach A (Heuristic Scorer) | Approach B (Ordinal ML) |
|---|---|---|
| Quadratic Weighted Kappa (QWK) | 0.8187 | 0.7956 |
| Accuracy | 76.92% | 69.55% |
| Macro F1-Score | 0.5854 | 0.4901 |
Conclusion: The interpretable heuristic approach achieved the strongest validation performance, reaching expert-level agreement with human scoring while remaining fully transparent and easy to audit.
The pipeline follows a staged architecture designed for both efficiency and rubric alignment.
Input transcriptions are standardized to reduce noise before scoring.
This includes:
- lowercasing
- punctuation normalization
- accent normalization
- removal of transcription artifacts such as pauses, cough markers, and gibberish tags
A lightweight deterministic filter handles trivial cases before any expensive NLP inference.
Rules:
- Exact match with target stimulus β assign Score 4
- Empty / pure gibberish response β assign Score 0
- otherwise β continue to feature extraction
This optimization reduces unnecessary compute and isolates the ambiguous middle-range responses that actually require modeling.
For unresolved cases, the pipeline extracts linguistically meaningful features using spaCy.
Features include:
- Lemma Recall β how much core lexical content from the target was retained
- Idea-Unit / Content-Word Recall β overlap over meaning-bearing words only
These features provide interpretable evidence for partial retention of form and meaning.
The pipeline computes sentence-level semantic similarity using SBERT.
- SBERT similarity captures broad semantic closeness between target and response
- this acts as a fast, robust semantic baseline
The final semantic refinement stage uses a multilingual Natural Language Inference (NLI) model.
Instead of treating the task like generic similarity, NLI directly tests whether the learner response preserves the meaning of the target sentence.
This is especially well aligned with EIT scoring, where the central question is not just lexical overlap, but whether the response still entails the original message.
For ambiguous responses, extracted features are combined using a weighted mathematical scoring function:
Raw Score = (0.55 Γ NLI_margin) + (0.30 Γ SBERT_similarity) + (0.15 Γ Lemma_recall)
Note: These specific weights were not arbitrarily chosen. They were determined via a Grid Search optimization process over the 1,560-row historical tuning set to maximize the Quadratic Weighted Kappa metric.
Rather than manually guessing score boundaries, the system tunes the threshold cutoffs using Powell optimization to maximize Quadratic Weighted Kappa (QWK) on the training split.
This produces empirically grounded boundaries between rubric levels:
- 0
- 1
- 2
- 3
- 4
To make the project more rigorous, AutoEIT includes two scoring approaches:
A fully interpretable weighted feature scorer with optimized thresholds.
A lightweight ordinal classifier trained on the extracted features as an experimental challenger.
This allows the system to compare:
- interpretability vs learned mapping
- deterministic thresholding vs data-driven ordinal classification
Known Limitations
- Transcription Dependency: The system evaluates text transcripts, not raw audio. The quality of the human/ASR transcription directly bounds the system's accuracy.
- Proficiency Bias: Because the tuning dataset is heavily skewed toward high-proficiency learners, boundary detection between lower scores (0, 1, and 2) is weakly constrained.
- Language Specificity: Currently validated exclusively for Spanish Elicited Imitation Tasks.
Handled Edge Cases
- Non-Target Language & Gibberish: Responses entirely in English (e.g.,
[en inglΓ©s]) or marked as gibberish (xxx) are caught by the Early Gate and automatically scored0. - Transcription Artifacts: Markers such as
[pause],[laugh], and partial word attempts (e.g.,cor-) are aggressively normalized prior to semantic evaluation to prevent false entailment scores.
- Python 3.11+
- Git
git clone https://github.com/Siddhazntx/AutoEIT-II
cd AutoEit-IIWindows (PowerShell):
python -m venv autoeit311
.\autoeit311\Scripts\activateLinux / macOS:
python -m venv autoeit311
source autoeit311/bin/activatepip install -r requirements.txtpython -m spacy download es_core_news_lgThe project includes a command-line entry script for reproducible execution.
python run_pipeline.pypython run_pipeline.py --debugTo run a quick sanity check on the pipeline components with sample data:
python test_run.pyThis will load a few sample data points, show preprocessing steps, and verify that the feature extractors are working correctly.
To export the final predictions into a Excel output:
python export_results.pyThe project includes Jupyter notebooks for exploratory data analysis and evaluation:
notebooks/data_exploration.ipynb: Exploratory Data Analysis (EDA) for understanding dataset distribution, class imbalance, and justifying architectural decisions.notebooks/final_evaluation.ipynb: Detailed evaluation of the winning model with statistical analysis, confusion matrices, and error analysis.
To run the notebooks, ensure you have Jupyter installed and the virtual environment activated:
jupyter notebook notebooks/AutoEit_2/
βββ configs/
β βββ config.yaml # Centralized configuration file
βββ data/
β βββ raw/ # Input data files (Excel/CSV)
βββ notebooks/
β βββ data_exploration.ipynb # EDA and dataset profiling
β βββ final_evaluation.ipynb # Model evaluation and analysis
βββ src/
β βββ __init__.py
β βββ pipeline.py # Master orchestration pipeline
β βββ data/
β β βββ data_loader.py # Data ingestion layer
β βββ preprocessing/
β β βββ preprocessor.py # Cleaning + early quality gate
β βββ features/ # Linguistic, SBERT, NLI extractors
β β βββ __init__.py
β β βββ cross_encoder.py
β β βββ feature_extractor.py
β β βββ linguistic.py
β β βββ nli_scorer.py
β β βββ sbert.py
β βββ scoring/ # Heuristic scorer, threshold optimizer, ordinal model
β βββ heuristic_scorer.py
β βββ ordinal_model.py
β βββ thresholding.py
βββ export_results.py # Utility for exporting final reports
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ run_pipeline.py # Main CLI entry point
βββ test_run.py # Sanity check script
Note: Generated outputs (in data/processed/ and data/cache/), virtual environments, and temporary files are not included in the repository and should be added to .gitignore.
The pipeline is configured via configs/config.yaml. This file contains parameters for:
- Data paths and preprocessing settings
- Feature extraction hyperparameters
- Scoring weights and thresholds
- Model training parameters
Modify this file to customize the pipeline behavior.