This project is a complete implementation of a LipNet model, capable of translating video sequences of human lip movements into text sentences. The architecture employs spatiotemporal convolutional networks (3D CNNs) combined with Recurrent Neural Networks (BiGRUs) and Connectionist Temporal Classification (CTC) loss to perform word-level and sentence-level predictions purely from visual data without audio.
- End-to-End Deep Learning Architecture: Built with TensorFlow and Keras, fully leveraging GPU acceleration for both training and inference.
- FastAPI Backend: An asynchronous, lightweight REST API serving predictions dynamically.
- Interactive UI: A vanilla Web frontend allowing you to test videos directly against the model. It automatically serves static video files alongside predictions.
- Built-in Evaluation Suite: Dedicated evaluation script (
evaluate.py) to systematically measure Character Error Rate (CER) and Word Error Rate (WER) across datasets. - Blazing Fast Setup: Managed with
uvfor strict, lightning-fast dependency resolution.
.
├── backend/ # FastAPI Application and endpoints
│ ├── app.py # Main backend API entry point
│ ├── data.py # Dataset processing, normalization, and tokenization logic
│ ├── model.py # LipNet 3D CNN + BiGRU Architecture definition
│ └── predict.py # Inference wrapper for the Checkpoint
├── data/
│ ├── s1/ # GRID corpus .mpg video samples
│ └── alignments/s1/ # Ground-truth .align transcription files
├── frontend/ # Static web assets
│ ├── index.html
│ ├── script.js
│ └── style.css
├── models/ # TensorFlow Checkpoints (`checkpoint` files)
├── evaluate.py # Comprehensive CER/WER evaluation script
├── pyproject.toml # Project metdata and dependencies
├── run.sh # Quick-start script for the backend
└── uv.lock # Deterministic dependency lockfile
- Python 3.13+
- (Optional but Reccomended) NVIDIA GPU + CUDA drivers for accelerated prediction and training.
We use uv for dependency management.
- Install
uvif you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone this repository and navigate to the project directory.
- Sync the dependencies into your environment:
uv syncStart the FastAPI backend server using the provided shell script or directly via uv:
# Using the shell script
bash run.sh
# Or directly via uv
uv run uvicorn backend.app:app --reloadThe server will initialize on http://127.0.0.1:8000.
The FastAPI application automatically mounts and serves the static frontend content on the root address. Simply open your browser and navigate to: 👉 http://localhost:8000/
From here, you can select videos from the connected GRID corpus or upload custom .mpg files to see real-time lipreading predictions!
The project includes an evaluation script (evaluate.py) that calculates the Character Error Rate (CER) and Word Error Rate (WER) across the s1 dataset directory using the Levenshtein distance metric.
On a locally provided benchmark of 1,000 samples, the LipNet model achieves a strong accuracy of ~1.65% WER and ~0.69% CER.
To run the evaluation yourself:
# Run over a subset (e.g. 50 files)
uv run python evaluate.py --num_samples 50
# Run evaluation over the entire dataset
uv run python evaluate.py --num_samples -1The script will output predictions alongside ground-truth text for every sample and summarize the final metrics at the end.