Hinglish β’ Tanglish β’ Benglish β’ Tenglish β’ Odia+English
Code-mixed speech is not an error β it's a cultural identity.
Benchmarks β’ Installation β’ Usage
India has 500M+ internet users, yet not a single free tool accurately captions the way Indians actually speak.
| What a Creator Says | What YouTube Auto-Captions Outputs |
|---|---|
| "Bohot saare log poochh rahe the, bhai kaise karte ho yeh, so today I'm going to show you everything." | "Bahut Saare log pooch rahe the buy case carters in Europe so today I..." |
Existing tools like Whisper and Google Speech-to-Text have WER > 25% on Indian code-mixed audio. They're built for monolingual speech and treat natural Hinglish/Tanglish as transcription errors.
Three core failures:
- β No accuracy β WER > 25% on Indian code-mixed audio with existing state-of-the-art tools
- β Wrong style β formal single-language output loses the authentic desi tone audiences expect
- β No SRT export β no free tool generates properly timed subtitle files from code-mixed speech
DesiCaptions is a dual-model ASR pipeline that routes audio intelligently between Sarvam AI Saaras and OpenAI Whisper large-v3, post-processed by Gemini 1.5 Flash for authentic desi tone.
Three output formats. Zero cost. Fully accessible.
| Mode | Output |
|---|---|
| π― Desi Style | WhatsApp-style romanized code-mixed captions β "Bohot accha laga yaar, let's go!" |
| π Script + English | Native script on line 1, English translation on line 2 |
| π English Only | Clean, casual English translation for global reach |
Supported Language Pairs: Hinglish Β· Tanglish Β· Benglish Β· Tenglish Β· Odia+English
- π΅ Upload video/audio files (
.mp4,.mp3,.wav) - ποΈ Live microphone input with real-time captions (< 2 sec latency)
- π€ Dual-model ASR with automatic fallback (Sarvam β Whisper)
- π¬ Gemini-powered "Desi Style" post-processing
- π SRT + TXT export with accurate timestamps (Β±200ms)
- π Deployed on Streamlit Community Cloud β zero server cost
- π Completely free to use
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interface β
β Streamlit Web App + FastAPI β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β Audio Preprocessing β
β FFmpeg + librosa + VAD β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β ASR Router (Language Detector) β
β Detects Indic phonemes vs English β
ββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββ
β β
βββββββββββββΌβββββββ βββββββββββΌβββββββββββββ
β Sarvam AI Saaras β β OpenAI Whisper large β
β (Primary β Indic) β β (Fallback β English) β
βββββββββββββ¬βββββββ βββββββββββ¬ββββββββββββββ
β β
ββββββββββββββββΌββββββββββββββββββββββΌββββββββββββββββ
β Gemini 1.5 Flash Post-Processor β
β Romanization β’ Desi Style β’ SRT Builder β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Role |
|---|---|---|
| Web UI | Streamlit + FastAPI | Interface & WebSocket live captions |
| Primary ASR | Sarvam AI Saaras v1 | Indic language transcription |
| Fallback ASR | OpenAI Whisper large-v3 | English & general transcription |
| Post-Processor | Google Gemini 1.5 Flash | Romanization, desi tone, formatting |
| SRT Builder | Custom Python | Timestamp formatting & export |
Dataset: Kathbath (AI4Bharat, IIT Madras) β 1,700-hour verified corpus Evaluation: 100 code-mixed audio clips (20 per language pair) Β· evaluated on Google Colab Metric: Word Error Rate (WER) via
jiwerβ lower is better Β· text normalized before scoring
| Language Pair | Sarvam AI (saaras:v3) | Whisper large-v3 | Winner | Sarvam Advantage |
|---|---|---|---|---|
| Hinglish | 4.9% | 26.4% | β Sarvam AI | 5.4Γ better |
| Benglish | 7.5% | 60.0% | β Sarvam AI | 8Γ better |
| Tanglish | 15.7% | 48.2% | β Sarvam AI | 3.1Γ better |
| Tenglish | 15.3% | 69.7% | β Sarvam AI | 4.6Γ better |
| Odia+Eng | 7.3% | 116.3% | β Sarvam AI | 15.9Γ better |
| Average | 10.1% | 64.1% | β Sarvam AI | 6.3Γ better overall |
Key Findings:
- Sarvam AI (saaras:v3) wins all 5 language pairs β its specialized Indic training data delivers decisive advantages across every code-mixed dialect tested
- The gap is most extreme for Odia+English: Whisper's 116.3% WER (more errors than words) vs Sarvam's 7.3% β a 15.9Γ improvement
- Whisper's WER exceeds 100% on Odia+Eng, meaning it produces more errors than there are reference words β it essentially cannot handle this language pair
- DesiCaptions uses Sarvam as primary across all pairs, with Whisper as a fallback only for API timeouts or English-only segments
5 evaluators rated 20 caption pairs on a 1β5 Likert scale:
| Dimension | Before Gemini | After Gemini | Gain |
|---|---|---|---|
| Readability | 1.8 | 3.5 | +1.7 |
| Punctuation | 1.2 | 3.8 | +2.6 |
| Authenticity | 2.1 | 3.6 | +1.5 |
| Overall | 1.7 | 3.7 | +2.0 |
Gemini Desi Style transformation:
β Raw ASR Output:
bahut saare log pooch rahe the bhai kaise karte ho yeh so today i am going
to show you everything step by step
β
DesiCaptions Output:
"Bohot saare log poochh rahe the,
bhai kaise karte ho yeh β
so today I'm gonna show you
everything, step by step! π"
git clone https://github.com/seriescrux/desi-caption.git
cd desi-caption
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtCreate a .env file in the root directory:
SARVAM_API_KEY=your_sarvam_api_key
GEMINI_API_KEY=your_gemini_api_keyFree API keys available at sarvam.ai and aistudio.google.com
streamlit run app.pyOpen http://localhost:8501 in your browser.
- Upload a video/audio file or start live microphone capture
- Select your output mode (Desi Style / Script+English / English Only)
- Click Generate Captions
- Download your
.srtor.txtfile
desi-caption/
βββ app.py # Streamlit main app
βββ asr_router.py # Language detection & ASR routing logic
βββ sarvam_client.py # Sarvam AI API wrapper
βββ whisper_client.py # Whisper inference wrapper
βββ gemini_postprocess.py # Gemini prompt engineering & post-processing
βββ srt_builder.py # SRT timestamp builder
βββ requirements.txt
βββ .env.example
βββ results/
β βββ benchmark_chart.png
β βββ benchmark_results.csv
βββ tests/
βββ test_functional.py # 15 IEEE 829 test cases
pytest tests/ -vAll 15 functional test cases pass per IEEE 829 documentation standards. Cross-browser tested on Chrome, Firefox, and Safari.
- SRT Format: SubRip Text standard (HH:MM:SS,mmm), β€2 lines/block, β€42 chars/line per BBC Subtitle Guidelines 2023
- Python: PEP 8 enforced via
flake8, Google-style docstrings, full type hints (mypy compatible) - Security: API keys in
.envonly β never committed to version control
- Kanglish, Malayalam+Eng, Marathi+Eng, Gujarati+Eng, Punjabi+Eng (β 22 Indian languages)
- Speaker diarization for multi-speaker podcasts and interviews
- YouTube Data API v3 integration for direct SRT upload
- React Native mobile app with FastAPI backend
- Fine-tuned Whisper on Kathbath code-mixed subset
- Creator analytics dashboard (language mix stats, vocabulary trends)
| Name | Role |
|---|---|
| Abhimanyu Kumar | ASR Pipeline & Backend |
| Gaurang Ayush | Audio Preprocessing & VAD |
| Kanishk Raj | Gemini Integration & Prompt Engineering |
| Madhurim Dutta | Frontend & Streamlit UI |
| Manan Ratnam Pandey | Benchmarking & Evaluation |
| Sruti Jha | Testing & Documentation |
Supervised by Dr. Krutika Verma β School of Computer Engineering, KIIT (2025β2026)
MIT License β see LICENSE for details.
Built at KIIT School of Computer Engineering Β· 2025β2026
Code-mixed speech is India's natural language. DesiCaptions speaks it.
