Skip to content

seriescrux/desi-caption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ DesiCaptions

AI-Powered Subtitle Generator for Indian Content Creators

Python Sarvam AI Whisper License

Hinglish β€’ Tanglish β€’ Benglish β€’ Tenglish β€’ Odia+English

Code-mixed speech is not an error β€” it's a cultural identity.

Benchmarks β€’ Installation β€’ Usage


πŸ“Œ The Problem

India has 500M+ internet users, yet not a single free tool accurately captions the way Indians actually speak.

What a Creator Says What YouTube Auto-Captions Outputs
"Bohot saare log poochh rahe the, bhai kaise karte ho yeh, so today I'm going to show you everything." "Bahut Saare log pooch rahe the buy case carters in Europe so today I..."

Existing tools like Whisper and Google Speech-to-Text have WER > 25% on Indian code-mixed audio. They're built for monolingual speech and treat natural Hinglish/Tanglish as transcription errors.

Three core failures:

  • ❌ No accuracy β€” WER > 25% on Indian code-mixed audio with existing state-of-the-art tools
  • ❌ Wrong style β€” formal single-language output loses the authentic desi tone audiences expect
  • ❌ No SRT export β€” no free tool generates properly timed subtitle files from code-mixed speech

βœ… Our Solution

DesiCaptions is a dual-model ASR pipeline that routes audio intelligently between Sarvam AI Saaras and OpenAI Whisper large-v3, post-processed by Gemini 1.5 Flash for authentic desi tone.

Three output formats. Zero cost. Fully accessible.

Mode Output
🎯 Desi Style WhatsApp-style romanized code-mixed captions β€” "Bohot accha laga yaar, let's go!"
πŸ“œ Script + English Native script on line 1, English translation on line 2
🌐 English Only Clean, casual English translation for global reach

Supported Language Pairs: Hinglish Β· Tanglish Β· Benglish Β· Tenglish Β· Odia+English


✨ Features

  • 🎡 Upload video/audio files (.mp4, .mp3, .wav)
  • πŸŽ™οΈ Live microphone input with real-time captions (< 2 sec latency)
  • πŸ€– Dual-model ASR with automatic fallback (Sarvam β†’ Whisper)
  • πŸ’¬ Gemini-powered "Desi Style" post-processing
  • πŸ“„ SRT + TXT export with accurate timestamps (Β±200ms)
  • 🌐 Deployed on Streamlit Community Cloud β€” zero server cost
  • πŸ†“ Completely free to use

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   User Interface                     β”‚
β”‚              Streamlit Web App + FastAPI             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Audio Preprocessing                     β”‚
β”‚              FFmpeg + librosa + VAD                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              ASR Router (Language Detector)          β”‚
β”‚         Detects Indic phonemes vs English            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚                     β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Sarvam AI Saaras β”‚    β”‚ OpenAI Whisper large  β”‚
   β”‚ (Primary β€” Indic) β”‚    β”‚ (Fallback β€” English)  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Gemini 1.5 Flash Post-Processor             β”‚
β”‚      Romanization β€’ Desi Style β€’ SRT Builder        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Component Technology Role
Web UI Streamlit + FastAPI Interface & WebSocket live captions
Primary ASR Sarvam AI Saaras v1 Indic language transcription
Fallback ASR OpenAI Whisper large-v3 English & general transcription
Post-Processor Google Gemini 1.5 Flash Romanization, desi tone, formatting
SRT Builder Custom Python Timestamp formatting & export

πŸ“Š Benchmarking

Dataset: Kathbath (AI4Bharat, IIT Madras) β€” 1,700-hour verified corpus Evaluation: 100 code-mixed audio clips (20 per language pair) Β· evaluated on Google Colab Metric: Word Error Rate (WER) via jiwer β€” lower is better Β· text normalized before scoring

Benchmark Chart

WER Results by Language Pair

Language Pair Sarvam AI (saaras:v3) Whisper large-v3 Winner Sarvam Advantage
Hinglish 4.9% 26.4% βœ… Sarvam AI 5.4Γ— better
Benglish 7.5% 60.0% βœ… Sarvam AI 8Γ— better
Tanglish 15.7% 48.2% βœ… Sarvam AI 3.1Γ— better
Tenglish 15.3% 69.7% βœ… Sarvam AI 4.6Γ— better
Odia+Eng 7.3% 116.3% βœ… Sarvam AI 15.9Γ— better
Average 10.1% 64.1% βœ… Sarvam AI 6.3Γ— better overall

Key Findings:

  • Sarvam AI (saaras:v3) wins all 5 language pairs β€” its specialized Indic training data delivers decisive advantages across every code-mixed dialect tested
  • The gap is most extreme for Odia+English: Whisper's 116.3% WER (more errors than words) vs Sarvam's 7.3% β€” a 15.9Γ— improvement
  • Whisper's WER exceeds 100% on Odia+Eng, meaning it produces more errors than there are reference words β€” it essentially cannot handle this language pair
  • DesiCaptions uses Sarvam as primary across all pairs, with Whisper as a fallback only for API timeouts or English-only segments

Human Evaluation β€” Gemini Post-Processing Quality

5 evaluators rated 20 caption pairs on a 1–5 Likert scale:

Dimension Before Gemini After Gemini Gain
Readability 1.8 3.5 +1.7
Punctuation 1.2 3.8 +2.6
Authenticity 2.1 3.6 +1.5
Overall 1.7 3.7 +2.0

πŸ’¬ Example Output

Gemini Desi Style transformation:

❌ Raw ASR Output:
bahut saare log pooch rahe the bhai kaise karte ho yeh so today i am going
to show you everything step by step

βœ… DesiCaptions Output:
"Bohot saare log poochh rahe the,
bhai kaise karte ho yeh β€”
so today I'm gonna show you
everything, step by step! πŸ™Œ"

πŸ› οΈ Installation

git clone https://github.com/seriescrux/desi-caption.git
cd desi-caption
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Environment Setup

Create a .env file in the root directory:

SARVAM_API_KEY=your_sarvam_api_key
GEMINI_API_KEY=your_gemini_api_key

Free API keys available at sarvam.ai and aistudio.google.com


πŸ’» Usage

streamlit run app.py

Open http://localhost:8501 in your browser.

  1. Upload a video/audio file or start live microphone capture
  2. Select your output mode (Desi Style / Script+English / English Only)
  3. Click Generate Captions
  4. Download your .srt or .txt file

πŸ“ Project Structure

desi-caption/
β”œβ”€β”€ app.py                    # Streamlit main app
β”œβ”€β”€ asr_router.py             # Language detection & ASR routing logic
β”œβ”€β”€ sarvam_client.py          # Sarvam AI API wrapper
β”œβ”€β”€ whisper_client.py         # Whisper inference wrapper
β”œβ”€β”€ gemini_postprocess.py     # Gemini prompt engineering & post-processing
β”œβ”€β”€ srt_builder.py            # SRT timestamp builder
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ benchmark_chart.png
β”‚   └── benchmark_results.csv
└── tests/
    └── test_functional.py    # 15 IEEE 829 test cases

πŸ§ͺ Testing

pytest tests/ -v

All 15 functional test cases pass per IEEE 829 documentation standards. Cross-browser tested on Chrome, Firefox, and Safari.


🌐 Standards & Code Quality

  • SRT Format: SubRip Text standard (HH:MM:SS,mmm), ≀2 lines/block, ≀42 chars/line per BBC Subtitle Guidelines 2023
  • Python: PEP 8 enforced via flake8, Google-style docstrings, full type hints (mypy compatible)
  • Security: API keys in .env only β€” never committed to version control

πŸ”­ Roadmap

  • Kanglish, Malayalam+Eng, Marathi+Eng, Gujarati+Eng, Punjabi+Eng (β†’ 22 Indian languages)
  • Speaker diarization for multi-speaker podcasts and interviews
  • YouTube Data API v3 integration for direct SRT upload
  • React Native mobile app with FastAPI backend
  • Fine-tuned Whisper on Kathbath code-mixed subset
  • Creator analytics dashboard (language mix stats, vocabulary trends)

πŸ‘₯ Team

Name Role
Abhimanyu Kumar ASR Pipeline & Backend
Gaurang Ayush Audio Preprocessing & VAD
Kanishk Raj Gemini Integration & Prompt Engineering
Madhurim Dutta Frontend & Streamlit UI
Manan Ratnam Pandey Benchmarking & Evaluation
Sruti Jha Testing & Documentation

Supervised by Dr. Krutika Verma β€” School of Computer Engineering, KIIT (2025–2026)


πŸ“„ License

MIT License β€” see LICENSE for details.


Built at KIIT School of Computer Engineering Β· 2025–2026

Code-mixed speech is India's natural language. DesiCaptions speaks it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages