Skip to content

TUSHARTAMRAKAR/VoiceScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VoiceScript



A production-grade, full-stack AI speech recognition system that transcribes any audio in any language โ€” powered by a multi-stage AI pipeline combining vocal isolation, noise reduction, and transformer-based transcription.


๐Ÿš€ Try Live Demo ย ย ยทย ย  โšก Quick Start ย ย ยทย ย  ๐Ÿ“ Architecture ย ย ยทย ย  โœจ Features



๐Ÿ“ธ Interface

VoiceScript UI


๐Ÿง  The Problem VoiceScript Solves

Most speech recognition tools fail in the real world because they assume clean, perfect audio. They struggle with:

  • ๐ŸŽต Background music underneath speech (YouTube videos, podcasts, interviews)
  • ๐ŸŒ Non-English languages โ€” or mixed language audio
  • ๐Ÿ”Š Crowd noise and ambient sound degrading accuracy
  • โฑ๏ธ Long audio files hitting API rate limits and time restrictions

VoiceScript solves all of this through a three-stage AI pipeline that processes audio before a single word is transcribed.


โœจ Features

๐ŸŽฏ Core

  • ๐ŸŽค Live microphone recording with real-time timer
  • ๐Ÿ“ File upload โ€” WAV, MP3, FLAC, OGG, WebM
  • ๐Ÿ”„ Any Language โ†’ English in one click
  • ๐ŸŒ Auto language detection badge
  • โ™พ๏ธ No audio length limit โ€” auto-chunked

๐Ÿš€ Advanced

  • โฑ๏ธ Timestamped transcript โ€” every sentence tagged
  • ๐ŸŒ Translate to 55+ languages via Google Translate
  • ๐Ÿ“„ Export to TXT โ€” clean formatted file
  • ๐ŸŽฌ Export to SRT โ€” YouTube-ready subtitles
  • ๐Ÿ“‘ Export to PDF โ€” professional document

๐Ÿ”’ Privacy & Performance

  • ๐Ÿ  100% local processing โ€” audio never leaves your machine
  • โšก Whisper medium โ€” 769M params, optimized for CPU
  • ๐ŸŽต Demucs vocal isolation โ€” strips background music
  • ๐Ÿ”ง Auto cleanup โ€” temp files deleted after transcription

๐ŸŽจ UI/UX

  • ๐ŸŒ™ Premium dark interface โ€” Space Grotesk + Inter
  • โœจ Smooth animations โ€” fade, slide, pulse, heartbeat
  • ๐Ÿ“ฑ Fully responsive โ€” works on any screen size
  • ๐ŸŽญ Animated background grid + floating orbs
  • ๐Ÿท๏ธ 7 color-coded tech badges in footer

๐Ÿ›  Tech Stack

Backend

Technology Version Purpose
Python 3.10+ Core backend language
Flask 3.0.3 REST API server + frontend serving
Flask-CORS 4.0.1 Cross-origin request handling
OpenAI Whisper medium Primary speech-to-text AI (769M params)
Facebook Demucs 4.0.1 Neural source separation โ€” vocal isolation
pydub 0.25.1 Audio format conversion + preprocessing
deep-translator 1.11.4 Google Translate integration (55+ languages)
SpeechRecognition 3.11.0 Fallback recognition engine
NumPy 2.4.4 Numerical processing for Whisper
ffmpeg 8.1 Low-level audio codec processing

Frontend

Technology Purpose
HTML5 Semantic structure
CSS3 Dark theme, keyframe animations, responsive grid
Vanilla JavaScript ES6+ MediaRecorder API, Fetch API, DOM manipulation
Web Audio API Live microphone capture
Google Fonts Space Grotesk + Inter typography

Infrastructure

Technology Purpose
Docker Containerized deployment
Hugging Face Spaces Production hosting (16GB RAM, free tier)
GitHub Version control + CI

๐Ÿ“ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         FRONTEND                                โ”‚
โ”‚              HTML5  ยท  CSS3  ยท  Vanilla JS ES6+                โ”‚
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚   File Upload        โ”‚    โ”‚   Live Mic Recording         โ”‚   โ”‚
โ”‚  โ”‚   Drag & Drop        โ”‚    โ”‚   MediaRecorder API          โ”‚   โ”‚
โ”‚  โ”‚   MP3/WAV/FLAC/OGG  โ”‚    โ”‚   webm/ogg format            โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                   โ”‚
โ”‚                            โ”‚ HTTP POST /transcribe              โ”‚
โ”‚                            โ”‚ multipart/form-data + mode         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       FLASK API                                 โ”‚
โ”‚                            โ–ผ                                    โ”‚
โ”‚                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                          โ”‚
โ”‚                  โ”‚   app.py         โ”‚                          โ”‚
โ”‚                  โ”‚  POST /transcribeโ”‚                          โ”‚
โ”‚                  โ”‚  POST /translate โ”‚                          โ”‚
โ”‚                  โ”‚  GET  /languages โ”‚                          โ”‚
โ”‚                  โ”‚  GET  /health    โ”‚                          โ”‚
โ”‚                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                          โ”‚
โ”‚                           โ”‚                                     โ”‚
โ”‚              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚              โ”‚      BEAST MODE PIPELINE           โ”‚             โ”‚
โ”‚              โ”‚                                    โ”‚             โ”‚
โ”‚              โ”‚  STAGE 1 โ€” pydub                   โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ Any format โ†’ 16kHz mono WAV    โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ Volume normalization            โ”‚             โ”‚
โ”‚              โ”‚  โ””โ”€ Silence stripping              โ”‚             โ”‚
โ”‚              โ”‚              โ†“                     โ”‚             โ”‚
โ”‚              โ”‚  STAGE 2 โ€” Facebook Demucs         โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ htdemucs model                 โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ Neural source separation       โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ Isolate vocal stem             โ”‚             โ”‚
โ”‚              โ”‚  โ””โ”€ Discard music/noise/drums      โ”‚             โ”‚
โ”‚              โ”‚              โ†“                     โ”‚             โ”‚
โ”‚              โ”‚  STAGE 3 โ€” OpenAI Whisper Medium   โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ 769M parameter transformer     โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ beam_size=5  best_of=5         โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ temperature=0  patience=2      โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ condition_on_previous_text=Trueโ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ no_speech_threshold=0.25       โ”‚             โ”‚
โ”‚              โ”‚  โ”œโ”€ word_timestamps=True           โ”‚             โ”‚
โ”‚              โ”‚  โ””โ”€ Auto language detection        โ”‚             โ”‚
โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ”‚                           โ”‚                                     โ”‚
โ”‚           JSON response: { transcript, segments,                โ”‚
โ”‚                           detected_language, duration,          โ”‚
โ”‚                           word_count, engine }                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚     FRONTEND RENDERS    โ”‚
              โ”‚  ยท Transcript text      โ”‚
              โ”‚  ยท Language badge       โ”‚
              โ”‚  ยท Timestamps view      โ”‚
              โ”‚  ยท Translation panel    โ”‚
              โ”‚  ยท Export buttons       โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure

VoiceScript/
โ”‚
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ app.py              # Flask server โ€” all API routes
โ”‚   โ”œโ”€โ”€ transcriber.py      # Beast mode pipeline: pydub โ†’ Demucs โ†’ Whisper
โ”‚   โ””โ”€โ”€ translator.py       # Google Translate integration (55+ languages)
โ”‚
โ”œโ”€โ”€ frontend/
โ”‚   โ”œโ”€โ”€ index.html          # App structure โ€” all UI elements
โ”‚   โ”œโ”€โ”€ style.css           # Dark theme, keyframe animations, responsive layout
โ”‚   โ””โ”€โ”€ app.js              # MediaRecorder, Fetch API, all interactivity
โ”‚
โ”œโ”€โ”€ uploads/                # Temp audio storage (auto-deleted after processing)
โ”‚   โ””โ”€โ”€ .gitkeep
โ”‚
โ”œโ”€โ”€ docs/
โ”‚   โ””โ”€โ”€ Screenshot.png      # UI screenshot
โ”‚
โ”œโ”€โ”€ Dockerfile              # Docker container config for HF Spaces
โ”œโ”€โ”€ .gitignore              # Excludes venv, uploads, cache
โ”œโ”€โ”€ requirements.txt        # All Python dependencies
โ””โ”€โ”€ README.md               # This file

โšก Quick Start

Prerequisites

  • Python 3.10+
  • Git
  • ffmpeg โ€” winget install ffmpeg on Windows
  • Modern browser (Chrome, Edge, Firefox)

1 โ€” Clone

git clone https://github.com/TUSHARTAMRAKAR/VoiceScript.git
cd VoiceScript

2 โ€” Virtual environment

python -m venv venv

# Windows
venv\Scripts\activate

# Mac/Linux
source venv/bin/activate

3 โ€” Install dependencies

pip install -r requirements.txt

โš ๏ธ First run downloads:

  • Whisper medium model โ€” ~769MB (cached at ~/.cache/whisper/)
  • Demucs htdemucs model โ€” ~300MB (cached automatically)

Both are one-time downloads. Every subsequent run loads instantly.

4 โ€” Start the server

cd backend
python app.py

5 โ€” Open the app

Open http://localhost:7860 in your browser. VoiceScript is running.


๐Ÿš€ Live Demo

Deployed on Hugging Face Spaces ยท Docker ยท CPU Basic ยท Always On


๐Ÿงช How to Use

Upload a file:

  1. Drop any audio file onto the upload zone
  2. Choose mode โ€” Transcribe or Any Language โ†’ English
  3. Click Transcribe File
  4. Get transcript + language badge + timestamps + export options

Record live:

  1. Click Start Recording โ€” allow mic access
  2. Speak clearly
  3. Click Stop & Transcribe
  4. Transcript appears in seconds

Translate:

  • After transcription, the Translate panel appears
  • Pick any of 55+ languages from the dropdown
  • Click Translate โ€” Google Translate does the rest

Export:

  • TXT โ€” plain text file with metadata header
  • SRT โ€” subtitle file with timestamps, ready for YouTube/VLC
  • PDF โ€” opens a clean print-ready page, save as PDF

โš™๏ธ Configuration

Change Whisper model in backend/transcriber.py:

# tiny | base | small | medium | large-v3
# medium = best CPU balance (default)
# large-v3 = maximum accuracy (needs GPU for practical speed)
WHISPER_MODEL_SIZE = "medium"

Change transcription language in backend/transcriber.py:

language = "en"   # en, hi, de, fr, es, ja, ko, zh...
# Remove language= entirely for auto-detection

๐Ÿ”ฎ Roadmap

  • GPU acceleration (CUDA) for large-v3 model
  • Real-time streaming transcription
  • Speaker diarization (who said what)
  • AI summarization of long transcripts
  • Transcript history (localStorage)
  • Docker compose for one-command setup
  • REST API documentation (Swagger/OpenAPI)
  • Chrome extension for transcribing browser audio

๐Ÿค Contributing

# Fork โ†’ clone โ†’ create branch
git checkout -b feature/your-feature

# Make changes, then
git commit -m "feat: describe your change"
git push origin feature/your-feature

# Open a Pull Request

Please follow Conventional Commits.


๐Ÿ‘จโ€๐Ÿ’ป Author

Tushar Tamrakar

Full-Stack Developer ยท AI/ML Enthusiast ยท Builder

GitHub Email HuggingFace

"Built with curiosity, powered by caffeine, debugged at 2AM." โ˜•


๐Ÿ“„ License

MIT License โ€” Copyright (c) 2026 Tushar Tamrakar

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software to deal in the Software without restriction,
including the rights to use, copy, modify, merge, publish, distribute,
sublicense, and/or sell copies of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND.

star

Python ยท Flask ยท JavaScript ยท OpenAI Whisper ยท Facebook Demucs ยท pydub ยท ffmpeg ยท Docker ยท Hugging Face

About

๐ŸŽ™๏ธ A full-stack AI-powered Speech Recognition web app. Upload audio or record live โ€” transcribed instantly using OpenAI Whisper + Facebook Demucs vocal isolation. Built with Python, Flask & JavaScript.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors