a real-time application that converts voice input to text and generates svg visualizations using an llm.
BOARD/
├── backend/ # fastapi backend
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py # fastapi application entry point
│ │ ├── config.py # application configuration
│ │ ├── models/
│ │ │ ├── __init__.py
│ │ │ └── schemas.py # pydantic models for request/response
│ │ ├── routers/
│ │ │ ├── __init__.py
│ │ │ ├── api.py # rest api endpoints
│ │ │ └── websocket.py # websocket handlers for real-time audio
│ │ ├── services/
│ │ │ ├── __init__.py
│ │ │ ├── speech_to_text.py # audio transcription service
│ │ │ ├── llm_processor.py # llm integration for svg generation
│ │ │ └── svg_generator.py # svg validation and processing
│ │ └── utils/
│ │ ├── __init__.py
│ │ └── audio_utils.py # audio format conversion utilities
│ ├── requirements.txt
│ └── .env.example
├── frontend/ # react typescript frontend
│ ├── src/
│ │ ├── components/
│ │ │ ├── AudioRecorder.tsx # main recording button component
│ │ │ ├── TranscriptionDisplay.tsx # real-time text display
│ │ │ └── SVGRenderer.tsx # svg visualization display
│ │ ├── hooks/
│ │ │ ├── useAudioRecorder.ts # microphone recording hook
│ │ │ └── useWebSocket.ts # websocket connection hook
│ │ ├── services/
│ │ │ └── api.ts # rest api client
│ │ ├── types/
│ │ │ └── index.ts # typescript type definitions
│ │ ├── App.tsx
│ │ └── main.tsx
│ ├── package.json
│ ├── tsconfig.json
│ └── vite.config.ts
└── README.md
- user clicks record button in frontend
- browser captures microphone audio
- audio chunks are base64 encoded and sent via websocket
- backend accumulates audio and sends to speech-to-text service
- transcription results are streamed back to frontend in real-time
- when recording stops, accumulated text is sent to llm processor
- llm generates svg code based on the description
- svg is sanitized, processed, and sent back to frontend
- optional live summary mode uses claude to convert transcript into slide-ready notes
- frontend renders notes on the left and visualization on the right
each service is modular and can be extended or replaced independently:
- speech_to_text.py: handles audio transcription with provider abstraction (openai whisper, google, deepgram)
- llm_processor.py: manages llm communication for svg generation with customizable prompts
- svg_generator.py: validates, sanitizes, and optimizes svg output
- AudioRecorder: main interaction point with record button and audio level visualization
- TranscriptionDisplay: shows real-time transcription as user speaks
- SVGRenderer: displays generated svg with loading and error states
cd backend
# create virtual environment
python -m venv venv
source venv/bin/activate # on windows: venv\Scripts\activate
# install dependencies
pip install -r requirements.txt
# copy environment file and add your api keys
cp .env.example .env
# run the server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000cd frontend
# install dependencies
npm install
# run development server
npm run devthe frontend will be available at http://localhost:5173 the backend api will be available at http://localhost:8000
copy backend/.env.example to backend/.env and configure:
| variable | description | default |
|---|---|---|
| OPENAI_API_KEY | openai api key for whisper and gpt | required |
| CLAUDE_KEY | anthropic api key for claude models | required for svg + summary |
| LLM_MODEL | llm model to use for svg generation | claude-opus-4-6 |
| SUMMARY_LLM_PROVIDER | provider for live speech summaries | claude |
| SUMMARY_LLM_MODEL | model used for live speech summaries | claude-sonnet-4-6 |
| STT_PROVIDER | speech-to-text provider | openai_whisper |
| CORS_ORIGINS | allowed cors origins | localhost:5173,localhost:3000 |
POST /api/text-to-svg- generate svg from text descriptionPOST /api/transcribe- transcribe uploaded audio filePOST /api/transcribe-and-generate- combined transcription and svg generationPOST /api/summarize-speech- summarize transcript chunks into slide-ready bullet notesGET /api/placeholder-svg- get loading placeholder svgGET /health- health check endpoint
ws://localhost:8000/ws/audio- real-time audio streaming endpoint- send:
{"type": "start_recording"}to begin - send:
{"type": "audio_chunk", "data": "<base64>"}for audio data - send:
{"type": "stop_recording"}to end - receive: transcription updates and generated svg
- send:
- create a new class in
speech_to_text.pyextendingBaseSpeechToText - implement the required methods:
transcribe_chunk,transcribe_stream,transcribe_file - add the provider to the factory in
SpeechToTextService._get_provider
- modify
SVG_SYSTEM_PROMPTinllm_processor.pyto change llm behavior - add new processing methods in
svg_generator.pyfor custom transformations - update the
process_svgpipeline to include new processing steps
all components use inline styles for simplicity. your team can:
- replace inline styles with css modules
- add a ui component library (chakra, material-ui, etc.)
- implement a design system
- the backend uses asyncio for non-blocking operations
- websocket connections maintain state per session
- audio is accumulated in 1-second chunks for optimal transcription
- svg code is sanitized to prevent xss attacks
- mock services are available for development without api keys
mit