AI Skin Doctor is an interactive Streamlit-based application that combines speech recognition, computer vision, and text-to-speech technologies to provide medical advice for skin conditions. Users can record their voice describing their symptoms and upload an image of their skin condition, and the AI will analyze the information and provide a voice response.
- Project Structure
- Features
- Technology Stack
- Installation
- Configuration
- Usage
- API Documentation
- Architecture
- Error Handling
- Contributing
LLMProject/
├── src/
│ ├── app.py # Main Streamlit application
│ ├── llm_brain.py # Image analysis with Groq LLM
│ ├── patient_voice.py # Speech-to-text transcription
│ ├── doctors_voice.py # Text-to-speech generation
│ ├── utils.py # Utility functions
│ └── main.py # Entry point
├── test/ # Test files
├── doc/ # Documentation
├── audio_records/
│ ├── inputs/ # User audio recordings
│ └── outputs/ # Generated doctor responses
├── images/ # Uploaded images
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
├── .env # Environment variables (not in repo)
└── README.md # Project overview
- Record audio through web browser
- Automatic audio format conversion (to MP3)
- Speech-to-text transcription using Groq's Whisper model
- Support for JPG, JPEG, and PNG formats
- Medical image analysis using Meta's Llama 4 Scout vision model
- Base64 encoding for secure image transmission
- Context-aware medical advice
- Natural language responses
- Differential diagnosis suggestions
- Text-to-speech conversion using ElevenLabs
- Natural-sounding voice responses
- Audio playback in browser
- Python 3.10+: Programming language
- Streamlit: Web application framework
- Groq API: LLM and STT services
- ElevenLabs API: Text-to-speech services
groq==0.15.0: Groq API clientstreamlit: Web UI frameworkSpeechRecognition: Audio processingpydub: Audio format conversionelevenlabs: Text-to-speech APIpython-dotenv: Environment variable managementgtts: Google Text-to-Speech (alternative)
- Whisper Large V3: Speech-to-text transcription
- Llama 4 Scout 17B: Multimodal image analysis
- ElevenLabs Multilingual V2: Natural voice synthesis
- Python 3.10 or higher
- FFmpeg (for audio processing)
- API Keys:
- Groq API Key
- ElevenLabs API Key
-
Clone the repository
git clone <repository-url> cd LLMProject
-
Create virtual environment (recommended)
python -m venv venv # Windows venv\Scripts\activate # Linux/Mac source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Install FFmpeg
- Windows: Download from ffmpeg.org and add to PATH
- Mac:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg
-
Set up environment variables Create a
.envfile in the root directory:GROQ_API_KEY=your_groq_api_key_here ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
| Variable | Description | Required | Default |
|---|---|---|---|
GROQ_API_KEY |
Groq API authentication key | Yes | None |
ELEVENLABS_API_KEY |
ElevenLabs API authentication key | Yes | None |
Models can be configured in the respective Python files:
llm_brain.py:
model = "meta-llama/llama-4-scout-17b-16e-instruct"patient_voice.py:
stt_model = "whisper-large-v3"doctors_voice.py:
voice_id = "21m00Tcm4TlvDq8ikWAM" # Rachel voice
model_id = "eleven_multilingual_v2"The system prompt in app.py can be modified to change the AI's behavior:
system_prompt = """You have to act as a professional doctor..."""-
Navigate to the src directory
cd src -
Start the Streamlit app
streamlit run app.py
-
Access the application
- Open your browser to
http://localhost:8501
- Open your browser to
- Record Audio: Click the audio input button and describe your skin condition
- Upload Image: Select an image file showing your skin condition
- Submit: The app processes both inputs automatically
- View Results:
- Transcribed text of your audio
- AI doctor's response
- Audio playback of the response
User Audio: "I have red bumps on my face that are painful and won't go away"
User Image: [Photo of facial acne]
AI Response: "With what I see, I think you have inflammatory acne.
Try using benzoyl peroxide and consider consulting a
dermatologist if symptoms persist."
Encodes an image file to base64 format.
Parameters:
image_path(str): Path to the image file
Returns:
- str: Base64 encoded image string
Raises:
FileNotFoundError: If image file doesn't existException: For other encoding errors
Example:
encoded = encode_image("path/to/image.jpg")Analyzes an image using Groq's multimodal LLM.
Parameters:
query(str): Text prompt for analysismodel(str): Model identifierencoded_image(str): Base64 encoded image
Returns:
- str: AI-generated analysis
Raises:
Exception: If API call fails
Example:
response = analyze_image_with_query(
query="What skin condition is this?",
model="meta-llama/llama-4-scout-17b-16e-instruct",
encoded_image=encoded_img
)Converts audio file to MP3 format.
Parameters:
input_file(str): Path to input audio file
Returns:
- str: Path to converted MP3 file
Raises:
ValueError: If no audio file providedFileNotFoundError: If FFmpeg not foundException: For conversion errors
Transcribes audio to text using Groq's Whisper model.
Parameters:
audio_wav_file(str): Path to audio filestt_model(str): Model identifier
Returns:
- str: Transcribed text
Raises:
ValueError: If no audio file providedException: For transcription errors
Example:
text = transcribe_with_groq("audio.wav", "whisper-large-v3")Converts text to speech using ElevenLabs API.
Parameters:
input_text(str): Text to convertoutput_filepath(str): Path to save audio file
Returns:
- str: Path to saved audio file
Example:
audio_path = text_to_speech_with_elevenlabs(
"Hello patient",
"output.mp3"
)Creates directory if it doesn't exist.
Parameters:
directory_path(str): Path to create
Returns:
- Path: Path object of created directory
Example:
from pathlib import Path
dir_path = createDirIfNotExists("audio_records/inputs")Main processing function that orchestrates the entire workflow.
Parameters:
audio_filepath(str): Path to user's audio recordingimage_filepath(str): Path to uploaded image
Returns:
- tuple: (transcribed_text, doctor_response, audio_response_path)
Raises:
FileNotFoundError: If files not foundException: For processing errors
User Input (Audio + Image)
↓
[Streamlit UI (app.py)]
↓
[File Storage] → audio_records/inputs/, images/
↓
[Audio Processing] → patient_voice.py
├── Format Conversion (MP3)
└── Speech-to-Text (Groq Whisper)
↓
[Image Processing] → llm_brain.py
├── Base64 Encoding
└── Image Analysis (Llama 4 Scout)
↓
[Response Generation] → doctors_voice.py
└── Text-to-Speech (ElevenLabs)
↓
[Output Storage] → audio_records/outputs/
↓
[Display Results] → Streamlit UI
-
Input Phase:
- User records audio via browser
- User uploads image file
- Files saved to local storage
-
Processing Phase:
- Audio converted to MP3
- Audio transcribed to text
- Image encoded to base64
- Combined query sent to Llama model
-
Response Phase:
- AI generates text response
- Text converted to speech
- Audio file saved
-
Output Phase:
- Display transcription
- Display AI response
- Play audio response
The application implements comprehensive error handling:
-
File Errors:
FileNotFoundError: Missing audio/image files IOError: File read/write errors
-
API Errors:
ValueError: Missing API keys Exception: API request failures
-
Processing Errors:
Exception: Transcription/analysis failures
- Missing Files: User-friendly error messages displayed in UI
- API Failures: Logged with detailed error messages
- Processing Errors: Graceful degradation with informative feedback
The application uses Python's logging module:
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)Log Levels:
INFO: Normal operation eventsERROR: Error conditions
Log Locations:
- Console output (stdout/stderr)
- Can be configured to file output
- API Keys: Never commit
.envfile to version control - Testing: Test audio/image inputs before deployment
- Error Handling: Always wrap API calls in try-except blocks
- Logging: Use appropriate log levels
- Path Handling: Use Path objects for cross-platform compatibility
- Audio Quality: Record in quiet environment
- Image Quality: Use clear, well-lit photos
- File Formats: Use supported formats (MP3, WAV for audio; JPG, PNG for images)
- Privacy: Don't share real medical images without consent
1. FFmpeg Not Found
Error: FFmpeg or the input file was not found
Solution: Install FFmpeg and add to system PATH
2. API Key Missing
ValueError: GROQ_API_KEY not found in environment variables
Solution: Create .env file with valid API keys
3. Audio Recording Issues
Solution: Check browser permissions for microphone access
4. Image Upload Fails
Solution: Ensure file size < 200MB and correct format
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit pull request
- Follow PEP 8 guidelines
- Add docstrings to functions
- Include type hints where appropriate
- Write descriptive commit messages
This project is for educational purposes. Consult license file for details.
For issues or questions:
- Open an issue on GitHub
- Check existing documentation
- Review error logs
Last Updated: February 2026 Version: 1.0.0