Auralis is a speaker identification system that uses voice biometrics to identify speakers in audio files. It is particularly well-suited for scenarios where speakers are known and recur, such as earnings calls, meetings, or podcasts.
The system works by generating a unique "voiceprint" (a speaker embedding) for each person and storing it in a reference database. When given a new audio clip, Auralis compares the voice in the clip to the database to find a match.
This project is fully containerized using Docker, making it easy to set up and run on any system.
The core of Auralis is a deep learning model (speechbrain/spkrec-ecapa-voxceleb) that has been trained to extract the unique characteristics of a person's voice. The process is as follows:
- Audio Processing: Raw audio files are sliced into smaller, labeled clips for each speaker.
- Embedding Generation: A speaker embedding (a vector of numbers) is generated for each clip. These embeddings, along with an average embedding for each speaker, are stored in a JSON database.
- Speaker Matching: To identify a speaker in a new audio clip, an embedding is generated for the clip and compared against the average embeddings in the database using cosine similarity. The speaker with the highest similarity score is identified as the match.
- Docker installed and running on your system.
- A Hugging Face account and an access token.
This project requires downloading a pre-trained model from the Hugging Face Hub. The model used is speechbrain/spkrec-ecapa-voxceleb, which is a gated repository.
- Create a Hugging Face Account: If you don't have one, create an account at huggingface.co.
- Accept the Model's Terms: Visit the model's page at https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb and accept the license agreement.
- Generate an Access Token: In your Hugging Face account settings, create an access token with "read" permissions.
This token will be passed to the Docker container as an environment variable.
You can build the Docker image with or without GPU support.
For CPU:
docker build -t auralis-cpu -f docker/Dockerfile.cpu .For GPU:
docker build -t auralis-gpu -f docker/Dockerfile.gpu .For a complete, step-by-step walkthrough on how to process audio, generate embeddings, and test speaker matching, please refer to the EXAMPLES.md file.
This guide will walk you through the entire workflow, from raw audio to speaker identification, with copy-paste-friendly commands.
This project has been tested on an Intel-based Mac (macOS). The auralis-cpu Docker image and all scripts have been confirmed to work in this environment.
The Dockerfile.gpu is provided for users with NVIDIA GPUs, but it has not been tested.
. Auralis/
├── docker/ # Dockerfiles for CPU and GPU environments
│ ├── Dockerfile.cpu
│ ├── Dockerfile.gpu
│ └── requirements.txt
├── src/ # Python source code
│ ├── process_audio.py
│ ├── generate_embeddings.py
│ └── test_matching.py
├── data/ # Data directory (ignored by git)
│ ├── raw_audio/ # Place your raw audio files here
│ ├── processed_audio/ # Processed clips will be saved here
│ └── test_audio/ # Place audio files for testing here
├── .gitignore
├── README.md # This file
├── EXAMPLES.md # Step-by-step usage examples
└── LICENSE # MIT License
This proof-of-value release provides a solid foundation. Future enhancements could include:
- A more robust database for storing embeddings.
- A user interface for easier interaction.
- Real-time transcription and speaker identification.
This project is licensed under the MIT License. See the LICENSE file for details.