An AI-powered resume screening tool using Sentence-BERT (SBERT) for semantic matching and skill gap insights.
ResumeMatch-AI is an interactive web application that automates the initial phases of candidate screening. Using deep-learning semantic representation, the system parses candidate resumes (PDF or TXT) and matches them against job descriptions to score and rank candidates based on contextual relevance.
Unlike traditional keyword-filtering tools, ResumeMatch-AI utilizes pre-trained Sentence Transformers to capture the semantic context of descriptions and experience, mapping candidate skills and highlighting missing prerequisites.
The application handles text extraction, neural sentence embedding, matching, and analysis through a streamlined processing pipeline:
graph TD
%% Files inputs
subgraph Input [Document Upload]
JD[Job Description .txt]
Resumes[Candidate Resumes .pdf / .txt]
end
%% Pipeline operations
subgraph Pipeline [Analysis Pipeline]
Parse[Text Extraction - pdfplumber / utf-8]
Embed[Neural Embedding - SBERT all-MiniLM-L6-v2]
Cosine[Cosine Similarity Calculation - PyTorch]
Skills[Skill Gap Matcher - Case-Insensitive Auditing]
end
%% UI Dashboard
subgraph UI [Streamlit User Interface]
Dash[Interactive Results Dashboard]
CSV[Downloadable CSV Report]
end
%% Flow
JD -->|Read text| Parse
Resumes -->|Extract text| Parse
Parse -->|Raw Text| Embed
Parse -->|Raw Text| Skills
Embed -->|JD Vector| Cosine
Embed -->|Resume Vectors| Cosine
Cosine -->|Match Score %| Dash
Skills -->|Matched & Missing Lists| Dash
Dash --> CSV
classDef default fill:#f9f9f9,stroke:#333,stroke-width:1px;
classDef highlight fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
class Embed,Cosine highlight;
- Semantic Matching Engine: Uses pre-trained Sentence-BERT (
all-MiniLM-L6-v2) to compute high-dimensional embedding vectors for job descriptions and resumes, ranking them by cosine similarity. - Skill Gap Analysis: Compares parsed candidate resumes against a core directory of technical skills to identify missing credentials.
- Multi-Format Parsing: Extracts text from PDF documents using
pdfplumberand handles encoding issues for flat.txtfiles. - Real-time Dashboard: Displays ranking lists with responsive visual progress bars, listing matched and missing skills.
- CSV Report Generation: Formulates structured candidate matching summaries available for instant HR download.
- Language: Python 3.8+
- Web Interface: Streamlit
- NLP Model: Sentence-BERT (SBERT) via the
sentence-transformerslibrary - Linear Algebra & Computations: PyTorch (used internally for cosine similarity calculations) and
scikit-learn - Text Parsers:
pdfplumberfor PDF processing - Data Processing:
pandas
ResumeMatch-AI/
├── app/ # Application Source Files
│ ├── __init__.py
│ ├── main.py # Streamlit Interface & State Controller
│ ├── matcher.py # SBERT Encoding & Similarity Engine
│ └── utils.py # File I/O, PDF Parsing & Skill Verification
├── models/ # Weights & Vectorizer Cache Directory
├── sample_data/ # Flat files used for system testing
├── uploaded_jd/ # Local directory for temporary job description uploads
├── uploaded_resumes/ # Local directory for temporary candidate resume uploads
├── requirements.txt # Python Package Dependencies
├── generate.py # Utility script to generate sample test files
└── README.md # Core Documentation
- Python 3.8 or higher installed on your system.
- A package manager like
pip.
- Clone the repository:
git clone https://github.com/joel8779/University_App.git cd AI_Resume_Screener - Set up a virtual environment:
python -m venv venv
- Activate the virtual environment:
- Windows (PowerShell):
.\venv\Scripts\Activate.ps1
- Windows (Command Prompt):
venv\Scripts\activate.bat
- Linux/macOS:
source venv/bin/activate
- Windows (PowerShell):
- Install the required dependencies:
pip install -r requirements.txt
- Run the Streamlit web application:
python -m streamlit run app/main.py
The codebase is split into modular components designed for reuse:
Computes semantic similarity scores.
- Parameters:
job_description(str): The text of the target job description.resumes(list): A list of dictionaries containing keys"filename","text", and"skills".
- Returns: A sorted list of candidate dictionaries containing:
"filename"(str): The resume file name."score"(float): Cosine similarity score rounded to two decimal places (0–100%)."skills"(list): The list of matched skills.
Parses unstructured documents.
- Parameters:
file_path(str): Path to target document on local disk.Returns(str): Extracted text. Falls back to empty string for unsupported formats.
Audits text for predefined technical keywords.
- Parameters:
text(str): Source text extracted from resume.skill_keywords(list): Target skills directory.
- Returns: List of matching strings found in the text.
Keyword matching (like standard TF-IDF or regex checks) fails when candidates describe their experience using synonyms (e.g., writing "Deep Learning" instead of "Machine Learning").
- Decision: Sentence-BERT (using the
all-MiniLM-L6-v2transformer model) encodes the entire document text into a 384-dimensional dense vector space. - Benefit: Cosine similarity is computed between the job description vector and candidate resume vectors, capturing context and synonyms rather than exact spelling matches.
Resume files are frequently uploaded as unstructured PDFs. Using basic text-extraction libraries can lead to corrupted character encoding or text column alignment issues.
- Decision: Integrated
pdfplumberinstead of standard older libraries likePyPDF2. - Benefit:
pdfplumberexcels at extracting text from multi-column layout PDFs, preserving logical word positioning to maintain high semantic encoding accuracy.
To prevent filename collisions when multiple candidates upload files with generic names (e.g., Resume.pdf), the system sandboxes documents:
- Files are renamed on upload with a prepended
uuid.uuid4()identifier. - The "Clear Uploaded Files" button handles safe recursive deletion of temp storage folders using
shutil.rmtree.
- Vector Database Integration: Replace on-the-fly model encoding with a vector database (like ChromaDB or FAISS) to cache candidate resume vectors, preventing redundant model execution.
- Entity Recognition (NER): Implement Spacy or custom transformers to extract candidate contact details, work history length, and education dynamically.
- Fine-tuned Models: Fine-tune the pre-trained SBERT model on tech-domain corpus data (like software engineering job posting descriptions) to improve industry-specific scoring.
- Asynchronous Processing: Refactor Streamlit file execution to run matching tasks asynchronously via Celery or Redis queues to support larger document batch sizes.
Contributions are welcome. Please open an issue first to discuss the changes you want to make.
- Fork the Repository.
- Create a feature branch (
git checkout -b feature/improvement). - Commit your changes (
git commit -m 'feat: description'). - Push to your branch (
git push origin feature/improvement). - Open a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for details.

