Welcome to my GitHub profile! I'm passionate about:
- AI Engineering
- Data Engineering
- Data Science
- Machine Learning
I love solving real-world problems through code
- I'm currently working on signature detection in documents: Signature Detection in Documents
-
Γcole Polytechnique, Master in Data Science
- September 2025 - December 2026 (Ongoing)
- Relevant Courses:
- Optimization for Data Science
- Deep Learning (PyTorch, Keras)
- Reinforcement Learning
- Advanced AI for Text and Graphs (LoRA, RAG, Graph AI)
-
INSA Lyon, Software Engineer
- September 2020 - July 2025 (Validated)
- Relevant Courses:
- Foundation of Data Engineering
- Machine Learning and Data Analytics
- Object Oriented Programming (C++)
- 2 years of STEM classes
Here are some of the notable projects I've worked on during my academic journey:
1. βοΈ AiGORA π
Multi-Agent Structural Bias Assessment for Educational Content
- π₯ First Prize Winner at the IPAI Foundation Hackathon on Education. Uses multiple LLM agents debating each other through a Socratic framework to identify different types of bias in educational text.
- Agents analyze text from different angles and synthesize their findings to surface nuanced bias patterns with quantitative scores. The debate approach helps catch biases that single-model analysis would miss.
- Evaluates content across customizable bias dimensions, combining agent outputs into consolidated results with visual summaries and detailed breakdowns by dimension.
Vision-Language Fine-tuning for Automated Document Verification
- Capstone project applying LoRA fine-tuning to Qwen2.5-VL and InternVL 2.0 to improve visual grounding and bounding box detection on document images.
- Evaluated four detection paradigms (YOLO11n, RT-DETR, Mistral OCR 3, and MLLMs), achieving F1-scores up to 95.3% and significantly outperforming zero-shot baselines.
- Developed parametrized JSON output format for precise bounding-box coordinates, enabling scalable document authentication and validation workflows.
Bridging Structured Chemical Graphs and Natural Language
- Multi-modal system that translates 2D molecular structures into human-readable scientific descriptions, automating chemical database enrichment and drug discovery reporting.
- Dual-tower architecture combining ChEmbed (BASF-AI) with a Graph Transformer using global self-attention, aligning symbolic graph representations with semantic text through trainable adapter layers.
- Trained with InfoNCE contrastive loss, hard negative mining via Tanimoto similarity, and Matryoshka representation learning for robust multi-modal alignment.
Click to view other projects
Real-time RAG Agent for Public Transport
- AI assistant that helps users navigate the Paris transport network using LLM tool-calling (Llama 3.1 405B via NVIDIA NIM) to query live APIs for itineraries and traffic disruptions.
- Microservices architecture deployed with Docker Compose, integrating Apache Kafka (KRaft) and ElasticSearch for real-time data streaming and retrieval.
- Combines RAG, tool-calling, and live API integration to deliver accurate, up-to-date transport guidance.
Time-series Event Classification on 2025 Roland Garros Final
- Classifies "Hit", "Bounce", and "Air" states from raw (x,y) coordinates by engineering kinematic features (Acceleration, Jerk, Turn Angle) to capture physical "shocks".
- Implemented an optimized LightGBM model that outperformed CatBoost and XGBoost baselines in handling extreme class imbalance.
- Built an unsupervised pipeline using UMAP embeddings and Gaussian Mixture Models (GMM) to cluster events without labels.
ETL Pipeline for Mental Health Data Analysis
- Built a robust data ingestion pipeline to scrape posts from Reddit and HealthUnlocked, using Redis for deduplication and MongoDB for storage.
- Used LLMs (Mistral and Ollama 1B) for sentiment analysis, keyword extraction, gender inference, and detection of self-diagnosis and self-medication mentions to enrich the dataset.
- Designed a common database schema and cleaned the augmented data with pandas for efficient querying, visualization, and reporting.
Industrial Data Collection Protocol for Production Lines
- Set up a data collection protocol within the production lines of Geberit's factory in Haldensleben, Germany, working closely with stakeholders to align with business goals.
- Designed the SQL Server database schema and chose communication protocols (OPC-UA and SAP Plant Connectivity) suited to the factory environment.
- Built interactive dashboards in C# and CSHTML using the MVC model to highlight KPIs and provide real-time data visualization.
- Programming Languages: Python, C++, C, Java, JavaScript, C#
- AI / ML Frameworks: PyTorch, TensorFlow, Scikit-learn, Hugging Face, LightGBM, XGBoost, CatBoost
- LLM & GenAI: LoRA Fine-tuning, RAG, Tool-calling, Multi-Agent Systems, Prompt Engineering, Vision-Language Models (Qwen2.5-VL, InternVL), Llama, Mistral, Gemini, Ollama
- Data Engineering: Apache Kafka, Airflow, Docker, ElasticSearch, ETL Pipelines
- Databases: MySQL, MS SQL Server, MongoDB, Redis, Neo4j
- Computer Vision: YOLO, RT-DETR, OCR, Object Detection, Bounding Box Detection
- Data Science: Pandas, NumPy, UMAP, Gaussian Mixture Models, Feature Engineering, Time-series Analysis
- Tools & Other: Git, Docker Compose, Vue.js, ASP.NET, OPC-UA
- Email: youssefsidhom92@gmail.com
- LinkedIn: Youssef SIDHOM
π I have played basketball all my life and I am only 175cm (5.7ft) tall π


