AI Research Scientist studying how machine learning models fail in real-world conditions.
I recently completed my PhD at Inria / Sorbonne Université, where I studied how large language and translation models behave under noisy, heterogeneous, and out-of-distribution data. My work focuses on representation learning, robustness, and the gap between controlled benchmarks and real-world deployment.
I’m particularly interested in foundation models, multilingual and low-resource settings, and data-centric approaches to improving reliability.
- Robust ML & Representation Learning: embeddings, noisy text, domain shift, heterogeneous data
- LLMs & Generative Models: evaluation, prompting, behaviour analysis
- Multilingual & Low-Resource NLP: transfer learning, real-world data challenges
- Python, PyTorch (Transformers, Fairseq)
- Scikit-learn, Pandas
- SLURM, Linux, Git
These are among the repositories I’ve highlighted on my GitHub profile — check them out for code, demos, and research results.
🔹 RoLASER
My PhD research work to make LASER more robust to User‑Generated Content (UGC).
Includes robust sentence embeddings and UGC data generation via augmentation.
Paper published at LREC-COLING 2024 conference; model released on Hugging Face.
These are important open‑source toolkits I’ve forked, used and contributed to during my PhD research:
📌 fairseq
Forked from Meta’s sequence‑to‑sequence toolkit and used extensively for NMT experiments. Contributed bug fixes and enhancements (e.g., dictionary handling improvements). Read my blog about the bug here.
Forked the transformation library, used it to generate artificial UGC for data augmentation, and contributed bug fixes and new features.
📌 LASER
Forked from the original LASER (Language‑Agnostic SEntence Representations) for use and extension in my research. Wrote evaluation scripts for evaluating on new models (RoLASER) and tasks (Massive Text Embedding Benchmark - MTEB).
📌 SONAR
Forked the SONAR repository to use its text embedding and translation capabilities in my work (text‑only use case, not speech). Used to extend the RoLASER approach to RoSONAR.
All my PhD-related repositories are grouped under my GitHub organization: lydianish-phd.
Work in progress: still migrating repositories from my lab's private GitLab.
Built the front-end of a data augmentation tool for Machine Translation during the 3-day online Unbabel MT Half-Marathon 2021.
Collaborative project for transparent distribution of Covid-19 relief funds in Kenya. Focused on system implementation and workflow improvements.
🧬 BRAG
Developed BRAG (Biomedical RAkinG), a cross-platform tool that aggregates bibliographic data from sources like PubMed and Google Scholar to summarise researchers’ scientific output, including publications, citations, h-index, and optional graphical representations. This was a school project at Centrale Nantes.
I am passionate about building robust NLP systems and contributing to open-source research. Feel free to explore my repositories and reach out for collaborations or opportunities!
