Lydia Nishimwe lydianish

Hi, I'm Lydia Nishimwe 👋

AI Research Scientist studying how machine learning models fail in real-world conditions.

I recently completed my PhD at Inria / Sorbonne Université, where I studied how large language and translation models behave under noisy, heterogeneous, and out-of-distribution data. My work focuses on representation learning, robustness, and the gap between controlled benchmarks and real-world deployment.

I’m particularly interested in foundation models, multilingual and low-resource settings, and data-centric approaches to improving reliability.

🛠️ Expertise

Robust ML & Representation Learning: embeddings, noisy text, domain shift, heterogeneous data
LLMs & Generative Models: evaluation, prompting, behaviour analysis
Multilingual & Low-Resource NLP: transfer learning, real-world data challenges

⚙️ Stack

Python, PyTorch (Transformers, Fairseq)
Scikit-learn, Pandas
SLURM, Linux, Git

🚀 Featured Projects

These are among the repositories I’ve highlighted on my GitHub profile — check them out for code, demos, and research results.

🔹 RoLASER

My PhD research work to make LASER more robust to User‑Generated Content (UGC).
Includes robust sentence embeddings and UGC data generation via augmentation. Paper published at LREC-COLING 2024 conference; model released on Hugging Face.

🔁 Forked and Contributed Projects (Used in Research)

These are important open‑source toolkits I’ve forked, used and contributed to during my PhD research:

📌 fairseq

Forked from Meta’s sequence‑to‑sequence toolkit and used extensively for NMT experiments. Contributed bug fixes and enhancements (e.g., dictionary handling improvements). Read my blog about the bug here.

📌 NL‑Augmenter

Forked the transformation library, used it to generate artificial UGC for data augmentation, and contributed bug fixes and new features.

📌 LASER

Forked from the original LASER (Language‑Agnostic SEntence Representations) for use and extension in my research. Wrote evaluation scripts for evaluating on new models (RoLASER) and tasks (Massive Text Embedding Benchmark - MTEB).

📌 SONAR

Forked the SONAR repository to use its text embedding and translation capabilities in my work (text‑only use case, not speech). Used to extend the RoLASER approach to RoSONAR.

📂 PhD Work Organization

All my PhD-related repositories are grouped under my GitHub organization: lydianish-phd.

Work in progress: still migrating repositories from my lab's private GitLab.

🌐 Prior Collaborative / Hackathon Projects

💻 MT Challenger Frontend Legacy

Built the front-end of a data augmentation tool for Machine Translation during the 3-day online Unbabel MT Half-Marathon 2021.

🤝 Social Relief

Collaborative project for transparent distribution of Covid-19 relief funds in Kenya. Focused on system implementation and workflow improvements.

🧬 BRAG

Developed BRAG (Biomedical RAkinG), a cross-platform tool that aggregates bibliographic data from sources like PubMed and Google Scholar to summarise researchers’ scientific output, including publications, citations, h-index, and optional graphical representations. This was a school project at Centrale Nantes.

📫 Connect with Me

I am passionate about building robust NLP systems and contributing to open-source research. Feel free to explore my repositories and reach out for collaborations or opportunities!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly