Initially I was not satisfied with the query responses due to llama_index overweighting funfacts. Solving that with document weighting. Having eval tests will ensure response consistency with any new content adding in the /data folder
Discussion with ChatGPT ( o4-mini-high)
“Evals” are essentially automated tests for LLM–powered systems. Instead of manually asking your model questions and eyeballing the answers, you write small scripts—or use a framework like OpenAI’s Evals—that:
1. Feed in a prompt or scenario (e.g. a sample user question).
2. Define the “right” output (or at least a set of criteria—multiple-choice answer, keywords it must include, sentiment, format).
3. Run the model (or your RAG pipeline) against those test prompts.
4. Score the results automatically (did the answer match exactly? Did it contain the required citation? Was it concise?).
⸻
Why use evals in a RAG app
1. Catch regressions
– If you tweak your prompt template or upgrade the LLM, you want to know if answers get worse before you push to production.
2. Measure retrieval quality
– You can write tests like “when I ask about my hackathon, at least one returned context chunk must come from kudos.txt.”
3. Enforce consistency
– Maybe you want every answer to mention your resume bullet “led the documentation team.” An eval can check for that phrase.
4. Automate benchmarks
– Over time, you can track a “correct-answer rate” or “context-citation rate” as you adjust your index or prompts.
⸻
How you’d add it
1. Pick a framework
– OpenAI Evals: plugs directly into OpenAI models, but you can call your FastAPI endpoint from within each test.
– Custom scripts: a simple Python or JS test suite (pytest or jest) that curl-es your /api/query and asserts on the JSON.
2. Write a handful of core tests
• Smoke test: “Can I hit /api/test?”
• Basic QA: “When I ask ‘What did Autumn do well?’ the response must contain the word ‘hackathon’.”
• RAG-specific: “When I ask ‘What does my resume say about data analytics?’ at least one context entry must reference the word ‘analytics.’”
3. Integrate into CI
– Every pull request runs your eval suite. Fail fast if an LLM or prompt change breaks your “must-mention” rules.
4. Iterate
– As you add new data sources (blog, Instagram), add new evals to cover those use cases.
⸻
Recommendation for AutBot
• Yes, even a toy RAG project benefits from a small eval suite.
• Start light: 5–10 tests covering your most important queries and citation requirements.
• Use plain pytest + requests (no heavyweight dependencies) or look at OpenAI Evals for more advanced grading.
• You’ll get confidence to experiment with prompts, models, and index parameters without fear of silently degrading the user experience.
Adding evals now will save you time later, and it’s exactly the kind of engineering rigor that recruiters (and your future self) will appreciate when they see “Automated LLM testing” on your resume.
Initially I was not satisfied with the query responses due to llama_index overweighting funfacts. Solving that with document weighting. Having eval tests will ensure response consistency with any new content adding in the /data folder
Discussion with ChatGPT ( o4-mini-high)
“Evals” are essentially automated tests for LLM–powered systems. Instead of manually asking your model questions and eyeballing the answers, you write small scripts—or use a framework like OpenAI’s Evals—that:
1. Feed in a prompt or scenario (e.g. a sample user question).
2. Define the “right” output (or at least a set of criteria—multiple-choice answer, keywords it must include, sentiment, format).
3. Run the model (or your RAG pipeline) against those test prompts.
4. Score the results automatically (did the answer match exactly? Did it contain the required citation? Was it concise?).
⸻
Why use evals in a RAG app
1. Catch regressions
– If you tweak your prompt template or upgrade the LLM, you want to know if answers get worse before you push to production.
2. Measure retrieval quality
– You can write tests like “when I ask about my hackathon, at least one returned context chunk must come from kudos.txt.”
3. Enforce consistency
– Maybe you want every answer to mention your resume bullet “led the documentation team.” An eval can check for that phrase.
4. Automate benchmarks
– Over time, you can track a “correct-answer rate” or “context-citation rate” as you adjust your index or prompts.
⸻
How you’d add it
1. Pick a framework
– OpenAI Evals: plugs directly into OpenAI models, but you can call your FastAPI endpoint from within each test.
– Custom scripts: a simple Python or JS test suite (pytest or jest) that curl-es your /api/query and asserts on the JSON.
2. Write a handful of core tests
• Smoke test: “Can I hit /api/test?”
• Basic QA: “When I ask ‘What did Autumn do well?’ the response must contain the word ‘hackathon’.”
• RAG-specific: “When I ask ‘What does my resume say about data analytics?’ at least one context entry must reference the word ‘analytics.’”
3. Integrate into CI
– Every pull request runs your eval suite. Fail fast if an LLM or prompt change breaks your “must-mention” rules.
4. Iterate
– As you add new data sources (blog, Instagram), add new evals to cover those use cases.
⸻
Recommendation for AutBot
• Yes, even a toy RAG project benefits from a small eval suite.
• Start light: 5–10 tests covering your most important queries and citation requirements.
• Use plain pytest + requests (no heavyweight dependencies) or look at OpenAI Evals for more advanced grading.
• You’ll get confidence to experiment with prompts, models, and index parameters without fear of silently degrading the user experience.
Adding evals now will save you time later, and it’s exactly the kind of engineering rigor that recruiters (and your future self) will appreciate when they see “Automated LLM testing” on your resume.