You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The currently-ACTIVE model (query_grader, version 20260521_021306) was trained as a bootstrap on synthetic/seed TrainingData to give the issue-#5 monitoring pipeline an ACTIVE target. Its metrics reflect that:
Training accuracy: 0.804
Validation accuracy: 0.712 (just over the 0.7 auto-deploy threshold)
Test accuracy: 0.095 — effectively no real predictive power.
It satisfies "an ACTIVE model exists so monitoring evaluates," but it is not a quality predictor. The low-confidence / low-user-agreement alerts it generates reflect exactly this.
Goal
Replace the synthetic bootstrap with a model trained on accumulated real user feedback once enough has been collected.
Tasks
Define a minimum real-feedback sample threshold before retraining (current ML_MIN_TRAINING_SAMPLES=50 counts synthetic seed rows — may want a separate "validated real feedback" gate).
Confirm process_ml_feedback → TrainingData flow is converting real QueryFeedback into training samples.
Establish a retrain cadence (the ML_AUTO_RETRAIN / beat machinery exists) and a quality gate higher than the bootstrap's 0.7 val / 0.095 test.
Re-evaluate model_type: bootstrap is QUERY_GRADER; the grading path expects HYBRID_SCORER (per CLAUDE.md) once HybridQueryGrader is wired in.
Notes
Training runs in-container: railway ssh -s querygrade-worker "python manage.py train_ml_model …" (local railway run can't reach postgres.railway.internal).
Context
The currently-ACTIVE model (
query_grader, version20260521_021306) was trained as a bootstrap on synthetic/seedTrainingDatato give the issue-#5 monitoring pipeline an ACTIVE target. Its metrics reflect that:It satisfies "an ACTIVE model exists so monitoring evaluates," but it is not a quality predictor. The low-confidence / low-user-agreement alerts it generates reflect exactly this.
Goal
Replace the synthetic bootstrap with a model trained on accumulated real user feedback once enough has been collected.
Tasks
ML_MIN_TRAINING_SAMPLES=50counts synthetic seed rows — may want a separate "validated real feedback" gate).process_ml_feedback→TrainingDataflow is converting realQueryFeedbackinto training samples.ML_AUTO_RETRAIN/ beat machinery exists) and a quality gate higher than the bootstrap's 0.7 val / 0.095 test.QUERY_GRADER; the grading path expectsHYBRID_SCORER(per CLAUDE.md) onceHybridQueryGraderis wired in.Notes
railway ssh -s querygrade-worker "python manage.py train_ml_model …"(localrailway runcan't reachpostgres.railway.internal).