IM3-263: feat(evaluation): 7 LLM 에이전트 정확도 v7 재설계 + 캐시 schema 보강 by yezin013 · Pull Request #207 · Himidea-AI/Final_Project

yezin013 · 2026-05-07T02:44:47Z

v6 LLM-as-judge 의 거짓 양성(market_analyst MAPE 0.1% 등) 발견 → 텍스트 분석 따라
에이전트 유형별 측정 가능한 평가 방식으로 재설계

population_node / market_analyst_node 캐시 schema 보강(raw 데이터 함께 저장)으로 측정
범위 4 → 6 에이전트로 확대
6 에이전트 평균 정확도 87.55% (n=8~11)

평가 방식 변경 (v6 → v7)

에이전트	v6	v7
market_analyst	LLM-judge	grade 분류 정확도 (룰엔진)
demographic_depth	LLM-judge	연령 직접 일치 (top_3_age_groups 1위)
synthesis	LLM-judge	정량 정합성 룰 (legal·net_profit·grade-추천 모순·winner)
trend_forecaster	6m future	QoQ 방향 일치
population	LLM-judge + peak	연령·성별·피크 직접 일치
competitor_intel	MAPE + signal	signal 룰엔진 (현행 유지)
legal	RAG benchmark	제외 (별도)

결과 (n=8~11)

에이전트	v6	v7	변화
synthesis	100%	97.7%	n 증가 안정화
competitor_intel	100%	100%	→
demographic_depth	83.3%	100%	↑ +16.7%p
market_analyst	50%	87.5%	↑ +37.5%p ⭐
trend_forecaster	66.7%	81.8%	↑ +15.1%p
population_analyst	66.7%	58.3%	↓ -8.4%p

산출 파일

backend/src/evaluation/ — 7 evaluator (룰엔진/직접 일치)
backend/scripts/eval/seed_eval_cache.py — 자동 batch 시뮬 (8 케이스)
backend/scripts/eval/run_all_agents_v7.py — 통합 실행 + 비교 리포트
docs/team/agent-accuracy-v6-vs-v7.md — 발표용 평가 문서

Test plan

평가 framework lint 통과 (ruff check/format)
v7 evaluator 7개 import 정상
seed_eval_cache 8 케이스 100% 성공
run_all_agents_v7 결과 dump 정상
리뷰어 검토

배경: 상권분석 탭 IndustryClosureTrendCard ("동 업종 폐업률 추세 8분기") vs 재무시뮬 탭 ClosureRatePanel ("과거 폐업률 4분기 평균") 두 카드가 라벨이 모호해 같은 데이터로 오인될 위험. 단위·기간·필터링 모두 다름: · A (Market): store_quarterly DB · 동+업종 필터 · 분기별 8개 · B (Financial): closure_rate.monthly_closure_rates · 동 전체 통합 · 4분기 Option 1 라벨 강화: · IndustryClosureTrendCard - title prefix: "{dong} · {industry} 폐업률 추세" - 부제: "8 분기 실측" - 출처 footnote: "store_quarterly DB (분기별, 업종별 필터)" · ClosureRatePanel - title: "{district} 동 전체 폐업률 (4분기)" - 출처 footnote: "동 전체 4분기 실측, 업종별 8분기와 다를 수 있음" Option 4 ℹ️ 툴팁: · 양쪽 카드 헤더에 lucide Info 아이콘 + group-hover absolute tooltip · 호버 시 다른 패널과의 차이 안내 (단위/기간/필터링 다름) · z-20 + backdrop blur + 256px width 호출처 (MarketTab): · analysisDong (spot 1위 동 우선) + simResult.business_type 전달 · ci.meta 가 frontend type 정의에 없어 SimulationOutput cast 로 business_type 추출 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v6 LLM-as-judge 의 거짓 양성 (market_analyst MAPE 0.1% 등) 발견 후 텍스트 분석에 따라 에이전트 유형별 측정 가능한 평가 방식으로 재설계. 평가 방식 변경 (v6 → v7): - market_analyst: LLM-judge → grade 분류 정확도 (룰엔진 임계값) - demographic_depth: judge → 연령 직접 일치 (top_3_age_groups 1위) - synthesis: judge → 정량 정합성 룰 (legal 보존·net_profit·grade-추천 모순·winner) - trend_forecaster: 6m future → QoQ 방향 일치 - population: judge 가중 → 연령·성별·피크 직접 일치 - competitor_intel: 현행 (signal 룰엔진) - legal: 제외 — 별도 RAG benchmark 캐시 schema 보강 (raw 데이터 함께 저장): - population_node: raw_metrics(age/gender/time distribution) — prefix v1→v2 - market_analyst_node: raw_inputs(qoq/saturation/competitor_count) — prefix v1→v2 - trend_forecaster: 기존 dong_trend.slope_pct 활용 (loader fix) 산출: - backend/scripts/eval/seed_eval_cache.py — 자동 batch 시뮬 (8 케이스) - backend/scripts/eval/run_all_agents_v7.py — 통합 실행 + v6/v7 비교 리포트 - docs/team/agent-accuracy-v6-vs-v7.md — 발표용 평가 문서 최종 결과 (n=8~11): 6 에이전트 평균 87.55% - market_analyst 50%→87.5% (+37.5%p) - demographic_depth 83%→100% (+16.7%p) - trend_forecaster 67%→82% (+15.1%p) - synthesis 100%→97.7% (n 증가 안정화) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # backend/src/evaluation/demographic_depth_eval.py # backend/src/evaluation/population_eval.py

yezin013 and others added 4 commits May 6, 2026 20:25

Merge remote-tracking branch 'origin/dev' into IM3-263-ai-summary-layout

527fb17

Merge remote-tracking branch 'origin/dev' into IM3-263-ai-summary-layout

a79b48b

# Conflicts: # backend/src/evaluation/demographic_depth_eval.py # backend/src/evaluation/population_eval.py

github-actions Bot changed the title ~~feat(evaluation): 7 LLM 에이전트 정확도 v7 재설계 + 캐시 schema 보강~~ IM3-263: feat(evaluation): 7 LLM 에이전트 정확도 v7 재설계 + 캐시 schema 보강 May 7, 2026

yezin013 merged commit cafa6ef into dev May 7, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IM3-263: feat(evaluation): 7 LLM 에이전트 정확도 v7 재설계 + 캐시 schema 보강#207

IM3-263: feat(evaluation): 7 LLM 에이전트 정확도 v7 재설계 + 캐시 schema 보강#207
yezin013 merged 4 commits into
devfrom
IM3-263-ai-summary-layout

yezin013 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yezin013 commented May 7, 2026

평가 방식 변경 (v6 → v7)

결과 (n=8~11)

산출 파일

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant