Skip to content

Ner evaluation proposal#124

Draft
puja-trivedi wants to merge 14 commits into
mainfrom
ner_evaluation_proposal
Draft

Ner evaluation proposal#124
puja-trivedi wants to merge 14 commits into
mainfrom
ner_evaluation_proposal

Conversation

@puja-trivedi
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive design document for evaluating neuroscientific NER systems, comparing the multi-agent StructSense pipeline against direct API calls. The document outlines system architectures, label space alignment strategies, and a multi-layered evaluation framework. Feedback suggests improving the clarity of mapping definitions by labeling them as data structures and considering external configuration files for better scalability.

Comment thread docs/design_docs/ner_evaluation.md Outdated
Comment thread docs/design_docs/ner_evaluation.md Outdated
…nd Layer 1 metrics

- direct_api.py: single LiteLLM call for free-form neuroscientific NER (no fixed label schema)
- layer1_metrics.py: Layer 1A/1B span and label comparison using ner_eval canonicalization
  and entity filtering; label agreement via _canonicalize_label without schema collapse step
…nonicalization approach

- direct API call no longer uses fixed 8-category schema; labels are LLM-assigned
- replace Mapping A / Mapping B schema collapse with _canonicalize_label() from ner_eval.py
- replace Entity Schema section with Direct API Label Approach description
- replace Ontology Collapse Mapping section with Label Canonicalization section
- update Layer 1A/1B strategy, metrics, and implementation notes accordingly
- update Key Risks: collapse mapping error → canonicalization gaps; schema drift → prompt drift
@puja-trivedi puja-trivedi force-pushed the ner_evaluation_proposal branch from ddd5b14 to 497fb65 Compare May 5, 2026 18:19
…-parse recovery

- Increase default max_tokens from 4096 to 16384
- Fix CompletionTokensDetailsWrapper not being JSON-serializable in usage dict
- Warn when completion_tokens hits max_tokens (response likely truncated)
- Add regex-based fallback in _parse_entities to recover complete entity objects
  from truncated responses instead of returning an empty list
- Add latent-circuit evaluation outputs (direct API cache + Layer 1 metrics)
- Update direct_api.py prompt to request sentence field per entity
- Add normalize_sentence() to strip StructSense chunk-prefix artifacts
- Add sentence_fingerprint() (first 8 words) as a stable position proxy
- Rewrite _build_entity_map to key on (normalized_entity, sentence_fingerprint)
  instead of normalized_entity string alone
- Expand StructSense occurrences so each mention in a different sentence gets
  its own key, preserving duplicate mentions across sentences
- Update disagreement records to include sentence_fingerprint field
- Update run_evaluation.py to display sentence fingerprint in disagreement output
- Update latent-circuit outputs with sentence-grounded results
Exposes the direct API max output tokens as a CLI argument so it can be
tuned per-model without code changes. Defaults to 16384 for backwards
compatibility.
- Add --chunk / --chunk-size CLI flags to run_evaluation.py
- Implement extract_entities_chunked() in direct_api.py using StructSense's
  existing sentence-boundary chunker (_chunk_doc_by_sentences) and offset
  globalizer (_globalize_entities) from src/utils/text_chunking.py
- Sentence context on chunked entities is drawn from the full document via
  spaCy, matching the same text source StructSense uses
- Falls back to paragraph-boundary splitting if spaCy is unavailable
- Add _chunk_text_by_paragraphs() as the fallback implementation
- Install en_core_web_sm (spaCy 3.8.0) for sentence-boundary splitting
- Update latent-circuit outputs with chunked gpt-5.4-mini results (844 entities,
  Jaccard 0.084 vs 0.023 single-call — ~4x improvement in recall)
- Add _resolve_output_dir() and _model_slug() helpers
- Auto-default api_cache and output_path to files inside the resolved dir
- Directory is created on demand with mkdir(parents=True, exist_ok=True)
- --api-cache and --output flags now act as overrides rather than requirements
- Results and API cache are always saved (no longer conditional on flag presence)
- Update module docstring to document the new output structure
Outputs are now organized as:
  outputs/<paper_id>/<model>/chunking/
  outputs/<paper_id>/<model>/no_chunking/

Determined automatically from the --chunk flag.
Output files now named <paper_id>_<model>_<chunk_label>_<type>.json
so they are self-contained when shared outside the directory structure.

Results for latent-circuit / gpt-5.4-mini:
  chunking:    670 API entities, Jaccard 0.079 (108 shared spans)
  no_chunking: 158 API entities, Jaccard 0.000 (sentence mismatch without globalization)
@puja-trivedi puja-trivedi marked this pull request as draft May 13, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant