Ner evaluation proposal#124
Draft
puja-trivedi wants to merge 14 commits into
Draft
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive design document for evaluating neuroscientific NER systems, comparing the multi-agent StructSense pipeline against direct API calls. The document outlines system architectures, label space alignment strategies, and a multi-layered evaluation framework. Feedback suggests improving the clarity of mapping definitions by labeling them as data structures and considering external configuration files for better scalability.
…nd Layer 1 metrics - direct_api.py: single LiteLLM call for free-form neuroscientific NER (no fixed label schema) - layer1_metrics.py: Layer 1A/1B span and label comparison using ner_eval canonicalization and entity filtering; label agreement via _canonicalize_label without schema collapse step
…nonicalization approach - direct API call no longer uses fixed 8-category schema; labels are LLM-assigned - replace Mapping A / Mapping B schema collapse with _canonicalize_label() from ner_eval.py - replace Entity Schema section with Direct API Label Approach description - replace Ontology Collapse Mapping section with Label Canonicalization section - update Layer 1A/1B strategy, metrics, and implementation notes accordingly - update Key Risks: collapse mapping error → canonicalization gaps; schema drift → prompt drift
ddd5b14 to
497fb65
Compare
…-parse recovery - Increase default max_tokens from 4096 to 16384 - Fix CompletionTokensDetailsWrapper not being JSON-serializable in usage dict - Warn when completion_tokens hits max_tokens (response likely truncated) - Add regex-based fallback in _parse_entities to recover complete entity objects from truncated responses instead of returning an empty list - Add latent-circuit evaluation outputs (direct API cache + Layer 1 metrics)
- Update direct_api.py prompt to request sentence field per entity - Add normalize_sentence() to strip StructSense chunk-prefix artifacts - Add sentence_fingerprint() (first 8 words) as a stable position proxy - Rewrite _build_entity_map to key on (normalized_entity, sentence_fingerprint) instead of normalized_entity string alone - Expand StructSense occurrences so each mention in a different sentence gets its own key, preserving duplicate mentions across sentences - Update disagreement records to include sentence_fingerprint field - Update run_evaluation.py to display sentence fingerprint in disagreement output - Update latent-circuit outputs with sentence-grounded results
Exposes the direct API max output tokens as a CLI argument so it can be tuned per-model without code changes. Defaults to 16384 for backwards compatibility.
- Add --chunk / --chunk-size CLI flags to run_evaluation.py - Implement extract_entities_chunked() in direct_api.py using StructSense's existing sentence-boundary chunker (_chunk_doc_by_sentences) and offset globalizer (_globalize_entities) from src/utils/text_chunking.py - Sentence context on chunked entities is drawn from the full document via spaCy, matching the same text source StructSense uses - Falls back to paragraph-boundary splitting if spaCy is unavailable - Add _chunk_text_by_paragraphs() as the fallback implementation - Install en_core_web_sm (spaCy 3.8.0) for sentence-boundary splitting - Update latent-circuit outputs with chunked gpt-5.4-mini results (844 entities, Jaccard 0.084 vs 0.023 single-call — ~4x improvement in recall)
- Add _resolve_output_dir() and _model_slug() helpers - Auto-default api_cache and output_path to files inside the resolved dir - Directory is created on demand with mkdir(parents=True, exist_ok=True) - --api-cache and --output flags now act as overrides rather than requirements - Results and API cache are always saved (no longer conditional on flag presence) - Update module docstring to document the new output structure
Outputs are now organized as: outputs/<paper_id>/<model>/chunking/ outputs/<paper_id>/<model>/no_chunking/ Determined automatically from the --chunk flag.
Output files now named <paper_id>_<model>_<chunk_label>_<type>.json so they are self-contained when shared outside the directory structure. Results for latent-circuit / gpt-5.4-mini: chunking: 670 API entities, Jaccard 0.079 (108 shared spans) no_chunking: 158 API entities, Jaccard 0.000 (sentence mismatch without globalization)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.