Ner evaluation proposal by puja-trivedi · Pull Request #124 · sensein/structsense

puja-trivedi · 2026-05-04T17:04:26Z

No description provided.

gemini-code-assist

Code Review

This pull request adds a comprehensive design document for evaluating neuroscientific NER systems, comparing the multi-agent StructSense pipeline against direct API calls. The document outlines system architectures, label space alignment strategies, and a multi-layered evaluation framework. Feedback suggests improving the clarity of mapping definitions by labeling them as data structures and considering external configuration files for better scalability.

…nd Layer 1 metrics - direct_api.py: single LiteLLM call for free-form neuroscientific NER (no fixed label schema) - layer1_metrics.py: Layer 1A/1B span and label comparison using ner_eval canonicalization and entity filtering; label agreement via _canonicalize_label without schema collapse step

…nonicalization approach - direct API call no longer uses fixed 8-category schema; labels are LLM-assigned - replace Mapping A / Mapping B schema collapse with _canonicalize_label() from ner_eval.py - replace Entity Schema section with Direct API Label Approach description - replace Ontology Collapse Mapping section with Label Canonicalization section - update Layer 1A/1B strategy, metrics, and implementation notes accordingly - update Key Risks: collapse mapping error → canonicalization gaps; schema drift → prompt drift

…-parse recovery - Increase default max_tokens from 4096 to 16384 - Fix CompletionTokensDetailsWrapper not being JSON-serializable in usage dict - Warn when completion_tokens hits max_tokens (response likely truncated) - Add regex-based fallback in _parse_entities to recover complete entity objects from truncated responses instead of returning an empty list - Add latent-circuit evaluation outputs (direct API cache + Layer 1 metrics)

- Update direct_api.py prompt to request sentence field per entity - Add normalize_sentence() to strip StructSense chunk-prefix artifacts - Add sentence_fingerprint() (first 8 words) as a stable position proxy - Rewrite _build_entity_map to key on (normalized_entity, sentence_fingerprint) instead of normalized_entity string alone - Expand StructSense occurrences so each mention in a different sentence gets its own key, preserving duplicate mentions across sentences - Update disagreement records to include sentence_fingerprint field - Update run_evaluation.py to display sentence fingerprint in disagreement output - Update latent-circuit outputs with sentence-grounded results

Exposes the direct API max output tokens as a CLI argument so it can be tuned per-model without code changes. Defaults to 16384 for backwards compatibility.

- Add --chunk / --chunk-size CLI flags to run_evaluation.py - Implement extract_entities_chunked() in direct_api.py using StructSense's existing sentence-boundary chunker (_chunk_doc_by_sentences) and offset globalizer (_globalize_entities) from src/utils/text_chunking.py - Sentence context on chunked entities is drawn from the full document via spaCy, matching the same text source StructSense uses - Falls back to paragraph-boundary splitting if spaCy is unavailable - Add _chunk_text_by_paragraphs() as the fallback implementation - Install en_core_web_sm (spaCy 3.8.0) for sentence-boundary splitting - Update latent-circuit outputs with chunked gpt-5.4-mini results (844 entities, Jaccard 0.084 vs 0.023 single-call — ~4x improvement in recall)

- Add _resolve_output_dir() and _model_slug() helpers - Auto-default api_cache and output_path to files inside the resolved dir - Directory is created on demand with mkdir(parents=True, exist_ok=True) - --api-cache and --output flags now act as overrides rather than requirements - Results and API cache are always saved (no longer conditional on flag presence) - Update module docstring to document the new output structure

Outputs are now organized as: outputs/<paper_id>/<model>/chunking/ outputs/<paper_id>/<model>/no_chunking/ Determined automatically from the --chunk flag.

Output files now named <paper_id>_<model>_<chunk_label>_<type>.json so they are self-contained when shared outside the directory structure. Results for latent-circuit / gpt-5.4-mini: chunking: 670 API entities, Jaccard 0.079 (108 shared spans) no_chunking: 158 API entities, Jaccard 0.000 (sentence mismatch without globalization)

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Comment thread docs/design_docs/ner_evaluation.md Outdated

Comment thread docs/design_docs/ner_evaluation.md Outdated

puja-trivedi added 5 commits May 5, 2026 11:13

first commit for ner evalution proposal

2b22c92

removed description of structsense features due to redundancy

1e7be1a

updated the metrics section and removed layer 3 from evalution pipeline.

8771b4c

puja-trivedi force-pushed the ner_evaluation_proposal branch from ddd5b14 to 497fb65 Compare May 5, 2026 18:19

puja-trivedi added 9 commits May 5, 2026 13:40

add run_evaluation.py runner script for Layer 1 NER comparison

695ccde

add --max-tokens CLI flag to run_evaluation.py

faafc7b

Exposes the direct API max output tokens as a CLI argument so it can be tuned per-model without code changes. Defaults to 16384 for backwards compatibility.

add chunking/no_chunking subdirectory to output structure

d0b1f83

Outputs are now organized as: outputs/<paper_id>/<model>/chunking/ outputs/<paper_id>/<model>/no_chunking/ Determined automatically from the --chunk flag.

remove flat output files superseded by new directory structure

0ac7fda

puja-trivedi marked this pull request as draft May 13, 2026 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ner evaluation proposal#124

Ner evaluation proposal#124
puja-trivedi wants to merge 14 commits into
mainfrom
ner_evaluation_proposal

puja-trivedi commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

puja-trivedi commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant