Skip to content

fix(language-lesson): drop over-broad "is" keyword that mislabels nodes with the type-guards concept#454

Open
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/language-lesson-is-keyword
Open

fix(language-lesson): drop over-broad "is" keyword that mislabels nodes with the type-guards concept#454
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/language-lesson-is-keyword

Conversation

@tirth8205

Copy link
Copy Markdown
Contributor

Problem

  • detectLanguageConcepts matches keywords with text.toLowerCase().includes(keyword.toLowerCase()), i.e. unbounded substring matching. The type guards pattern includes the keyword "is". Because includes does not respect word boundaries, the 2-char substring is appears inside extremely common English words found in node summaries/tags such as "this", "list", "persists", "exists", "analysis", "visible". As a result virtually every node — regardless of whether it has anything to do with TypeScript type guards — gets the type guards concept appended, polluting the detected-concepts list passed into the LLM lesson prompt. Verified by running the exact matching logic: a node summary of "This function adds two numbers", "Persists data to disk", or "Renders a list of items" all return ['type guards']. The maintainers narrowly avoided this in the existing 'returns empty for nodes…

Fix

  • Drop the bare "is" keyword (it is too short/generic for substring matching) and rely on the more specific keywords already present ("type guard", "narrowing", "discriminated union"), e.g. "type guards": ["type guard", "narrowing", "discriminated union"]. (A more thorough fix would switch keyword matching to word-boundary matching, but removing the "is" token is the minimal correctness fix.)

Testing

Adds unit test(s) that fail before the change and pass after. The full core test suite, eslint, and tsc --noEmit all pass locally on this branch.

Found via a static correctness audit of the language-lesson concept detector.

🤖 Generated with Claude Code

…es with the type-guards concept

The "type guards" concept pattern included the bare keyword "is", and
detectLanguageConcepts matches via unbounded substring includes(). The
2-char substring "is" appears inside common English words (this, list,
exists, analysis), so nearly every node was tagged with "type guards",
polluting the concepts passed to the LLM lesson prompt.

Remove the "is" token and rely on the more specific existing keywords
("type guard", "narrowing", "discriminated union").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@thejesh23 thejesh23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Same substring-matching bug remains in other concept patterns
The root cause is includes-based matching, not the "is" token specifically. decorators still has "@" (matches every JSDoc @param/email in a summary), dependency injection has "di" (matches "audio", "edit", "directory", "modifies"), and both middleware pattern and streams use "pipe". Same pollution risk; worth a follow-up issue at minimum, or just switching to a word-boundary regex helper now.

2. No predicate-style return-type signal replaces "is"
The canonical TS type guard is function foo(x): x is T. Since tags/summary are LLM-generated prose, you're relying on the model to use the words "type guard"/"narrowing" verbatim. If summaries say e.g. "checks whether value is a User", this PR yields zero detection. Consider adding "is " (with trailing space) or a \bx is \w+\b-style check against the node's signature if available.

3. Test coverage thin for the substring class
Only one negative case is added. The PR description cites three ("This function...", "Persists data", "Renders a list of items"); table-driving all three (plus a positive case where a summary genuinely mentions "type guard" / "narrowing") would lock in both the false-positive fix and the absence of precision regression.

The same unbounded substring-match false-positive class that motivated
removing 'is' also affects other short/symbol keywords. Drop '@' from the
decorators pattern (matched any JSDoc @param/@returns or email in a summary)
and 'di' from dependency injection (matched audio/edit/directory/modifies/
loading/reading). Both concepts still detect via their specific keywords
(decorator/annotation; inject/provider/container). Document the substring
over-match limitation above BASE_CONCEPT_PATTERNS for the remaining 'pipe'
case and a future word-boundary matcher.

Expand detectLanguageConcepts coverage: table-drive the 'is'-substring
negatives ('This function...', 'Persists data...', 'Renders a list...'),
add a positive type-guard case, and lock in the '@'/'di' removals with
both a prose-substring negative and a specific-keyword positive each.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tirth8205

Copy link
Copy Markdown
Contributor Author

1. Substring matching in other concept patterns — Fixed the two clearest offenders that share the exact false-positive class. Removed "@" from decorators (it matched any JSDoc @param/@returns fragment or an email in a summary) and "di" from dependency injection (matched "audio", "edit", "directory", "modifies", "loading", "reading"). Both concepts still detect via their specific keywords (decorator/annotation; inject/provider/container), which the new tests assert. I added a doc comment above BASE_CONCEPT_PATTERNS recording that matching is unbounded substring-based, that short/symbol keywords over-match, and that "pipe" (still matches "pipeline") and a proper word-boundary matcher are a tracked follow-up. I deliberately did not do the broad regex rewrite in this PR: word boundaries interact oddly with the legitimate symbol/compound keywords ("@", "try/catch", "async/await") and changing the matcher could shift detection for those, so that's better as its own change.

2. Predicate-style x is T return-type signal — Not implemented, and I'd push back on the two concrete suggestions:

  • Re-adding "is " (even with a trailing space) reintroduces the exact bug this PR fixes — common prose like "this is", "data is persisted", "value is null", "which is returned" would all match and precision regresses immediately.
  • The \bx is \w+\b check "against the node's signature" isn't implementable here: GraphNode has no signature field, and detectLanguageConcepts builds its text only from node.tags, node.summary, and node.languageNotes — all LLM-generated prose, never a real type signature. There's nothing to regex against.

Detection here intentionally relies on the LLM-authored summary/tags using terms like "type guard"/"narrowing", and the downstream prompt also asks the model to surface patterns it wasn't pre-fed, so missed pre-detection is handled gracefully. I've kept this as a known recall limitation rather than re-introducing a regression.

3. Test coverage — Table-driven all three motivating negatives ("This function adds two numbers", "Persists data to disk", "Renders a list of items"), each asserting type guards is not detected. Added a positive case ("Type guard that narrows the value to a User") proving the fix didn't over-remove and the specific keywords still detect. Also added paired negative/positive cases for the @ and di removals (a @param/@returns summary and a "reads and modifies a directory while loading audio files" summary no longer false-positive; a real decorator/DI summary still detects).

Full core suite: 698 passed. tsc --noEmit clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants