Skip to content

RegExp: negated shorthand escapes (\D \W \S) inside a [...] class still diverge from JS (deferred from #693) #749

Description

@nickna

Context

Follow-up to #693, which added RewriteEcmaScriptShorthands (a JS→.NET pattern rewrite shared by interpreter SharpTSRegExp and emitted $RegExp) to pin the shorthand class escapes to the JS sets. #693 deliberately scoped the rewrite to:

  • Positive shorthands (\d \w \s) — expanded everywhere (inside and outside [...]).
  • Negated shorthands (\D \W \S) — expanded only outside a class. Inside a class they pass through to .NET unchanged (ExpandShorthand returns null for D/W/S when inClass).

This issue tracks the remaining inside-a-class negated divergence.

Repro (both interpreter and compiled modes)

/[\W]/.test("İ")   // İ  — JS: true,  SharpTS: false   (.NET \w over-matches U+0130, so [\W] under-matches it)
/[\S]/.test(" ")   // NBSP — JS: false, SharpTS: true  (.NET narrow \s, so [\S] over-matches every Unicode WhiteSpace)
/[\S]/.test(" ")   // IDEO space — JS: false, SharpTS: true
/[\S]/.test("
")   // LS — JS: false, SharpTS: true

The positive in-class forms are already correct after #693:

/[\w]/.test("İ")   // JS: false, SharpTS: false  ✓
/[\s]/.test(" ")   // JS: true,  SharpTS: true   ✓
/[^\w]/.test("İ")  // JS: true,  SharpTS: true   ✓  (negated *class* with a positive shorthand inside — handled)

So the gap is strictly: a negated shorthand (\D/\W/\S) appearing inside a [...] character class. (\D is mostly harmless since .NET ECMAScript \d is already [0-9]; \W and \S carry the real divergence.)

Why it was deferred

A negated shorthand inside a class can't be expressed as a plain character set without engine-level class nesting/union-of-negation, which .NET's regex syntax doesn't offer in general. .NET does support class subtraction ([base-[subtract]]), so the sole-element case is tractable — e.g. [\S][^<wsSet>], [\W][^A-Za-z0-9_] — but the union case is not clean: [a\S] = a ∪ \S, and since \S already includes a, that collapses to \S ("anything but whitespace"); a general [X\S] has no simple rewrite. A correct general fix needs either careful case analysis (sole-element vs union, and interaction with other class members / ranges / ^ negation) or a different matching strategy.

Suggested approach

  1. Handle the common sole-element case in RewriteEcmaScriptShorthands: when a class body is exactly one negated shorthand (optionally with a leading ^), rewrite to the equivalent [^…]/[…]. This alone fixes the [\W]/[\S] repros above.
  2. Leave the union case ([x\S], multiple negated shorthands, ranges mixed in) for a later pass, or document it as a known edge.

Pointers

Acceptance

  • /[\W]/.test("İ") === true; /[\S]/.test(" ") === false in both modes.
  • No regressions in RegExp/ or String/ Test262 baselines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    deferredDe-prioritized; not planned for active work (see tracking comment)enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions