Skip to content

Phase Q1: migrate __extension__ + attribute-between-specs to grammar#3

Merged
rjrodger merged 18 commits into
mainfrom
claude/refactor-jsonic-grammar-Up3ZD
May 2, 2026
Merged

Phase Q1: migrate __extension__ + attribute-between-specs to grammar#3
rjrodger merged 18 commits into
mainfrom
claude/refactor-jsonic-grammar-Up3ZD

Conversation

@rjrodger
Copy link
Copy Markdown
Contributor

@rjrodger rjrodger commented May 2, 2026

Q1.1: extension keyword is now recognised as a leading specifier in
both the STORAGE_PREFIX token set (consumed by simple_declaration's
opening alt) and the storagePrefixSet (used by @looks-simple-decl's
walker), so __extension__ int x; flows through the grammar path.

Q1.2: int __attribute__((unused)) x; (attribute between specs) flows
through the grammar path. skipLeadingAttributes walks attribute bodies
via fetchDeep so the attribute can extend past the dispatch alt's
6-token lookahead window. The walk is bounded to ~64 tokens past the
keyword so it can't grow ctx.t unboundedly. After successful body walk,
the function pre-fetches a few tokens past the closing )) so the
caller's post-attribute lookahead works without falling off the end of
ctx.t.

Incidental fix: spec_loop had a latent bug where KW_STRUCT /
KW_UNION / KW_ENUM were absorbed as plain tokens whenever they
followed another type qualifier (e.g. volatile struct S0) — the
generic #SIMPLE_TYPE_HEAD alt was tried before the specific
tagged-type alts and won. Reordered so tagged-type heads dispatch into
struct_specifier / enum_specifier regardless of position, matching the
shape that legacy parseDeclarationSpecifiers has always produced. The
csmith fixtures encoded the buggy shape and were regenerated.

Bonus: attribute_item now accepts any KW_* token in the name slot
(e.g. __attribute__((const))), mirroring legacy parseAttributeItem.
A new KW_TOKEN token-set enumerates every C23 + extension keyword.

path-dispatch.tsv rows 82–83 flipped from legacy to grammar.

claude added 18 commits May 2, 2026 10:29
Q1.1: __extension__ keyword is now recognised as a leading specifier in
both the `STORAGE_PREFIX` token set (consumed by `simple_declaration`'s
opening alt) and the `storagePrefixSet` (used by `@looks-simple-decl`'s
walker), so `__extension__ int x;` flows through the grammar path.

Q1.2: `int __attribute__((unused)) x;` (attribute between specs) flows
through the grammar path. `skipLeadingAttributes` walks attribute bodies
via fetchDeep so the attribute can extend past the dispatch alt's
6-token lookahead window. The walk is bounded to ~64 tokens past the
keyword so it can't grow ctx.t unboundedly. After successful body walk,
the function pre-fetches a few tokens past the closing `))` so the
caller's post-attribute lookahead works without falling off the end of
ctx.t.

Incidental fix: `spec_loop` had a latent bug where `KW_STRUCT` /
`KW_UNION` / `KW_ENUM` were absorbed as plain tokens whenever they
followed another type qualifier (e.g. `volatile struct S0`) — the
generic `#SIMPLE_TYPE_HEAD` alt was tried before the specific
tagged-type alts and won. Reordered so tagged-type heads dispatch into
struct_specifier / enum_specifier regardless of position, matching the
shape that legacy parseDeclarationSpecifiers has always produced. The
csmith fixtures encoded the buggy shape and were regenerated.

Bonus: `attribute_item` now accepts any `KW_*` token in the name slot
(e.g. `__attribute__((const))`), mirroring legacy parseAttributeItem.
A new `KW_TOKEN` token-set enumerates every C23 + extension keyword.

path-dispatch.tsv rows 82–83 flipped from `legacy` to `grammar`.
Adds an `initializer` wrapper rule to the grammar that dispatches to
`initializer_list` for `{...}` forms and to `val` for expression
initializers. The `init_declarator`'s `=` alt now goes through this
wrapper, mirroring legacy parseInitializer's shape (an `initializer`
node wrapping either an initializer_list or an expression).

The brace-init blocking gates in `@looks-simple-decl` are relaxed so
pointer-with-brace-init and array-with-brace-init declarations flow
through the grammar path: pointer/array followed by `=` now bails only
when the RHS is a non-brace expression (val still doesn't handle every
C expression form — e.g. chained subscripts `arr[i][j]`). Plain and
designated brace inits both flow through.

The post-`=` `{` peek uses fetchDeep so it sees past the dispatch
alt's 6-token lookahead window for declarations with bracket
postfixes (e.g. `int a[3] = { ... };`).

`@init_declarator-bc` no longer wraps the child node in an
`initializer` (the new wrapper rule already does that). A defensive
re-wrap remains for any legacy fallthrough.

path-dispatch.tsv rows 78–79 flipped from `legacy` to `grammar`.
Latent fix discovered while exploring Q2.2 (standalone tagged-type
definitions through grammar). struct_declaration's bo creates its
own `u.specs` (a `specifier_qualifier_list` per legacy CST shape),
so when @sd-absorb-spec-storage / @sd-absorb-spec-type fire from
struct_declaration's own open alts, specOwner should return the
struct_declaration itself rather than its parent (member_decl_list,
which has no `u.specs` scaffolding).

Without this fix, any path that exercises struct_declaration via
the new grammar — currently `typedef struct vec { int x; } vec_t;`
and similar — would crash with "Cannot read properties of undefined
(reading 'children')" inside pushTokenWithTrivia.

Q2.2 itself (standalone struct/union/enum body declarations) is
deferred: walking past a non-trivial tagged-type body via fetchDeep
exposes a state-leakage bug in struct_specifier/enum_specifier that
breaks subsequent enum-with-`=` parsing. Needs a deeper grammar fix
outside this phase's scope.
`typedef int Arr[10];` and similar array typedefs now flow through
the grammar path. The bracket walker in @looks-simple-decl uses
fetchDeep instead of raw ctx.t access so it can see the post-`]`
terminator (PUNC_SEMI / PUNC_ASSIGN / PUNC_LBRACKET) when the
declaration extends past the dispatch alt's 6-token lookahead window.

Side effect: more array declarations now flow through grammar than
before (any whose terminator landed past the dispatch window).
Grammar's array_postfix rule descends into val for the dimension,
producing a structured `literal_expression` (or arbitrary expression
tree) inside `array_postfix`. Legacy's parseDirectDeclarator absorbs
the brackets opaquely via consumeBalanced. The csmith fixtures encoded
the legacy opaque shape for declarations that previously fell back;
they're regenerated to encode the structured shape.

path-dispatch.tsv row 71 flipped from `legacy` to `grammar`.
Adds a new `extended: boolean` plugin option (default `false`). When
false, the parser handles plain C23 only; preprocessor directives,
GCC keywords (__attribute__, __asm__, __extension__, etc.), MSVC
(__declspec), and Clang nullability annotations either fail to parse
or fall through to the legacy chomp+structure path (which still
covers them today).

Mechanism:
- New `@extended-on` / `@extended-off` / `@ext-and-first-iter`
  condition refs that closure-capture the resolved options.
- Every extension dispatch alt in c-grammar.jsonic now carries
  `c: '@extended-on'`:
    external_declaration: PP_HASH, KW_ASM/__ASM/__ASM__
    simple_declaration: KW___ATTRIBUTE__/__ATTRIBUTE/__DECLSPEC
    spec_loop (open + close): same GCC/MSVC keywords
    statement: KW_ASM/__ASM/__ASM__, PP_HASH
- C23 [[…]] attributes, _Static_assert, static_assert, _Generic,
  typeof, _BitInt, statement-expressions all stay plain.

`grammarRefs` is now built by `makeGrammarRefs(opts)` so the new
condition closures can access the resolved options.

Test wiring:
- test/c.test.ts: shared instance opts in via `{extended: true}` since
  most tests in that file exercise extension constructs. The
  path-dispatch test does the same.
- test/csmith-common.ts: parseCsmithSource opts in (csmith generates
  GCC-flavoured C with #include, __attribute__, etc.).

Plain-C smoke check: a fresh `Jsonic.make().use(C)` parses int x;,
typedef T; int arr[10] = {1,2,3};, [[nodiscard]] int f(int);, struct
S s;, _Static_assert(1, "ok"); cleanly via grammar; rejects
__attribute__((unused)) int x; with a syntax error; #include <…>
falls through to the legacy chomp-and-structure path (until
extension dispatch in legacy is also gated).

This is the foundation for completing the plain-C grammar without
extension shapes complicating the picture. Follow-ups: physical
split into c-grammar-{plain,extended}.jsonic and identify+fix the
remaining plain-C gaps (Q2.2/Q2.4/Q3 territory).
Adds a `pointer_qualifier_loop` sub-rule called from `pointer_list`'s
open after each `*` is absorbed. The sub-rule consumes zero or more
type qualifiers (`const`, `volatile`, `restrict`, `_Atomic`) and
appends each to the parent `pointer_list`'s most recently-pushed
pointer node. Two-step structure (rather than putting qualifiers
directly in `pointer_list.close`) keeps the multi-star recursion
working: `int **pp` still re-enters `pointer_list.open` after the
first `*`, while `int * const p` consumes the qualifier in the
sub-rule and exits cleanly.

`@looks-simple-decl`'s pointer-prefix walker is widened to skip
post-`*` qualifiers too so the dispatch alt accepts shapes like
`int * const p;` without bailing to legacy.

CSmith fixtures regenerated to encode the small set of newly-grammar
declarations (a few that previously fell to legacy now produce the
canonical structured shape with the extra qualifier inside the
pointer node).
Adds an `EXTENSION_RULES` constant listing the grammar rule names
that exist only to support compiler extensions:

  - GCC inline assembly: asm_statement, asm_template, asm_section,
    asm_operand, asm_clobber, asm_label_ref
  - Preprocessor (in-body opaque + top-level structured):
    preprocessor_line, preprocessor_directive, define_directive,
    macro_parameter_list, macro_body, undef_directive,
    include_directive, header_form, conditional_directive,
    simple_directive
  - Compiler-specific attribute spec syntax:
    attribute_spec_gcc, attribute_spec_msvc

When `extended: false` (the default) these rule definitions are
stripped from the parsed grammar spec before passing to
jsonic.grammar(). The dispatch alts that would have reached them
were already gated with `c: '@extended-on'`, so deleting the rule
definitions outright is housekeeping that makes plain-C mode
self-evidently free of extension grammar.

Plain C keeps:
  - C23 attribute_spec_c23 + attribute_item + attribute_argument_list
  - static_assert_declaration (C23 + _Static_assert)
  - statement_expression (GCC, kept in plain by user choice)

Verified: plain-mode parser rejects `__attribute__((unused)) int x;`
with a syntax error and falls __asm__/preprocessor input through to
the legacy chomp+structure path (which can be tightened in a
follow-up). Extended-mode parser handles all forms via grammar as
before.
…)[1], (*p)[0])

Bug: any postfix paren-form (call, subscript, grouping) following
another paren-form failed with a syntax error. `a[0]`, `f(0)`,
`(*p)` worked in isolation, but `a[0][1]`, `f(0)[1]`, `a[0](1)`,
`f(0)(1)`, `(*p)[0]` and so on did not — val.close had no alt
matching `[OP]` (an opening paren after the val had already
produced a value).

Fix: add a `chain` alt at the bottom of val.close in vendor/
jsonic-expr/src/expr.ts (replacing the previously commented-out
WWW alt). When val.close sees an opening paren whose paren-form
has `preval.active` and val.node is not undefined:

  s: [OP], b: 1,
  c: pdef.preval.active && r.node !== undefined && allow check,
  p: 'expr',
  u: { paren_preval: true },

The alt uses `p: 'expr'` (push expr as a child) rather than
`r: 'val'` (replace val with a new val instance). The replacement
form is wrong because `ctx.root().node` — which jsonic returns as
the parser result — keeps pointing at the ORIGINAL val. With
`r: 'val'`, the original val is popped before the chained outer
paren-form is built, so the result reflects only the inner
chain step. With `p: 'expr'`, the original val stays alive: each
chained paren-form is built into the same val.node via the
existing makeCloseParen → paren.ac → expr.ac evaluation pipeline,
so the final val.node is the correctly-nested left-associative
result.

`u: { paren_preval: true }` is set so makeCloseParen's
`r.parent.parent.u.paren_preval` check passes for chain steps
where the outer val didn't open with the C-plugin's paren-preval
alt (e.g. `(*p)[0]` where the outer val opened with the bare-OP
alt).

Verified against an extended test set: simple/chained subscript
(a[0], a[0][1], a[0][1][2]), simple/chained calls (f(0), f(0)(1),
f(g(h(x)))), mixed chains (a.b(c)[d], p->arr[0][1], (*pfn)(a, b)),
arithmetic-with-chain (a + b[0][1], arr[i+j][k*2]), sizeof of
chain (sizeof(a[0][1])), address-of/cast/dereference of chain.

Plain-C grammar gain: pointer-with-expression-initializer no
longer needs the legacy fallback gate. `const char *s = "hello";`,
`int *p = NULL;`, `int *p = (int*)0;`, `int *p = &arr[0];`,
`int *p = arr[0][0];`, `int x = a[0][1];` etc. now all flow
through grammar instead of legacy. Plain-C survey: legacy
declarations dropped from 21 to 16. Remaining legacy cases are
struct/union body declarations, function defs/decls with named
params, variadic, _BitInt(N), enum-with-underlying-type — all
unrelated to expression evaluation.

CSmith fixtures regenerated to encode the new (more structured)
shapes for declarations that moved from legacy to grammar.
Three changes that move large categories of declarations off the
legacy chomp+structure path:

1. @looks-simple-decl: switch the PUNC_LPAREN walker to fetchDeep so
   it can see beyond the dispatch lookahead window (max 6 tokens).
   Function declarations with named/typed parameters, multi-param
   forms, and function definitions with non-trivial bodies all now
   pass the validator and route through grammar.

2. parameter_type_list: add `, ...` ellipsis alt for variadic
   function declarations. The @ptl-take-ellipsis action builds a
   `parameter_variadic` CST node, tags ptl.variadic = true, and
   attaches the ptl onto function_postfix (since this alt completes
   the rule without falling through to @ptl-attach-and-end).

3. Vendored expr: add a `.ac` to the ternary rule that evaluates
   the completed `[op_ternary, cond, then, else]` op-array into a
   structured `conditional_expression` CST node and writes it back
   to every rule on the r:-replacement chain (each successive
   ternary instance plus the original val that triggered the
   chain). Without this, ternaries that aren't wrapped in expr
   (e.g. when val.close fires `r: 'ternary'` directly with no
   intermediate operator) leave the result as a raw op-array.
   The walk over r.prev replaces .node on every level so
   ctx.root().node — whichever rule jsonic actually returns —
   reflects the evaluated form.

UNSUPPORTED_BODY_TOKENS: extend with KW_ASM/__ASM/__ASM__ and
PP_HASH so function bodies that contain inline assembly or
preprocessor lines stay on the legacy path. The grammar's
asm_operand is currently opaque token-list; making it produce
structured `.constraint` / `.value` / `.asm_outputs` shapes is a
separate task. Until that's done, gating these bodies preserves
the existing structured asm CST shape that the test suite asserts.

Plain-C survey: legacy declarations dropped from 16 to 9.
Remaining legacy cases are struct/union body declarations (Q2.2
state leak), C23 enum-with-underlying-type, _BitInt(N), string-
literal array init `char buf[] = "hi";` — tracked separately.

CSmith fixtures regenerated to encode the function-definition
shapes that moved from legacy to grammar.

Test suite: 329/331 pass. The two remaining failures are esoteric
plain-C parameter shapes (abstract-declarator function-pointer
parameters and K&R `int f(a, b)` identifier_list) — both require
new grammar rules and are tracked.
…meter declarators

Two new grammar rules and a parameter_declaration overhaul cover the
remaining plain-C parameter shapes that were on legacy.

1. identifier_list: when function_postfix opens `(` and the next
   tokens are `ID , ID ...` or `ID )` with the ID NOT a registered
   typedef, dispatch to a new identifier_list rule that comma-
   separates the bare names. The K&R-prototype shape required by
   `int f(a, b);`. The bo guard skips reinitialisation on
   r:-recursion so the children carry forward across each comma.

2. parameter_declaration: extend close to handle parenthesised
   abstract / named declarators via a new param_paren_inner sub-
   rule. After type spec(s), `(` opens an inner declarator that
   absorbs `*` (each as a pointer node on the OUTER parameter's
   declarator) plus an optional ID. The outer `)` is consumed by a
   new close alt gated on `@param-paren-pending`. A subsequent `(`
   that matches `@param-need-fn-postfix` (paren-form done, no
   function-postfix yet) routes into function_postfix to cover the
   `(int)` part of `int (*)(int)`. State keys use `param*` prefixes
   to avoid colliding with the same name in init_declarator's
   inherited k state.

Together these enable plain-C-only parsing of:
  int f(a, b);                                       (K&R)
  int f(int (*)(int));                               (abstract fn-ptr)
  int f(int (*name)(int));                           (named fn-ptr param)
  int qsort(void *, size_t, size_t,
            int (*)(const void *, const void *));    (multiple, mixed)

CSmith fixtures regenerated for the longer-parameter-list shapes
that flowed off legacy with the earlier fetchDeep validator
change.

Test suite: 331/331.
@looks-simple-decl now skips `_BitInt` plus its parenthesised
width argument as a single specifier so the dispatcher routes
declarations using the C23 width-parameterised integer type
through the new path. Without parens the validator returns false
so the (malformed) input falls to the legacy structuring path.

c-grammar.jsonic adds a `bit_int_spec` sub-rule dispatched from
spec_loop's `KW__BITINT` alts (open + close). It captures the
keyword + `(` opener, opaque-absorbs the inner tokens, and takes
the closing `)`. The completed bit_int_specifier node is appended
to the parent spec_loop owner's specs.

Note: full plain-C migration of `_BitInt(N) b;` still routes to
legacy because bit_int_spec needs additional plumbing to attach
its node correctly. Leaving the validator-pass + grammar scaffold
in place so subsequent work can complete it.

Test suite: 331/331.
Two bug fixes that unblock most of the remaining legacy fallbacks
plus the supporting grammar and validator changes.

1. fetchDeep — never return NOTOKEN to callers. ctx.t slots that the
   parser cleared after consume-and-shift are filled with a sentinel
   token whose name is ''. Callers using `t?.name === 'PUNC_LBRACE'`
   would treat this as a non-match and walkers would spin in place
   until the safety cap (or, more often, mis-detect end-of-tokens
   and return false). The fix returns undefined when the resolved
   slot is NOTOKEN, and limits both the search and the result
   buffer growth via a FETCH_DEEP_CAP of 256.

2. Control-flow took*/elseSeen state-leak. if_statement, while_
   statement, do_statement, for_statement, switch_statement and
   for_controls all use the SAME generic k keys (tookCond, tookBody,
   tookThen, tookInit, tookIter, etc.). The shallow-copy-on-push
   that jsonic does at every rule transition propagated those flags
   from a parent if/while/do into a nested for_controls and the
   for-cond / for-iter alts then misfired, e.g. for_controls saw
   `tookCond=true` and skipped the for-cond push. New helper
   `clearStmtState(rule)` is invoked from each control-flow rule's
   bo, plus the bo's existing "early return on re-entry" guard now
   tests `prev.name === rule.name` instead of `rule.node.kind ===
   <expected>` (the inherited rule.node from a parent would falsely
   match the kind even on a fresh instance).

simple_declaration's bo gained a similar broad k cleanup so the
tagged-type rules (struct/enum/member_decl_list/enumerator_list)
don't inherit each other's per-instance node references across
consecutive declarations in a translation unit.

Validator (`@looks-simple-decl`) widened with fetchDeep:
  - skipTaggedSpec walks past `{ ... }` bodies via fetchDeep so
    `struct S { int a; };`, `struct { int a; } v;`, `enum E { A };`
    and the C23 `enum E : int { A };` shape all reach the post-spec
    `;` / ID terminator check and route through grammar.
  - Spec-walk loop and post-spec ID/qualifier walk both upgraded
    to fetchDeep with a 256-iteration cap.
  - parameter_declaration's ID alt re-enters via `r:` so a trailing
    array postfix `int main(int argc, char *argv[])` flows through
    grammar.
  - @arr-close attaches array_postfix onto either init_declarator's
    direct_declarator or parameter_declaration's declarator.

C23 [[…]] attribute on enumerator: enumerator's close adds a
PUNC_LBRACKET PUNC_LBRACKET alt that pushes attribute_spec_c23 with
r:'enumerator', and @enumerator-bc attaches the returned attribute
node onto the enumerator (Set-deduped against re-fires).

_BitInt(N): the prior validator+scaffold attempt is reverted to
legacy until the lex-matcher integration is sorted out (the
KW__BITINT PUNC_LPAREN open alt was failing at the lexer level).

Plain-C survey: legacy declarations dropped from 21 to 1 (only
`_BitInt(N)` remains). All struct/union/enum body forms — with or
without declarators, with or without C23 underlying type — now
flow through grammar. Nested control flow (`for (;;) for (;;) {}`,
`if (1) for (;;) ;` etc.) parses cleanly.

CSmith fixtures regenerated for 60+ files where the new grammar
shapes replaced the old legacy shapes. Test suite: 331/331.
The previous attempt registered a sub-rule whose open alt was
'KW__BITINT PUNC_LPAREN' — that two-token open-pattern hit a
lexer-level edge case that left the parser unable to start
parsing on input that begins with `_BitInt`.

This pass uses a different shape:

  - simple_declaration's open dispatches `KW__BITINT b: 1 p:
    'spec_loop'` so `_BitInt` as the head specifier flows into
    spec_loop the same way `int` etc. do.
  - spec_loop (open + close) gains a `KW__BITINT` alt that absorbs
    the keyword onto the parent's u.specs via `@absorb-spec-type`
    and pushes a new `bit_int_paren` sub-rule.
  - bit_int_paren takes `(`, descends into val for the width
    expression, and takes `)`. Each piece is appended directly
    onto the parent owner's u.specs so the keyword + parens + width
    sit as adjacent children of declaration_specifiers, matching
    the legacy CST shape.

Validator: `@looks-simple-decl` now skips `_BitInt(N)` as a single
specifier (similar to how it handles tagged-type specs) so the
dispatch routes the declaration through grammar.

All four `_BitInt` shapes (`_BitInt(8) b;`, `_BitInt(64) b;`,
`unsigned _BitInt(128) x;`, leading + trailing combinations) now
parse via grammar. Plain-C survey: legacy declarations now zero —
all 40 surveyed shapes flow through grammar.

Test suite: 331/331.
@looks-simple-decl's parenthesised-compound-declarator validation
extended to accept three shapes (was previously only the function-
pointer case):

  - `int (*p)(int);`      function pointer
  - `int (*p)[10];`       pointer to array
  - `int (*arr[3])(int);` array of fn-pointers
  - `int (*matrix)[3][3];` pointer to multi-dim array

Inside the parens we now allow zero-or-more array postfixes after
the inner ID (`*name [N]...`) before the closing `)`. Outside the
parens, the trailing postfix can be either `(...)` (function) or
`[...]` (array), and they can chain.

The grammar already handles all four shapes (init_declarator's
paren_inner_declarator + array_postfix + function_postfix), so no
grammar changes were needed — only the validator gate widening.

Plain-C survey: 44/44 grammar, 0 legacy, 0 failures across the
full surveyed shape set (basic decls, pointers, arrays, complex
compound declarators, struct/union/enum bodies with and without
declarators, C23 enum underlying type, _BitInt, function decls/
defs with all parameter shapes, K&R, abstract function-pointer
parameters, nested control flow, expression initializers).

Test suite: 331/331.
Extends the existing k cleanup at the start of every
simple_declaration to also remove initializer_list / initializer_
item / declarator scaffolding keys (ilNode, ilOpened, iiNode,
hasDesig, tookEq, declarator, directDeclarator, lastPointer) that
the shallow-copy-on-push spreads forward from a previous decl in
the same translation unit.

Without this, e.g. an init_declarator's k.directDeclarator
reference from declaration N-1 would still be visible when
declaration N starts, and pointer/array-postfix actions would
mutate the wrong node.

Test suite: 331/331.
Inline comment on the lex loop in fetchDeep recording why we don't
also fill NOTOKEN slots in the middle of ctx.t (a previous attempt
to do so passed plain-C tests but blew up memory on csmith's
larger files and broke the parser's consume-and-shift invariant).

The trade-off: a small set of plain-C SEQUENCE cases — where the
second declaration's validator needs to look past tokens that the
parser cleared after consuming the first declaration — fall back
to the legacy chomp path. Concrete examples:

    struct s { int x; };
    enum E : int { A };       <- this one routes via legacy

    int g[2][2] = {{1}};
    int *g2[8] = {0};         <- this one routes via legacy

Both still parse correctly via legacy. Migrating them through
grammar would require a richer NOTOKEN-handling strategy in
fetchDeep that doesn't conflict with the parser's invariant.

Plain-C single-declaration survey unchanged at 44/44 grammar.
Sequence-of-declarations survey: 2 cases use legacy (out of 46).
Test suite: 331/331.
Two cond functions used ctx.t[1] to look ahead one token past
their alt's matched s:. That works only when the second slot is
real — but the parser's consume-and-shift fills mid-buffer slots
with NOTOKEN, and ctx.t[1] then comes back as the empty-named
sentinel and the cond rejects.

The fix: extend the alt's s: pattern to require both tokens. Then
parse_alts force-fetches both via lex.next (filling NOTOKEN slots
in the order the parser expects), the cond inspects rule.o*
(the matched-tokens accessor) instead of ctx.t, and the rule
state stays consistent.

  attribute_item close, namespaced `::` form:
    s: 'PUNC_COLON'  -> s: 'PUNC_COLON PUNC_COLON' b: 1
    cond: ctx.t[1] check -> just the rule.k state gate (the s:
                            pattern itself ensures the second `:`)

  preprocessor_directive open, all five dispatch alts:
    s: 'PP_HASH'  -> s: 'PP_HASH #ANY_C_TOKEN' b: 2
    cond: ctx.t[1].src check -> rule.o1.src check

This pattern — let parse_alts do the fetch via the s:, then read
the matched tokens via rule.o*/c* — replaces fetchDeep-style
ad-hoc lookahead in conds with the standard jsonic mechanism.
The remaining fetchDeep callers (@looks-simple-decl,
isFunctionBodySupported, skipTaggedSpec) need to walk arbitrary
distance past `{...}` and `(...)` bodies, which the multi-token
s: pattern can't easily express; they're documented as the place
where the NOTOKEN-in-place limitation still bites.

Test suite: 331/331.
Adds five direct-dispatch alts in external_declaration's open
that route declarations straight to simple_declaration without a
lookahead-validator walk. They fire only on the first iteration
of a fresh ext_decl AND only when extended is false:

  s: '#SIMPLE_TYPE_HEAD'           -> simple_declaration
  s: '#STORAGE_PREFIX'             -> simple_declaration
  s: 'KW__BITINT'                  -> simple_declaration
  s: 'PUNC_LBRACKET PUNC_LBRACKET' -> simple_declaration  (C23 [[…]])

(Leading GCC __attribute__ / MSVC __declspec attributes are
extension-only, so they stay on the wildcard path.)

In extended mode the existing wildcard cascade with
@looks-simple-decl + isFunctionBodySupported is still in charge —
that's important for asm-body / pp-line function definitions,
which depend on the validator's gate routing them to the legacy
structuring path (the grammar's asm_operand is opaque, while
legacy produces structured constraint / value / asm_outputs CST).
asm and preprocessor are extension features, so the gate is only
relevant in extended mode anyway.

Plain-mode benefit: the two sequence cases that previously routed
the second declaration to legacy because fetchDeep can't see past
parser-cleared NOTOKEN slots in ctx.t now flow through grammar:

  struct s { int x; };
  enum E : int { A };       <- second was legacy, now grammar

  int g[2][2] = {{1}};
  int *g2[8] = {0};         <- second was legacy, now grammar

Plain-C survey: 46/46 grammar, 0 legacy across the full surveyed
shape set including consecutive declarations.

Test suite: 331/331.
@rjrodger rjrodger merged commit e780ac3 into main May 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants