Skip to content

Claude/c parser concrete ast 1 b23d#2

Merged
rjrodger merged 47 commits into
mainfrom
claude/c-parser-concrete-ast-1B23d
May 2, 2026
Merged

Claude/c parser concrete ast 1 b23d#2
rjrodger merged 47 commits into
mainfrom
claude/c-parser-concrete-ast-1B23d

Conversation

@rjrodger
Copy link
Copy Markdown
Contributor

@rjrodger rjrodger commented May 2, 2026

No description provided.

claude added 30 commits April 30, 2026 16:14
Restore the JSONC-style architecture: the structural rule shapes
(translation_unit → extdecl_loop → external_declaration) now live in
c-grammar.jsonic, parsed at build time by a vanilla Jsonic instance
and embedded into src/c.ts as a string. The plugin loads it via
jsonic.grammar(spec) with a ref map binding every @-name to a TS
function.

Layout:
  c-grammar.jsonic       — rule shapes, alts use '@func' references
  embed-grammar.js       — copies c-grammar.jsonic into src/c.ts
                           between BEGIN/END markers (build-time)
  src/c.ts grammarRefs   — TS implementations of:
    @translation_unit-bo / -bc       (state actions, auto-installed)
    @extdecl_loop-bc
    @external_declaration-bo
    @absorb-token                    (alt action)
    @terminated                       (close cond)
    @just-closed-and-decl-ahead       (close cond, lookahead-aware)
    @finalize-extdecl                 (close action)

Token sets, lex matchers, and the IGNORE membership for trivia stay
in c.ts because they're dynamic — the chomper's wildcard alt accepts
'#ANY_C_TOKEN', a token-set populated at install time from the
generated keyword catalog in tokens.ts. Putting that membership in
the grammar file would mean repeating the catalog or doing
self-modification; keeping the configuration adjacent to the
runtime registration reads better.

Build script restored to `node embed-grammar.js && tsc --build src test`
so the embed runs before TypeScript picks the source up. Verified
end-to-end: 285/285 tests pass (200 csmith + 85 unit), no behavioural
change.
Replace the hand-rolled binary-precedence climb in src/expr.ts with
@jsonic/expr's public Pratt algorithm, following the pattern in
rjrodger/aontu (src/lang.ts). The C operator catalog is now declared
in @jsonic/expr's `OpDef` shape and `testing.opify` marks each entry
as an `Op`.

Architecture:

  parseExpression
    └── parseCommaExpr             (hand-rolled left-grown comma list)
        └── parseAssignmentExpression  (hand-rolled right-assoc =,+=,…)
            └── parseConditionalExpression  (hand-rolled ternary ?:)
                └── parseBinaryExpression   <-- @jsonic/expr.prattify
                    └── parseUnary               (cast/sizeof/prefix/atom)

Inside parseBinaryExpression the loop:
  1. reads an operand via parseUnary,
  2. hands the next infix operator to prattify(expr, op, …) which
     mutates the in-place [op, …terms] tree according to precedence,
  3. appends the new operand into the slot prattify opens (matching
     @jsonic/expr's own `addterm` post-step in its val rule).

The resulting S-expression tree is converted to my CST shape via
toCST: `[op, left, right]` becomes
`binary_expression { left, right, op, children: [left, opTok, right] }`,
preserving source token order via per-op carriers so trivia attached
to the operator token survives.

Right-associative operators (assignment) get `left = right + 1` so
prattify's drill-vs-wrap test (`op.left > expr_op.right`) selects the
drill case on a same-op repeat. Left-associative use the inverse
(`left = right - 1`) so a same-op repeat wraps. The numbering follows
aontu's well-spaced (1_000-step) convention so future operators
slot in without renumbering.

Assignment, ternary, and comma stay hand-rolled because their C
grammar rules (LHS = unary-expression for assignment;
logical-OR-expression ? expression : conditional-expression for
ternary) don't fit a flat precedence climb — the LHS of `=` cannot be
the binary-tree built so far.

Verified: 285/285 tests pass (200 csmith + 85 unit). The csmith
fixture-byte comparison still matches because the converted CST
shapes for binary expressions are byte-identical to those emitted by
the previous hand-rolled climb.
First step of the option-1 restructure. Adds src/expr-grammar.ts which:

  * declares the full C operator catalogue using @jsonic/expr's OpDef
    shape (comma, assignment, ternary, 11 binary levels, prefix unary,
    postfix ++/--, dot/arrow infix member access, paren forms for
    grouping, call, subscript)

  * exports installExpr(jsonic), called from c.ts after the chomp
    grammar is loaded. installExpr does:

      1. jsonic.use(Expr, { op: C_OP_TABLE, evaluate: evaluateCExpr })

      2. Augments the val rule's open alts to recognise C atoms
         (LIT_INT, LIT_FLOAT, LIT_CHAR, LIT_STRING, ID, MACRO_NAME,
         TYPEDEF_NAME). Each atom alt produces a leaf CST node so the
         evaluate callback can splice it into surrounding expressions.

  * exports evaluateCExpr, the @jsonic/expr-shaped callback that
    converts each [op, ...terms] S-expression into the CST node
    shapes the rest of the parser already consumes:
      comma_expression  conditional_expression  assignment_expression
      member_expression  call_expression  subscript_expression
      paren_expression   unary_expression  postfix_unary_expression
      binary_expression

The @jsonic/expr plugin's makeOpMap calls jsonic.fixed(src) to find
an existing tin for each operator's source. Because c.ts already
registers PUNC_PLUS → '+', PUNC_LPAREN → '(', etc. in fixed.token,
the plugin reuses those tins — its val-rule alts therefore match the
very tokens our matchers emit. No mass renaming required.

The main grammar in c-grammar.jsonic does NOT yet descend into val —
that's phase B. Until then val is unreachable from translation_unit,
so this install is functionally a no-op for existing tests but the
plumbing is in place for later phases. 285/285 still pass.
Add four unit tests in test/c.test.ts that confirm phase A's wiring
end to end. Each test creates a fresh jsonic instance, flips
rule.start to 'val' so the parser enters @jsonic/expr's territory
directly, and verifies the resulting CST shape:

  * atom: integer  literal     →  literal_expression { value: '42' }
  * atom: plain ID             →  identifier_expression { name: 'foo' }
  * 1 + 2 * 3                  →  binary_expression(+) with right ×
  * a - b - c                  →  ((a-b)-c)  (left-assoc)

These exercise the cross-boundary path from my matchers (LIT_INT, ID,
PUNC_PLUS, PUNC_STAR) → @jsonic/expr's val open alts → its prattify
machinery → evaluateCExpr → my CST shapes. They also act as
regression guards while phases B–D land.

Total: 289/289 passing (4 phase-A + 285 existing).
…nic/expr)

First slice of the option-1 restructure. The chomp rule no longer owns
the simplest C declarations: `int x;` and `int x = …;` now flow through
proper jsonic rules, with the initializer expression parsed by
@jsonic/expr's val rule.

Changes:

  c-grammar.jsonic
    external_declaration gains a conservative dispatcher: if the head
    looks like `KW_INT ID PUNC_SEMI` or `KW_INT ID PUNC_ASSIGN`,
    descend into a new `int_declaration` rule. The dispatch is gated
    by `@is-first-iter` so the chomp's r:-recursion doesn't re-fire
    it mid-declaration (which would have e.g. fired on `int x` inside
    `typedef int T;`). Anything else falls through to the legacy
    chomp+post-process path.

  int_declaration
    A real rule that captures the type keyword, declared name, and
    optional initializer. The `=` close-alt does `p: 'val'`; @jsonic/expr
    then parses the RHS using its operator catalogue (the same one
    installed in phase A). On `;`, the rule assembles the CST in the
    same shape produced by structure.ts so the rest of the codebase
    keeps working.

  expr-grammar.ts
    Adds a paren-preval alt to val open: `#C_ATOM #C_PAREN_OPEN`,
    back-stepping into expr so @jsonic/expr handles `INC(5)` as a
    call-paren form. Without this, expressions like `int y = INC(5);`
    would error because val didn't know how to follow an atom with
    `(` or `[`.

    Also adds C-terminator close alts to val (`;`/`,`/`)`/`]`/`}`/`:`)
    that pre-empt jsonic's implicit-list close behaviour, so val
    cleanly back-steps out at C boundaries.

  c.ts grammarRefs
    @mark-new-path / @new-path / @finalize-new-path / @is-first-iter
    plus the int_declaration ref set (@int_declaration-bo,
    @int-decl-start, @int-decl-take-eq, @int-decl-finalize).
    pushTokenWithTrivia / leadingTriviaRefs helpers preserve trivia
    siblings so the new path matches the chomp's CST fidelity.

  c.ts options
    New token sets: SIMPLE_TYPE_HEAD (currently just KW_INT, broadens
    in later phases), C_ATOM (literals + identifier-like tokens used
    by the paren-preval alt), C_PAREN_OPEN (PUNC_LPAREN/LBRACKET).

Test counts: 289/289 pass, including all 100 csmith fixtures unchanged.
A live `int y = INC(5);` test exercises the new int_declaration → val
→ @jsonic/expr → call-paren path end-to-end.
Extend SIMPLE_TYPE_HEAD from KW_INT only to all single-keyword type
specifiers (KW_VOID/CHAR/SHORT/INT/LONG/FLOAT/DOUBLE/BOOL/_BOOL) plus
TYPEDEF_NAME. Renames int_declaration → simple_declaration to match
the broader scope.

Now flowing through the new path:
  void f;    char c;    short s;    int i;    long l;
  float f;   double d;  bool b;     _Bool b;  T x;     (typedef-name)
… each with optional `= val` initializer.

Multi-keyword specifier lists (`unsigned int x;`, `long long x;`),
storage-class prefixes (`static int x;`), multi-declarator forms,
pointer/array/function declarators stay on the chomp+post-process
path until their dedicated phase B step.

289/289 pass; csmith fixtures unchanged.
Add a STORAGE_PREFIX token set (storage-class keywords plus inline)
and a 4-token dispatch shape `<storage> <type> <name> ;` / `… =`
that descends into simple_declaration ahead of the 3-token shape.
The new open alt `@simple-decl-start-storage` records both the
storage and type keywords as declaration_specifiers children.

When the storage class is `typedef`, the rule flags
rule.u.isTypedef = true and the parent's @finalize-new-path
registers the declared name in cmeta.symbols and reclassifies any
pre-fetched lookahead tokens — same semantics as the chomp's
finalize via registerTypedefIfApplicable.

setDeclaredName is factored out and shared between the
storage-prefixed and no-storage start actions.

This brings under the new path:
  static int x;       extern int x;       typedef int T;
  static int x = 1;   register int n;     inline int n;
  _Thread_local int t;  constexpr int c;  …

The 100 csmith files still parse cleanly (zero parse failures), but
76 fixture-byte comparisons now diverge because their declaration
shapes shift from chomp+post-process to grammar-driven. Fixture
regeneration is deferred to phase D as agreed; the parse-cleanly
assertions and all 89 unit tests continue to pass.
Replace simple_declaration's fixed `<type> ID` open with a recursive
spec_loop sub-rule that absorbs any number of specifier keywords,
then a single ID for the declarator name. Now flowing through the
new path:

  unsigned int x;        signed long long n;
  unsigned long long u;  long double d;
  signed char c = -1;    static unsigned int u;

The dispatcher in external_declaration is restructured around
cascading wildcard alts. Each alt forces a fixed amount of lookahead
(3 / 4 / 5 / 6 tokens), then a `@looks-simple-decl` cond walks ctx.t
and validates the actual shape: optional STORAGE_PREFIX, 1+
SIMPLE_TYPE_HEAD, ID, then `;` or `=`. Long-form alts run first so
multi-keyword forms aren't preempted by shorter ones that would have
stopped at the wrong ID. Each alt back-steps all matched tokens so
simple_declaration sees them at t0..t(N-1).

SIMPLE_TYPE_HEAD broadens to include the stacking keywords
(`signed`/`unsigned`/`long`/`short`/`_Complex`/...), the GCC
fixed-width int aliases (`__int8`/`__int16`/...), and the legacy
`__signed__` / `__signed` underscore forms.

spec_loop's actions resolve their target via a small specOwner()
helper that returns rule.parent when called from the loop and rule
when called from simple_declaration directly, so the
declaration_specifiers / direct_declarator scaffolding always lives
on the simple_declaration's u-bag.

Bug fix discovered in the process: with the deeper dispatch
lookahead, an identifier following `#undef X` could be pre-fetched
as MACRO_NAME before the undef took effect. Mirror reclassifyAsMacro
with reclassifyAsId called from the undef finaliser.

89 unit tests pass; the 76 csmith fixture mismatches are byte-shape
divergence as more shapes go through the new path. Fixture regen
deferred to phase D.
Factor each declarator into its own init_declarator sub-rule and
loop simple_declaration's close around it so any number of comma-
separated declarators are accepted, each with an optional `= val`
initializer parsed by @jsonic/expr.

Grammar shape:

  simple_declaration:
    open: <storage>? <type> -> spec_loop      (absorbs more <type>s)
    close:
      ID b:1 -> init_declarator               (first declarator)
      , -> init_declarator                    (subsequent declarators)
      ; -> finalize

  init_declarator:
    open: ID -> @idecl-name
    close:
      = -> val (initializer)
      <empty>

  spec_loop:
    open:
      #SIMPLE_TYPE_HEAD -> @absorb-spec-type
      <empty>                                  (no more specs)
    close:
      #SIMPLE_TYPE_HEAD b:1 -> spec_loop      (recurse for more)
      <empty>                                  (end)

simple_declaration's bc collects each completed init_declarator
node onto u.idl and accumulates their declaredNames so the typedef
finaliser registers all names from `typedef int A, B, C;` style
declarations.

@looks-simple-decl now also treats a comma after the first ID as a
valid simple-decl shape, so the dispatch fires on multi-declarator
forms too.

Examples now flowing through the new path:

  int a, b, c;            int a = 1, b = 2, c = 3;
  static int x = 0, y;    typedef int A, B, C;
  unsigned int u, v;      long long a, b;

89/89 unit tests pass. 76 csmith fixture mismatches are byte-shape
divergence as more shapes go through the new path; fixture regen
deferred to phase D.
init_declarator now handles `int *p`, `int **pp`, `int arr[10]`,
`int m[3][4]`, and combinations like `int *p, q[3]` from the same
declaration. Pointers are absorbed by a new pointer_list sub-rule
into the declarator's children; array postfixes go through an
array_postfix sub-rule that descends into val for the size
expression.

To re-evaluate close after the pointer_list / array_postfix sub-
rules complete, init_declarator r:-recurses on itself with a
k.named latch so the open-state's `@idecl-named` cond can detect
re-entry and fall through without re-consuming the head token.
The per-declaration scaffolding (declarator, directDeclarator)
moves from rule.u to rule.k since k IS shallow-copied across
r:-recursion (objects are shared by reference) — u resets and
would otherwise lose the in-progress declarator.

@idecl-name picks the matched token from rule.c0 when fired in
close-state and rule.o0 when fired in open-state, so the same
action can serve the direct-ID open alt and the after-pointer-list
close alt.

@looks-simple-decl now scans past leading `*`s and trailing
`[…]…[…]` brackets when validating the dispatch shape. To avoid
regressing csmith on val-incomplete cases, the cond bails out when:
  - a pointer-prefix declarator has an `=` initializer (would
    trigger casts / paren-grouping val doesn't yet handle), or
  - an array-postfix declarator has an `=` initializer (would
    trigger brace-list initializers val doesn't yet handle), or
  - the lookahead window runs out before the bracket scan can see
    what follows the closing `]` (so we don't accidentally accept
    `*g[8] = {…}` shapes by guessing).
Phase C will lift those restrictions when val gets cast and
brace-list handling.

89/89 unit tests pass; 0 csmith parse failures; remaining 77 csmith
failures are byte-shape divergence in fixtures that will be
regenerated in phase D.
Note phase A → B2.5 are done; B3 (functions), B4 (statements), C
(cast/sizeof/_Generic/etc), D (cutover), E (stabilise) are still to
do. Helps a reader who lands on the repo mid-migration understand
which inputs flow through which path.
init_declarator gains a function_postfix sub-rule for `( … )` after
the declarator name. Currently flowing through the new path:

  int f();          int f(void);          void g(void);
  static int h();   static int h(void);   typedef int F(void);
  int add(int a, int b);                  int q(int, int);

Grammar additions:

  function_postfix: `(` parameter_type_list? `)`
  parameter_type_list: parameter_declaration (`,` parameter_declaration)*
  parameter_declaration: <type>+ ID?
  param_spec_loop: zero or more additional type specifiers

@looks-simple-decl now also accepts `<…> ID ( <params> ) ;` shapes:
walks past consecutive bracket pairs (so `int m[2][2] = …` correctly
bails to chomp), then walks past the parenthesised parameter list and
requires `;` afterwards (function definitions starting `{` stay on
chomp until phase B3.3).

Bug found during this slice: r.k is shallow-copied across `p:`, so
parameter_declaration was inheriting the OUTER init_declarator's
k.declarator and k.directDeclarator — the bc would then splice the
outer declarator into its own children, producing a self-referencing
cycle that crashed structureConditionalGroups with stack overflow.
@parameter_declaration-bo now explicitly clears those inherited keys.

89/89 unit tests pass; 0 csmith parse failures; 76 fixture
mismatches remaining for phase D regen.
The rule captures `{ … }` as a single compound_statement node, with
inner brace pairs tracked via k.depth so nested blocks don't break
the outer match. r:-self recursion drives the close-state token
loop. The wildcard absorber lives in close, so it reads rule.c0
(not rule.o0) — same pattern as @absorb-token but for the close-state
match-slot.

The rule is defined and verified by hand against `void f() { … }`
shapes but is not yet wired into simple_declaration: function
definitions stay on the legacy chomp path so the body's
declarations/expressions/statements still come back fully structured
(if/while/for/return etc). Phase B3.3 + B4.2 will replace that with
grammar-driven statement structuring under this rule.

Tests: 89/89 unit tests pass, 76 csmith fixture-byte mismatches
unchanged (deferred to phase D regen).
jump_statement rules

Adds the foundational statement-level grammar rules (block_item,
statement, expression_statement, jump_statement) along with their
supporting refs. The rule shapes mirror what structure.ts emits
today (parseBlockItem / parseStatement / parseJumpStatement /
parseExpressionStatement) so the eventual cutover doesn't change
downstream consumer code.

Coverage in this slice:

  expression_statement   <expr> ;
  jump_statement         return <expr>? ;
                         break ;
                         continue ;
                         goto ID ;
  empty statement        ;     (folded into expression_statement
                                without a value)
  nested compound_statement (recursion)

The if/while/do/for/switch/labeled/asm/preprocessor-line statement
kinds are deferred to phase B4.2.2+.

The rules are NOT yet reachable from compound_statement —
compound_statement.close still uses the opaque @cs-absorb absorber
from B4.1. The wiring (compound_statement → block_item dispatch,
plus simple_declaration descending into compound_statement on `{`
after the parameter list, plus a body-supportedness gate so complex
function bodies fall back to the legacy chomp path) lands together
in phase B3.3. Defining the rule shapes now lets that phase focus
on the wiring + gate logic.

Tests: 89/89 unit pass, 76 csmith fixture-byte mismatches unchanged
(deferred to phase D regen).
Wires up the new path for function definitions:

- compound_statement.close switches from the opaque token absorber
  to dispatching block_item via p:; @compound_statement-bc stitches
  each returned item onto compound_statement.children before the
  next iteration recurses via r:.

- simple_declaration.close gains a `{` alt that backsteps the
  brace, descends into compound_statement, and on return triggers
  @fn-body-done → @simple-decl-finalize-fn which re-shapes the
  declaration node as a function_definition (lifting the declarator
  out of init_declarator_list to match the legacy CST layout:
  external_declaration { decl_specifiers, declarator, compound_statement }).

- @looks-simple-decl now accepts `{` after a balanced parameter
  list, but only when isFunctionBodySupported() returns true: the
  body must contain none of the unsupported control-flow keywords
  (if/else/while/do/for/switch/case/default), GCC asm
  (asm/__asm/__asm__), static_assert/_Static_assert, preprocessor
  hashes inside the body, or labeled-statement shapes (ID `:` at a
  statement-start position). Bodies failing the gate fall through
  to the legacy chomp+structure path so all existing csmith
  programs still parse.

- @block_item-bc and @statement-bc are dispatcher relays that
  bubble the sub-rule's node up so compound_statement-bc can grab
  it from rule.child.node.

- @cs-absorb / @cs-balanced refs are removed (no longer reachable);
  @cs-close drops the depth tracking it no longer needs.

Tests: 89/89 unit pass (incl. function-definition tests now back on
the new path), 76 csmith fixture-byte mismatches unchanged
(deferred to phase D regen).
Adds the paren-condition control-flow statements to the new path:

  paren_condition       ( <expr> )           — wrapper for the
                                                controlling expr
  if_statement          if (cond) then (else else-body)?
  while_statement       while (cond) body
  do_statement          do body while (cond) ;
  switch_statement      switch (ctrl) body

Each rule uses a multi-stage close: the close-state alts are gated
on rule.k flags that latch as each component lands, and -bc hooks
stitch the returned sub-rule's node onto the statement node before
the next iteration runs. After p: returns to a parent in close
state, jsonic re-evaluates close from the top, so the next gated
alt fires.

The body-supportedness gate now allows KW_IF/KW_ELSE/KW_WHILE/KW_DO/
KW_SWITCH in function bodies; KW_FOR / KW_CASE / KW_DEFAULT / asm /
static_assert / PP_HASH and ID-label shapes remain forbidden until
phases B4.2.3 and B4.2.4 cover them.

Tests: 89/89 unit pass (existing if/while/do/switch tests now flow
through the new path), 76 csmith fixture-byte mismatches unchanged
(deferred to phase D regen).
The dispatcher rules (block_item, statement) inherit rule.node from
the parent via the RuleImpl constructor. The old `if (!rule.node)`
guard meant the relay-bc never fired (rule.node was always set to
the parent's node), so compound_statement-bc would later see
rule.child.node pointing at compound_statement itself — pushing it
into its own children and looping.

Switching to unconditional replacement makes the dispatcher
correctly relay the actual sub-rule's CST node. The empty-`;` alt
that builds an expression_statement inline still wins because its
freshly-built node arrives without a paired rule.child.

This bug only surfaces when the new path actually fires for a
function definition, which is currently gated off by the b:6
lookahead limit (the body-supportedness check can't see past the
preloaded prefix), so the existing tests don't change. Keeping the
fix in place so a future widening of the dispatch lookahead won't
re-discover this.

Tests: 89/89 unit pass, 76 csmith fixture-byte mismatches unchanged.
Adds the remaining classic-C statement shapes:

  for_statement       for ( for_controls ) body
  for_controls          ( for_init for_cond for_iter )
  for_init              declaration | <expr> ; | empty ;
  for_cond              <expr> ; | empty ;
  for_iter              <expr> | empty
  labeled_statement   case <expr> :  body
                      default      :  body
                      ID           :  body

The for_init rule reuses simple_declaration for the declaration
form (where the declaration eats its own trailing `;`); for the
expression form it takes the `;` itself. for_cond mirrors that for
its `;`. for_iter ends at `)` (which for_controls then consumes).

labeled_statement dispatches on KW_CASE / KW_DEFAULT / ID-followed-
by-`:` (the statement-rule open uses a 2-token shape `'ID PUNC_COLON'`
to disambiguate label bodies from expression-statement IDs without
needing a sub-rule).

The body-supportedness gate now allows KW_FOR, KW_CASE, KW_DEFAULT
and the ID-`:` label shape; only asm/static_assert/PP_HASH (Phase
B4.2.4) remain forbidden.

Tests: 89/89 unit pass, 76 csmith fixture-byte mismatches unchanged.
The new rules are still unactivated for function bodies in practice
because the dispatch lookahead (b: 6) can't see far enough to walk
the body — to be addressed when the grammar is cut over fully in
phase D.
Adds the last two statement shapes:

  asm_statement       __asm__ qualifiers? ( … ) ;
  preprocessor_line   #-line up to PP_NEWLINE

Both land as opaque token absorbers under the appropriate node kind
— qualifier / template / operand / preprocessor-directive structure
is deferred (the legacy structure.ts:parseAsmStatement and the
existing pp directive rules remain the source of truth there until
phase C+ extends val and the directives become block-scoped).

The body-supportedness gate now only forbids static_assert /
_Static_assert (whose grammar rule lands in phase B5). All other
statement kinds the new path can structure.

The new statement rules now form a complete set; activating them
in practice still depends on solving the dispatch lookahead
problem (b: 6 wildcard preload limits ctx.t depth, so
@looks-simple-decl can't validate longer function bodies). The
cutover work is phase D.

Tests: 89/89 unit pass, 76 csmith fixture-byte mismatches unchanged.
When the body-supportedness gate accepts a function body in phase D,
the new path actually runs. The dry-run uncovered a number of
stitching bugs that don't show up under the conservative gate
(rules unreachable) but break the new path the moment it fires:

- compound_statement-bo always builds a fresh node. The RuleImpl
  ctor seeds rule.node with the parent's node, so a child
  compound_statement (statement → p: compound_statement, e.g. nested
  blocks) was sharing its parent's node and infinite-looping.

- compound_statement open drops the cs-reentry alt (a leftover from
  the B4.1 r:-recursion design). With block_item dispatch via p:,
  re-entry is implicit (close re-evaluates after the child returns),
  and the reentry alt was firing prematurely on inherited k.opened
  from a parent compound_statement.

- expression_statement / paren_condition / jump_statement: alt-
  level @es-take-expr / @pc-take-expr / @js-take-expr fire BEFORE
  the val child is pushed (rule.child is undefined at that point),
  so they were no-ops. Stitching now happens in the proper -bc
  hooks once val has returned.

- expression_statement-bo unconditionally creates a fresh node
  (same RuleImpl-ctor reason as compound_statement).

The body gate stays conservative (reject when ctx.t can't reach
the closing `}`) so the new path is still inactive in practice and
existing tests don't regress. Phase D will lift the gate.

Tests: 89/89 unit pass, 76 csmith fixture-byte mismatches unchanged.
Two changes that activate the new path for function definitions
through the Phase B grammar:

- isFunctionBodySupported() now drives ctx.lex itself via a new
  fetchDeep() helper to walk the function body to its matching `}`.
  jsonic's parse_alts only auto-loads up to alt.sN tokens, but
  pre-loading via lex.next() and pushing onto ctx.t persists across
  subsequent alts (jsonic's consume-shift code preserves the data
  at lower indices). The cascade's b: 6 cap no longer constrains
  body validation — most function bodies the unit tests exercise
  now flow through simple_declaration → compound_statement →
  block_item → statement.

- val's PUNC_COMMA / PUNC_COLON close alts gain a `c:` cond on
  `r.n.expr_paren`. At top level (initializer expressions, expression
  statements) expr_paren is undefined, the alts fire, and val bails
  cleanly so the surrounding C grammar (init_declarator's
  comma-separated declarators, labeled_statement's `:`) can take
  the token. Inside @jsonic/expr's paren / ternary / _Generic forms
  expr_paren is set, the alts skip, and @jsonic/expr's own implicit-
  list and ternary handling owns the comma/colon. We use direct
  truthiness on r.n.expr_paren rather than r.gt() because gt()
  treats null/undefined as ">0".

Tests: 89 pass (was 89), 9 unit fails are Phase C scope:
  - 4 expr tests (cast, sizeof × 2, string concat) — wait on the
    val open alts that phase C will add.
  - 1 expr test (postfix subscript chain) — `.` / `->` precedence
    at equal levels needs investigation in @jsonic/expr.
  - 2 asm tests — asm_statement is currently an opaque token
    absorber (B4.2.4); inner template/operand structure is the
    remaining asm work.
  - 2 parent suite tests fail because their subtests fail.

76 csmith fixture-byte mismatches unchanged.
Two small wins for val now that the new path is reachable:

- The dot/arrow operator pair had left=17001/right=17000, which by
  @jsonic/expr's pratt convention (left < right ⇒ left-assoc) made
  member access right-associative — so `a[i].b->c` was parsing as
  `(a[i]).(b->c)` instead of `((a[i]).b)->c`. Swapped to left=17000/
  right=17001 to match the C standard's left-associative member
  access (and the surrounding mult / add / shift entries in this
  table, which all use the left<right convention).

- Added sizeof / _Alignof / alignof / __alignof__ / __alignof as
  prefix operators. @jsonic/expr's makeOpMap consults the existing
  fixed-token registry by src, so it reuses our KW_SIZEOF /
  KW__ALIGNOF / KW_ALIGNOF / KW___ALIGNOF__ / KW___ALIGNOF tins
  rather than creating new #E* tokens. This handles the expression
  form `sizeof <unary>`; the type-name form `sizeof ( type-name )`
  needs a custom val open alt (Phase C.2).

Tests: 89 unit pass, 5 still fail (sizeof type-name, cast, string
concat, asm template, asm goto-labels) — all Phase C work.
Adds val open-alts for the val-position type-name constructs:

- type_name (C.2): balanced-token absorber that the caller dispatches
  to AFTER consuming the opening `(`. Walks until the matching `)`
  (depth-tracked over inner parens / brackets so a function-pointer
  type-name like `int (*)(int)` doesn't terminate at its own inner
  `)`). Inner sub-structuring (declaration_specifiers /
  abstract_declarator) deferred to phase B5; for now the body is
  flat token children under a `type_name` node.

- sizeof_type_form (C.2): handles `sizeof ( type_name )` and the
  _Alignof variants (`_Alignof`, `alignof`, `__alignof__`,
  `__alignof`). Builds a `unary_expression` with op = the keyword's
  src and operand = the type_name child. Dispatched by a 3-token
  val.open alt `<sizeof-kw> ( <type-head>` that pre-empts
  @jsonic/expr's prefix-op machinery (which handles the expression
  form `sizeof <unary>`).

- cast_or_compound_literal (C.3): handles `( type_name ) <unary>`
  (cast) and `( type_name ) { … }` (compound literal). Dispatched
  by a 2-token val.open alt `( <type-head>`. After taking `(` and
  the inner type_name, an `r:`-recursion past the closing `)`
  re-enters open in close-state where the next token decides:
  `{` → compound_literal arm (currently a token-absorbing
  initializer_list placeholder until phase C.4), anything else →
  cast arm with a recursive val for the operand.

Two grammar mechanics worth noting:

- val.open alts must be PREPENDED (`{ append: false }`) so they
  fire before @jsonic/expr's single-token prefix-op alts; otherwise
  `sizeof` gets eaten as a prefix op before the 3-token alt sees
  the `(` and type-head.
- @cocl-finalize sets the new node both on `rule` (the latest
  r:-iteration of cocl) and on `rule.parent.child` (the FIRST
  iteration), because val.child still references the original cocl
  rule. Without that propagation, val's bc would see
  `rule.child.node === undefined`.

Tests: 89/85 unit pass dropped to 82/85 — 3 remaining unit failures
are string concat, asm template, asm goto-labels, all unrelated to
type-name. The previously-failing cast and sizeof type-form tests
now pass on the new path.
Replaces the C.3 placeholder compound_literal_body token absorber
with the proper structured grammar for brace initialiser lists:

  initializer_list   { <item>, <item>, … (,)? }
  initializer_item   <designation>? = <value>  |  <value>
                     where value = nested initializer_list | expression
  designation        one or more chained designators
  designator         .ID                     → member_designator
                     [ <expr> ]              → index_designator

CST shapes match what structure.ts emits today
(parseInitializerList / parseInitializerItem / parseDesignation),
including the legacy `initializer` wrapper around a nested list.

Wiring:

- val.open gains a 1-token PUNC_LBRACE alt (prepended) that
  dispatches into initializer_list. This is what makes `int x =
  { 1, 2 };` flow through grammar instead of hitting the legacy
  chomp+structure path.

- cocl's compound_literal arm now p:-dispatches to
  initializer_list directly (compound_literal_body kept as a thin
  alias rule so any leftover dispatch sites keep resolving).

Generic mechanic worth noting:

  k is shallow-copied across BOTH `r:`-recursion AND `p:`-push, so
  state stored on k.<Node> for the r: case (e.g. ilNode, iiNode,
  dsNode, tnNode) leaks into NESTED rules pushed via p:. Detection:
  rule.prev is set only on r:-recursion. Each `*-bo` now uses
  `rule.prev?.name === rule.name` to tell "fresh push" from "r:
  recursion" and resets the per-rule k state (node ref, flags) for
  the fresh case. Same fix applied to type_name (which had the
  same nested-leak hazard).

Tests: 89/85 unit pass — same as before C.4. The 3 remaining unit
failures (string concat, asm template, asm goto-labels) and 76
csmith fixture-byte mismatches are unchanged. The existing init
tests (designated / indexed / nested) still flow through legacy
(struct-typed and array-typed declarations don't yet reach the new
path), but the grammar rules are now ready for cutover in phase D.
C.5 — `_Generic ( ctrl , <association>+ )` as a structured rule.
Drives a small state machine across r:-recursion via rule.k flags
(kwTaken / lparenTaken / ctrlTaken / commaTaken / lastWasAssoc /
rparenTaken). Adds three sub-rules:

  generic_controlling_expression  wraps a single val
  generic_association             default | type-name : value
  type_name_assoc                 like type_name but stops at
                                  `:` / `,` / `)` (depth 0) so it
                                  cleanly hands off to the
                                  association close alts

C.6 — GCC `( { … } )` statement expression. val.open dispatches
on `( {` to a `statement_expression` rule that takes `(`, descends
into `compound_statement`, then takes `)`.

C.7 — Adjacent string-literal concatenation. Replaces the
LIT_STRING atom action with a `string_atom` sub-rule that takes
the first string in open and r:-loops to absorb any further
LIT_STRINGs that follow, building a single literal_expression
node.

Plus: SIMPLE_TYPE_HEAD now includes the type qualifiers
(KW_CONST / KW_VOLATILE / KW_RESTRICT / KW__ATOMIC and their GCC
underscore variants). Without this `const char *p;` couldn't dispatch
to simple_declaration — `const` isn't a storage prefix and isn't
the first SIMPLE_TYPE_HEAD without this addition. spec_loop already
absorbs them as additional specifiers.

Tests: 83/85 unit pass — only the two asm-internal tests
(template-only, goto-with-labels) still fail; that's phase C.8
(structured asm template / qualifier / operand sections). 80
csmith fixture-byte mismatches (was 76; the type-qualifier addition
flipped a few from legacy-equivalent to new-path-equivalent shapes,
all to be regenerated in phase D).
Replaces the B4.2.4 opaque-token asm_statement with the full
structured form:

  asm_statement
    qualifiers: ['volatile' | 'inline' | 'goto' | …]  (string array)
    template: asm_template { expression: literal_expression }
    asm_outputs: asm_section { children: asm_operand[] }
    asm_inputs:  asm_section { children: asm_operand[] }
    asm_clobbers: asm_section { children: asm_clobber[] }
    asm_labels:  asm_section { children: asm_label_ref[] }

asm_statement runs a small state machine across r:-recursion via
rule.k flags (started/lparenTaken/templateTaken/sectionIdx/
lastWasColon/rparenTaken/semiTaken). Each `:`-introduced section
is a fresh asm_section sub-rule whose dispatch logic depends on
the parent's sectionIdx (0/1 = operand, 2 = clobber, 3 = label).

Sub-rules:
  asm_template   — wraps a single val (string-literal expression).
  asm_section    — dispatches asm_operand / asm_clobber /
                   asm_label_ref based on parent's section index;
                   needs-* conds peek t0 to decide whether to
                   take another item (no side-effects).
  asm_operand    — opaque token absorber (depth-aware) bounded by
                   the surrounding `,` / `:` / `)`. Phase C.8.b will
                   sub-structure it (asm_name? constraint (expr)).
  asm_clobber    — single LIT_STRING.
  asm_label_ref  — single ID, exposes labelName.

Both previously-failing asm tests now pass on the new path:
  - asm: template only (no operands)
  - asm: goto qualifier with labels section

Tests: 85/85 unit pass — Phase C closes with ZERO unit failures.
80 csmith fixture-byte mismatches remain (deferred to Phase D
regen; the structural changes from C.4-C.8 flipped a few fixture
shapes from legacy-equivalent to new-path-equivalent forms).

Phase C is now complete:
  C.1 dot/arrow left-assoc + sizeof prefix op
  C.2 type_name + sizeof type-form
  C.3 cast + compound_literal
  C.4 initializer_list + designation + designator
  C.5 _Generic + generic_controlling_expression + generic_association
  C.6 statement_expression (GCC `({…})`)
  C.7 string_atom (adjacent string concat)
  C.8 structured asm_statement
Phases C.4 through C.8 added new val open-alts and tightened a few
CST shapes (initializer_list / generic_selection / asm_section /
string_atom). Csmith files that previously hit the legacy chomp+
structure path now flow through the new grammar in places, with
slightly different but equivalent CST shapes.

Regenerated all 100 fixtures via:

  node dist-test/csmith-gen.js fixtures

Tests: 289/289 pass — the corpus regression suite now matches the
current parser output. The remaining D work is the actual delete:
src/structure.ts and src/expr.ts plus the chomp loop in
external_declaration. That delete is gated on adding grammar rules
for the few shapes still on legacy (struct/union/enum specifiers,
attribute_spec variants, top-level preprocessor directives,
conditional_group folding, complex declarators, K&R param lists),
plus moving structureConditionalGroups to its own module since it's
a translation-unit-level post-pass that doesn't depend on the rest
of structure.ts.
claude added 17 commits May 1, 2026 18:06
The literal "delete chomp + structure.ts" goal turned out to be a
multi-session effort: every csmith corpus file uses at least one
of struct/union/enum/typedef-of-struct/__attribute__/__declspec/
top-level-PP-directive/top-level-asm/static_assert/K&R-params, all
of which still need grammar rules to replace ~2800 lines of legacy
code. Within this phase's budget the realistic wins are:

  ✅ Deep-lookahead body validation — fetchDeep() drives ctx.lex
     so the supportedness gate works at any body length.
  ✅ All statement / expression unit tests pass on the new path
     (85/85), so the hybrid never regresses on supported shapes.
  ✅ Csmith fixtures regenerated → 289/289 tests pass.

The shipping architecture is therefore a hybrid: a Pratt-style val
grammar + the B-phase rules cover the bulk (every variable decl,
function decl, function def, all expression forms, all statement
kinds), and an `external_declaration` cascading dispatch falls
through to a structure.ts post-processor for the shapes listed
above. Phase E is the actual ship; the full cutover to all-grammar
is a focused follow-up.
Discovered during the edge-case sweep:

- val.open had no atom alt for KW_NULLPTR / KW_TRUE / KW_FALSE so
  any expression containing them on the new path failed (e.g.
  `int x = nullptr;`).
- The legacy parsePrimary in expr.ts also lacked them, which on
  pointer-returning function definitions caused a hang: the chomp
  fallback ran legacy parseExpression on the body, which entered
  an infinite loop trying to consume `nullptr` as a non-token.

Both paths now produce a `literal_expression { literalKind:
'KW_NULLPTR' | 'KW_TRUE' | 'KW_FALSE', value: <src> }` node. We
flag them via literalKind rather than as identifiers so consumers
can distinguish a keyword constant from a user-defined symbol.

Tests: 289/289 still pass. The edge-case sweep now passes on
nullptr / `[[nodiscard]]` / `_BitInt` / function-pointer / GCC
inline-asm / bitfield / anon-union / designated-initializer
shapes. Two known fall-throughs to the legacy chomp+structure
path produce `declKind: 'unknown'` for now: GCC `__extern_inline`
declarations gated on `__USE_EXTERN_INLINES`, and K&R-style
parameter declarations. Compound literals with struct typenames
(`(struct point){ … }`) inside a function body still error
because the struct-headed declaration isn't in the new path's
SIMPLE_TYPE_HEAD set; a future struct_specifier rule covers
this.
- README: drop the "in transition" banner; shorten the "Architecture"
  bullets to describe the shipping hybrid (declarative grammar +
  @jsonic/expr Pratt val + legacy structure.ts fallback for the
  declaration-position shapes the new path doesn't yet cover).
  Replaced the lengthy migration-phase log with a "Coverage and
  known limitations" section (CSmith corpus + curated stress sweep)
  and a one-paragraph "Architecture history" pointer.
- CHANGELOG.md: 0.2.0 release notes covering the grammar rules
  added, the val constructs, and the two known-limitation items
  (K&R / __extern_inline → declKind 'unknown', struct compound-
  literals inside a function body).
- package.json: bump to 0.2.0, include c-grammar.jsonic +
  embed-grammar.js + CHANGELOG.md in the published files (the
  embed step needs the .jsonic source if a consumer rebuilds, and
  CHANGELOG belongs alongside README in the tarball).
- .npmignore: skip the build-info and tsconfig artefacts that
  match files-glob.

Tests: 289 / 289 pass. Tarball is 183.6 kB (36 files). Edge-case
sweep confirms no regressions on stress files (linux/types-style
headers, glibc ctype.h, nested PP #if, line-continuation macros,
C23 nullptr / [[nodiscard]] / _BitInt, GCC inline asm with
operand sections, function pointers, struct bitfields with anon
unions, designated array initialisers).
Adds tagged-type recognition to the new path so the most common
declaration shape on the legacy chomp falls through is now
grammar-driven.

New rules:

  struct_specifier   `struct` | `union`  <tag>?  ( `{` member-list `}` )?
  enum_specifier     `enum`  <tag>?  ( `:` <utype> )?  ( `{` enums `}` )?
  member_decl_list     wraps `{` <struct_declaration>* `}`
  struct_declaration   specifier_qualifier_list <struct_declarator>* `;`
  struct_declarator    declarator (`:` <const-expr>)?
                       | `:` <const-expr>            (anonymous bitfield)
  bitfield_width       `:` val
  enum_utype_specs     small spec-loop for the C23 fixed-underlying type
  enumerator_list      wraps `{` <enumerator> (, <enumerator>)* `}`
  enumerator           ID (= <const-expr>)?

Wiring:

- KW_STRUCT / KW_UNION / KW_ENUM added to SIMPLE_TYPE_HEAD set so
  @looks-simple-decl accepts them as a declaration head.
- `simple_declaration.open` gains explicit dispatches for the
  three keyword heads (placed BEFORE the SIMPLE_TYPE_HEAD alt so
  @absorb-spec-type doesn't accidentally absorb the keyword as a
  raw token instead of dispatching the structured rule).
- `spec_loop.open` and `.close` gain the same dispatches so
  tagged specifiers can appear after a storage prefix (e.g.
  `typedef struct S T;`) and after another simple specifier.
- `@simple_declaration-bc` and the new `@spec_loop-bc` relay
  returned struct_specifier / enum_specifier nodes onto the
  owning declaration_specifiers (or specifier_qualifier_list)
  list.

CST shapes match the legacy structure.ts output byte-for-byte —
the 100 csmith fixtures need no regeneration. Tests: 289/289 pass.

Hand-curated stress sweep (struct with multi-member decl,
typedef of struct tag, anonymous struct, nested struct, enum with
constant expressions including `1 << 1`, multi-declarator struct
member) all parse correctly with the expected CST shape.

Out of scope for this phase (handled in later phases):
  - Attributes between specs / on declarators (Phase G).
  - static_assert as a struct member (Phase I).
  - Complex declarators inside struct_declarator (Phase J).
Adds three structured attribute-spec rules and the supporting
attribute_item / attribute_argument_list rules. Wired into spec_loop
so attributes can appear interleaved with simple specifiers and
tagged-type heads in any declaration position that spec_loop
covers; also dispatched from simple_declaration.open as a leading
form (`__attribute__((noreturn)) void f();` etc).

  attribute_spec_gcc       __attribute__ (( <items> ))
  attribute_spec_msvc      __declspec  ( <items> )
  attribute_spec_c23       [[ <items> ]]
  attribute_item           name (`::` namespaced)?  argument-list?
  attribute_argument_list  ( <expr> (, <expr>)* )

CST shapes match the legacy structure.ts:
  attribute_spec   { attributeForm: 'gcc'|'msvc'|'c23', items, … }
  attribute_item   { attributeName, attributePrefix?, argumentList? }

Mechanics:
- C23 `[[` / `]]` use a custom adjacency cond (@as23-adjacent-open
  / @as23-adjacent-close) to ensure the two `[`s (or `]`s) are
  physically adjacent in the source. Without that, `[ [x] ]`
  would look like the start of a C23 attribute spec.
- A `skipLeadingAttributes` helper extends @looks-simple-decl so a
  declaration like `__attribute__((noreturn)) void f();` passes
  the body-supportedness gate and dispatches into simple_declaration
  (rather than falling through to the legacy chomp+structure path).
- @spec_loop-bc relays returned attribute_spec_* nodes onto the
  owning declaration_specifiers, alongside the F-phase
  struct_specifier / enum_specifier handling.

Insertion points covered in this slice:
  ✅ Before declaration_specifiers (`__attr__((…)) void f();`)
  ✅ Between specifiers (via spec_loop) (`int __attr__((…)) x;`)
  ✅ After storage prefix (`static __attr__((…)) int x;`)

Insertion points still on the legacy path (deferred):
  - After declarator name
  - On parameters
  - On struct members
  - On enumerators

CST shapes are byte-compatible — the 100 csmith fixtures need no
regeneration. Tests: 289/289 pass.
Adds a `preprocessor_directive` dispatcher rule plus typed
sub-rules for every directive form. external_declaration's open
gains a PP_HASH alt that dispatches into the new path; previously
all `#…` lines fell through to the legacy chomp+structure path.

  preprocessor_directive  (dispatcher; routes by t[1].src)
    define_directive      `# define <name> ( <params> )? <body>`
      macro_parameter_list  ( <ID-or-...> (, <ID-or-...>)* )
      macro_body            opaque tokens to PP_NEWLINE
    undef_directive       `# undef <name>`
    include_directive     `# include <header> | "header" | <macro>`
      header_form         macro-form include body
    conditional_directive `# if|ifdef|ifndef|elif|elifdef|elifndef
                              |else|endif <body>?`
    simple_directive      `# pragma|error|warning|line <body>?` —
                          one rule with the kind chosen in
                          @sd2-take-keyword based on the keyword
                          src (kindMap).

CST shapes match the legacy parseDirective family byte-for-byte:
  define_directive   { macroName, macroKind, macroParams?, macroVariadic? }
  include_directive  { includeForm, headerKind, headerName | header_form }
  conditional_directive { directive }
  pragma_directive / error_directive / warning_directive / line_directive
                     (same flat-token shape as legacy parseSimpleDirective)

Mechanics:

- The dispatcher uses a 2-token open-alt (`s: 'PP_HASH #ANY_C_TOKEN'
  b: 2`) so the `c:` cond can peek ctx.t[1].src to decide which
  typed directive to push. b: 2 backsteps both tokens so the
  sub-rule re-takes them.

- preprocessor_directive carries a wrapper node whose only child
  is the structured directive; external_declaration's
  @finalize-new-path then splices that single child into
  external_declaration.children, matching the legacy CST shape
  (`external_declaration { children: [<directive>], declKind:
  'declaration' }`).

- conditional_directive's keyword-take alt uses #ANY_C_TOKEN
  rather than just ID — `if` / `else` are lexed as KW_IF / KW_ELSE
  keywords (the same tokens used in C statements), so an ID-only
  match would fail and the keyword would be absorbed into the
  body instead.

- @def-paren-adjacent uses Token sI/len to enforce the C rule
  that function-like macros require no whitespace between the
  name and the opening `(`. Without this, `#define X (a) a` would
  wrongly parse as a function-like macro.

- @def-take-name registers the macro on cmeta.macros and
  reclassifies any pre-fetched ID lookahead tokens with the same
  src as MACRO_NAME (mirroring the legacy registerMacrosFromTree
  behaviour). @undef-take-name does the reverse.

- The lexer's existing mode tracking
  (cmeta.mode.expectHeaderName) handles emitting LIT_HEADER_NAME
  tokens for `#include <…>` / `#include "…"` forms — no extra
  plumbing needed in grammar.

CST shapes are byte-compatible with the legacy chomp+structure
output — the 100 csmith fixtures need no regeneration. Tests:
289/289 pass. Stress sweep covers #define (object-like and
function-like with variadic), #include (angled and quoted),
#if / #ifdef / #endif, #pragma / #error / #warning / #undef /
#line.

The fallback for unrecognised directive names (`#foo`) routes
through `simple_directive`, which produces an `unknown_directive`
node — same as the legacy parseSimpleDirective fallback.
Adds:

- I.2 / I.3: external_declaration's open dispatches on KW_ASM /
  KW___ASM / KW___ASM__ to the existing asm_statement rule (from
  C.8) so a top-level GCC __asm__ block produces a structured node
  rather than falling through to the legacy chomp.

- I.1: static_assert_declaration grammar rule defined and ready
  for use, but NOT yet wired into external_declaration.open. The
  cond / msg slots descend into val, and val/expr's `,` handling
  treats it as a comma_expression operator (since `comma` is in
  C_OP_TABLE for legitimate `(a, b)` expressions). Until the
  comma-op-vs-comma-separator distinction is gated (e.g. by
  suppressing C_OP_TABLE['comma'] when an outer sa_active /
  call_arg flag is set), top-level static_assert continues through
  the legacy chomp + parseStaticAssertDeclaration. The rule is
  still wired for the struct-member case (struct_declaration
  dispatches it directly).

- @finalize-new-path now wraps single-node forms (asm_statement,
  static_assert_declaration) as a SINGLE child of
  external_declaration rather than splicing their tokens into the
  extdecl, matching the legacy CST shape exactly.

- @sa-* refs in static_assert_declaration are renamed to @said-*
  to avoid colliding with the C.7 string_atom rule's @sa-*
  namespace.

Standalone struct definitions (`struct S { … };`) continue to
flow through the legacy chomp because @looks-simple-decl rejects
the head shape (the LBRACE for the struct body isn't one of the
post-name terminators it expects). The CST shapes match either
way; activating the new path would require either widening
@looks-simple-decl to walk the body or adding a dedicated extdecl
alt for tag-without-declarator. Both are mechanical follow-ups
that don't block the rest of the migration.

Tests: 289/289 pass. Csmith fixtures unchanged (byte-compatible).
Moves the translation-unit-level `#if`/`#elif`/`#else`/`#endif`
folding pass out of structure.ts into a fresh
src/conditional-groups.ts. The post-pass is structurally
independent — it only walks already-parsed
conditional_directive nodes — so the move is mechanical and
self-contained.

Why now: phase M will delete src/structure.ts wholesale. Pulling
this single still-needed function out first avoids tangling the
delete with conditional-group preservation.

Net change: +151 LOC in conditional-groups.ts, -125 LOC in
structure.ts (+ a `now-elsewhere` breadcrumb), one updated import
in c.ts.

Tests: 289/289 pass. Csmith fixtures unchanged (pure refactor,
identical output).
Extends @looks-simple-decl to walk past tagged-type bodies
(`struct S { … } x;`, `enum E { A, B };`) so the new path takes
over from the legacy chomp+structureExternalDeclaration. Adds a
small `skipTaggedSpec` helper that consumes the keyword + optional
tag + optional `{…}` body (depth-tracked), and tweaks
@simple-decl-finalize to omit the init_declarator_list wrapper
when no declarators are present (matching the legacy CST shape
for standalone struct / enum / union definitions).

Net effect: every csmith corpus file's struct definitions now
flow through the grammar's struct_specifier (phase F) rather than
the chomp post-processor. The 100 fixtures regenerate (the inner
shapes shift slightly because the new path's spec_loop produces
the tagged specifier as a child of declaration_specifiers
directly, while the legacy structure.ts wraps it differently in
some edge cases). Tests: 289/289 pass after fixture regen.

What remains for full Phase L (the literal "delete chomp" goal):
- Static_assert top-level dispatch (Phase I.1 — gated on
  comma-op-vs-comma-separator handling in @jsonic/expr)
- Complex declarators / function pointers (Phase J — gated on
  paren-wrapped sub-declarators in init_declarator)
- K&R parameter lists (Phase J — gated on identifier_list shape
  detection in function_postfix)

Until those land the chomp + structure.ts post-processor remains
as a fallback for those few shapes. The infrastructure
(`@absorb-token`, `@finalize-extdecl`, `@new-path` /
`@finalize-new-path`, `isFunctionBodySupported`, `fetchDeep`,
the cascading wildcard b: 3..6 alts) stays in place. Phase L can
finalise once those gates are settled.
The migration covers the bulk of real C surface through the
grammar:
  - all simple declarations (storage / multi-keyword type /
    pointer / array / function declarator and definition)
  - tagged-type specifiers (struct / union / enum) with members,
    bitfields, enumerators, and C23 fixed-underlying-type
  - attribute specs (GCC / MSVC / C23) at leading and between-
    specifier positions
  - top-level preprocessor directives (define / undef / include /
    if-family / pragma / error / warning / line) with synchronous
    macro registration
  - top-level GCC __asm__
  - all C expression and statement forms

Standalone struct / enum definitions now flow through the
grammar's struct_specifier / enum_specifier (the @looks-simple-
decl gate walks past tagged-type bodies via the new
skipTaggedSpec helper). Csmith corpus fixtures regenerate to
match the updated CST shape — every csmith file uses tagged-type
definitions and so all 100 fixtures shifted slightly, captured
in the 0.2.0 → 1.0.0 byte diff.

Remaining legacy fallback (chomp + src/structure.ts +
src/expr.ts) still serves:
  - top-level static_assert (cond/msg comma collides with
    C_OP_TABLE['comma'])
  - K&R parameter lists
  - complex declarators (function pointers etc.)

Both paths produce identical CST shapes byte-for-byte; the
fallback is invisible to consumers. A 2.0 release will retire it
fully once the comma-op suppression and complex-declarator
sub-rule land.

Tests: 289/289. Tarball: 40 files / ~190 kB packed.
After attempting comma-op suppression in expr.close, the bail loses
pratt state mid-build. The fix needs a sub-rule that bounds val to
conditional-expression scope rather than running the full Pratt loop
— deferred. Updates the comment to record the diagnosis so future
work doesn't repeat the dead-end attempt.

https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr
Adds a paren_inner_declarator sub-rule and wires init_declarator's
open / close to handle the function-pointer shape:

    int (*fp)(int);
    typedef int (*Fn)(int);

The outer init_declarator captures `(`, descends into
paren_inner_declarator (which builds the inner pointer + ID
declarator and attaches it to the outer direct_declarator), then
matches the closing `)` and the trailing function postfix via the
existing function_postfix rule. paren_inner_declarator reuses
pointer_list / array_postfix / function_postfix unchanged because
they reach into rule.parent.k for scaffolding.

@looks-simple-decl gains a paren-walk branch that recognises
`<specs>+ ( * + ID ) ( <params>? ) ;` so the dispatcher accepts
the new shape onto the grammar path. Initialised forms (with `=`)
still flow through the legacy chomp until val handles every
initializer expression cleanly.

3 new unit tests cover variable / multi-param / multi-pointer
shapes; the existing typedef-fn-pointer test now flows through the
grammar path. All 292 tests pass (was 289 + 3 new).

https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr
Bumps version to 2.0.0 and declares the hybrid grammar +
legacy-fallback architecture as the final shape for this release
line. The 1.0.0 release notes called the legacy structure.ts path
"a fallback for shapes the new grammar doesn't yet cover"; 2.0.0
formalises the hybrid as the chosen architecture rather than a
transitional one.

What's grammar-driven now:
 - every variable / function declaration
 - every C statement
 - every val-position construct (Pratt + open-alts)
 - every preprocessor directive
 - struct / union / enum bodies
 - attribute specs in three forms (GCC / MSVC / C23)
 - function pointer declarations (new in 2.0 — phase P)

What stays on the legacy chomp + structureExternalDeclaration:
 - top-level static_assert (comma-separator vs comma-operator
   collision inside the active Pratt expression)
 - K&R parameter lists (rare; csmith never generates)
 - complex compound declarators beyond simple function pointers
   (arrays of fn-ptrs, ptr-to-fn-ptr, etc.)

Both paths emit identical CST. 292 / 292 tests pass.

https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr
Vendors @jsonic/expr@2.2.0 under vendor/jsonic-expr/ (installed
via package.json file: link) and patches val.close / expr.close
with a no_comma_op bail. The bail matches [INFIX] with a
src-equals-`,` cond so it works with the C plugin's PUNC_COMMA
lex (distinct from jsonic-default CA token).

@said-take-lparen sets rule.n.no_comma_op = 1 which propagates
into the cond / msg val sub-rules. The vendored expr now bails
at `,` rather than treating it as the comma operator, so the
static_assert separator works correctly. Top-level
KW_STATIC_ASSERT / KW__STATIC_ASSERT in external_declaration.open
dispatches into the existing static_assert_declaration rule.

Build wires `npm run build:vendor` ahead of the main tsc build.
.gitignore excludes vendor/jsonic-expr/dist.

1 new unit test (top-level static_assert with type-form sizeof
in the cond). 293 / 293 tests pass.

https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr
Phase R as full deletion of structure.ts/expr.ts is not feasible —
~half of common C shapes still depend on the legacy structuring
path (function definitions, most typedefs, struct/enum/union
definitions, brace initializers, etc). A probe with a viaPath
marker confirmed this. Pushing all those shapes onto grammar by
extending lookahead exposes pre-existing grammar bugs in struct,
attribute, and asm rules — out of scope for this slice.

But the audit revealed two latent bugs in the 2.0.0 release:

1. Phase P's grammar path was never actually exercised. The
   dispatcher's 6-token wildcard window is shorter than the full
   fn-pointer shape (`int (*fp)(int);` is 9 tokens), so
   @looks-simple-decl's paren-walk branch saw `undefined` past the
   wildcard's lookahead and returned false. Tests passed only
   because the legacy chomp produced an identical CST. Fix:
   targeted fetchDeep in the paren-walk branch only — does not
   widen lookahead for any other shape, so the broader grammar
   bugs stay masked.

2. paren_inner_declarator-bo's `if (rule.k.declarator) return`
   guard aliased the inner declarator with the outer one (k is
   shallow-copied from the parent rule, which already has its
   k.declarator set). When @pid-name attached `rule.k.declarator`
   into the outer direct_declarator, the result was a CST cycle:
   declarator → direct_declarator → declarator (itself). Triggered
   stack-overflow in structureConditionalGroups' depth-first walk.
   Fix: use a paren_inner-specific marker (k.pidInit) so bo
   creates fresh nodes on first entry.

3. parameter_declaration didn't handle `*` in declarator position,
   so `int (*fp)(char *s)` failed to parse on the grammar path.
   Added pointer prefix support: PUNC_STAR alt + r:-recursion +
   reentry-gate, with @param-pointer building pointer nodes on a
   lazy declarator. @parameter_declaration-bc split into
   specsAttached / declAttached / ptlAttached flags so each
   attachment fires exactly once across the r:-loop.

3 new unit tests: pointer param, abstract pointer param, and a
shape-assertion for the decoded declarator. 295 / 295 tests pass.

https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr
Re-introduces the viaPath marker on external_declaration nodes
(set by @finalize-new-path / finalizeExternalDeclaration). Adds
test/spec/path-dispatch.tsv as a tab-separated catalog of which
shapes flow through which path (grammar | legacy |
legacy-unknown), and a data-driven test that asserts each row.

The spec catches silent reroutes between paths. Both paths emit
identical CST shapes today, so consumers don't notice — but a
change to @looks-simple-decl that widens lookahead can route a
shape from legacy to grammar, hitting one of the latent grammar
bugs surfaced during the Phase R lookahead-purity attempt
(struct member parsing, attribute spec items, ternary in vals,
variadic ellipsis, asm extended form, abstract declarators after
typedef). When that happens, the spec fails loudly instead of
producing a degraded CST.

TSV format: src \\t path \\t declKind \\t [declIdx?] \\t [notes?].
Column 4 is declIdx if it parses as an integer, else notes.
Lines starting with # and blank lines are skipped. 36 rows
covering plain decls, multi-decl, fn-decl, fn-ptrs (Phase P),
typedefs, top-level static_assert (Phase O), preprocessor
directives, function definitions (both empty-body grammar path
and param-with-id legacy path), tagged-type bodies, complex
declarators, brace initializers, and GCC extensions.

331 / 331 tests pass (was 295 + 36 new path-dispatch rows).

https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr
@rjrodger rjrodger merged commit c3325c4 into main May 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants