Skip to content

Move K&R function definitions onto the grammar path#4

Open
rjrodger wants to merge 9 commits into
mainfrom
claude/investigate-c-parsing-mode-OITEN
Open

Move K&R function definitions onto the grammar path#4
rjrodger wants to merge 9 commits into
mainfrom
claude/investigate-c-parsing-mode-OITEN

Conversation

@rjrodger
Copy link
Copy Markdown
Contributor

@rjrodger rjrodger commented May 3, 2026

K&R-style definitions (int f(a, b) int a; long b; { ... }) used to
fall through to the legacy chomp, where the chomp's top-level ;
terminator fragmented them into multiple declKind: 'unknown'
external declarations. The validator now accepts the shape, the
grammar dispatches into a new kr_declaration_list rule between the
parameter-list ) and the body {, and the result is a single
structured function_definition external declaration. The
declaration-list child preserves the source as flat token refs to
match the legacy CST shape.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv

claude added 9 commits May 2, 2026 23:11
K&R-style definitions (`int f(a, b) int a; long b; { ... }`) used to
fall through to the legacy chomp, where the chomp's top-level `;`
terminator fragmented them into multiple `declKind: 'unknown'`
external declarations. The validator now accepts the shape, the
grammar dispatches into a new `kr_declaration_list` rule between the
parameter-list `)` and the body `{`, and the result is a single
structured `function_definition` external declaration. The
declaration-list child preserves the source as flat token refs to
match the legacy CST shape.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Phase P landed simple function-pointer declarators on the grammar but
left the complex shapes — arrays of fn-pointers, fn returning ptr-to-
array, nested paren-forms, leading-pointer types with paren-form
declarators — on the legacy chomp + structure.ts path. Worse, the
validator silently accepted `int (*arr[3])(int);` while the grammar
emitted a structurally wrong CST (the inner `[3]` postfix sat as a
sibling of the inner declarator instead of inside it).

This change extends the grammar so all four shapes parse correctly:

  int (*arr[3])(int);     // inner array postfix on inner DD
  int (*get())[10];       // inner function postfix on inner DD
  int (*(*fpp))(int);     // nested paren-form (recursive PID)
  char *(*foo[3])(int);   // leading-pointer-type with paren-form

`paren_inner_declarator` now dispatches `array_postfix` /
`function_postfix` for inner postfixes (they attach to its own
direct_declarator via rule.parent.k.directDeclarator), recurses into
itself for nested paren-forms, and tracks paren-pending state
separately from `init_declarator`'s. `@pid-paren-close` performs the
declarator-attachment that `@pid-name` does for non-nested PIDs.
`init_declarator` close gains a paren-form alt gated on `!named` so
the leading-pointer-type case routes here instead of falling into
function_postfix. The validator factors out a `walkParenFormDeclarator`
helper that recursively validates the new shapes.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Plain-C was already 100% covered by the grammar path; the chomp +
structureExternalDeclaration safety net was reachable only for malformed
input, and extended-mode (preprocessor, GCC __attribute__ / __asm__,
MSVC __declspec, in-body #-lines, etc.) depended on the legacy
post-process for anything the grammar didn't structure. Per the
upcoming clean-slate rewrite of extended mode, both go.

Removed:
- src/structure.ts (~1975 lines), src/conditional-groups.ts, src/expr.ts
  (the latter two were already dead code in the grammar path)
- The chomp wildcard alt + finalize-extdecl close alts in
  external_declaration; the wildcard cascade and @looks-simple-decl
  validator
- All extension grammar rules: preprocessor_directive / define_directive /
  undef_directive / include_directive / conditional_directive /
  simple_directive / macro_parameter_list / macro_body / header_form /
  preprocessor_line, asm_statement / asm_template / asm_section /
  asm_operand / asm_clobber / asm_label_ref, attribute_spec_gcc /
  attribute_spec_msvc
- COptions.extended, EXTENSION_RULES stripping, @extended-on /
  @extended-off / @plain-and-first-iter / @ext-and-first-iter /
  @plain-as23-and-first / @new-path / @mark-new-path / @absorb-token /
  @finalize-extdecl / @terminated / @just-closed-and-decl-ahead
- isFunctionBodySupported, fetchDeep, walkParenFormDeclarator,
  skipLeadingAttributes, skipTaggedSpec, UNSUPPORTED_BODY_TOKENS, plus
  the registerTypedefIfApplicable / finalizeExternalDeclaration /
  registerMacrosFromTree / firstNonTriviaIs / startsNewExternalDeclaration
  helpers and the legacy declarator-walk helpers
  (findDeclaredName, splitDeclarators, declaratorPart, findSpecBoundary,
  isSpecifierKw, matchClose) plus the unused TYPE_SPEC_KEYWORD_NAMES /
  STORAGE_CLASS_NAMES / etc. classification sets
- csmith corpus + fixture tests + generator
- test/spec/path-dispatch.tsv (single path → no dispatch to track)
- ~24 extension-feature tests in c.test.ts (preprocessor, GCC asm,
  GCC __attribute__, MSVC __declspec, conditional_group, macro tagging
  in #define bodies, ...) and the viaPath assertions

external_declaration's open now dispatches statically: KW_STATIC_ASSERT
/ KW__STATIC_ASSERT into static_assert_declaration, #SIMPLE_TYPE_HEAD /
#STORAGE_PREFIX / KW__BITINT / `[[` into simple_declaration. Anything
else is a parse error. external_declaration's close runs
@finalize-new-path unconditionally.

Tests: 78/78 pass.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Brings back support for the five typed top-level preprocessor
directive families plus the conditional family, opt-in via a new
COptions.extended boolean. Plain mode stays the canonical default
and is byte-identical to the prior build (extension grammar rules
are stripped from the spec when the flag is off).

Scope this round: #define / #undef / #include / #pragma / #error /
#warning / #line, and #if / #elif / #else / #endif. Each #-line is
its own external_declaration containing a typed directive node;
conditional directives stay flat (no #if-group folding — consumers
that want grouped structure walk the tree themselves).

#define populates cmeta.macros and tags subsequent identifiers as
MACRO_NAME at lex time; #undef reverts. call_expression nodes whose
callee was tagged carry isMacro: true.

GCC __attribute__, MSVC __declspec, GCC __asm__, and in-body #-lines
are not in scope this round.

Implementation: grammar-only. The 9 preprocessor sub-rules
(preprocessor_directive dispatcher + define/undef/include/conditional/
simple typed sub-rules + macro_parameter_list / macro_body /
header_form helpers) are pasted from commit 84d19eb~1 into
c-grammar.jsonic. external_declaration.open gains two PP_HASH
dispatch alts gated on the new @ext-and-first-iter condition. All 54
referenced @-actions / -conditions were already present in
src/c.ts (left orphaned by the prior teardown), so no new TS
authoring is needed beyond the option plumbing and the three
extension gate refs (@extended-on, @extended-off,
@ext-and-first-iter).

Tests: 12 new under describe('extended-mode preprocessor') —
covers each directive shape, plain-mode rejection of #-lines, macro
tagging across #define/#undef cycles, function-like macro call
isMacro tagging. All 90 (78 plain + 12 extended) pass.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Brings the two compiler-specific attribute forms back as grammar-only
extensions, alongside the existing C23 [[…]] form. Decoration points
match the prior implementation: leading position in
simple_declaration.open, plus interleaved alts in spec_loop.open and
spec_loop.close. external_declaration.open also gains direct
KW___ATTRIBUTE__ / KW___ATTRIBUTE / KW___DECLSPEC dispatch into
simple_declaration so leading-attribute external decls don't have to
go through a different head.

The two paren-form rules (attribute_spec_gcc with double-paren shape,
attribute_spec_msvc with single-paren shape) are pasted from
84d19eb~1; both delegate to the already-present attribute_item /
attribute_argument_list helpers shared with the C23 form. All
referenced @-actions (@asg-*, @asm2-*) were already in src/c.ts. New
to EXTENSION_RULES so the rules are stripped from the spec in plain
mode.

One small follow-on fix: spec_loop.close now also accepts
#STORAGE_PREFIX so a leading attribute followed by a storage class
keyword (`__attribute__((unused)) static int q;`) parses correctly.
@absorb-spec-storage / @absorb-spec-type are state-aware about which
slot (rule.o0 vs rule.c0) holds the matched token.

Out of scope this round: post-declarator attributes (`int x
__attribute__((aligned(8)));`), GCC __asm__, in-body #-lines, #if
folding.

Tests: 11 new under describe('extended-mode GCC __attribute__ and
MSVC __declspec') — covers leading + interleaved positions, MSVC
single-paren form, attribute argument lists, keyword-as-attribute-
name, multi-form coexistence with C23, plain-mode rejection. 101 / 101
pass (78 plain + 12 preprocessor + 11 attribute).

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Restores the six asm grammar rules — asm_statement (the state-machine
driver), asm_template, asm_section, asm_operand, asm_clobber,
asm_label_ref — pasted from 84d19eb~1. Dispatched from
external_declaration.open (KW_ASM / KW___ASM / KW___ASM__ gated on
@ext-and-first-iter) and from statement.open (same three keywords
gated on @extended-on for in-body asm). All ~30 referenced @-actions
(@asm-*, @asec-*, @aop-*, @acl-*, @alr-*) were already present in
src/c.ts. Added to EXTENSION_RULES so the rules drop out of the spec
in plain mode.

@finalize-new-path now wraps an asm_statement child as a single child
of external_declaration (matching static_assert_declaration) rather
than splicing its children — this preserves the asm_statement node
itself for consumers.

Tests: 6 new under describe('extended-mode GCC __asm__') —
plain-mode rejection, top-level template-only, volatile + clobbers,
in-body asm under compound_statement, asm goto with labels, full
extended form with outputs / inputs / clobbers. 107 / 107 pass
(78 plain + 12 preprocessor + 11 attribute + 6 asm).

Out of scope this round: in-body #-lines, post-declarator
attributes, #if folding.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Restores the preprocessor_line rule for #-lines that appear inside
function bodies (mid-body #pragma, #ifdef, #error, etc — rare but
legal C). Pasted from 84d19eb~1; dispatched from statement.open on
PP_HASH gated on @extended-on. The four referenced @-actions
(@preprocessor_line-bo, @pp-take-hash, @pp-reentry, @pp-absorb,
@pp-take-newline) were already present in src/c.ts. Added to
EXTENSION_RULES so the rule drops out of the spec in plain mode.

The lexer's PP_HASH matcher remains line-start-gated, so the `#`
must be preceded only by whitespace since the last newline — same
constraint as top-level directives.

Tests: 4 new under describe('extended-mode in-body
preprocessor_line') — plain-mode rejection, in-body #pragma,
in-body #ifdef/#endif as separate preprocessor_line nodes, and
trailing PP_NEWLINE preservation. 111 / 111 pass.

This completes the preprocessor coverage (top-level + in-body).
Remaining deferred items: post-declarator attributes and #if-group
folding.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Adds the previously-unsupported post-declarator attribute decoration
point. Now `int x __attribute__((aligned(8)));`, `void f(void)
__attribute__((noreturn));`, `int z [[deprecated]];`, and `int g
__declspec(thread);` all parse with the attribute attaching to the
init_declarator (between the declarator and any `=` initializer /
terminator).

Grammar: four new alts in init_declarator.close gated on a new
@idecl-named (so they only fire after the declarator is complete):
- KW___ATTRIBUTE__ / KW___ATTRIBUTE → attribute_spec_gcc, gated on
  @idecl-named-and-extended.
- KW___DECLSPEC → attribute_spec_msvc, same gate.
- PUNC_LBRACKET PUNC_LBRACKET → attribute_spec_c23, gated on
  @idecl-named-and-as23 (named + token-adjacency check; no
  extended-mode requirement since [[…]] is plain C23).

The C23 alt is positioned BEFORE the single-token PUNC_LBRACKET
array_postfix alt so the 2-token lookahead wins on `[[`. Each alt
has `r: 'init_declarator'` so multiple attributes chain and the
trailing `=` / `,` / `;` is still picked up.

@init_declarator-bc gains a branch that pushes any returned
attribute_spec child onto rule.node.children, with a per-rule
attachedAttrs Set to prevent re-pushing on the next bc cycle.

Two new conditions: @idecl-named-and-extended,
@idecl-named-and-as23.

Tests: 10 new under describe('extended-mode post-declarator
attributes') — plain-mode rejection, GCC attr on variable / function
/ array, attr-with-initializer, multi-attr chaining, per-declarator
attrs in a multi-declarator decl, C23 [[…]] in this slot, MSVC
__declspec, regression for plain `int x[10]`. 121 / 121 pass.

Remaining deferred: #if-group folding (the hardest, still on the
roadmap).

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Folds runs of #if/#ifdef/#ifndef … (#elif/#elifdef/#elifndef/#else)*
… #endif into a single conditional_group node, with one
conditional_branch per opening directive carrying its directive +
body items, plus a separate `endif` slot on the group. Nested #if
inside a branch produces a nested conditional_group (recursion in
the grammar, not via a post-pass). Stray #endif / unterminated #if
degrade gracefully — orphan directives stay flat as
external_declaration{conditional_directive}, unterminated groups
omit the `endif` field.

This was previously a tree post-pass (`structureConditionalGroups`,
removed in 84d19eb). Now it's pure grammar.

Two new grammar rules:
- conditional_group: state machine that takes the head directive,
  delegates body absorption to cg_branch_body, then reads the next
  directive (advance vs. close).
- cg_branch_body: absorber that pushes external_declarations and
  nested conditional_groups onto the parent's curBranch.children;
  stops without consuming at the next boundary directive.

extdecl_loop gains a 2-token-lookahead dispatch alt
(s: 'PP_HASH #ANY_C_TOKEN' c: '@cg-head-is-if-family' b: 2) that
routes #if-family heads into conditional_group; everything else
goes through external_declaration as before. Lookahead conditions
in close-state alts use the same s: + b: pattern to force-fetch
PP_HASH and the directive name (close-state `s: []` alts don't
auto-fetch ctx.t).

@conditional_group-bo distinguishes r:-recursion (preserve in-
progress group) from p:-descent (nested #if — fresh init) via
rule.prev: jsonic sets rule.prev to the previous instance only on
r:; p:-descent leaves it as NORULE. This is the same pattern
paren_inner_declarator uses for nested paren-forms.

EXTENSION_RULES gains conditional_group and cg_branch_body so
plain mode strips them entirely.

Tests: 9 new under describe('extended-mode #if conditional_group
folding') — basic if/endif, ifdef/else, include guards (ifndef +
multi-item body), nested groups, group followed by top-level decl,
stray #endif, unterminated #if, three-way fold's directive
identity. Updated the previously-flat-only test to assert the new
folded shape. 130 / 130 pass.

This completes the extended-mode roadmap. All deferred items from
the rewrite are now shipped: top-level preprocessor (with macro
tagging, #if folding), in-body #-lines, GCC __attribute__ + MSVC
__declspec (leading / interleaved / post-declarator), GCC __asm__
(top-level + in-body), post-declarator C23 attributes.

https://claude.ai/code/session_01DEdkKecwpq59ydTqZ7Aobv
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8eb7a2aeae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/c.ts
# Stop at the body-opening \`{\` (don't consume — let
# simple_declaration's PUNC_LBRACE alt drive
# compound_statement).
{ s: 'PUNC_LBRACE' b: 1 g: 'kr-end' }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Track nested braces in K&R declaration list

The kr_declaration_list rule currently stops as soon as it sees any PUNC_LBRACE, but valid K&R parameter declarations may contain braces inside a type specifier (for example int f(x) struct S { int a; } x; { ... }). In that case this rule will terminate at the struct body opener instead of the real function-body opener, so the function definition is split at the wrong place and the resulting CST is malformed. The stop condition here needs brace-depth awareness (or declaration-aware parsing) rather than a raw first-{ cutoff.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants