Add C language parser plugin for Jsonic#1
Conversation
Convert this package from @jsonic/jsonc to @jsonic/c. Targets C23 with GCC/Clang/MSVC extensions; produces a concrete syntax tree preserving every token, comment, macro, and extension as-is. - src/tokens.ts: token catalog covering C23 keywords, all extension keywords, every punctuator (one named token per literal form). - src/symbols.ts: SymbolTable with full nested scopes (file, fn-proto, fn-body, block, struct-union, enum, for-init), MacroTable, and a LexMode flag bag — all bundled as CMeta on ctx.meta.cmeta so lex matchers and rule actions share the same state. - src/matchers.ts: focused lex matchers, one job each — whitespace, line continuation, line/block comments, preprocessor directive opener (line-start gated), directive newline, header name (mode-gated), identifier with keyword/typedef-name/macro-name reclassification, integer (dec/hex/oct/binary, separators, suffixes), float (decimal + hex), char (with prefixes), string (with prefixes and raw R""), and a single longest-match punctuator dispatch. - src/c.ts: plugin entry. Disables jsonic's built-in lexers so our matchers fully own tokenization, registers all token names so grammar rules can reference them, installs CMeta via parse.prepare. Grammar covers translation_unit -> extdecl_loop -> external_declaration with a coarse-grained token chomper that recognises the typedef-name shape and registers it in the symbol table. Pre-lexed lookahead tokens are reclassified in place after typedef registration so subsequent matches see TYPEDEF_NAME immediately. Smoke tests cover tokenization correctness (keyword vs identifier boundary, multi-char punctuator dispatch, comment trivia preservation) and the typedef disambiguation path end to end. Subsequent slices will replace the chomper with proper declarator, declaration, statement, and expression grammar (the latter driven by @jsonic/expr); add directive nodes via @jsonic/directive with best-effort parsing of conditional groups; and surface the macro table to tag macro-call sites.
- Replace the single-name typedef heuristic with a real declarator
walker. For each declarator in `typedef T1 *p, q[3], (*fn)(int);`
the trailing identifier is found by stripping pointers, qualifiers,
attribute groups, array/function postfixes, and recursing into
parenthesised subdeclarators. Every declared name is registered.
- splitDeclarators splits the init-declarator-list at top-level commas,
ignoring commas inside (), [], or {}.
- declaratorPart drops the initializer suffix at top-level `=` so
initializer expressions don't pollute the name search.
- findSpecBoundary identifies where declaration-specifiers end: it
walks past storage-classes, type-specifiers, qualifiers, function-
specifiers, attribute groups, and a single TYPEDEF_NAME, plus the
optional brace-balanced body of struct/union/enum specifiers.
- The chomper no longer auto-terminates at a top-level `}`. Instead it
marks "just closed a brace" and only terminates if the next non-
trivia token unambiguously begins a new external declaration
(storage-class/type-specifier/qualifier keyword, attribute spec,
TYPEDEF_NAME, preprocessor hash, or EOF). This lets
`typedef struct { … } S;` and `enum E { … } var;` finish at the
trailing `;`, while function definitions still terminate at `}`
before the next top-level decl.
Tests cover multi-name typedef, pointer/array/function-pointer
typedefs, struct-tag typedefs, struct-with-body typedefs, mixed
declarator lists, function-definition-then-decl, and brace-bearing
initializers.
Move comment and line-continuation tokens into the IGNORE token set so
the parser proper never sees them, then capture them in a sub-lex hook
that attaches the buffered trivia to the next non-trivia token via
tkn.use.leading. The chomper drains use.leading into the AST in source
order ahead of each absorbed token, preserving comments verbatim
without any grammar rule needing to mention them.
- Split trivia into PRESERVE (block/line comments, line continuations)
and DROP (whitespace, jsonic's #LN/#CM). Only PRESERVE flows to
use.leading.
- pendingTrivia lives on CMeta so it travels with parser state.
- jsonic.sub({lex}) registers the hook once at plugin install.
Tests: trivia ordering through declarations, line-continuation
preservation, whitespace remains absent from the AST.
Add a post-processing pass over the chomped token list (src/structure.ts)
that produces structured concrete-syntax nodes for declarations and
function definitions. Walking the resulting tree depth-first yields the
original token sequence in order, so source fidelity is preserved while
clients gain a shape they can actually navigate.
The external_declaration close action now calls
structureExternalDeclaration(tokens). On a clean parse it replaces the
flat token-ref children with structured children:
external_declaration { declKind: 'declaration' }
declaration_specifiers
<token-refs for storage-class / type-specifier / qualifier>
struct_specifier|union_specifier|enum_specifier (with optional
member_decl_list / enumerator_list still flat for now)
attribute_spec (__attribute__((...)) / __declspec(...))
init_declarator_list
init_declarator { declaredName }
declarator
pointer* (with qualifiers + attribute specs)
direct_declarator { declaredName }
<ID or parenthesised subdeclarator>
array_postfix*
function_postfix*
asm_label?
attribute_spec*
'=' initializer?
',' init_declarator ...
';'
external_declaration { declKind: 'function_definition' }
declaration_specifiers
declarator
kr_declaration_list? (K&R old-style parameter declarations)
compound_statement (body still flat tokens for now)
The TokenStream class hides PRESERVED trivia from grammar decisions but
emits the trivia tokens in order as siblings of the next real token's
ref. This keeps comments and line continuations in the right place
inside the structured tree.
When structure can't recognise the shape (e.g. preprocessor lines, raw
expression statements at top level) the chomper's flat token-ref list
is retained and declKind is set to 'unknown'.
Tests cover: simple int x = 1, multi-decl int a,b=2,c, pointer/array/
function declarators, function-definition with compound_statement,
struct-with-body, enum with C23 fixed underlying type, and
__attribute__ on a declaration.
Replace the opaque brace-balanced member_decl_list / enumerator_list
of slice 4 with proper member parsing.
struct/union bodies now contain struct_declaration nodes, each with:
- specifier_qualifier_list (declaration_specifiers reused, renamed)
- struct_declarator_list of struct_declarator nodes
- trailing ';'
struct_declarator carries:
- the declarator (with declaredName)
- optional bitfield_width (`: const-expr`)
- optional trailing attribute_spec
This handles `struct S { unsigned f : 1; int : 7; int n; };` cleanly,
including anonymous bitfields.
static_assert at member level (C23 + GCC) becomes a
static_assert_declaration node.
enum bodies now contain enumerator nodes, each with:
- declaredName
- optional [[…]] / __attribute__((…)) attribute_spec (C23)
- optional initializer (constant-expression, opaque for now)
The trailing comma after the last enumerator is preserved.
Tests: three-field struct, bitfield + anonymous bitfield, enum with
initializer and trailing comma.
Replace the flat-token compound_statement of slice 4 with proper block-item parsing. A compound_statement now contains: - declaration nodes (when the head is a specifier or static_assert) - statement nodes, dispatched by leading token Statement kinds modelled (each preserves all source tokens): - if_statement (with paren_condition + then-stmt + optional else) - switch_statement - while_statement - do_statement (do … while (…) ;) - for_statement (for_controls captures the parenthesised header) - jump_statement (goto/continue/break/return; jumpKind set) - labeled_statement (case / default / label; labelKind/labelName) - expression_statement - compound_statement (recursive) - asm_statement (GCC __asm__ / asm with optional qualifiers) - preprocessor_line (PP_HASH … PP_NEWLINE, opaque) Also factored out parseDeclaration so the same shape used at the top level is reused inside blocks. Tests: function with mixed decl+expr+return body, if/else+while+for in one body, switch with case+default labels, goto+label round trip, do/while.
Each #-line on the input now becomes its own external_declaration
containing one structured directive node, instead of being absorbed
into the surrounding code by the chomper.
Directive node kinds and their structured fields:
- include_directive { includeForm, headerKind, headerName }
- define_directive { macroName, macroKind ('object-like' |
'function-like'), macroParams?, macroVariadic? }
- undef_directive { macroName }
- conditional_directive { directive: 'if'|'ifdef'|'ifndef'|'elif'|
'elifdef'|'elifndef'|'else'|'endif' }
- pragma_directive (opaque body)
- error_directive (opaque body)
- warning_directive (opaque body)
- line_directive (opaque body)
- unknown_directive (any other #-form)
The function-like distinction in #define checks adjacency: only
`NAME(` with no whitespace between produces a function-like macro.
Parameter list parsing pulls out parameter names and detects the
variadic ellipsis.
The chomper now terminates an external_declaration at PP_NEWLINE when
the first non-trivia token is PP_HASH, so a directive line plus a
following declaration land as two separate external_declarations
instead of one giant chomp.
Define/undef directives populate ctx.meta.cmeta.macros so future
slices can tag macro-call sites.
Conditional groups (#if … #endif) are intentionally left as a flat
sequence of directive + declaration nodes for now; collapsing them
into a single nested group with branches comes in a follow-up slice.
Tests: angled and quoted #include, object-like and function-like
#define (with variadic), #if/#endif sequencing, #pragma/#error,
#undef.
function_postfix nodes are no longer opaque '(' … ')' chunks. Each one
now contains either:
parameter_type_list
parameter_declaration { declaredName? }
declaration_specifiers
declarator | abstract_declarator
parameter_variadic (the '...' marker, also sets
parameter_type_list.variadic = true)
identifier_list (K&R-style identifier-only list)
Special cases handled:
- `()` — empty postfix, returned as-is
- `(void)` — collapsed into a single parameter_declaration whose
spec list contains the lone void
- K&R `(a, b, c)` — detected by lookahead (every comma-separated
item is exactly one ID, ending with ')')
Abstract vs concrete declarator: parseParameterDeclaration tries the
concrete form first; if no declaredName surfaces it backtracks and
re-parses as abstract. This keeps `int qsort(void *, size_t, ...)`
clean (no spurious declaredName on the size_t parameters) once
size_t is registered as a typedef.
Tests: void prototype, named ANSI parameters with declaredName
extraction, variadic ellipsis, abstract parameters across a typedef
boundary, and K&R-style identifier_list.
Identifiers previously seen in a #define now lex as MACRO_NAME instead of ID, mirroring the typedef-name path. The grammar accepts MACRO_NAME wherever it accepts ID (added a small isIdLike helper and replaced the relevant call sites in structure.ts), so structuring is unchanged while clients can distinguish macro references from ordinary identifiers without consulting the macro table themselves. The identifier matcher consults ctx.meta.cmeta.macros after the typedef check. After a #define directive is structured into a define_directive node, registerMacrosFromTree calls reclassifyAsMacro which walks the lexer's pre-fetched lookahead (ctx.t and lex.pnt.token) mutating any matching ID tokens to MACRO_NAME — so the very first post-#define occurrence already carries the correct token name even when jsonic's lookahead got there first. #undef removes the entry from the macro table; subsequent uses of the name re-emerge as plain ID. Tests cover all three transitions.
Inside expression_statement, initializer, and jump_statement bodies,
the post-chomp pass now promotes ID/MACRO_NAME-followed-by-(args)
sequences into nested call_expression nodes. The grammar context for
calls is identical regardless of statement form, so a single
structureCallsInPlace helper handles all three.
call_expression { callee, isMacro }
<callee token>
argument_list
'(' <recursively-structured tokens> ')'
isMacro is set from the callee token's tname (true for MACRO_NAME),
giving consumers a syntactic flag that distinguishes a macro
invocation from a real function call without re-querying the macro
table.
Recursion is handled by structuring the argument list's interior as a
synthetic node and inlining the result, so g(f(1), h(2)) produces
three nested call_expression nodes.
Tests: simple call (isMacro false), macro invocation (isMacro true),
nested calls inside arguments.
Add a translation_unit-level post-pass that folds the flat sequence of
#if/#ifdef/#ifndef … (#elif…)* (#else)? … #endif directives into a
single conditional_group node containing typed branches. Best-effort:
unmatched #endif or unterminated #if leaves the surrounding flat
sequence untouched so the rest of the tree stays intact.
Output shape:
conditional_group
branches: [
conditional_branch {
branchKind: 'if'|'ifdef'|'ifndef'|'elif'|'elifdef'|'elifndef'|'else',
directive: <external_declaration containing the directive>,
body: [<external_declaration | conditional_group>...],
children: [directive, ...body] // depth-first walk fidelity
},
...
]
endif: <external_declaration containing the #endif directive>
children: [...branches, endif]
Nested #if … #endif inside a branch are recursively grouped by
re-running structureConditionalGroups on the branch body. The pass
also re-runs inside any preserved children (e.g. function bodies) so
preprocessor groups that live mid-function get the same treatment.
Tests: simple if/endif fold, three-way if/elif/else, nested ifdef
inside an outer ifdef, and best-effort handling of a stray #endif
(left flat).
Add src/expr.ts: a hand-rolled Pratt-style parser covering the full C
operator-precedence table from C23 §6.5. All expression contexts —
expression_statement bodies, jump_statement (return/goto) operands,
init_declarator initializers — now flow through it instead of being
absorbed as flat tokens with a post-pass.
Output shapes (all preserve every source token via depth-first
children):
literal_expression { literalKind, value }
identifier_expression { name }
paren_expression
call_expression { callee, isMacro, argument_list }
subscript_expression { target, index_list }
member_expression { object, op ('.'|'->'), memberName }
postfix_unary_expression { target, op ('++'|'--') }
unary_expression { op, operand } // ++/--/+/-/!/~/*/&/sizeof/_Alignof/...
cast_expression { typeName, operand }
binary_expression { op, left, right } // 11 precedence levels
conditional_expression { cond, then, else }
assignment_expression { left, op, right } // right-assoc
comma_expression
generic_selection
statement_expression // GCC ({ ... })
compound_literal { typeName, initializer_list }
Implementation notes:
- Cast vs paren-expression vs compound-literal disambiguation peeks
one token past `(`. Type-name detection accepts type keywords and
TYPEDEF_NAMEs (the typedef-name table is already populated by the
earlier slices).
- sizeof / _Alignof on a parenthesised type-name produce a type_name
operand; on an expression they recurse into parseUnary.
- Adjacent string literals are folded into a single literal_expression.
- Macro-call detection moves into parsePostfix's call branch: when
the immediate target is an identifier_expression whose token was
MACRO_NAME, isMacro=true. The slice-10 post-pass is no longer
needed for these contexts and was removed.
Tests: precedence 1+2*3, right-assoc assignment chain, ternary,
postfix subscript+member chain, prefix -/!/* unary, typedef-name
cast, sizeof on expr and on type-name, adjacent-string concatenation.
Parse two header- and source-shaped fragments end to end and verify
that the structural CST matches expectations. Catches regressions
where a single feature works in isolation but breaks under composition.
Header-shape coverage:
- #ifndef … #define … #endif wrapping the whole file (folds into a
single conditional_group with one ifndef branch),
- #include <…> (angled),
- eight typedef declarations (signed/unsigned char/short/int/long
long), all registered as typedef-names,
- three function-like and object-like #define directives whose
macro names land in the macro table,
- typedef of a struct-with-body (struct vec → vec_t), three int32_t
members,
- two function prototypes using the freshly-registered typedef-names,
- C23 fixed-underlying-type enum (`enum status : int`) with two
enumerators plus trailing comma.
Source-shape coverage:
- #include "vec.h",
- three function definitions (sign, vec_add, vec_dot),
- if-statements, multiple return values, member access via . and ->,
- long chained binary_expression on vec_dot's return.
Both tests verify the structural shape (e.g. conditional_group has the
expected branchKind, struct has the expected member count, top-level
return chain is rooted at +) rather than just "doesn't throw", giving
confidence that real-world C will round-trip through the parser.
initializer_list contents are no longer opaque. Each item is parsed as
an initializer_item that may carry a leading designation:
initializer_item { designation?, value }
designation
member_designator { memberName } // .x
index_designator // [n]
value: initializer | <expression-node>
Nested initializer-lists are recognised: each item's value may itself
be an initializer wrapping another initializer_list, so 2D arrays
(`{ {1,2}, {3,4} }`) and nested-struct initializers structure the
whole way down. The leading PUNC_ASSIGN of a designation is captured
on the designation node so source fidelity is preserved.
_Static_assert / static_assert split into typed children:
static_assert_declaration { condition, message? }
condition is parsed via the Pratt parser so the boolean expression is
fully structured; the optional second argument lands as a
literal_expression (or whatever expression form). Top-level
static_assert at the translation-unit level is now dispatched
explicitly in structureExternalDeclaration (it isn't a declaration-
specifier head).
Tests: .field designators, [index] designators, nested initializer
lists, static_assert with both arguments, and bare static_assert
without a message.
_Generic( ctrl, T1: e1, T2: e2, default: eD ) is no longer an opaque
balanced-paren node. Output:
generic_selection
controlling: generic_controlling_expression { expression }
associations: [
generic_association {
associationKind: 'type'|'default',
typeName?: type_name,
value: <expression-node>
},
...
]
The controlling expression and each association's value run through
the Pratt parser so binary operators, calls, identifiers etc. all
land structured. The type-name slot still holds an opaque token list
because a full type-name parser lives in structure.ts and consuming
it from inside an expression context would be circular for now;
preserved verbatim.
Test: _Generic(x, int:1, double:2, default:0) — three associations
with the right kinds and structured values.
attribute_spec is no longer an opaque ((…)) chunk. Each attribute item
inside the parens becomes a typed sub-node:
attribute_spec { attributeForm: 'gcc'|'msvc'|'unknown', items }
attribute_item { attributeName, attributePrefix?, argumentList? }
attribute_argument_list // structured args via Pratt
Form distinction:
- GCC __attribute__((items)) uses double parentheses
- MSVC __declspec(items) uses single parentheses
The attribute name slot is permissive: identifiers, typedef-names,
macro-names, and even C reserved words are accepted (so things like
__attribute__((const)) and __attribute__((noreturn)) parse the same
way).
C23 namespaced form `prefix::name` is recognised and split into
attributePrefix + attributeName.
Each argument is parsed with the Pratt expression parser so e.g.
__attribute__((format(printf, 1, 2))) yields three structured
expression arguments instead of opaque tokens.
Tests: GCC __attribute__ with bare name + format(...) + nonnull(...),
MSVC __declspec(dllexport), and __attribute__((const)) using a
keyword as the name slot.
asm_statement is no longer an opaque (...) block. The body now splits
along its colons into typed sections, each member structured:
asm_statement { qualifiers: ['volatile'|'inline'|'goto'...] }
asm_template { expression } // string literal expr
asm_outputs // optional
asm_operand { asmName?, constraint { value }, value { expression } }
...
asm_inputs // optional
asm_operand
...
asm_clobbers // optional
asm_clobber { value } // string literal
...
asm_labels // optional
asm_label_ref { labelName } // identifier
...
Output and input operand sub-shapes are identical (same C grammar):
optional [asm-name] in brackets, a string-literal constraint, and a
parenthesised C expression that the Pratt parser structures.
Trailing empty sections (e.g. `: : : "cc"` with empty outputs and
inputs) are fine — each ':' opens a new section regardless of
whether the previous one had items, and the section's children
remain empty. Walking depth-first still yields the original tokens
in order including the colons.
Tests: bare template, full extended form (output, two inputs,
clobbers), `__asm__ goto` with labels section, and operand with
[asm-name] prefix.
C23 introduces a new attribute syntax sitting alongside GCC's
__attribute__ and MSVC's __declspec. The lexer emits `[` and `]` as
single PUNC_LBRACKET / PUNC_RBRACKET tokens, so detection requires an
adjacency check against the source positions:
isC23AttributeOpen(ts):
PUNC_LBRACKET at offset 0,
PUNC_LBRACKET at offset 1,
second.sI === first.sI + first.len // no chars between
A new parseC23AttributeSpec produces an attribute_spec node with
attributeForm: 'c23', sharing the parseAttributeItem shape from slice
16. parseAnyAttributeSpec dispatches on the head token between gcc /
msvc / c23 forms so callers don't have to.
Hooked in at every relevant site:
- parseDeclarationSpecifiers accepts a leading [[…]] block
(declaration head can be the attribute itself).
- structureExternalDeclaration's head dispatch recognises [[…]] as
starting a declaration.
- parseDeclaration (block-item path) likewise.
- parseEnumerator now uses parseAnyAttributeSpec, replacing the
ad-hoc inline handling.
Items inside [[…]] support all C23 forms:
- Plain identifier -> attribute_item { attributeName }
- Namespaced prefix::name -> { attributePrefix, attributeName }
- With argument list -> + attribute_argument_list (Pratt-parsed)
Tests: [[nodiscard]] on a function decl, [[gnu::pure]] namespaced,
[[deprecated("reason")]] with a string-literal argument, and a
[[deprecated]] applied to an enumerator inside an enum body.
The header of a for-loop is no longer captured as one opaque balanced
paren. for_controls now contains three typed slots, each populated
with the structured form of its expression or declaration:
for_controls
init: for_init { value: declaration | <expression-node> | (empty) }
cond: for_cond { value: <expression-node> | (empty) }
iter: for_iter { value: <expression-node> | (empty) }
init dispatches between declaration form (when the head is a
specifier, static_assert, or C23 [[…]]) and expression form,
consuming the trailing `;` in the expression form so subsequent
slots see the right boundary. The declaration form's terminating ';'
is part of the declaration node itself.
Empty slots (`for (;;)`) keep their `;`s as direct token children so
source fidelity is preserved while .value is undefined.
Tests: full `for (int i = 0; i < 10; i++)` declaration init form,
expression init form `for (i = 0; …)`, and the empty `for (;;)`
infinite-loop shape.
Build a 100-file regression corpus with the csmith random C program
generator (seeds 1..100). Each file's structured CST is captured as a
gzipped JSON fixture; the test suite re-parses the corpus and asserts
the result is byte-identical to the fixture.
Layout:
test/csmith-corpus/seed-NNN.c — csmith output, committed
test/csmith-fixtures/seed-NNN.json.gz — golden CST, committed
test/csmith-fixture.ts — fixture serializer
test/csmith-gen.ts — corpus + fixture generator CLI
test/csmith.test.ts — regression test runner
Approach:
1. csmith.h's stdint typedefs (int8_t, uint64_t, size_t, FILE, …) are
pre-registered in the parser via meta.cmeta before each parse,
since the parser doesn't expand `#include`. Without this, e.g.
`static int32_t g_2 = 6L;` would parse `int32_t` as the declared
name. With the pre-registration, every csmith program structures
cleanly with zero `unknown` declarations.
2. The fixture serializer (toFixture) walks `kind`, `children`, and a
stable whitelist of scalar metadata. Convenience cross-references
that the parser exposes for ergonomic access (.left/.right/.target/
.value when it points to a node, etc.) are dropped to avoid
duplication, which would otherwise blow up JSON.stringify
exponentially. Trivia tokens (block/line comments, line
continuations) are also dropped — csmith's preamble alone is a
1KB block comment per file.
3. The fixture is gzipped at level 9. Average size: ~70 KB per file,
so the 100-fixture suite takes ~7.7 MB. Plain JSON would be ~3 MB
per file; uncompressed cross-references would have been ~50 MB
per file.
4. The test harness has two assertions per seed:
- parse-cleanly: every external_declaration must structure (no
declKind === 'unknown'),
- fixture-match: re-running the parser yields byte-identical
fixture JSON to what's committed.
Regenerating after a deliberate parser change:
npx tsc --build src test
node dist-test/csmith-gen.js fixtures # rebuilds *.json.gz
Result: 285 tests, 200 from csmith (100 parse + 100 fixture), 85 from
the existing unit suite. All pass.
Split test/csmith-gen.ts into:
test/csmith-common.ts — pure helpers (STDINT_TYPEDEFS,
parseCsmithSource, path constants).
Importing has zero side effects.
test/csmith-gen.ts — CLI only: imports common + csmith-fixture,
runs corpus/fixture generation guarded by
require.main === module.
test/csmith.test.ts now imports from csmith-common, never from
csmith-gen. The test runner therefore never reaches code that calls
execSync('csmith ...') or mkdirSync at module-load time.
Verified by removing the csmith binary from PATH and rerunning the
full suite: 285/285 pass. Restoring the binary still allows
`node dist-test/csmith-gen.js all` to regenerate the corpus.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 890bf3b989
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -22,16 +22,19 @@ | |||
| "watch": "tsc --build src test -w", | |||
| "build": "node embed-grammar.js && tsc --build src test", | |||
There was a problem hiding this comment.
Remove deleted embed step from build command
The build script still invokes node embed-grammar.js, but this commit deletes embed-grammar.js, so npm run build fails immediately with MODULE_NOT_FOUND on a clean checkout and the package cannot be compiled or tested through the documented workflow. Please either restore the script or remove this step from build/embed scripts.
Useful? React with 👍 / 👎.
The earlier slice-1 cleanup removed embed-grammar.js (the JSONC plugin embedded its grammar at build time; the C plugin doesn't need that step) and the go/ directory, but two surfaces still referenced them: - package.json's build/embed scripts ran `node embed-grammar.js`, which fails on a fresh checkout with MODULE_NOT_FOUND. - .github/workflows/build.yml had a build-go job that ran `go build ./...` in a directory that no longer exists. Strip both. The build script is now simply `tsc --build src test`. Verified by running npm clean / install / build / test on a fresh node_modules: 285/285 still pass.
Windows CI was failing for seed 100's fixture-match because git's autocrlf rewrites the .c corpus to CRLF on checkout, while the fixtures (built on Linux) encode the parser's output for LF source. The token .src strings divergence then propagates into the fixture-byte comparison. Two fixes, applied together: - .gitattributes pins .c (and other text files) to `text eol=lf`, so future Windows checkouts keep LF regardless of core.autocrlf. - normaliseEol() in test/csmith.test.ts collapses any \r\n / \r sequences in the corpus to \n before parsing. Belt-and-suspenders: if a Windows clone slipped past the gitattributes (e.g. cloned before this commit landed), the test still passes. Verified locally by injecting CRLF into seed-001.c and rerunning: both `parse seed 001` and `fixture seed 001` still pass.
This PR replaces the JSONC (JSON with Comments) plugin with a new C language parser plugin for Jsonic.
Summary
The repository has been transformed from a JSONC parser to a comprehensive C language parser. This includes a complete lexer, parser, and AST builder that handles C23 syntax with support for compiler extensions and macros.
Key Changes
New C Parser Implementation:
src/c.ts- Main plugin entry point that integrates the C parser with Jsonicsrc/structure.ts- Post-processing pass that converts flat token lists into a structured concrete syntax tree using recursive-descent parsingsrc/expr.ts- Pratt-style expression parser handling all C operator precedence levels (C23 §6.5)src/matchers.ts- Focused lexer matchers for C tokens (whitespace, comments, preprocessor directives, identifiers, literals, punctuators)src/tokens.ts- Token name catalog for all C tokens (keywords, punctuators, literals)src/symbols.ts- Symbol and macro tables for resolving the identifier/typedef-name disambiguation problemRemoved JSONC Implementation:
src/jsonc.ts,go/jsonc.go,go/jsonc_test.goConfiguration Updates:
package.jsonto reflect C parser package (@jsonic/c)README.mdwith C parser documentationMakefilewith C-specific build targets.gitattributesfor binary test fixture handlingImplementation Details
The C parser uses a two-pass approach:
Key features:
https://claude.ai/code/session_01Qjw28F24FXYwmDtUHnB4Gr