Lexical parsing C is simple, except that typedef's technically make it non-conte...

nextaccountic · 2026-02-09T06:19:11 1770617951

Clang's solution (presented at the end of the Wikipedia article you linked) seem much better - just use a single lexical token for both types and variables.

Then, only the parser needs to be context sensitive, for the A* B; construct which is either a no-op multiplication (if A is a variable) or a variable declaration of a pointer type (if A is a type)

stevefan1999 · 2026-02-09T08:00:20 1770624020

Well, as you see this is inherently taking the spirit of GLL/GLR parser -- defer parse until we have all the information. The academic solution to this is not to do it on token level but introduce a parse tree that is "forkable", meaning a new persistent data structure is needed to "compress" the tree when we have different routes, and that thing is called: graph structured stack (https://en.wikipedia.org/wiki/Graph-structured_stack)

mahmoudimus · 2026-02-09T07:49:44 1770623384

I think you're referring to this one: https://github.com/jhjourdan/C11parser

wahern · 2026-02-09T13:06:54 1770642414

What I had specifically in mind definitely wasn't using OCaml or Menhir, but that's a very useful resource, as is the associated paper, "A simple, possibly correct LR parser for C11", https://jhjourdan.mketjh.fr/pdf/jourdan2017simple.pdf

This is closer to what I remember, but I'm not convinced it's what I had in mind, either: https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua It uses LPeg's match-time capture feature (not a pure PEG construct) to dynamically memorize typedef's and condition subsequent matches. In fact, it's effectively identical to what C11Parser is doing, down to the two dynamically invoked helper functions: declare_typedefname/is_typedefname vs set_typedef/is_typedef. C11Parser and the paper are older, so maybe the lpegrex parser is derivative. (And probably what I had in mind, if not lpegrex, was derivative, too.)