Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Lexical parsing C is simple, except that typedef's technically make it non-context-free. See https://en.wikipedia.org/wiki/Lexer_hack When handwriting a parser, it's no big deal, but it's often a stumbling block for parser generators or other formal approaches. Though, I recall there's a PEG-based parser for C99/C11 floating around that was supposed to be compliant. But I'm having trouble finding a link, and maybe it was using something like LPeg, which has features beyond pure PEG that help with context dependent parsing.


Clang's solution (presented at the end of the Wikipedia article you linked) seem much better - just use a single lexical token for both types and variables.

Then, only the parser needs to be context sensitive, for the A* B; construct which is either a no-op multiplication (if A is a variable) or a variable declaration of a pointer type (if A is a type)


Well, as you see this is inherently taking the spirit of GLL/GLR parser -- defer parse until we have all the information. The academic solution to this is not to do it on token level but introduce a parse tree that is "forkable", meaning a new persistent data structure is needed to "compress" the tree when we have different routes, and that thing is called: graph structured stack (https://en.wikipedia.org/wiki/Graph-structured_stack)


I think you're referring to this one: https://github.com/jhjourdan/C11parser


What I had specifically in mind definitely wasn't using OCaml or Menhir, but that's a very useful resource, as is the associated paper, "A simple, possibly correct LR parser for C11", https://jhjourdan.mketjh.fr/pdf/jourdan2017simple.pdf

This is closer to what I remember, but I'm not convinced it's what I had in mind, either: https://github.com/edubart/lpegrex/blob/main/parsers/c11.lua It uses LPeg's match-time capture feature (not a pure PEG construct) to dynamically memorize typedef's and condition subsequent matches. In fact, it's effectively identical to what C11Parser is doing, down to the two dynamically invoked helper functions: declare_typedefname/is_typedefname vs set_typedef/is_typedef. C11Parser and the paper are older, so maybe the lpegrex parser is derivative. (And probably what I had in mind, if not lpegrex, was derivative, too.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: