Skip to content

A modern LALR(1) compiler-compiler for C

License

Notifications You must be signed in to change notification settings

AutoGentoo/neoast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

default sanitizer

neoast

A modern AST grammar lexer/parser generator written in C. Neoast can take arbitrary lexing and grammar rules and generate a scanner and parsing table. Unlike Bison/Flex, neoast does not generate the scanner and parsing runtime for each input, it uses the same algorithm and links a static runtime library. This means that many optimizations can be made when comparing to Bison.

Usage

Neoast is shipped with a CMakeLists.txt script that can build all of the dependencies to the lexing and parsing runtime. If you already have a CMake project, you can add this project as a sub-project:

add_subdirectory(neoast EXCLUDE_FROM_ALL)

Now you will have access to a macro to build a parser given an input file:

BuildParser(parser_name path/to/my/parser.y)

# Add the parser source code to an executable
add_executable(exec
        main.c ${parser_name_OUTPUT})

# Link the runtime library
target_link_libraries(exec PRIVATE neoast)

Examples

Check test/input/ for example input files.

What is a Lexer and Parser

A lexer is a program that will break up an input into tokens. For example if we have a simple C expression:

fprintf(stderr, "Hello world %s\n", "===");

A lexer may break up this expression into something like this:

IDENTIFIER '(' IDENTIFIER ',' STRING_CONSTANT ',' STRING_CONSTANT ')' ';'

This is whats known as a token table. Each value in the token was matched by a regular expression and built using a lexing rule. Notice that the value corresponding to each token is not indicated in the token table. Like tokens, values are stored in a separate table known as the value table. The value table is an array of %union objects. See the %union section for more information.

After an input string/buffer has been tokenized with the lexer, its time for the parser to reduce sequences of tokens into expressions. This is done through grammar rules.

Neoast can generate LALR(1) and CLR(1) parsing tables and using the LR parsing algorithm to reduce grammar rules. If you don't know what this is its not super important. By default, neoast will generate a LALR(1) parser but this can be changed via the parser_type %option.

Lexer/Parser files

Neoast can read input files and generate a C source file based on the the contents. This source file will initialize a parser with compile-time constants such as a LR parsing table and lexing rules.

There are three distinct sections to a parser input file:

[HEADER]

==
[LEXING RULE]
==

%%
[grammar RULES]
%%

Header section

The header section is where all of the metadata and declarations occur. Here you will define the value %union and as well as %tokens and %types. The header section also takes %options that control different features of neoast code generation. Below we discuss each header switch that is parsed by neoast

%top and %bottom

%top is where we define C source code that will be placed at the top of the generated C file. Usually header #includes will go here. For example:

%top {
    #include <stdio.h> // printf()
    #include <math.h>  // pow()
}

Similarly %bottom is source code that will be placed at the bottom of the source file. Here you'll probably place extra functions that depend on functions that are generated during code generation and therefore need to be placed after they are declared.

%union

%union is the most important switch in the input file. It defines that data type that will store the Abstract-Syntax-Tree (AST) from the parser. During lexing and parsing, the programmer will have access to objects with this union type. The different fields in the union are used to provide type to %tokens and %type.

%union {
    char* identifier;
    struct {
        char* key;
        char* value;
    } key_val;
}

%token and %type

%token and %type define the lexing and parsing tokens respectively. %token is used for anything found during lexing including ASCII characters. %type is used to define an expression that the parser can reduce to. More on this later.

Here as example of the valid syntax to use with these switches

// Token declarations
%token MY_TOKEN_NAME // A simple token
%token ';'  // ascii token (no unicode allowed, define a simple token for this)
%token <my_union_field> MY_TYPED_TOKEN  // typed token

// Type declarations
%type <my_union_field> MY_TYPE

Notice the special syntax to select the type <field_name> of a %token or %type. A %type must include a type where a %token does not necessarily need one. The field name must be a valid field name in the %union. When the token/type is used in a grammar rule, their union field will be used. This allows the programmer to easily declare different expressions and tokens with types and for the runtime library to not have to worry about these types during parsing.

%destructor

A destructor will tell the parser how to free a union field if it runs into a syntax error. For example, if you allocate a string during lexing and place the pointer in the value table, you may want to free this memory if the parser runs into an error. Here's an example:

%union {
    char* identifier;
    struct {
        char* key;
        char* value;
    } key_val;
};

%destructor <identifier> {free($$);}
%destructor <key_val> {free($$.key); free($$.value);}

Note: $$ denotes the value in the value table

%left and %right

These switches are used to resolve SR conflicts. See below.

Regex macros

Regex macros are used to simplify a programmers life when defining lexing rules. They are similar to C macros as they expand at compile time. Here's an example:

+letter            A-Za-z
+digit             0-9
+identifier        [{letter}_][{letter}{digit}_]*

Notice how macros don't necessarily need to be valid regex however once the macro is fully expanded into a regular expression for a lexer rule, it must be valid regex

Lexer section

In this section you will define the regular expressions used to match different tokens. Neoast used Google's RE2 as a regular expression engine. This means its limited to these rules and will fail to compile if you attempt to use an invalid regular expression. Here is an example of a lexer rule:

"[A-Z_a-z][A-Za-z_0-9]*"             {yyval->identifier = strdup(yytext); return IDENTIFIER;}

// Or we can use the macro defined above
"{identifier}"                       {yyval->identifier = strdup(yytext); return IDENTIFIER;}

Note: All regular expressions in the lexer sections must be inside a C-string with double quotes.

This is a common declaration when trying to match an identifier in most languages. Here we saying that every identifier must have a letter or underscore followed by any combination of letters, underscores and digits to declare an identifier. The bracketed statement to the right of the regular expression is known as an action. A lexer actions has a couple of valid local variables associated with it. Here's how a lexer action is defined:

static int32_t
ll_rule_LEX_STATE_DEFAULT_01(const char* yytext, NeoastValue* yyval,
                             uint32_t len, ParsingStack* lex_state)
{
    // These are purely used to ignore compiler warnings for
    // 'unused variables'
    (void) yytext;
    (void) yyval;
    (void) len;
    (void) lex_state;

    {yyval->identifier = strdup(yytext); return IDENTIFIER;}
    return -1;
}
Local Name Purpose
yytext Raw text matched by the regular expression (constant).
yyval A pointer to the value in the value table. NeoastValue is the union generated by %union
len Same as strlen(yytext). Use this for optimizations because we already know this strings length.
lex_state We can push are or pop from this stack to go to different lexing states. See Lexing states for more details.

Notice that if the defined action does not return, the function returns -1. -1 is a special token used to tell the lexer to skip this block of text. You'll most likely use this when defining rules for [ \n\t]+ or whitespace.

Lexing states

There are situations where you may want to only generate some tokens at different points. For example, when matching a brace, you could do something like this:

%top {
    static int brace_counter = 0;
    static void* brace_buffer = NULL;
}

%state BRACE_MODE // define the state
%token <braced_content> BRACED_CONTENT

%union {
    void* braced_content;
}

==

"{"        {
                brace_buffer = buffer_new();
                add_to_buffer(brace_buffer, yytext);
                brace_counter = 1;
                NEOAST_STACK_PUSH(lex_state, BRACE_MODE); // only match regex in the BRACE_MODE state
                // skip token (no return)
           }

<BRACE_MODE>
{
    "{"     {brace_counter++; add_to_buffer(brace_buffer, yytext);}
    "}"     {
                brace_counter--;
                add_to_buffer(brace_buffer, yytext);
                
                if (brace_counter == 0)
                {
                    yyval->brace_content = brace_buffer;
                    NEOAST_STACK_POP(lex_state); // return to default state
                    return BRACED_CONTENT;
                }
            }
    "[^\}\{]+" {add_to_buffer(brace_buffer, yytext);}
}
==

This set of lexer rules will match a set of braces and return a BRACED_CONTENT token with the value being the text in the braced content. (Obviously you would need to define your own buffer mutators).

Grammar section

The grammar section will define the grammar rules used to reduce token sequences into expressions. The same syntax used in bison is utilized in neoast.

A grammar rule is just a sequence of tokens followed by an action:

%union {
    char* str;
    struct Expression {
        ...
        struct Expression* next;
    }* expression;
}

// Every grammar must have a corresponding %type declaration
%type <expression> function_call
%type <expression> argument_list
%type <expression> expression
%token <string> IDENTIFIER
%token '('
%token ')'
%token ','

%%
expression      : function_call                {$$ = $1;}
                | IDENTIFIER                   {$$ = build_identifier($1);}
                ;

function_call   :  expression '(' argument_list ')'  {$$ = build_function_call($1, $3);}
                ;

argument_list   : expression ',' argument_list       {$$ = $1; $$->next = $3;}
                | expression                         {$$ = $1; $$->next = NULL;}
                ;

%%

This grammar will define a very simple language where all you can do is call functions with arguments. Notice that the $[0-9]+ variables are available in grammar actions to get the value of a token from the value table. ONLY tokens with a defined type can be used here, if you tried to use a token without a type defined, you'll get a compilation error.

SR and RR conflicts

The parsing table is an array used to define a finite state machine to keep track of parsing states will the parser is making sense of an input. There are three types of operations embedded in the parsing table, SHIFT, REDUCE and GOTO.

Shift is used when a token is seen by the parser which the parser expects. When this token is seen, the parser will push this token to the stack along with the state its in (state in the finite stack machine).

A reduce operation happens when the parser sees the entire token sequence of a grammar rule is seen by the parser and its ready to reduce this sequence to the %type token (function_call for example).

After a reduce operation happens, it needs to know which state to proceed to which is done via the Goto operation.

Ok, know that those concepts are explained, on to conflicts. A conflict occurs when grammar rule is ambiguous and could be interpreted in multiple ways.

An SR conflict happens when a shift or reduce operation is valid for the same location in the parsing table. If that means nothing to you take the example below:

// Simple addition grammar
expression : NUMBER
           | expression '+' expression
           ;

and the following input:

5 + 6 + 7

Now think of how a parser would reduce this:

// Option 1   (reduce)
(5 + 6) + 7

// Option 2   (shift)
5 + (6 + 7)

To solve these type of issues, neoast has %left and %right switches to determine operation precedence for ambiguous grammar rules.

// Reduce the expression before continuing
%left '+'

// OR
// We want option 2 instead of 1
%right '+'

The second type of conflict, RR conflict, is much more rare than the SR conflict. It happens when two rules can validly be reduced at the same time. This is simply an error in the grammar and cannot be fixed via a header switch.

Benchmark

A benchmark was done for the AutoGentoo CPortage use case. Here we needed to parse REQUIRE_USE and DEPEND* expressions for Gentoo ebuild. The original parser was written for Bison+Flex so we can compare the results with the old parser. The parsing and lexing rules are exactly the same and they both run in a single thread.

This benchmark was taken on a Thinkpad X1 Carbon 3rd Generation, i7-5600U @ 3.2 GHz running Gentoo Linux with a 5.4.38-gentoo kernel.

Benchmarks with and without optimizations were performed to show the benefit of optimization with neoast. The entire portage repository was parsed:

Parser CFLAGS Time (s)
Bison+Flex 10.50
Bison+Flex -O2 -march=native -pipe 10.35
neoast 7.13
neoast -O2 -march=native -pipe 4.83

As you can see neoast has a lot of optimizations that reduce runtime overhead when comparing to Bison+Flex. This is because neoast using a parsing table instead of a VERY long chain of if statements. The advantage of neoast is that there is no runtime impact to having more tokens or grammar rules, it will simply increase the size of the parsing table which doesn't really matter too much.

Contributing

If you want to contribute or submit a bug report/feature request, you are always welcome to do so.

Any contribution should be make via a pull request. When you open a pull request please request a review from @kronos3.

I will not merge any PR that does not pass all of the CI tests (including the sanitizer tests (ASAN + UBSAN)).

If you have any questions, you can always open an issue and mark it with the "question" tag.