This is SeeCompiler, a readable and modular C compiler that translates a verified subset of the C programming language into a sequence of x86 assembly instructions. Designed for correctness, clarity and traceability, the compiler exposes each major compilation step with beginner-friendly source code and debug-friendly output.
At the core of SeeCompiler is driver.c
, the management abstraction layer that runs the full compilation pipeline in a linear and inspectable process. This makes the project ideal for education, experimentation, and systems-level programming exploration.
Internally, driver.c
calls each compiler subroutine in sequence, validating results before advancing to the next procedure. We use clear diagnostic messages and structured output to aid learning and debugging. Each major data structure (tokens, AST, IR) is printed to the console, allowing users to see the transformation pipeline from source to assembly. Memory is fully freed after compilation, ensuring stability and leak-free operations across repeated runs.
Firstly, the lexer begans this compilation pipeline, which is housed in lexer.c
. The purposw of this module is to convert raw C source code into a stream of tokens, each annotated with a category and lexeme (a literal representation).
Internally, lexer.c
has the functionality to support tokenization:
- Primitive types (
int
,char
,bool
,string
, etc.) - Control flow keywords (
if
,else
,while
,for
, etc.) - Literals: integers, characters, strings
- Symbols: both single-character and multi-character (e.g.,
==
,++
,->
) - Comments: skips both
//
and/* ... */
formats - Identifiers: variable/function names, respecting C conventions are all handled and analyzed by the module.
Tokens are dynamically allocated and returned via the tokenize()
function, which also emits a final TOKEN_EOF
to signal the end of input. A corresponding free_tokens()
function is provided to release all memory used by the token stream.
Next, the parser is the second step of our compilation, converting a stream of tokens into a tree-structured representation of the source code’s grammar: the Abstract Syntax Tree (AST). The code of which is contained in parser.c
.
Internally, the parser is powered by a Parser
struct that maintains a token stream and cursor. It uses:
match()
andadvance()
utilities for stream controlparse_expression_with_precedence()
for left-to-right operator parsing- Specialized routines for each syntactic construct (
parse_if()
,parse_while()
,parse_return()
, etc.)
This module transforms a flat token list into a deeply structured representation that guides semantic checks and backend code generation by implementing a recursive descent parser with full support for: Variable declarations and assignments
- Arithmetic and logical expressions with operator precedence
- Control flow: if, else, while, for
- Function calls and return statements.
The unit outputs an ASTNode tree rooted at the program’s main function and organizes ASTs using left/right children and statement lists. All statements are parsed into a unified AST_STATEMENT_LIST
form to simplify traversal in later processes.
The semantic analyzer is the third stage of compilation in semantic.c
, responsible for enforcing language rules beyond syntax. It validates that:
- All variables are declared before use
- No redeclarations exist in the same scope
- Expressions are well-formed
- Return statements are correctly structured
- Control flow conditions are semantically valid.
Internally, this layer runs a simple compile-time execution environment, performing symbol tracking, scope validation, and structural checks of statements and expressions.
Symbol Table Management:
- Maintains a linked-list-based symbol table for tracking variable declarations
- Implements
add_symbol()
,symbol_exists()
, andclear_symbols()
for symbol lifecycle - Uses a single-scope environment (flat, no shadowing), suitable for validating simple
main
-only programs.
Expression Validation:
- Checks that identifiers are declared before use (
AST_IDENTIFIER
) - Recursively validates binary expressions (
AST_BINARY_OP
) and function calls - Reports semantic errors for undeclared variables and malformed expressions.
Statement-Level Semantics:
AST_DECLARATION
: ensures identifiers are unique and recorded in the symbol tableAST_ASSIGNMENT
: ensures LHS is a declared identifier and RHS is a valid expressionAST_RETURN_STMT
: requires a return value and checks that it’s semantically validAST_STATEMENT_LIST
: flattens and validates sequential statements recursively.
Control Flow Validation:
AST_IF_STMT
: verifies condition expression and both branches (then/else)AST_WHILE_LOOP
: verifies condition and loop bodyAST_FOR_LOOP
: validates initializer (stmt
), condition (expr
), step (stmt
) in structured order.
Function-Level Verification:
- Entry point must be a function with
type == AST_FUNCTION_DEF
andvalue == "main"
- Entire function body is validated as a statement list.
Error Reporting:
- Use-before-declaration
- Redeclarations
- Missing return values
- Invalid constructs in branches or expressions.
AST Traversal Strategy:
- The analysis is top-down and recursive, using
switch
dispatch on AST node types - Expression analysis is composed within statement analysis, composing semantics hierarchically.
Next, the IR generation phase serves as a lowering step, translating validated AST nodes into a flat, linear sequence of instructions. This IR abstracts away high-level language syntax into stack-based, pseudo-assembly operations, enabling backend code generation to be cleanly decoupled from AST shape.
Internally, this module converts the AST into a linear, stack-based Intermediate Representation (IR)—a simplified pseudo-assembly language that decouples high-level syntax from backend code generation.
Translates each AST Node into a chain of IRInstr structs (type, arg, next) i.e. it supports core operations:
- Memory:
IR_DECL
,IR_LOAD
,IR_STORE
- Literals & Stack:
IR_PUSH
- Arithmetic:
IR_ADD
,IR_SUB
,IR_MUL
,IR_DIV
- Control Flow:
IR_LABEL
,IR_JUMP
,IR_JUMP_IF_ZERO
,IR_CMP
- Functions:
IR_CALL
,IR_RET
.
High-level flow (e.g., if, while, for) is lowered into labeled jumps:
if
emits conditional branches toelse_LABEL
/endif_LABEL
while
emits looping conditions withJUMP_IF_ZERO
for
handles init/test/step with generated blocks and labels.
Lastly, the x86 code generation module emits Intel-style x86-64 assembly from the compiler’s intermediate representation (IR). It uses a stack-based model and generates correct function prologues, epilogues, and instruction sequences for expressions, memory access, and control flow.
Internally, this module converts the linear IR instruction stream into x86-64 assembly instructions, using a stack-based calling convention and Intel syntax by handling:
Variable Layout
- Maintains a linked
VarEntry
table mapping variable names torbp
-relative offsets. - Offsets are aligned at 8 bytes, growing downward from
rbp
.
Code Structure
- Emits standard function prologue and epilogue around main.
Instruction Emission
- Each IR type (
IR_PUSH
,IR_ADD
,IR_LOAD
, etc.) maps to corresponding assembly snippets. - Binary ops (
+
,-
,*
,/
) are translated into stack-based computations usingrax
/rbx
. - Control flow (e.g.,
if
,while
,for
) is handled via jump labels and conditional branches. - Strings are stored in
.rodata
, labeled.LCn
, and referenced usinglea rax, label
.
String Handling
- Escapes quotes, backslashes, and newlines
- Allocates each unique string as
.string
in a separate.rodata
section
Prerequisites
- C99-compatible compiler (e.g.,
clang
,gcc
) - Unix-like environment (Linux/macOS recommended)
make
utility
To build the compiler, simply run:
make