Compiler Design
Structure & Phases of a Compiler
Modern production compilers are divided into logical stages called Phases. Conceptualizing these phases helps modularize compiler development into two primary divisions: the Front End (Analysis) and the Back End (Synthesis).
Front End vs Back End
- Front End (Analysis): Scans the source code, validates structure, performs semantic analysis, and compiles it into a machine-independent Intermediate Representation (IR).
- Back End (Synthesis): Optimizes the machine-independent IR, performs register allocation, and generates machine-specific assembly or binary code.
The Six Phases of a Compiler
Every standard compiler processes source code through the following pipeline:
| Phase | Input Representation | Output Representation |
|---|---|---|
| 1. Lexical Analysis (Scanner) | Character Stream (Source Code) | Token Stream |
| 2. Syntax Analysis (Parser) | Token Stream | Parse Tree / Abstract Syntax Tree (AST) |
| 3. Semantic Analysis | Abstract Syntax Tree (AST) | Annotated AST (Typed) |
| 4. Intermediate Code Generation | Annotated AST | Intermediate Representation (IR / TAC) |
| 5. Code Optimization | Intermediate Representation (IR) | Optimized IR |
| 6. Code Generation | Optimized IR | Target Assembly / Machine Code |
Global Support Structures
Two major system utilities run parallel to all six compiler phases:
- Symbol Table: A hash table or stack of scopes that stores identifiers (variables, functions) and their types, scopes, and memory references.
- Error Handler: Manages diagnostic warnings and recovery routines, allowing compilation to continue and detect multiple errors in a single execution.
Deep Walkthrough: Compiling result = a + b * 2;
Let's trace how a single line of C code travels through the six phases of compilation:
The scanner reads individual characters and groups them into Tokens, discarding whitespace and comments.
Input: result = a + b * 2; Tokens Produced: 1. <IDENTIFIER, "result"> 2. <ASSIGN_OP, "="> 3. <IDENTIFIER, "a"> 4. <ADD_OP, "+"> 5. <IDENTIFIER, "b"> 6. <MUL_OP, "*"> 7. <INTEGER_LITERAL, "2"> 8. <SEMICOLON, ";">
The parser reads the token stream and verifies syntactic correctness against a context-free grammar (CFG). It outputs an Abstract Syntax Tree (AST):
ASSIGN
/ \
result ADD_OP
/ \
a MUL_OP
/ \
b 2Note: Multiplication (*) resides deeper in the tree than addition (+) reflecting its higher operator precedence.
Verifies logical meaning, type consistency, and variable scoping. For example, if a and b are declared as int, it checks if adding them is valid. The output is an Annotated AST:
ASSIGN [type: int]
/ \
result [int] ADD [type: int]
/ \
a [int] MUL [type: int]
/ \
b [int] 2 [int]Transforms the annotated tree into a machine-independent code format. In this case, Three-Address Code (TAC):
t1 = b * 2 t2 = a + t1 result = t2
Simplifies instructions to improve execution speed. Here, Copy Propagation eliminates the redundant assignment of t2 to result:
// Before Optimization: t1 = b * 2 t2 = a + t1 result = t2 // After Copy Propagation: t1 = b * 2 result = a + t1
Translates optimized IR to machine-specific CPU assembly instructions. Here is an x86 assembly implementation:
MOV EAX, [b] ; Load value of variable 'b' into register EAX IMUL EAX, 2 ; Multiply EAX by 2 (EAX = b * 2) ADD EAX, [a] ; Add value of variable 'a' (EAX = a + b * 2) MOV [result], EAX ; Store register contents into 'result' memory address