Compiler Design

Structure & Phases of a Compiler


Modern production compilers are divided into logical stages called Phases. Conceptualizing these phases helps modularize compiler development into two primary divisions: the Front End (Analysis) and the Back End (Synthesis).

Front End vs Back End

  • Front End (Analysis): Scans the source code, validates structure, performs semantic analysis, and compiles it into a machine-independent Intermediate Representation (IR).
  • Back End (Synthesis): Optimizes the machine-independent IR, performs register allocation, and generates machine-specific assembly or binary code.

The Six Phases of a Compiler

Every standard compiler processes source code through the following pipeline:

PhaseInput RepresentationOutput Representation
1. Lexical Analysis (Scanner)Character Stream (Source Code)Token Stream
2. Syntax Analysis (Parser)Token StreamParse Tree / Abstract Syntax Tree (AST)
3. Semantic AnalysisAbstract Syntax Tree (AST)Annotated AST (Typed)
4. Intermediate Code GenerationAnnotated ASTIntermediate Representation (IR / TAC)
5. Code OptimizationIntermediate Representation (IR)Optimized IR
6. Code GenerationOptimized IRTarget Assembly / Machine Code

Global Support Structures

Two major system utilities run parallel to all six compiler phases:

  • Symbol Table: A hash table or stack of scopes that stores identifiers (variables, functions) and their types, scopes, and memory references.
  • Error Handler: Manages diagnostic warnings and recovery routines, allowing compilation to continue and detect multiple errors in a single execution.

Deep Walkthrough: Compiling result = a + b * 2;

Let's trace how a single line of C code travels through the six phases of compilation:

The scanner reads individual characters and groups them into Tokens, discarding whitespace and comments.

Input: result = a + b * 2;

Tokens Produced:
1. <IDENTIFIER, "result">
2. <ASSIGN_OP, "=">
3. <IDENTIFIER, "a">
4. <ADD_OP, "+">
5. <IDENTIFIER, "b">
6. <MUL_OP, "*">
7. <INTEGER_LITERAL, "2">
8. <SEMICOLON, ";">

The parser reads the token stream and verifies syntactic correctness against a context-free grammar (CFG). It outputs an Abstract Syntax Tree (AST):

         ASSIGN
        /      \
    result    ADD_OP
             /     \
            a    MUL_OP
                 /    \
                b      2

Note: Multiplication (*) resides deeper in the tree than addition (+) reflecting its higher operator precedence.

Verifies logical meaning, type consistency, and variable scoping. For example, if a and b are declared as int, it checks if adding them is valid. The output is an Annotated AST:

         ASSIGN [type: int]
        /              \
  result [int]    ADD [type: int]
                 /         \
              a [int]    MUL [type: int]
                         /          \
                      b [int]     2 [int]

Transforms the annotated tree into a machine-independent code format. In this case, Three-Address Code (TAC):

t1 = b * 2
t2 = a + t1
result = t2

Simplifies instructions to improve execution speed. Here, Copy Propagation eliminates the redundant assignment of t2 to result:

// Before Optimization:
t1 = b * 2
t2 = a + t1
result = t2

// After Copy Propagation:
t1 = b * 2
result = a + t1

Translates optimized IR to machine-specific CPU assembly instructions. Here is an x86 assembly implementation:

MOV EAX, [b]   ; Load value of variable 'b' into register EAX
IMUL EAX, 2    ; Multiply EAX by 2 (EAX = b * 2)
ADD EAX, [a]   ; Add value of variable 'a' (EAX = a + b * 2)
MOV [result], EAX ; Store register contents into 'result' memory address