Compiler Design

Introduction to Syntax Analysis


Role of the Parser

In the compilation pipeline, the Syntax Analyzer (Parser) is the second phase. It takes the token stream generated by the Lexical Analyzer (Lexer) and checks whether the tokens form valid expressions according to the grammatical rules of the programming language.

Tokens Stream → [ Parser ] → Parse Tree / Syntax Tree
Interacts with Symbol Table & Reports Errors

Context-Free Grammars (CFGs)

A Context-Free Grammar is a formal mathematical system used to define the syntax of a language. It is defined as a 4-tuple:

G = (V, T, P, S)
  • V: A finite set of non-terminal variables (syntactic categories).
  • T: A finite set of terminal symbols (actual tokens/words).
  • P: A set of production rules mapping variables to strings of variables and terminals (e.g., A → α).
  • S: The start symbol (S ∈ V).

Derivations and Parse Trees

A Derivation is the sequence of production steps used to generate a target string from the start symbol. There are two primary types of derivations:

  • Leftmost Derivation (LMD): At each step, the leftmost non-terminal is replaced first.
  • Rightmost Derivation (RMD): At each step, the rightmost non-terminal is replaced first.

A grammar is ambiguous if there exists a string that has more than one leftmost derivation, rightmost derivation, or parse tree.

Given Grammar (Expression Grammar):

E → E + E | E * id | id

We want to trace the string: id + id * id.

Parse Tree 1: (id + id) * id

Evaluates addition first (leaves * at the top level).

      E
    / | \
   E  *  id
 / | \
E  +  E
|     |
id    id

LMD: E ⇒ E * id ⇒ E + E * id ⇒ id + E * id ⇒ id + id * id

Parse Tree 2: id + (id * id)

Evaluates multiplication first (leaves + at the top level).

      E
    / | \
   id +  E
       / | \
      E  *  id
      |
      id

LMD: E ⇒ E + E ⇒ id + E ⇒ id + E * id ⇒ id + id * id

Conclusion: Since the string has two distinct parse trees representing different operator precedences, the grammar is ambiguous.

Elimination of Left Recursion

Top-down parsers (like LL(1)) enter infinite loops when a grammar has Left Recursion (where a production looks like A → Aα). We must eliminate this syntax pattern.

Direct Left Recursion Rule:

Given a left-recursive production:

A → Aα | β

Rewrite it as:

A → βA' A' → αA' | ε

Elimination of Left Factoring

Left Factoring is a grammar transformation useful for resolving choices when two alternative productions start with the same prefix. It allows top-down parsers to defer the decision until they see what comes after the shared prefix.

Left Factoring Rule:

Given productions with a common prefix:

A → αβ_1 | αβ_2

Factor out the prefix:

A → αA' A' → β_1 | β_2