Compiler Design

CLR(1) & LALR Parsing


LR(1) Items & Lookaheads

To build more powerful bottom-up parsers, we include a lookahead symbol inside each parsing item. An LR(1) item is written as:

[A → α • β, a]

Where A → αβ is a production, and a is a terminal symbol representing the lookahead. The lookahead acts as a gatekeeper: the parser is allowed to reduce A → α if and only if the next input terminal is exactly a.

CLR(1) vs LALR(1) Parsing

Both CLR(1) and LALR(1) use LR(1) items, but they trade off memory for power in different ways:

  • Canonical LR (CLR/LR(1)): Keeps every set of LR(1) items as a distinct state. This is the most powerful parsing method but creates a massive number of states, making tables very large.
  • Look-Ahead LR (LALR(1)): Merges CLR(1) states that have the same core (the same LR(0) items, ignoring the lookaheads). The lookaheads of the merged states are combined (unioned). This retains the small state count of SLR while providing almost the same power as CLR.

How to compute the lookahead in CLOSURE(I):

If [A → α • B β, a] is in the set of items, then for each production B → γ, we add the item [B → • γ, b] to the set for every terminal b in:

FIRST(βa)

Where β is the sequence of symbols following B, and a is the lookahead of the parent item.

PropertyLR(0)SLRLALR(1)CLR(1)
LookaheadNoneFOLLOW(A)Merged LR(1)Exact lookahead
States CountFewestSame as LR(0)Same as LR(0)Most (10x more)
Parsing PowerWeakestMedium-LowMedium-HighStrongest
ConflictsManySomeFewZero false conflicts
Real-world UseNoneRareBison/Yacc standardResearch/LLVM

Containment Hierarchy

The relationships between grammar families parsed by bottom-up methods follow a strict containment hierarchy:

LR(0) ⊂ SLR ⊂ LALR ⊂ CLR(1)

Every LR(0) grammar is also SLR, every SLR grammar is also LALR, and every LALR grammar is also CLR. However, adding more lookahead symbols beyond 1 (e.g. LR(2)) does not increase the class of languages that can be parsed; all deterministic context-free languages can be parsed using just LR(1).