Compiler Design

Compiler Writing Tools


Writing a compiler from scratch is a massive undertaking. To address this complexity, the CS community developed specialized tools called Compiler Writing Tools (or Compiler-Compilers) that automate the most mathematically complex parsing and scanning logic.

Key Automated Tools

Phase TargetedToolDescription
Lexical AnalysisLex / FlexGenerates O(n) DFA-based lexical scanners from regular expression rules.
Syntax AnalysisYacc / BisonGenerates LALR(1) parsers from Context-Free Grammars (CFG) in Backus-Naur Form (BNF).
Intermediate & BackendLLVMA modern library infrastructure for code optimization and hardware target translation.

1. LEX / FLEX

Lex reads a specification file (typically ending in .l) containing regular expressions, and produces a C source file containing a fast finite-automaton scanner function yylex().

A Lex specification file is split into three sections separated by %% delimiters:

%{
/* Section 1: C Declarations */
#include <stdio.h>
%}

/* Section 2: Definitions (regex patterns) */
digit   [0-9]
letter  [a-zA-Z]

%%
/* Section 3: Rules (Pattern & matching action) */
{digit}+    { printf("INTEGER: %s\n", yytext); }
{letter}+   { printf("WORD: %s\n", yytext); }
.           { /* ignore everything else */ }
%%

/* Section 4: User Subroutines */
int yywrap() { return 1; }

2. YACC / BISON

Yacc (Yet Another Compiler-Compiler) takes a context-free grammar specification (usually ending in .y) and generates an LALR(1) bottom-up parser function yyparse().

Like Lex, Yacc files use %% split markers for definitions, grammar rules, and helper C code:

%{
#include <stdio.h>
int yylex();
void yyerror(const char *s);
%}

%token NUMBER ID
%left '+' '-'
%left '*' '/'

%%
/* Grammar Rules */
expr : expr '+' expr   { $$ = $1 + $3; }
     | expr '*' expr   { $$ = $1 * $3; }
     | NUMBER          { $$ = $1; }
     ;
%%

void yyerror(const char *s) {
    fprintf(stderr, "Parsing error: %s\n", s);
}

LEX + YACC Integration

In practice, the lexical scanner generated by Lex acts as a helper to the parser generated by Yacc. The parser calls yylex() each time it needs a new token.

Source File
LEX Scanner (yylex)
Matches characters to match token patterns, returning Token Code and storing attribute in yylval
↓ Token Code
YACC Parser (yyparse)
Validates context-free grammar constraints and executes semantic block code actions
Abstract Syntax Tree (AST)

Modern Alternatives

ANTLR4

Generates LL(*) parsers supporting multiple target languages (Java, C++, Python, JavaScript) from unified .g4 grammar files.

LLVM

Translates frontend-produced LLVM IR (Static Single Assignment form) into optimized target machine code for x86, ARM, RISC-V, etc.