Compiler Design
Compiler Writing Tools
Writing a compiler from scratch is a massive undertaking. To address this complexity, the CS community developed specialized tools called Compiler Writing Tools (or Compiler-Compilers) that automate the most mathematically complex parsing and scanning logic.
Key Automated Tools
| Phase Targeted | Tool | Description |
|---|---|---|
| Lexical Analysis | Lex / Flex | Generates O(n) DFA-based lexical scanners from regular expression rules. |
| Syntax Analysis | Yacc / Bison | Generates LALR(1) parsers from Context-Free Grammars (CFG) in Backus-Naur Form (BNF). |
| Intermediate & Backend | LLVM | A modern library infrastructure for code optimization and hardware target translation. |
1. LEX / FLEX
Lex reads a specification file (typically ending in .l) containing regular expressions, and produces a C source file containing a fast finite-automaton scanner function yylex().
A Lex specification file is split into three sections separated by %% delimiters:
%{
/* Section 1: C Declarations */
#include <stdio.h>
%}
/* Section 2: Definitions (regex patterns) */
digit [0-9]
letter [a-zA-Z]
%%
/* Section 3: Rules (Pattern & matching action) */
{digit}+ { printf("INTEGER: %s\n", yytext); }
{letter}+ { printf("WORD: %s\n", yytext); }
. { /* ignore everything else */ }
%%
/* Section 4: User Subroutines */
int yywrap() { return 1; }2. YACC / BISON
Yacc (Yet Another Compiler-Compiler) takes a context-free grammar specification (usually ending in .y) and generates an LALR(1) bottom-up parser function yyparse().
Like Lex, Yacc files use %% split markers for definitions, grammar rules, and helper C code:
%{
#include <stdio.h>
int yylex();
void yyerror(const char *s);
%}
%token NUMBER ID
%left '+' '-'
%left '*' '/'
%%
/* Grammar Rules */
expr : expr '+' expr { $$ = $1 + $3; }
| expr '*' expr { $$ = $1 * $3; }
| NUMBER { $$ = $1; }
;
%%
void yyerror(const char *s) {
fprintf(stderr, "Parsing error: %s\n", s);
}LEX + YACC Integration
In practice, the lexical scanner generated by Lex acts as a helper to the parser generated by Yacc. The parser calls yylex() each time it needs a new token.
Modern Alternatives
ANTLR4
Generates LL(*) parsers supporting multiple target languages (Java, C++, Python, JavaScript) from unified .g4 grammar files.
LLVM
Translates frontend-produced LLVM IR (Static Single Assignment form) into optimized target machine code for x86, ARM, RISC-V, etc.