Compiler Design
Lexical Analysis & Tokens / Hand-written vs Tool Lexers
Lexical scanners can be constructed in two primary ways: written manually by hand (using custom state loops) or generated automatically by a scanner tool (using regular expression patterns and finite automata).
Hand-written vs Tool-generated Lexers
Almost all commercial and production compilers (like GCC, Clang, and rustc) use hand-written lexers. They are preferred because they offer better performance, simpler error reporting/recovery, and allow custom behavior without the overhead of state tables. Conversely, tool-generated lexers are fast to prototype and easier to maintain.
Input Buffering & Buffer Pairs
Reading one character at a time from disk/file is highly inefficient. To optimize this, compilers use an **Input Buffering** scheme using **Buffer Pairs**:
- Two N-character Buffers: Alternately reloaded from input files.
- Two Pointers:
lexemeBegin: Points to the start of the current lexeme being matched.forward: Moves ahead character-by-character to scan the token.
- Sentinels (EOF): Special marker characters placed at the end of each buffer half to verify if buffer reloading is needed in a single comparison step.
[ b = a + b * 2; |EOF| ] [ (next buffer block) |EOF| ] ^ ^ lexemeBegin forward
Designing a Hand-written Lexer
A hand-written lexer typically uses a large loop with a switch statement processing the current lookahead character.
Token next_token() {
char c = next_char();
while (isspace(c)) c = next_char(); // skip whitespace
if (isalpha(c)) {
string lexeme = "";
while (isalnum(c) || c == '_') {
lexeme += c;
c = next_char();
}
retract_char(1); // push back extra character
if (is_keyword(lexeme)) {
return Token(KEYWORD, lexeme);
}
return Token(IDENTIFIER, lexeme);
}
if (isdigit(c)) {
string value = "";
while (isdigit(c)) {
value += c;
c = next_char();
}
retract_char(1);
return Token(INTEGER_LITERAL, value);
}
switch (c) {
case '=':
if (peek_char() == '=') {
next_char();
return Token(COMPARE_OP, "==");
}
return Token(ASSIGN_OP, "=");
case '+': return Token(ADD_OP, "+");
case '*': return Token(MUL_OP, "*");
case ';': return Token(SEMICOLON, ";");
case EOF: return Token(EOF_TOKEN, "");
default: return Token(ERROR_TOKEN, string(1, c));
}
}