Compiler Design

Lexical Analysis & Tokens / Hand-written vs Tool Lexers


Lexical scanners can be constructed in two primary ways: written manually by hand (using custom state loops) or generated automatically by a scanner tool (using regular expression patterns and finite automata).

Hand-written vs Tool-generated Lexers

Almost all commercial and production compilers (like GCC, Clang, and rustc) use hand-written lexers. They are preferred because they offer better performance, simpler error reporting/recovery, and allow custom behavior without the overhead of state tables. Conversely, tool-generated lexers are fast to prototype and easier to maintain.


Input Buffering & Buffer Pairs

Reading one character at a time from disk/file is highly inefficient. To optimize this, compilers use an **Input Buffering** scheme using **Buffer Pairs**:

  • Two N-character Buffers: Alternately reloaded from input files.
  • Two Pointers:
    • lexemeBegin: Points to the start of the current lexeme being matched.
    • forward: Moves ahead character-by-character to scan the token.
  • Sentinels (EOF): Special marker characters placed at the end of each buffer half to verify if buffer reloading is needed in a single comparison step.
[ b = a + b * 2; |EOF| ]  [ (next buffer block) |EOF| ]
   ^             ^
   lexemeBegin   forward

Designing a Hand-written Lexer

A hand-written lexer typically uses a large loop with a switch statement processing the current lookahead character.

Token next_token() {
    char c = next_char();
    while (isspace(c)) c = next_char(); // skip whitespace
    
    if (isalpha(c)) {
        string lexeme = "";
        while (isalnum(c) || c == '_') {
            lexeme += c;
            c = next_char();
        }
        retract_char(1); // push back extra character
        if (is_keyword(lexeme)) {
            return Token(KEYWORD, lexeme);
        }
        return Token(IDENTIFIER, lexeme);
    }
    
    if (isdigit(c)) {
        string value = "";
        while (isdigit(c)) {
            value += c;
            c = next_char();
        }
        retract_char(1);
        return Token(INTEGER_LITERAL, value);
    }
    
    switch (c) {
        case '=':
            if (peek_char() == '=') {
                next_char();
                return Token(COMPARE_OP, "==");
            }
            return Token(ASSIGN_OP, "=");
        case '+': return Token(ADD_OP, "+");
        case '*': return Token(MUL_OP, "*");
        case ';': return Token(SEMICOLON, ";");
        case EOF: return Token(EOF_TOKEN, "");
        default: return Token(ERROR_TOKEN, string(1, c));
    }
}