Compiler Design
Lexical Analysis & Tokens
The primary responsibility of the Lexical Analyzer (or scanner) is to read the input characters of the source program and group them into lexically valid units called tokens.
Tokens, Lexemes, and Patterns
It is crucial to understand the formal distinction between these three core concepts:
Pattern
The descriptive rule (typically represented by a regular expression) that defines what character sequence is required to match a specific token category.
Example: [a-zA-Z_][a-zA-Z0-9_]*
Lexeme
The actual, physical character sequence from the source code that matches the pattern rule.
Example: "counter", "sum_val"
Token
The abstract logical category produced by the scanner, along with optional attribute values, to send to the parser.
Example: <IDENTIFIER, "counter">
Token Categories & Attributes
Typically, programming languages define these categories of tokens:
- Keywords: Reserved strings with fixed semantic meaning (e.g.,
if,while,return). - Identifiers: User-defined names for variables, classes, or functions.
- Literals: Constants (e.g., integers like
42, floating points like3.14, string values like"hello"). - Operators: Arithmetic, logical, or comparison symbols (e.g.,
+,&&,==). - Punctuation/Separators: Structural symbols (e.g.,
;,(,{).
When matching complex tokens (like numbers or variables), the parser needs to know both the token's type and its specific value (lexeme). The lexer bundles these together as:
For instance, the source code segment x = 42 translates into three token packets:
1. <IDENTIFIER, Pointer_To_Symbol_Table_Entry_For_x> 2. <ASSIGN_OP, -> 3. <INTEGER_LITERAL, 42>
Maximal Munch Rule
When scanning, the lexer must resolve ambiguity (e.g., is <= a less-than sign followed by equals, or a single less-than-or-equal-to comparison?). The lexer resolves this using the Maximal Munch (longest match) rule: always match the longest sequence of characters that can form a valid token.
Detailed Deep Dive
Learn how to design lexical analyzers, handle buffer pairs, and see the differences between hand-written scanners and tool-generated DFAs:
Explore Hand-written vs Tool Lexers →