Compiler Design

Lexical Analysis & Tokens


The primary responsibility of the Lexical Analyzer (or scanner) is to read the input characters of the source program and group them into lexically valid units called tokens.

Tokens, Lexemes, and Patterns

It is crucial to understand the formal distinction between these three core concepts:

Pattern

The descriptive rule (typically represented by a regular expression) that defines what character sequence is required to match a specific token category.

Example: [a-zA-Z_][a-zA-Z0-9_]*

Lexeme

The actual, physical character sequence from the source code that matches the pattern rule.

Example: "counter", "sum_val"

Token

The abstract logical category produced by the scanner, along with optional attribute values, to send to the parser.

Example: <IDENTIFIER, "counter">


Token Categories & Attributes

Typically, programming languages define these categories of tokens:

  • Keywords: Reserved strings with fixed semantic meaning (e.g., if, while, return).
  • Identifiers: User-defined names for variables, classes, or functions.
  • Literals: Constants (e.g., integers like 42, floating points like 3.14, string values like "hello").
  • Operators: Arithmetic, logical, or comparison symbols (e.g., +, &&, ==).
  • Punctuation/Separators: Structural symbols (e.g., ;, (, {).

When matching complex tokens (like numbers or variables), the parser needs to know both the token's type and its specific value (lexeme). The lexer bundles these together as:

<Token_Type, Attribute_Value>

For instance, the source code segment x = 42 translates into three token packets:

1. <IDENTIFIER, Pointer_To_Symbol_Table_Entry_For_x>
2. <ASSIGN_OP, ->
3. <INTEGER_LITERAL, 42>

Maximal Munch Rule

When scanning, the lexer must resolve ambiguity (e.g., is <= a less-than sign followed by equals, or a single less-than-or-equal-to comparison?). The lexer resolves this using the Maximal Munch (longest match) rule: always match the longest sequence of characters that can form a valid token.


Detailed Deep Dive

Learn how to design lexical analyzers, handle buffer pairs, and see the differences between hand-written scanners and tool-generated DFAs:

Explore Hand-written vs Tool Lexers →