Compiler Design

Architecture Dependent Code Improvement


Architecture Dependent Code Improvement

Unlike machine-independent optimizations that focus on the logic of the code, architecture-dependent code improvement tailors the generated code to the specific hardware characteristics of the target machine. This includes optimizing for the CPU pipeline, cache memory hierarchy, and instruction set.

Pipelining allows a CPU to execute multiple instructions simultaneously by overlapping their execution stages (e.g., Fetch → Decode → Execute → Write-back). However, this can be interrupted by hazards:

  • Data Hazards: An instruction depends on the result of a previous, unfinished instruction.
  • Control Hazards: A branch instruction causes the pipeline to fetch the wrong instructions.

Instruction Scheduling Solution:

The compiler reorders instructions to fill pipeline "slots" that would otherwise be wasted on stalls.

// Example Without Scheduling (causes stall)
LOAD  R1, A        ; 3-cycle latency
ADD   R2, R1, R3   ; <--- stalls waiting for R1
STORE R2, B

// Example With Scheduling (stalls eliminated)
LOAD  R1, A        ; start load
ADD   R4, R5, R6   ; independent instruction fills the slot
MOV   R7, #1       ; another independent instruction
ADD   R2, R1, R3   ; now R1 is ready — no stall!
STORE R2, B

Cache memory is much faster than RAM. Programs that access memory in cache-friendly patterns run significantly faster.

  • Loop Interchange: Swaps the order of nested loop indices to improve spatial locality (e.g., accessing arrays in row-major order rather than column-major).
  • Loop Tiling (Blocking): Divides the iteration space into small tiles that fit completely into the cache, improving data reuse (commonly used in matrix multiplication).
  • Loop Fusion: Combines two adjacent loops with the same iteration range into one loop to reduce loop overhead and improve data reuse.
  • Loop Unrolling: Replicates the loop body multiple times, reducing branch overhead and enabling more instruction-level parallelism.
  • Loop Jamming: Combines multiple loops into one (similar to fusion) to reduce control overhead and improve register reuse.