A compiler implements a formal transformation from a high-level source program to a low-level target program. Compiler design can define an end-to-end solution or tackle a defined subset that interfaces with other compilation tools e.g. preprocessors, assemblers, linkers. Design requirements include rigorously defined interfaces both internally between compiler components and externally between supporting toolsets. In the early days, the approach taken to compiler design was directly affected by the complexity of the computer language to be processed, the experience of the person(s) designing it, and the resources available. Resource limitations led to the need to pass through the source code more than once. A compiler for a relatively simple language written by one person might be a single, monolithic piece of software. However, as the source language grows in complexity the design may be split into a number of interdependent phases. Separate phases provide design improvements that focus development on the functions in the compilation process.
One-pass vs multi-pass compilers Classifying compilers by number of passes has its background in the hardware resource limitations of computers. Compiling involves performing much work and early computers did not have enough memory to contain one program that did all of this work. As a result, compilers were split up into smaller programs which each made a pass over the source (or some representation of it) performing some of the required analysis and translations. The ability to compile in a
single pass has classically been seen as a benefit because it simplifies the job of writing a compiler and one-pass compilers generally perform compilations faster than
multi-pass compilers. Thus, partly driven by the resource limitations of early systems, many early languages were specifically designed so that they could be compiled in a single pass (e.g.,
Pascal). In some cases, the design of a language feature may require a compiler to perform more than one pass over the source. For instance, consider a declaration appearing on line 20 of the source which affects the translation of a statement appearing on line 10. In this case, the first pass needs to gather information about declarations appearing after statements that they affect, with the actual translation happening during a subsequent pass. The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated
optimizations needed to generate high quality code. It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyse one expression many times but only analyse another expression once. Splitting a compiler up into small programs is a technique used by researchers interested in producing provably correct compilers. Proving the correctness of a set of small programs often requires less effort than proving the correctness of a larger, single, equivalent program.
Three-stage compiler structure Regardless of the exact number of phases in the compiler design, the phases can be assigned to one of three stages. The stages include a front end, a middle end, and a back end. • The
front end scans the input and verifies syntax and semantics according to a specific source language. For
statically typed languages it performs
type checking by collecting type information. If the input program is syntactically incorrect or has a type error, it generates error and/or warning messages, usually identifying the location in the source code where the problem was detected; in some cases the actual error may be (much) earlier in the program. Aspects of the front end include lexical analysis, syntax analysis, and semantic analysis. The front end transforms the input program into an
intermediate representation (IR) for further processing by the middle end. This IR is usually a lower-level representation of the program with respect to the source code. • The
middle end performs optimizations on the IR that are independent of the CPU architecture being targeted. This source code/machine code independence is intended to enable generic optimizations to be shared between versions of the compiler supporting different languages and target processors. Examples of middle end optimizations are removal of useless (
dead-code elimination) or unreachable code (
reachability analysis), discovery and propagation of constant values (
constant propagation), relocation of computation to a less frequently executed place (e.g., out of a loop), or specialization of computation based on the context, eventually producing the "optimized" IR that is used by the back end. • The
back end takes the optimized IR from the middle end. It may perform more analysis, transformations and optimizations that are specific for the target CPU architecture. The back end generates the target-dependent assembly code, performing
register allocation in the process. The back end performs
instruction scheduling, which re-orders instructions to keep parallel
execution units busy by filling
delay slots. Although most optimization problems are
NP-hard,
heuristic techniques for solving them are well-developed and implemented in production-quality compilers. Typically the output of a back end is machine code specialized for a particular processor and operating system. This front/middle/back-end approach makes it possible to combine front ends for different languages with back ends for different
CPUs while sharing the optimizations of the middle end. Practical examples of this approach are the
GNU Compiler Collection,
Clang (
LLVM-based C/C++ compiler), and the
Amsterdam Compiler Kit, which have multiple front-ends, shared optimizations and multiple back-ends.
Front end and
parser example for
C. Starting from the sequence of characters "if(net>0.0)total+=net*(1.0+tax/100.0);", the scanner composes a sequence of
tokens, and categorizes each of them, for example as , , , or . The latter sequence is transformed by the parser into a
syntax tree, which is then treated by the remaining compiler phases. The scanner and parser handles the
regular and properly
context-free parts of the
grammar for C, respectively. The front end analyzes the source code to build an internal representation of the program, called the
intermediate representation (IR). It also manages the
symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope. While the frontend can be a single monolithic function or program, as in a
scannerless parser, it was traditionally implemented and analyzed as several phases, which may execute sequentially or concurrently. This method is favored due to its modularity and
separation of concerns. Most commonly, the frontend is broken into three phases:
lexical analysis (also known as lexing or scanning),
syntax analysis (also known as scanning or parsing), and
semantic analysis. Lexing and parsing comprise the syntactic analysis (word syntax and phrase syntax, respectively), and in simple cases, these modules (the lexer and parser) can be automatically generated from a grammar for the language, though in more complex cases these require manual modification. The lexical grammar and phrase grammar are usually
context-free grammars, which simplifies analysis significantly, with context-sensitivity handled at the semantic analysis phase. The semantic analysis phase is generally more complex and written by hand, but can be partially or fully automated using
attribute grammars. These phases themselves can be further broken down: lexing as scanning and evaluating, and parsing as building a
concrete syntax tree (CST, parse tree) and then transforming it into an
abstract syntax tree (AST, syntax tree). In some cases additional phases are used, notably
line reconstruction and
preprocessing, but these are rare. The main phases of the front end include the following: • ''
converts the input character sequence to a canonical form ready for the parser. Languages which strop their keywords or allow arbitrary spaces within identifiers require this phase. The top-down, recursive-descent, table-driven parsers used in the 1960s typically read the source one character at a time and did not require a separate tokenizing phase. Atlas Autocode and Imp (and some implementations of ALGOL and Coral 66) are examples of stropped languages whose compilers would have a Line Reconstruction'' phase. •
Preprocessing supports
macro substitution and
conditional compilation. Typically the preprocessing phase occurs before syntactic or semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than syntactic forms. However, some languages such as
Scheme support macro substitutions based on syntactic forms. •
Lexical analysis (also known as
lexing or
tokenization) breaks the source code text into a sequence of small pieces called
lexical tokens. This phase can be divided into two stages: the
scanning, which segments the input text into syntactic units called
lexemes and assigns them a category; and the
evaluating, which converts lexemes into a processed value. A token is a pair consisting of a
token name and an optional
token value. Common token categories may include identifiers, keywords, separators, operators, literals and comments, although the set of token categories varies in different
programming languages. The lexeme syntax is typically a
regular language, so a
finite-state automaton constructed from a
regular expression can be used to recognize it. The software doing lexical analysis is called a
lexical analyzer. This may not be a separate step—it can be combined with the parsing step in
scannerless parsing, in which case parsing is done at the character level, not the token level. •
Syntax analysis (also known as
parsing) involves
parsing the token sequence to identify the syntactic structure of the program. This phase typically builds a
parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a
formal grammar which define the language's syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler. •
Semantic analysis adds semantic information to the
parse tree and builds the
symbol table. This phase performs semantic checks such as
type checking (checking for type errors), or
object binding (associating variable and function references with their definitions), or
definite assignment (requiring all local variables to be initialized before use), rejecting incorrect programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the
parsing phase, and logically precedes the
code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.
Middle end The middle end, also known as
optimizer, performs optimizations on the intermediate representation in order to improve the performance and the quality of the produced machine code. The middle end contains those optimizations that are independent of the CPU architecture being targeted. The main phases of the middle end include the following: •
Analysis: This is the gathering of program information from the intermediate representation derived from the input;
data-flow analysis is used to build
use-define chains, together with
dependence analysis,
alias analysis,
pointer analysis,
escape analysis, etc. Accurate analysis is the basis for any compiler optimization. The
control-flow graph of every compiled function and the
call graph of the program are usually also built during the analysis phase. •
Optimization: the intermediate language representation is transformed into functionally equivalent but faster (or smaller) forms. Popular optimizations are
inline expansion,
dead-code elimination,
constant propagation,
loop transformation and even
automatic parallelization. Compiler analysis is the prerequisite for any compiler optimization, and they tightly work together. For example,
dependence analysis is crucial for
loop transformation. The scope of compiler analysis and optimizations vary greatly; their scope may range from operating within a
basic block, to whole procedures, or even the whole program. There is a trade-off between the granularity of the optimizations and the cost of compilation. For example,
peephole optimizations are fast to perform during compilation but only affect a small local fragment of the code, and can be performed independently of the context in which the code fragment appears. In contrast,
interprocedural optimization requires more compilation time and memory space, but enable optimizations that are only possible by considering the behavior of multiple functions simultaneously. Interprocedural analysis and optimizations are common in modern commercial compilers from
HP,
IBM,
SGI,
Intel,
Microsoft, and
Sun Microsystems. The
free software GCC was criticized for a long time for lacking powerful interprocedural optimizations, but it is changing in this respect. Another open source compiler with full analysis and optimization infrastructure is
Open64, which is used by many organizations for research and commercial purposes. Due to the extra time and space needed for compiler analysis and optimizations, some compilers skip them by default. Users have to use compilation options to explicitly tell the compiler which optimizations should be enabled.
Back end The back end is responsible for the CPU architecture specific optimizations and for
code generation. A prominent example is
peephole optimizations, which rewrites short sequences of assembler instructions into more efficient instructions. •
Code generation: the transformed intermediate language is translated into the output language, usually the native
machine language of the system. This involves resource and storage decisions, such as deciding which variables to fit into
registers and memory and the
selection and
scheduling of appropriate machine instructions along with their associated
addressing modes (see also
Sethi–Ullman algorithm). Debug data may also need to be generated to facilitate
debugging.
Compiler correctness Compiler correctness is the branch of software engineering that deals with trying to show that a compiler behaves according to its
language specification. Techniques include developing the compiler using
formal methods and using rigorous testing (often called compiler validation) on an existing compiler. == Compiled vis-à-vis interpreted languages ==