Computer language syntax is generally distinguished into three levels: • Words – the lexical level, determining how characters form
tokens; • Phrases – the grammar level, narrowly speaking, determining how tokens form phrases; • Context – determining what objects or
variables names refer to, if
types are valid, etc. Distinguishing in this way yields modularity, allowing each level to be described and processed separately and often independently. First, a
lexer turns the linear sequence of characters into a
linear sequence of tokens; this is known as "
lexical analysis" or "lexing". Second, the parser turns the linear sequence of tokens into a
hierarchical syntax tree; this is known as "
parsing" narrowly speaking. This ensures that the line of tokens conform to the formal grammars of the programming language. The parsing stage itself can be divided into two parts: the
parse tree, or "concrete syntax tree", which is determined by the grammar, but is generally far too detailed for practical use, and the
abstract syntax tree (AST), which simplifies this into a usable form. The AST and contextual analysis steps can be considered a form of
semantic analysis, as they are adding meaning and interpretation to the syntax, or alternatively as informal, manual
implementations of syntactical rules that would be difficult or awkward to describe or implement formally. Thirdly, the contextual analysis resolves names and checks types. This modularity is sometimes possible, but in many real-world languages an earlier step depends on a later step – for example,
the lexer hack in
C is because tokenization depends on context. Even in these cases, syntactical analysis is often seen as approximating this ideal model. The levels generally correspond to levels in the
Chomsky hierarchy. Words are in a
regular language, specified in the
lexical grammar, which is a Type-3 grammar, generally given as
regular expressions. Phrases are in a
context-free language (CFL), generally a
deterministic context-free language (DCFL), specified in a
phrase structure grammar, which is a Type-2 grammar, generally given as
production rules in
Backus–Naur form (BNF). Phrase grammars are often specified in much more constrained grammars than full
context-free grammars, in order to make them easier to parse; while the
LR parser can parse any DCFL in linear time, the simple
LALR parser and even simpler
LL parser are more efficient, but can only parse grammars which production rules are constrained. In principle, contextual structure can be described by a
context-sensitive grammar, and automatically analyzed by means such as
attribute grammars, though, in general, this step is done manually, via
name resolution rules and
type checking, and implemented via a
symbol table which stores names and types for each scope. Tools have been written that automatically generate a lexer from a lexical
specification written in regular expressions and a parser from the phrase grammar written in BNF: this allows one to use
declarative programming, rather than need to have
procedural or
functional programming. A notable example is the
lex-
yacc pair. These automatically produce a
concrete syntax tree; the parser writer must then manually write code describing how this is converted to an
abstract syntax tree. Contextual analysis is also generally implemented manually. Despite the existence of these automatic tools, parsing is often implemented manually, for various reasons – perhaps the phrase structure is not context-free, or an alternative implementation improves performance or error-reporting, or allows the grammar to be changed more easily. Parsers are often written in functional programming languages, such as
Haskell, or in
scripting languages, such as
Python or
Perl, or in
imperative programming languages such as
C or
C++. == Syntax definition ==