• Submatch extraction: re2c supports both POSIX-compliant capturing groups and standalone
tags (with leftmost greedy disambiguation and optional handling of repeated submatch). The implementation is based on the lookahead-TDFA algorithm. • Encoding support: re2c supports ASCII, UTF-8, UTF-16, UTF-32, UCS-2 and EBCDIC. • Flexible user interface: the generated code uses a few primitive operations in order to interface with the environment (read input characters, advance to the next input position, etc.); users can redefine these primitives to whatever they need. • Storable state: re2c supports both
pull-model lexers (when lexer runs without interrupts and pulls more input as necessary) and
push-model lexers (when lexer is periodically stopped and resumed to parse new chunks of input). • Start conditions: re2c can generate multiple interrelated lexers, where each lexer is triggered by a certain
condition in program. • Self-validation: re2c has a special mode in which it ignores all used-defined interface code and generates a self-contained
skeleton program. Additionally, re2c generates two files: one with the input strings derived from the regular grammar, and one with compressed match results that are used to verify lexer behavior on all inputs. Input strings are generated so that they extensively cover DFA transitions and paths. Data generation happens right after DFA construction and prior to any optimizations, but the lexer itself is fully optimized, so skeleton programs are capable of revealing any errors in optimizations and code generation. • Warnings: re2c performs static analysis of the program and warns its users about possible deficiencies or bugs, such as undefined control flow, unreachable code, ill-formed escape symbols and potential misuse of the interface primitives. • Debugging. Besides generating human-readable lexers, re2c has a number of options that dump various intermediate representations of the generated lexer, such as
NFA, multiple stages of
DFA and the resulting program graph in
DOT format. == Syntax ==