Assembler An
assembler program creates
object code by
translating combinations of
mnemonics and
syntax for operations and addressing modes into their numerical equivalents. This representation typically includes an
operation code ("
opcode") as well as other control
bits and data. The assembler also calculates constant expressions and resolves
symbolic names for memory locations and other entities. Assemblers have been available since the 1950s, as the first step above machine language and before
high-level programming languages such as
Fortran,
Algol,
COBOL and
Lisp. There have also been several classes of translators and semi-automatic
code generators with properties similar to both assembly and high-level languages, with
Speedcode as perhaps one of the better-known examples. There may be several assemblers with different
syntax for a particular
CPU or
instruction set architecture. For instance, an instruction to add memory data to a register in a
x86-family processor might be add eax,[ebx], in original
Intel syntax, whereas this would be written addl (%ebx),%eax in the
AT&T syntax used by the
GNU Assembler. Despite different appearances, different syntactic forms generally generate the same numeric
machine code. A single assembler may also have different modes in order to support variations in syntactic forms as well as their exact semantic interpretations (such as
FASM-syntax,
TASM-syntax, ideal mode, etc., in the special case of
x86 assembly programming).
Number of passes There are two types of assemblers based on how many passes through the source are needed (how many times the assembler reads the source) to produce the object file. •
One-pass assemblers process the source code once. For symbols used before they are defined, the assembler will emit
"errata" after the eventual definition, telling the
linker or the loader to patch the locations where the as yet undefined symbols had been used. •
Multi-pass assemblers create a table with all symbols and their values in the first passes, then use the table in later passes to generate code. In both cases, the assembler must be able to determine the size of each instruction on the initial passes in order to calculate the addresses of subsequent symbols. This means that if the size of an operation referring to an operand defined later depends on the type or distance of the operand, the assembler will make a pessimistic estimate when first encountering the operation, and if necessary, pad it with one or more "
no-operation" instructions in a later pass or the errata. In an assembler with
peephole optimization, addresses may be recalculated between passes to allow replacing pessimistic code with code tailored to the exact distance from the target. The original reason for the use of one-pass assemblers was memory size and speed of assembly – often a second pass would require storing the symbol table in memory (to handle
forward references), rewinding and rereading the program source on
tape, or rereading a deck of
cards or
punched paper tape. Later computers with much larger memories (especially disc storage), had the space to perform all necessary processing without such re-reading. The advantage of the multi-pass assembler is that the absence of errata makes the
linking process (or the
program load if the assembler directly produces executable code) faster.
Example: in the following code snippet, a one-pass assembler would be able to determine the address of the backward reference BKWD when assembling statement S2, but would not be able to determine the address of the forward reference FWD when assembling the branch statement S1; indeed, FWD may be undefined. A two-pass assembler would determine both addresses in pass 1, so they would be known when generating code in pass 2. B ... EQU * ... EQU * ... B
High-level assemblers More sophisticated
high-level assemblers provide language abstractions such as: • High-level procedure/function declarations and invocations • Advanced control structures (IF/THEN/ELSE, SWITCH) • High-level abstract data types, including structures/records, unions, classes, and sets • Sophisticated macro processing (although available on ordinary assemblers since the late 1950s for, e.g., the
IBM 700 series and
IBM 7000 series, and since the 1960s for
IBM System/360 (S/360), amongst other machines) •
Object-oriented programming features such as
classes,
objects,
abstraction,
polymorphism, and
inheritance See
Language design below for more details.
Assembly language A program written in assembly language consists of a series of
mnemonic processor instructions and meta-statements (known variously as declarative operations, directives, pseudo-instructions, pseudo-operations and pseudo-ops), comments and data. Assembly language instructions usually consist of an
opcode mnemonic followed by an
operand, which might be a list of data, arguments or parameters. Some instructions may be "implied", which means the data upon which the instruction operates is implicitly defined by the instruction itself—such an instruction does not take an operand. The resulting statement is translated by an
assembler into
machine language instructions that can be loaded into memory and executed. For example, the instruction below tells an
x86/
IA-32 processor to move an
immediate 8-bit value into a
register. The
binary code for this instruction is 10110 followed by a 3-bit identifier for which register to use. The identifier for the
AL register is 000, so the following
machine code loads the
AL register with the data 01100001. 10110000 01100001 This binary computer code can be made more human-readable by expressing it in
hexadecimal as follows. B0 61 Here, B0 means "Move a copy of the following value into
AL", and 61 is a hexadecimal representation of the value 01100001, which is 97 in
decimal. Assembly language for the 8086 family provides the
mnemonic MOV (an abbreviation of
move) for instructions such as this, so the machine code above can be written as follows in assembly language, complete with an explanatory comment if required, after the semicolon. This is much easier to read and to remember. MOV AL, 61h ; Load AL with 97 decimal (61 hex) In some assembly languages (including this one) the same mnemonic, such as MOV, may be used for a family of related instructions for loading, copying and moving data, whether these are immediate values, values in registers, or memory locations pointed to by values in registers or by direct addresses embedded in the instruction. Other assemblers may use separate opcode mnemonics such as L for "move memory to register", ST for "move register to memory", LR for "move register to register", MVI for "move immediate operand to memory", etc. If the same mnemonic is used for different instructions, that means that the mnemonic corresponds to several different binary instruction codes, excluding data (e.g. the 61h in this example), depending on the operands that follow the mnemonic. For example, for the x86/IA-32 CPUs, the Intel assembly language syntax MOV AL, AH represents an instruction that moves the contents of register
AH into register
AL. The hexadecimal form of this instruction is: 88 E0 The first byte, 88h, identifies a move between a byte-sized register and either another register or memory, and the second byte, E0h, is encoded (with three bit-fields) to specify that both operands are registers, the source is
AH, and the destination is
AL. In a case like this where the same mnemonic can represent more than one binary instruction, the assembler determines which instruction to generate by examining the operands. In the first example, the operand 61h is a valid hexadecimal numeric constant and is not a valid register name, so only the B0 instruction can be applicable. In the second example, the operand AH is a valid register name and not a valid numeric constant (hexadecimal, decimal, octal, or binary), so only the 88 instruction can be applicable. Assembly languages are always designed so that this sort of lack of ambiguity is universally enforced by their syntax. For example, in the Intel x86 assembly language, a hexadecimal constant must start with a numeral digit, so that the hexadecimal number 'A' (equal to decimal ten) would be written as 0Ah or 0AH, not AH, specifically so that it cannot appear to be the name of register
AH. (The same rule also prevents ambiguity with the names of registers
BH,
CH, and
DH, as well as with any user-defined symbol that ends with the letter
H and otherwise contains only characters that are hexadecimal digits, such as the word "BEACH".) Returning to the original example, while the x86 opcode 10110000 (B0) copies an 8-bit value into the
AL register, 10110001 (B1) moves it into
CL and 10110010 (B2) does so into
DL. Assembly language examples for these follow. MOV AL, 1h ; Load AL with immediate value 1 MOV CL, 2h ; Load CL with immediate value 2 MOV DL, 3h ; Load DL with immediate value 3 The syntax of MOV can also be more complex as the following examples show. MOV EAX, [EBX] ; Move the 4 bytes in memory at the address contained in EBX into EAX MOV [ESI+EAX], CL ; Move the contents of CL into the byte at address ESI+EAX MOV DS, DX ; Move the contents of DX into segment register DS In each case, the MOV mnemonic is translated directly into one of the opcodes 88-8C, 8E, A0-A3, B0-BF, C6 or C7 by an assembler, and the programmer normally does not have to know or remember which. An assembler transforms assembly language into machine code, and the reverse can at least partially be achieved by a
disassembler. Unlike
high-level languages, there is a
one-to-one correspondence between many simple assembly statements and machine language instructions. However, in some cases, an assembler may provide
pseudoinstructions (essentially macros) which expand into several machine language instructions to provide commonly needed functionality. For example, for a machine that lacks a "branch if greater or equal" instruction, an assembler may provide a pseudoinstruction that expands to the machine's "set if less than" and "branch if zero (on the result of the set instruction)". Most full-featured assemblers also provide a rich
macro language (discussed below) which is used by vendors and programmers to generate more complex code and data sequences. Since the information about pseudoinstructions and macros defined in the assembler environment is not present in the object program, a disassembler cannot reconstruct the macro and pseudoinstruction invocations but can only disassemble the actual machine instructions that the assembler generated from those abstract assembly-language entities. Likewise, since comments in the assembly language source file are ignored by the assembler and have no effect on the object code it generates, a disassembler is always completely unable to recover source comments. Each
computer architecture has its own machine language. Computers differ in the number and type of operations they support, in the different sizes and numbers of registers, and in the representations of data in storage. While most general-purpose computers are able to carry out essentially the same functionality, the ways they do so differ; the corresponding assembly languages reflect these differences. Multiple sets of
mnemonics or assembly-language syntax may exist for a single instruction set, typically instantiated in different assembler programs. In these cases, the most popular one is usually that supplied by the CPU manufacturer and used in its documentation. Two examples of CPUs that have two different sets of mnemonics are the Intel 8080 family and the Intel 8086/8088. Because Intel claimed copyright on its assembly language mnemonics (on each page of their documentation published in the 1970s and early 1980s, at least), some companies that independently produced CPUs compatible with Intel instruction sets invented their own mnemonics. The
Zilog Z80 CPU, an enhancement of the
Intel 8080A, supports all the 8080A instructions plus many more; Zilog invented an entirely new assembly language, not only for the
new instructions but also for all of the 8080A instructions. For example, where Intel uses the mnemonics
MOV,
MVI,
LDA,
STA,
LXI,
LDAX,
STAX,
LHLD, and
SHLD for various data transfer instructions, the Z80 assembly language uses the mnemonic
LD for all of them. A similar case is the
NEC V20 and
V30 CPUs, enhanced copies of the Intel 8086 and 8088, respectively. Like Zilog with the Z80, NEC invented new mnemonics for all of the 8086 and 8088 instructions, to avoid accusations of infringement of Intel's copyright. (It is questionable whether such copyrights can be valid, and later CPU companies such as
AMD and
Cyrix republished Intel's x86/IA-32 instruction mnemonics exactly with neither permission nor legal penalty.) It is doubtful whether in practice many people who programmed the V20 and V30 actually wrote in NEC's assembly language rather than Intel's; since any two assembly languages for the same instruction set architecture are isomorphic (somewhat like English and
Pig Latin), there is no requirement to use a manufacturer's own published assembly language with that manufacturer's products.
"Hello, world!" on bare hardware "Hello, world!" can be printed using 32-bit assembly language for an
x86 processor with little help from an operating system. "Call outchr" calls some mechanism that prints a character in AL to the console. A non zero-length string must be terminated with a byte of zero. hello: mov esi,msg ; address of string into ESI cld ; Set direction to increment ESI lodsb ; Load first char in AL, inc ESI chrlp: call outchr ; Print character in AL lodsb ; Load next character in AL, inc ESI or al, al ; Is it a zero terminator? bne chrlp ; If not, continue ret ; Return to caller msg: db 'Hello, world!', 0xa, 0x0 ; string to be printed
"Hello, world!" on x86 Linux In 32-bit assembly language for Linux on an
x86 processor, "Hello, world!" would be printed by an single operating system call: section .text ; start of the code segment global _start ; declare _start to be visible in the generated object file _start: mov edx,len ; length of string, third argument to write() mov ecx,msg ; address of string, second argument to write() mov ebx,1 ; file descriptor (standard output), first argument to write() mov eax,4 ; system call number for write() int 0x80 ; system call trap mov ebx,0 ; exit code, first argument to exit() mov eax,1 ; system call number for exit() int 0x80 ; system call trap section .data ; start of data segment msg db 'Hello, world!', 0xa ; string to be printed len equ $ - msg ; length of that string as a constant calculated at assembly time ==Language design==