Fixed-point numbers Fixed-point formatting can be useful to represent fractions in binary. The number of bits needed for the precision and range desired must be chosen to store the fractional and integer parts of a number. For instance, using a 32-bit format, 16 bits may be used for the integer and 16 for the fraction. The eight's bit is followed by the four's bit, then the two's bit, then the one's bit. The fractional bits continue the pattern set by the integer bits. The next bit is the half's bit, then the quarter's bit, then the eighth's bit, and so on. For example: This form of encoding cannot represent some values in binary. For example, the fraction , 0.2 in decimal, the closest approximations would be as follows: Even if more digits are used, an exact representation is impossible. The number , written in decimal as 0.333333333..., continues indefinitely. If prematurely terminated, the value would not represent precisely.
Floating-point numbers While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not enough to handle all the range of numbers a calculator can handle, and that's not even including fractions. To approximate the greater range and precision of
real numbers, we have to abandon signed integers and fixed-point numbers and go to a "
floating-point" format. In the decimal system, we are familiar with floating-point numbers of the form (
scientific notation): : 1.1030402 × 105 = 1.1030402 × 100000 = 110304.02 or, more compactly: : 1.1030402E5 which means "1.1030402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "
significand", multiplied by a power of 10 (E5, meaning 105 or 100,000), known as an "
exponent". If we have a negative exponent, that means the number is multiplied by a 1 that many places to the right of the decimal point. For example: : 2.3434E−6 = 2.3434 × 10−6 = 2.3434 × 0.000001 = 0.0000023434 The advantage of this scheme is that by using the exponent we can get a much wider range of numbers, even if the number of digits in the significand, or the "numeric precision", is much smaller than the range. Similar binary floating-point formats can be defined for computers. There is a number of such schemes, the most popular has been defined by
Institute of Electrical and Electronics Engineers (IEEE). The
IEEE 754-2008 standard specification defines a 64 bit floating-point format with: • an 11-bit binary exponent, using "excess-1023" format. Excess-1023 means the exponent appears as an unsigned binary integer from 0 to 2047; subtracting 1023 gives the actual signed value • a 52-bit significand, also an unsigned binary number, defining a fractional value with a leading implied "1" • a sign bit, giving the sign of the number. With the bits stored in 8 bytes of memory: where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once the bits here have been extracted, they are converted with the computation: : <sign> × (1 + <fractional significand>) × 2<exponent> − 1023 This scheme provides numbers valid out to about 15 decimal digits, with the following range of numbers: The specification also defines several special values that are not defined numbers, and are known as
NaNs, for "Not A Number". These are used by programs to designate invalid operations and the like. Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving seven valid decimal digits. The bits are converted to a numeric value with the computation: : <sign> × (1 + <fractional significand>) × 2<exponent> − 127 leading to the following range of numbers: Such floating-point numbers are known as "reals" or "floats" in general, but with a number of variations: A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-point value". A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value". The relation between numbers and bit patterns is chosen for convenience in computer manipulation; eight bytes stored in computer memory may represent a 64-bit real, two 32-bit reals, or four signed or unsigned integers, or some other kind of data that fits into eight bytes. The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data. Only a finite range of real numbers can be represented with a given number of bits. Arithmetic operations can overflow or underflow, producing a value too large or too small to be represented. The representation has a limited precision. For example, only 15 decimal digits can be represented with a 64-bit real. If a very small floating-point number is added to a large one, the result is just the large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the computer effectively discards it. Analyzing the effect of limited precision is a well-studied problem. Estimates of the magnitude of round-off errors and methods to limit their effect on large calculations are part of any large computation project. The precision limit is different from the range limit, as it affects the significand, not the exponent. The significand is a binary fraction that doesn't necessarily perfectly match a decimal fraction. In many cases a sum of reciprocal powers of 2 does not match a specific decimal fraction, and the results of computations will be slightly off. For example, the decimal fraction "0.1" is equivalent to an infinitely repeating binary fraction: 0.000110011 ... ==Numbers in programming languages==