Unicode and its parallel standard, the ISO/IEC 10646
Universal Character Set, together constitute a unified standard for character encoding. Rather than mapping characters directly to
bytes, Unicode separately defines a coded character set that maps characters to unique natural numbers (
code points), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets (bytes). The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To describe the model precisely, Unicode uses existing terms and defines new terms.
Abstract character repertoire An abstract character repertoire (ACR) is the full set of abstract characters that a system supports. Unicode has an open repertoire, meaning that new characters will be added to the repertoire over time.
Coded character set A coded character set (CCS) is a
function that maps characters to
code points (each code point represents one character). For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on. Multiple coded character sets may share the same character repertoire; for example
ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different code points.
Character encoding form A character encoding form (CEF) is the mapping of code points to
code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence is defined by a CEF.
Character encoding scheme A character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network. Simple character encoding schemes include
UTF-8,
UTF-16BE,
UTF-32BE,
UTF-16LE, and
UTF-32LE; compound character encoding schemes, such as
UTF-16,
UTF-32 and
ISO/IEC 2022, switch between several simple schemes by using a
byte order mark or
escape sequences; compressing schemes try to minimize the number of bytes used per code unit (such as
SCSU and
BOCU). Although
UTF-32BE and
UTF-32LE are simpler CESes, most systems working with Unicode use either
UTF-8, which is
backward compatible with fixed-length ASCII and maps Unicode code points to variable-length sequences of octets, or
UTF-16BE, which is
backward compatible with fixed-length UCS-2BE and maps Unicode code points to variable-length sequences of 16-bit words. See
comparison of Unicode encodings for a detailed discussion.
Higher-level protocol There may be a higher-level protocol which supplies additional information to select the particular variant of a
Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as the same character. An example is the
XML attribute xml:lang. The Unicode model uses the term "character map" for other systems which directly assign a sequence of characters to a sequence of bytes, covering all of the CCS, CEF and CES layers.
Code point documentation A character is commonly documented as 'U+' followed by its code point value in
hexadecimal. The range of valid code points (the code space) for the Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17
planes, identified by the numbers 0 to 16. Characters in the range U+0000 to U+FFFF are in plane 0, called the
Basic Multilingual Plane (BMP). This plane contains the most commonly used characters. Characters in the range U+10000 to U+10FFFF in the other planes are called
supplementary characters. The following table includes examples of code points:
Example Consider, "ab̲c𐐀" a string containing a Unicode combining character ( to underline the ) as well as a supplementary character (). This string has several Unicode representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements: • Four
composed characters: • :, , , • Five graphemes: • :, , , , • Five Unicode
code points: • :, , , , • Five UTF-32 code units (32-bit integer values): • :, , , , • Six UTF-16 code units (16-bit integers) • :, , , , , • Nine UTF-8 code units (8-bit values, or
bytes) • :, , , , , , , , Note in particular that 𐐀 is represented with either one 32-bit value (UTF-32), two 16-bit values (UTF-16), or four 8-bit values (UTF-8). Although each of those forms uses the same total number of bits (32) to represent the grapheme, it is not obvious how the actual numeric byte values are related. ==Transcoding==