Codespace and code points The Unicode Standard defines a
codespace: a sequence of integers called
code points in the range from 0 to , notated according to the standard as –. The codespace is a systematic, architecture-independent representation of
The Unicode Standard; actual text is processed as binary data via one of several Unicode encodings, such as
UTF-8. In this normative notation, the two-character prefix U+ always precedes a written code point, and the code points themselves are written as
hexadecimal numbers. At least four hexadecimal digits are always written, with
leading zeros prepended as needed. For example, the code point is padded with two leading zeros, but () is not padded. There are a total of valid code points within the codespace. This number arises from the limitations of the
UTF-16 character encoding, which can encode the 216 code points in the range through except for the 211 code points in the range through , which are used as surrogate pairs to encode the 220 code points in the range through .
Code planes and blocks The Unicode codespace is divided into 17
planes, numbered 0 to 16. Plane 0 is the
Basic Multilingual Plane (BMP), and contains the most commonly used characters. All code points in the BMP are accessed as a single code unit in UTF-16 encoding and can be encoded in one, two or three bytes in UTF-8. Code points in planes 1 through 16 (the
supplementary planes) are accessed as surrogate pairs in
UTF-16 and encoded in four bytes in
UTF-8. Within each plane, characters are allocated within named
blocks of related characters. The size of a block is always a multiple of 16, and is often a multiple of 128, but is otherwise arbitrary. Characters required for a given script may be spread out over several different, potentially disjunct blocks within the codespace.
General Category property Each code point is assigned a classification, listed as the code point's
General Category property. Here, at the uppermost level code points are categorized as one of Letter, Mark, Number, Punctuation, Symbol, Separator, or Other. Under each category, each code point is then further subcategorized. In most cases, other properties must be used to adequately describe all the characteristics of any given code point. The points in the range – are known as
high-surrogate code points, and code points in the range – ( code points) are known as
low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point forms a
surrogate pair in UTF-16 in order to represent code points greater than . In principle, these code points cannot otherwise be used, though in practice this rule is often ignored, especially when not using UTF-16. A small set of code points are guaranteed never to be assigned to characters, although third-parties may make independent use of them at their discretion. There are 66 of these
noncharacters: – and the last two code points in each of the 17 planes (e.g. , , , , ..., , ). The set of noncharacters is stable, and no new noncharacters will ever be defined. Like surrogates, the rule that these cannot be used is often ignored, although the operation of the
byte order mark assumes that will never be the first code point in a text. The exclusion of surrogates and noncharacters leaves code points available for use.
Private use code points are considered to be assigned, but they intentionally have no interpretation specified by
The Unicode Standard such that any interchange of such code points requires an independent agreement between the sender and receiver as to their interpretation. There are three private use areas in the Unicode codespace: • Private Use Area: – ( characters), • Supplementary Private Use Area-A: – ( characters), • Supplementary Private Use Area-B: – ( characters).
Graphic characters are those defined by
The Unicode Standard to have particular semantics, either having a visible
glyph shape or representing a visible space. As of Unicode 17.0, there are graphic characters.
Format characters are characters that do not have a visible appearance but may have an effect on the appearance or behavior of neighboring characters. For example, and may be used to change the default shaping behavior of adjacent characters (e.g. to inhibit ligatures or request ligature formation). There are 172 format characters in Unicode 17.0. 65 code points, the ranges – and –, are reserved as
control codes, corresponding to the
C0 and C1 control codes as defined in
ISO/IEC 6429. , , and are widely used in texts using Unicode. In a phenomenon known as
mojibake, the C1 code points are improperly decoded according to the
Windows-1252 codepage, previously widely used in Western European contexts. Together, graphic, format, control code, and private use characters are collectively referred to as
assigned characters.
Reserved code points are those code points that are valid and available for use, but have not yet been assigned. As of Unicode 17.0, there are reserved code points.
Abstract characters The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of
abstract characters representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point. However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, a Latin small letter "i" with an
ogonek, a
dot above, and an
acute accent, which is required in
Lithuanian, is represented by the character sequence ; ; . Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode. All assigned characters have a unique and immutable name by which they are identified. This immutability has been guaranteed since version 2.0 of
The Unicode Standard by its Name Stability policy.
Precomposed vis-à-vis composite characters Unicode includes a mechanism for modifying characters that greatly extends the supported repertoire of glyphs. This covers the use of
combining diacritical marks that may be added after the base character by the user. Multiple combining diacritics may be simultaneously applied to the same character. Unicode also contains
precomposed versions of most letter/diacritic combinations in normal use. These make the conversion to and from legacy encodings simpler, and allow applications to use Unicode as an internal text format without having to implement combining characters. For example, é can be represented in Unicode as followed by , and equivalently as the precomposed character . Thus, users often have multiple equivalent ways of encoding the same character. The mechanism of
canonical equivalence within
The Unicode Standard ensures the practical interchangeability of these equivalent encodings. An example of this arises with the Korean alphabet
Hangul: Unicode provides a mechanism for composing Hangul syllables from their individual
Hangul Jamo subcomponents. However, it also provides combinations of precomposed syllables made from the most common jamo.
CJK characters presently only have codes for uncomposable radicals and precomposed forms. Most Han characters have either been intentionally composed from, or reconstructed as compositions of, simpler orthographic elements called
radicals, so in principle Unicode could have enabled their composition as it did with Hangul. While this could have greatly reduced the number of required code points, as well as allowing the algorithmic synthesis of many arbitrary new characters, the complexities of character etymologies and the post-hoc nature of radical systems add immense complexity to the proposal. Indeed, attempts to design CJK encodings on the basis of composing radicals have been met with difficulties resulting from the reality that Chinese characters do not decompose as simply or as regularly as Hangul does. The
CJK Radicals Supplement block is assigned to the range –, and the
Kangxi radicals are assigned to –. The
Ideographic Description Sequences block covers the range –, but
The Unicode Standard warns against using its characters as an alternate representation for characters encoded elsewhere:
Ligatures Many scripts, including
Arabic and
Devanāgarī, have special orthographic rules that require certain combinations of letterforms to be combined into special
ligature forms. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of
The Unicode Standard), which became the
proof of concept for
OpenType (by Adobe and Microsoft),
Graphite (by
SIL International), or
AAT (by Apple). Instructions are also embedded in fonts to tell the operating system how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally, this approach is only effective in monospaced fonts but may be used as a fallback rendering method when more complex methods fail.
Standardized subsets Several subsets of Unicode are standardized: Microsoft Windows since
Windows NT 4.0 supports
WGL-4 with 657 characters, which is considered to support all contemporary European languages using the Latin, Greek, or Cyrillic script. Other standardized subsets of Unicode include the Multilingual European Subsets: MES-1 (Latin scripts only; 335 characters), MES-2 (Latin, Greek, and Cyrillic; 1062 characters) and MES-3A & MES-3B (two larger subsets, not shown here). MES-2 includes every character in MES-1 and WGL-4. The standard
DIN 91379 specifies a subset of Unicode letters, special characters, and sequences of letters and diacritic signs to allow the correct representation of names and to simplify data exchange in Europe. This standard supports all of the official languages of all European Union countries, as well as the German minority languages and the official languages of Iceland, Liechtenstein, Norway, and Switzerland. To allow the transliteration of names in other writing systems to the Latin script according to the relevant ISO standards, all necessary combinations of base letters and diacritic signs are provided. Rendering software that cannot process a Unicode character appropriately often displays it as an open rectangle, or as to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's
Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the
SIL International's
Unicode fallback font will display a box showing the hexadecimal scalar value of the character.
Mapping and encodings Several mechanisms have been specified for storing a series of code points as a series of bytes. Unicode defines two mapping methods: the
Unicode Transformation Format (UTF) encodings, and the
Universal Coded Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode
code points to sequences of values in some fixed-size range, termed
code units. All UTF encodings map code points to a unique sequence of bytes. The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and
UTF-1). UTF-8 and UTF-16 are the most commonly used encodings.
UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent. UTF encodings include: •
UTF-8, which uses one to four 8-bit units per
code point, and has maximal compatibility with
ASCII •
UTF-16, which uses one 16-bit unit per code point below , and a
surrogate pair of two 16-bit units per code point in the range to •
UTF-32, which uses one 32-bit unit per code point •
UTF-EBCDIC, not specified as part of
The Unicode Standard, which uses one to five 8-bit units per code point, intended to maximize compatibility with
EBCDIC UTF-8 uses one to four 8-bit units (
bytes) per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for the interchange of Unicode text. It is used by
FreeBSD and most recent
Linux distributions as a direct replacement for legacy encodings in general text handling. The UCS-2 and UTF-16 encodings specify the Unicode
byte order mark (BOM) for use at the beginnings of text files, which may be used for byte-order detection (or
byte endianness detection). The BOM, encoded as , has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; (the result of byte-swapping ) does not equate to a legal character, and in places other than the beginning of text conveys the zero-width non-break space. The same character converted to UTF-8 becomes the byte sequence EF BB BF.
The Unicode Standard allows the BOM "can serve as a signature for UTF-8 encoded text where the character set is unmarked". Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit
code pages. However , the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8, but discusses the cases where this may not be possible. In addition, the large restriction on possible patterns in UTF-8 (for instance there cannot be any lone bytes with the high bit set) means that it should be possible to distinguish UTF-8 from other character encodings without relying on the BOM. In UTF-32 and UCS-4, one
32-bit code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the
GCC compilers to generate software uses it as the standard "
wide character" encoding. Recent versions of the
Python programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in
high-level coded software.
Punycode, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the
ASCII-based
Domain Name System (DNS). The encoding is used as part of
IDNA, which is a system enabling the use of
Internationalized Domain Names in all scripts that are supported by Unicode. Earlier and now historical proposals include
UTF-5 and
UTF-6.
GB18030 is another encoding form for Unicode, from the
Standardization Administration of China. It is the official
character set of the People's Republic of China (PRC).
BOCU-1 and
SCSU are Unicode compression schemes. The
April Fools' Day RFC of 2005 specified two parody UTF encodings,
UTF-9 and
UTF-18. == Adoption ==