Overview JIS X 0208 prescribes a set of 6879 graphical characters that correspond to two-byte codes with either seven or eight bits to the byte; in JIS X 0208, this is called the , which includes 6355 kanji as well as 524 , including characters such as
Latin letters,
kana, and so forth. ;Special characters :Occupies rows 1 and 2. There are 18 such as the "ideographic space" (
), and the Japanese
comma and
period; eight
diacritical marks such as
dakuten and handakuten; 10 characters for such as the
Iteration mark; 22 ; 45 ; and 32
unit symbols, which includes the
currency sign and the
postal mark, for a total of 147 characters. ;
Numerals :Occupies part of row 3. The ten digits from "0" to "9". ;
Latin letters :Occupies part of row 3. The 26 letters of the English alphabet in uppercase and lowercase form for a total of 52. ;
Hiragana :Occupies row 4. Contains 48 unvoiced kana (including the obsolete
wi and
we), 20 voiced kana (
dakuten), 5 semi-voiced kana (
handakuten), 10 small kana for palatalized and assimilated sounds, for a total of 83 characters. ;
Katakana :Occupies row 5. There are 86 characters; in addition to the katakana equivalents of the hiragana characters, the
small ka/ke kana (/) and the
vu kana (). ;
Greek letters :Occupies row 6. The 24 letters of the Greek alphabet in uppercase and lowercase form (minus the final
sigma) for a total of 48. ;
Cyrillic letters :Occupies row 7. The 33 letters of the
Russian alphabet in uppercase and lowercase form for a total of 66. ;
Box-drawing characters :Occupies row 8. Thin segments, thick segments, and mixed thin and thick segments, 32 total. ;
Kanji :The 2965 characters of from row 16 to row 47, and the 3390 characters of from row 48 to row 84 for a total of 6355.
Special characters, numerals, and Latin characters As for the special characters in the kanji set, some characters from the graphic character set of the International Reference Version (IRV) of
ISO/IEC 646:1991 (equivalent to
ASCII) are absent from JIS X 0208. There are the aforementioned four characters "QUOTATION MARK", "APOSTROPHE", "HYPHEN-MINUS", and "TILDE". The former three are split into different code points in the kanji set (Nishimura, 1978; JIS X 0221-1:2001 standard, Section 3.8.7). The "TILDE" of IRV has no corresponding character in the kanji set. In the following table, the ISO/IEC 646:1991 IRV characters in question are compared with their multiple equivalents in JIS X 0208, except for the IRV character "TILDE", which is compared with the "WAVE DASH" of JIS X 0208. The entries under the "Symbol" columns utilize UCS/Unicode code points, so the specifics of display may differ. The ASCII/IRV characters without exact JIS X 0208 equivalents were later assigned code points by
JIS X 0213, these are also listed below, as are
Microsoft's mapping of the four characters. This means that the kanji set is the most widespread non-upward-compatible character set in the world; it is counted as one of the weak points of this standard. Even with the 90 special characters, numerals, and Latin letters the kanji set and the IRV set have in common, this standard does not follow the arrangement of ISO/IEC 646. These 90 characters are split between rows 1 (punctuation) and 3 (letters and numbers), although row 3 does follow ISO 646 arrangement for the 62 letters and numbers alone (e.g. 4/1 ("A") in ISO 646 becomes 2/3 4/1 (i.e. 3-33) in JIS X 0208). As to the cause of how these numerals, Latin letters, and so forth in the kanji set are the and how the original implementation came forth with a differing interpretation compared to the IRV, it is thought that it is due to these incompatibilities. Ever since the first standard, it has been possible to represent such as
encircled numbers, ligatures for measurement unit names, and
Roman numerals; they were not given independent
kuten code points. Although individual companies that manufacture information systems can make an effort to represent these characters as customers may require by the composition of the characters, none has requested to have them added to the standard, instead choosing to proprietarily offer them as
gaiji. In the fourth standard (1997), all these characters were explicitly defined as characters that accompany an advancement of the current position; that is to say, they are
spacing characters. Furthermore, it was ruled that they should not be made by the composition of characters. For this reason, it became disallowed to represent Latin characters with
diacritics at all, with possibly the sole exception of the
ångström symbol (
Å) at row 2 cell 82.
Hiragana and katakana The
hiragana and
katakana in JIS X 0208, unlike
JIS X 0201, include
dakuten and
handakuten markings as part of a character. The katakana and (both obsolete in modern Japanese) as well as the small , not in JIS X 0201, are also included. The arrangement of kana in JIS X 0208 is different from the arrangement of katakana in JIS X 0201. In JIS X 0201, the syllabary starts with , followed by the small kana sorted by
gojūon order, followed by the full-size kana, also in
gojūon order (). On the other hand, in JIS X 0208, the kana are sorted first by
gojūon order, then in the order of "small kana, full-size kana, kana with dakuten, and kana with handakuten" such that the same fundamental kana is grouped with its derivatives (). This ordering was chosen in order to more simply facilitate the sorting of kana-based dictionary look-ups (Yasuoka, 2006). As mentioned above, in this standard, the previously defined katakana order in JIS X 0201 was not followed in JIS X 0208. It is thought that the JIS X 0201 katakana being "
half-width kana" arose due to the incompatibility with the katakana of this standard. This point is also one of the weaknesses of this standard.
Kanji How the kanji in this standard were chosen from what sources, why they are split into level 1 and level 2, and how they are arranged are all explained in detail in the fourth standard (1997). Per that explanation, the kanji included in the following four kanji listings were reflected in the 6349 characters of the first standard (1978). • The
Information Processing Society of Japan kanji code committee compiled this list in 1971. In the below "Correspondence Analysis Results", this appears to be 6086 characters. • Selected by the
Administrative Management Agency of Japan in 1975, it consists of 2817 characters. For data for the purpose of selection, the Agency made a report which, starting with the "Kanji Listing for Standard Code (Tentative)", contrasted several kanji listings, the , or for short. • One of the kanji listings that compose the "Correspondence Analysis Results", consisting of 3044 characters. It no longer exists. The original list was nonexistent for the original drafting committee; this kanji list was reflected in the standard to follow the "Correspondence Analysis Results". • One of the kanji listings that compose the "Correspondence Analysis Results", consisting of 3251 characters. They are the kanji used in the list of all administrative place names compiled by the
Japan Geographic Data Center, the . The original drafting committee did not investigate the listing itself; the kanji used from this list followed the "Correspondence Analysis Results". In the second and third standards, they added four and two characters to level 2, respectively, bringing the total kanji to 6355. Also, in the second standard, character forms were changed as well as transposition among the levels; in the third standard as well, character forms were changed. These are described further below.
Level partitioning The 2,965 Level 1 kanji occupy rows 16 to 47. The 3,390 Level 2 kanji occupy rows 48 to 84. For level 1, characters common to multiple kanji glyph listings were chosen, using the
tōyō kanji, the tōyō kanji correction draft, and the
jinmeiyō kanji as a basis. Also, JIS C 6260 ("To-Do-Fu-Ken (Prefecture) Identification Code"; currently
JIS X 0401) and JIS C 6261 ("Identification code for cities, towns and villages"; currently
JIS X 0402) were consulted; kanji for nearly all Japanese
prefectures, cities, districts, wards, towns, villages, and so forth were intentionally placed in level 1. Furthermore, amendments by experts were added. Level 2 was dedicated to kanji that made an appearance in the aforementioned four major listings but were not selected for level 1. As noted below, the kanji of level 1 were ordered by their pronunciation, so among the kanji whose pronunciation were difficult to determine, there were those that were transferred from level 1 to level 2 on that basis (Nishimura, 1978). Due to these decisions, for the most part, level 1 contains more frequently used kanji, and level 2 contains more infrequently used kanji, but of course, those were judged by the standards of the day; over the passage of time, some level 2 kanji have become more frequently used, such as one meaning "to soar" () and one meaning "to glitter" (); and inversely, some level 1 kanji have become infrequent, notably the ones meaning "centimeter" () and "millimeter" (). Of the current
jōyō kanji, 30 fall into level 2, while three are missing altogether (塡󠄀, 剝󠄀 and 頰󠄀). Of the current
jinmeiyō kanji, 192 are in level 2, while 105 are not part of the standard.
Arrangement The kanji in level 1 are sorted in order of each one's "representative reading" (i.e. a canonical reading chosen for the purposes of this standard only); the reading of a kanji for this may be an
on or a
kun reading; readings are sorted in
gojūon order. As a general rule, the
on (Chinese-sound) reading is considered the representative reading; where a kanji has multiple
on readings, the reading judged to be predominant in use frequency is used for the representative reading (JIS C 6226-1978 standard, Section 3.4). For the small percentage of kanji that either do not have an
on reading or have an
on reading which is little known and not in common use, the
kun reading was employed as the representative reading. Where a verb
kun reading must be used as the representative reading, the ''
ren'yōkei (rather than the shūshikei'') form is used. For example, cells 1 to 41 on row 16 are 41 characters sorted as starting with a reading of
a. Within these, 22 characters, including 16-10 (:
on reading "
ki";
kun reading "
aoi") and 16-32 (:
on readings "
zoku" and "
shoku";
kun reading "
awa") are there on the basis of their
kun readings. 16-09 (:
on reading "
hō",
kun reading "
a(i)") and 16-23 (:
on readings "
sō" and "
kyū",
kun reading "
atsuka(i)") are just two examples of ''ren'yōkei''-form verbs used for the representative reading. Where the representative reading is the same between different kanji, a kanji that uses an
on reading is placed ahead of one that uses a
kun reading. Where the
on or
kun readings are the same between more than one kanji, they are then ordered by their
primary radical and
stroke count. Whether on level 1 or level 2,
itaiji are arranged to directly follow their exemplar form. For example, in level 2, right after row 49 cell 88 (), the immediately following characters deviate from the general rule (stroke count in this case) to include three variants of 49-88 (, , and ). The kanji in level 2 are arranged in order of primary radical and stroke count. Where these two properties are the same for different kanji, they are then sorted by reading.
Kanji from unknown sources It has been pointed out that there are kanji in the kanji set that are not found in comprehensive, unabridged kanji dictionaries, and that the sources thereof are unknown. For example, only one year after the first standard was established, Tajima (1979) reported that he had confirmed 63 kanji that were not to be found in
Shinjigen (a large kanji dictionary published by
Kadokawa Shoten), nor in
Dai Kan-Wa jiten, and they did not make sense as
ryakuji of any sort; he noted that it would be preferable for kanji not available in kanji dictionaries to be selected from definite sources. These kanji came to be known as or , among other names. The drafting committee for the fourth version of the standard also saw the existence of kanji with sources unknown as a problem, and so made an inquiry into just what kind of sources the drafting committee of the first version referenced. As a result, it was discovered that the original drafting committee had heavily relied on the "Correspondence Analysis Results" to collect kanji. When the drafting committee investigated the "Correspondence Analysis Results", it became clear that many of the kanji included in the kanji set but not found in exhaustive kanji dictionaries supposedly came from the "Japanese Personality Registration Name Kanji" and "Kanji for National Administrative District Listing" lists mentioned in the "Correspondence Analysis Results". It was confirmed that no original text for the "Japanese Personality Registration Name Kanji" referenced in the "Correspondence Analysis Results" exists. For the "National Administrative District Listing",
Sasahara Hiroyuki of the fourth version's drafting committee examined the kanji that appeared on the in-progress development pages for the first standard. The committee also consulted many ancient writings, as well as many examples of personal names in a database of
NTT phone books. Due to this thorough investigation, the committee was able to pare down the number of kanji for which the source cannot be confidently explained to twelve, shown on the adjacent table. Of these, it is conjectured that several glyphs came about due to copying errors. In particular, 妛 was probably created when printers tried to create 𡚴 by cutting and pasting 山 and 女 together. A shadow from that process was misinterpreted as a line, resulting in 妛 (a picture of this can be found in the
Jōyō kanji jiten).
Unification of kanji variants According to the specifications in the fourth standard (1997), is the action of giving the same code point to a character without regard to its different character forms. In the fourth standard, the
glyphs allowed are limited; the extent to which particular
allographic glyphs are unified into a
graphemic code point is clearly defined. Furthermore, according to the specifications in the standard, a is an abstract notion as to the graphical representation of a graphic character; a is the representation as a graphical shape that a glyph takes in actuality (e.g. due to a glyph being handwritten, printed, displayed on a screen, etc.). For a single glyph, there exists an endless range of possible concretely and/or visibly different character forms. A variation between a character form of one glyph is termed a . The extent to which a glyph is unified to one code point is determined according to that code point's and the that can be applied to that example glyph; that is, the example glyph for a code point applies to that code point, and any glyphs for which the parts that compose the example glyph are replaced in accordance with the unification criteria
also apply to that code point. For example, the example glyph at 33-46 () is composed of
radical 9 () and the kanji that eventually spawned the
so kana (). Also, in unification criterion 101, there are three kanji displayed: the first takes the form most often seen in Japanese (); the second contains a more traditional form () in which the first two strokes form
radical 12 (the kanji numeral for the number 8: ); and the third is like the second, except that radical 12 is inverted (). Consequently, all three permutations (, , ) all apply to the code point at line 33 cell 46. In the fourth standard, including one of the
errata for the first printing, there are 186 unification criteria. When a code point's example glyph is composed of more than one part glyph, unification criteria can be applied to each part. After a unification criterion is applied to one part glyph, that part cannot have any more unification criteria applied to it. Also, a unification criterion is not allowed to apply if the resulting glyph would coincide with that of another code point entirely. An example glyph is no more than an example for that code point; it is not a glyph "endorsed" by the standard. Also, the unification criteria need only be used for generally used kanji and for the purpose of assigning things to the code points of this standard. The standard requests that generally unused kanji not be created based on the example glyphs and unification criteria. The kanji of the kanji set are not chosen completely consistently according to the unification criteria. For example, although 41-7 corresponds to the form where the third and fourth strokes cross () as well as the form where they don't () according to unification criterion 72, 20-73 only corresponds to the form where they do not cross (), and 80-90 only corresponds to the form where they do (). The terms "unification", "unification criteria", and "example glyph" were adopted in the fourth standard. From the first to the third version, kanji and relations between kanji were grouped into three types: , , and ; it was explained that the characters recognized as equivalent "consolidate to just one point". "Equivalence" included, other than kanji with exactly the same shape, kanji with differences due to style, and kanji where the difference in character form is small. In the first standard, it was stipulated that "this standard ... does not establish the particulars of character forms" (Section 3.1); it also states that "the aim of this standard is to establish the general idea of characters and their codes; the design of their character forms and such lie outside its scope." In the second and third standards as well, notes to the effect that specific designs of character forms lie outside its scope (the note on item 1). The fourth standard also stipulates that "This standard regulates graphic characters as well as their bit patterns, and the use, specific designs of individual characters, and so forth are not within the scope of this standard" (JIS X 0208:1997, item 1).
Unification criteria for compatibility In the fourth standard, is defined. Their application is limited to 29 code points whose glyphs vary greatly between the standards JIS C 6226-1983 on and after and JIS C 6226-1978. For those 29 code points, the glyphs from JIS C 6226-1983 on and after are displayed as "A", and the glyphs from JIS C 6226-1978 as "B". On each of them, both "A" and "B" glyphs may be applied. However, in order to claim compatibility with the standard, whether the "A" or "B" form has been used for each code point must be explicitly noted. == Character encodings ==