KS X 1001

KS X 1001, "Code for Information Interchange ", formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer.

History

This standard was previously known as KS C 5601. There have been several revisions of this standard. For example, there were revisions in 1987, 1992, 1998 and 2002. The present, double-byte, Wansung () It is an ISO 2022 compatible encoding, typically used in EUC form, which assigns double-byte codes for non-Hangul, Hangul jamo, and the most common Hangul syllables, in contrast to Johab () The second edition, published in 1982, retained the main character set from the 1974 edition but defined two supplementary sets, including a version of Johab. Neither edition was adopted as widely as intended. in annex 3, and the older N-byte Hangul encoding in annex 4. == Encodings ==

Encodings

Encoding schemes of KS X 1001 include EUC-KR (in both ASCII and ISO 646-KR based variants, the latter of which includes a won currency sign (₩) at byte 0x5C rather than a backslash) and ISO-2022-KR, as well as ISO-2022-JP-2 (which also encodes JIS X 0208 and JIS X 0212). These all have the drawback that they only assign codes for the 2350 precomposed Hangul syllables which have their own KS X 1001 codepoints (out of 11172 in total, not counting those using obsolete jamo), and require others to use eight-byte composition sequences, which are not supported by some partial implementations of the standard. The Johab encoding (stipulated in annex 3 of the 1992 version of the standard) and the EUC-KR superset known as Unified Hangul Code (UHC, also called Windows-949) provide single codes for all 11172 Hangul syllables. Unicode includes the Wansung code Hangul Filler in the Hangul Compatibility Jamo block for round-trip compatibility, but uses its own system (with its own, differently used, filler characters) for composing Hangul. The KS X 1001 Hangul composition system is not used in Unicode, and the filler renders merely as an empty space; KS X 1001 composition sequences using modern jamo may be mapped to precomposed characters in Unicode. This is not usually done with Unified Hangul Code. For round-trip compatibility, Unicode also includes the N-byte Hangul code Hangul Filler separately in the Halfwidth and Fullwidth Forms block, named the "Halfwidth Hangul Filler". == Wansung code charts ==

Wansung code charts

Following are the code charts for KS X 1001 in Wansung layout. Where a pair of hexadecimal numbers is given, the smaller is used when encoded over GL (0x21-0x7E), as in ISO-2022-KR when the Korean set has been shifted to, and the larger is used in the more typical case of it being encoded over GR (0xA1-0xFE), as in EUC-KR or UHC. Johab changes the arrangement to encode all 11172 Hangul clusters separately and in order. To illustrate vendor differences in implementation, multiple Unicode mappings are shown for some characters. Apple's HangulTalk extensions to the Wansung plane (i.e. where both bytes are in the 0xA1-0xFE range) are shown, but other HangulTalk extension ranges are not. The additional codes for composed syllables in Unified Hangul Code, and IBM's extensions in IBM-949, are also not shown, since both fall outside of the Wansung plane. Lead bytes } Non-Hanja non-precomposed sets The rows 41 and 94 may be used for user-defined purposes. or U+223C (favoured by Microsoft). Compare the similar but not identical handling of the JIS wave dash, and the handling of the tilde in the next row. Except for the backslash, if two mappings are shown below, the first is used by Apple and the second is used by Microsoft. Mapping of the circled dot also differs. Microsoft updated its Unified Hangul Code implementation to add the 1998 additions including the euro sign, but did not add the Korean postal mark when it was added to the standard. Character set 0x23 / 0xA3 (row number 3, basic Latin / ISO 646-KR) This set corresponds to KS X 1003 (the ISO 646 variant for Korean, a similar set to ASCII), but as two-byte codes preceded by 0x23 (or 0xA3 in GR-invoked (EUC) form). It includes the English alphabet / Basic Latin alphabet, western Arabic numerals and punctuation. Compare the Roman set of JIS X 0201, which differs by including a Yen sign rather than a Won sign. Contrast the third rows of KPS 9566 and of JIS X 0208, which follow the ISO 646 layout but only include letters and digits. Encodings such as EUC-KR and UHC combine KS X 1001 with single-byte ASCII or KS X 1003, and hence use alternative Unicode mappings to the Halfwidth and Fullwidth Forms block for the double-byte representations of these characters. Character set 0x24 / 0xA4 (row number 4, Hangul jamo) This set includes modern Hangul consonants, followed by vowels, both ordered by South Korean collation customs, followed by obsolete consonants. When used individually, these characters map to the Unicode Hangul Compatibility Jamo block, and do not have a one-to-one mapping with the position-specific characters in the Hangul Jamo block. Compare with row 4 of the North Korean KPS 9566. Character 04-52 is a Hangul Filler (see above), used in combining sequences. Character set 0x25 / 0xA5 (row number 5, Roman numerals and Greek) This set contains Roman numerals and basic support for the Greek alphabet, without diacritics or the final sigma. Apple includes some additional punctuation in this row, as well as some black circled list markers continuing from those in row 6. Apple also includes some bracketed list markers continuing from those in rows 9 and 10. Compare row 11 of KPS 9566, which uses the same layout. Compare and contrast row 5 of JIS X 0208, which also uses the same layout, but in a different row. Character set 0x2C / 0xAC (row number 12, Cyrillic) This set contains the modern Russian alphabet, and is not necessarily sufficient to represent other forms of the Cyrillic script. Apple also includes some black boxed list markers. Compare row 5 of KPS 9566 and row 7 of JIS X 0208, which use the same layout (but in a different row). Extended character set 0x2D / 0xAD (row number 13, Apple additional punctuation) Precomposed Hangul sets (rows number 16 through 40) Code points for precomposed Hangul are included in a continuous sorted block between code points 16-01 and 40-94 inclusive. Not all possible syllable clusters are included in this range. Compare the different ordering and availability in KPS 9566. Initial+vowel+final syllables 뢨, 썅, 쏀, 쓩, and 쭁 are included but their initial+vowel counterparts 뢔, 쌰, 쎼, 쓔, and 쬬 are not. This can cause a problem with inputting, because input methods have to go through an initial+vowel syllable first in order to get to an initial+vowel+final syllable (e.g. ㅎ → 하 → 한). Those which are not listed here may be represented using eight-byte composition sequences. All other modern-jamo clusters are assigned codes elsewhere by UHC. All possible modern-jamo clusters are assigned codes by Johab. Statistics by jamo ; Vowels ; Final consonants Hanja sets (rows number 42 through 93) KS X 1001 encodes hanja with multiple pronunciations multiple times, once for each pronunciation. (Some pronunciations are inherited from Middle Chinese, and others are an effect of the initial sound rule.) One character, 樂, is encoded four times. The first 268 characters (U+F900–U+FA0B) in the CJK Compatibility Ideographs block correspond to these duplicates. In the table below, the first row-cell value (and reading) for each Hanja maps to the CJK Unified Ideographs block; others map to the CJK Compatibility Ideographs block. == Johab encoding ==

Johab encoding

KS X 1001, since 1992, also defines an alternative encoding known as Johab. This represents a Hangul syllable as the sequence of three five-bit values, split across two 8-bit bytes, most significant bit first. The most significant bit of the lead byte is always set (allowing combination with single-byte ASCII or KS X 1003). This encoding is also used for the modern jamo from row 4 of KS X 1001, by using the filler values for the other components. The Johab encoding for Hangul is shown in the table below. Johab encodes the remainder of KS X 1001 using lead bytes which do not correspond to an initial jamo (0xE0–0xF9 for Hanja and 0xD9–0xDE with two KS X 1001 rows per lead byte (compare and contrast Shift JIS). The ASCII-based Johab encoding is numbered Code page 1361 by Microsoft. Other, vendor-defined, Johab variants also exist; for example, IBM defines one for use as a Shift Out set with EBCDIC. That variant uses shift in and shift out to switch between a single-byte EBCDIC page and Johab, uses a different encoding for the non-Hangul characters (using lead bytes 0x40–6C with a different layout), and uses lead bytes 0xD4–DD as a user-defined region, but uses the same Johab layout as the 1992 standard for the Hangul characters when in shift-out state. IBM number the EBCDIC-based, stateful Johab encoding Code page 1364, Some other vendors such as Samsung or GoldStar (now LG) used other "Johab" encodings where the mappings of five-bit codes to jamo differ from the below, consequently not being compatible with the 1992 standard Johab. The table below corresponds to the 1992 standard and also to IBM usage. == N-byte Hangul code ==

N-byte Hangul code

This is the N-byte Hangul code, is a superset of this, assigning the characters ¢¬\~ (although not £) to the same locations as in Code page 1041, while the unextended N-Byte Hangul (besides C0 control code replacement graphics in some usage contexts, shared with IBM-1040) is Code page 891. Character 0x40/0xC0 is a Hangul Filler (see above), used in combining sequences. Similarly to its Japanese counterpart JIS C 6220 (JIS X 0201), N-byte Hangul code could be used as a 7-bit encoding, with character allocations over the range 0x40 through 0x7C. The chart below shows the code in an 8-bit environment with the high bit set (i.e. over 0xC0 through 0xFC), as it is used in e.g. code page 891 or 1040. == Footnotes ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com