Since release 8 of the
3GPP 23.038 standard of March 2008, additional characters sets can be accessed through the use of a National Language Shift Tables. These tables allow using of different character sets according to the language the text is going to be written. The choice of table for a given message is selected in the
User Data Header section of an SMS message and can be specified for the whole text (a
Locking shift table replacing standard GSM 7-bit default alphabet table) or a single character (
Single shift table replacing the GSM 7-bit default alphabet extension table).
Locking and
Single shift tables together in the same message are possible, if both standard default alphabet table and default alphabet extension table are to be replaced. Using a shift table, a message can still use 7-bit encoding for the characters, but a different set can be chosen to correctly show accented and language specific characters. This allows up to 155 characters, encoded in 136 octets (140 octets, minus the 4-octets of
User Data Header required to indicate the use of a shift table and the language code). With both
Locking and
Single shift tables, up to 152 characters are allowed, encoded in 133 octets (140 octets, minus 7-octets
User Data Header). Characters from any locking shift table take one septet, characters from the single shift table (or Basic Character Set Extension table) take two septets. Initially, shift tables only for Turkish were specified; Spanish and Portuguese were added in later revisions of release 8. Release 9 introduced 10 languages used in India written with a
Brahmic scripts (Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu) and
Urdu. There is still no defined national language shift table for French, Greek, Russian, Bulgarian, Arabic, Hebrew and most Central European languages that need a better coverage than the default 7-bit standard character set and its default 7-bit extension character set: if ever any character is composed that cannot be represented in those default GSM 7-bit sets, the message will be automatically reencoded using UCS-2, with the effect of dividing by more than two the maximum length in characters of messages that can be sent at the price of a single SMS (when a message is split in multiple parts, a few other octets are needed in the
User Data Header to indicate the sequence number of each part). Although a revision of GSM 03.38 (as early as in version 4.0.1 of September 1994) has defined Data Coding Scheme values for
Cell Broadcast System (CBS) for German, English, Italian, French, Spanish, Dutch, Swedish, Danish, Finnish, Norwegian, Greek and Turkish; with Hungarian, Polish, Czech, Hebrew, Arabic, Russian and Icelandic added in later revisions, no coding tables were defined for these languages. The purpose of this field was purely to identify the language of the message. There's also no language shift table for Japanese written in basic kanas, or for Korean written in Hangul jamos, or for Chinese written in the Han script. This is often not a problem in Japan, because it uses other standards than GSM and WAP for messaging. The two other languages also have too many distinct characters to fit into a 7-bit shift table.
Spanish language (Latin script) There's no specific Locking Shift Character Set for the Spanish language. Uses the default Basic Character Set. • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||title="Latin capital letter I with acute"|
Í ||style="background:#CCC"| ||title="Latin small letter i with acute"|
í ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Portuguese language (Latin script) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||title="Latin capital letter I with acute"|
Í ||style="background:#CCC"| ||title="Latin small letter i with acute"|
í ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Turkish language (Latin script) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||title="Latin capital letter dotted I"|
İ ||style="background:#CCC"| ||title="Latin small letter dotless i"|
ı ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Urdu language (Arabic and basic Latin scripts) It may also be used for the
Sindhi language also written in the Arabic script. Sometimes it may be used for
Arabic language as well, but the Eastern digits (encoded here in their Persian-Hindu variant) won't be used in that case because standard Arabic prefer its traditional Eastern Arabic digits, and will frequently be replaced by Western Arabic digits (encoded in the locking shift character set in column 0x30) which are also used now frequently in Urdu as well. However, in India, phones recognizing the Arabic language indication may substitute the Persian-Hindu variants of the Eastern Arabic digits by the traditional Eastern Arabic digits. • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||
ٲ ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Hindi language (Devanagari and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||
॰ ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Bengali and Assamese languages (Bengali and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Punjabi language (Gurmukhī and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Gujarati language (Gujarati and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Oriya language (Oriya and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Tamil language (Tamil and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Telugu language (Telugu and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. ||style="background:#CCC"| ||
I ||
Y ||style="background:#CCC"| ||style="background:#CCC"| • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Kannada language (Kannada and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. || style="background:#CCC" | ||
I ||
Y || style="background:#CCC" | || style="background:#CCC" | • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions.
Malayalam language (Malayalam and basic Latin scripts) • LF is a Line Feed control. • CR is a Carriage Return control, or filler. • ESC is an Escape control. • SP is a Space character. || style="background:#CCC" | ||
I ||
Y || style="background:#CCC" | || style="background:#CCC" | • FF is a Page Break control. If not recognized, it shall be treated like LF. • CR2 is a control character. No language specific character shall be encoded at this position. • SS2 is a second Single Shift Escape control reserved for future extensions. == See also ==