Each code point is assigned a value for General Category. This is one of the character properties that are also defined for unassigned code points and code points that are defined "not a character".
Punctuation Characters have separate properties to denote they are a
punctuation character. The properties all have a
Yes/No values:
Dash,
Quotation_Mark,
Sentence_Terminal,
Terminal_Punctuation. The
Punctuation property refers to characters that are used to divide or structure text, and these are classified into different types based on their roles. Unicode assigns these punctuation characters specific categories.
Whitespace Whitespace is a commonly used concept for a typographic effect. Basically it covers invisible characters that have a spacing effect in rendered text. It includes
spaces, tabs, and new line formatting controls. In Unicode, such a character has the property set WSpace=yes. In version , there are 25 whitespace characters.
Casing The Case value is normative in Unicode. It pertains to those scripts with uppercase and lowercase letters. Case-difference occurs in Adlam, Armenian, Beria Erfe, Cherokee, Coptic, Cyrillic, Deseret, Garay, Glagolitic, Greek, Khutsuri and Mkhedruli Georgian, Latin, Medefaidrin, Old Hungarian, Osage, Vithkuqi and Warang Citi scripts. Different languages have different case mapping rules. In Turkish, corresponds to instead of . Similarly, when corresponds to instead of . In
Nawdm, the letter Ĥ corresponds to ɦ in lowercase instead of the usual case mappings being Ĥĥ and Ɦɦ. In Greek, the letter sigma has different lowercase forms depending on where it is in a word. converts to if it is at the start or middle of a word, and converts to if it is at the end of a word. In Lithuanian, the dot in lowercase i and j is preserved when followed by accents. For example: Í in lowercase is i̇́. Despite the existence of , corresponds to "SS". Unicode encodes 31 titlecase characters. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
Other general characteristics Unicode defines several general character properties in the Unicode Character Database (UAX #44). Some of the most important ones include: •
Ideographic — Characters that represent ideas or concepts rather than specific sounds. These include most Han (CJK) characters used in Chinese, Japanese, and Korean writing systems. •
Alphabetic — Characters that are considered letters in an alphabetic or syllabic writing system. This includes Latin, Greek, Cyrillic letters, as well as characters from syllabaries like Hiragana. •
Noncharacter — Code points that are permanently reserved for internal use and are not assigned to any abstract character. These include U+FDD0 through U+FDEF, and any code ending in FFFE or FFFF (such as U+1FFFE, U+10FFFF). ==Combining class==