Mojibake

and 2.03 when running on an incorrect DOS version, displaying mojibake To correctly reproduce the original text that was encoded, the correspondence between the encoded data and the notion of its encoding must be preserved (i.e. the source and target encoding standards must be the same). As mojibake is the instance of non-compliance between these, it can be achieved by manipulating the data itself, or just relabeling it. Mojibake is often seen with text data that have been tagged with a wrong encoding; it may not even be tagged at all, but moved between computers with different default encodings. A major source of trouble are communication protocols that rely on settings on each computer rather than sending or storing metadata together with the data. The differing default settings between computers are in part due to differing deployments of Unicode among operating system families, and partly the legacy encodings' specializations for different writing systems of human languages. Whereas Linux distributions mostly switched to UTF-8 in 2004, Microsoft Windows generally uses UTF-16, and sometimes uses 8-bit code pages for text files in different languages. For some writing systems, such as Japanese, several encodings have historically been employed, causing users to see mojibake relatively often. As an example, the word mojibake itself () stored as EUC-JP might be incorrectly displayed as , (MS-932), or if interpreted as Shift-JIS, or as in software that assumes text to be in the Windows-1252 or ISO 8859-1 encodings, usually labelled Western or Western European. This is further exacerbated if other locales are involved: the same text stored as UTF-8 may appear as if interpreted as Shift-JIS, as if interpreted as Western, or (for example) as if interpreted as being in a GBK (Mainland China) locale. Underspecification If the encoding is not specified, it is up to the software to decide it by other means. Depending on the type of software, the typical solution is either configuration or charset detection heuristics, both of which are prone to mis-prediction. The encoding of text files is affected by locale setting, which depends on the user's language and brand of operating system, among other conditions. Therefore, the assumed encoding is systematically wrong for files that come from a computer with a different setting, or even from a differently localized piece of software within the same system. For Unicode, one solution is to use a byte order mark, but many parsers do not tolerate this for source code or other machine-readable text. Another solution is to store the encoding as metadata in the file system; file systems that support extended file attributes can store this as user.charset. This also requires support in software that wants to take advantage of it, but does not disturb other software. While some encodings are easy to detect, such as UTF-8, there are many that are hard to distinguish (see charset detection). For example, a web browser may not be able to distinguish between a page coded in EUC-JP and another in Shift-JIS if the encoding is not assigned explicitly using HTTP headers sent along with the documents, or using the document's meta tags that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers; see character encodings in HTML. Mis-specification Mojibake also occurs when the encoding is incorrectly specified. This often happens between encodings that are similar. For example, the Eudora email client for Windows was known to send emails labelled as ISO 8859-1 that were in reality Windows-1252. Windows-1252 contains extra printable characters in the C1 range (the most frequently seen being curved quotation marks and extra dashes), that were not displayed properly in software complying with the ISO standard; this especially affected software running under other operating systems such as Unix. User oversight Of the encodings still in common use, many originated from taking ASCII and appending atop it; as a result, these encodings are partially compatible with each other. Examples of this include Windows-1252 and ISO 8859-1. People thus may mistake the expanded encoding set they are using with plain ASCII. Overspecification When there are layers of protocols, each trying to specify the encoding based on different information, the least certain information may be misleading to the recipient. For example, a web server serving a static HTML file over HTTP can communicate the character set to the client in any number of 3 ways: • in the HTTP header. This information can be based on server configuration (for instance, when serving a file off disk) or controlled by the application running on the server (for dynamic websites). • in the file, as an HTML meta tag (http-equiv or charset) or the encoding attribute of an XML declaration. This is the encoding that the author meant to save the particular file in. • in the file, as a byte order mark. This is the encoding that the author's editor actually saved it in. Unless an accidental encoding conversion has happened (by opening it in one encoding and saving it in another), this will be correct. It is, however, only available in Unicode encodings such as UTF-8 or UTF-16. Lack of hardware or software support Much older hardware is typically designed to support only one character set and the character set typically cannot be altered. The character table contained within the display firmware will be localized to have characters for the country the device is to be sold in, and typically the table differs from country to country. As such, these systems will potentially display mojibake when loading text generated on a system from a different country. Likewise, many early operating systems do not support multiple encoding formats and thus will end up displaying mojibake if made to display non-standard text early versions of Microsoft Windows and Palm OS for example, are localized on a per-country basis and will only support encoding standards relevant to the country the localized version will be sold in, and thus will display mojibake if a file containing a text in a different encoding format from the version that the OS is designed to support is opened. == Resolutions ==