UTF-8 has been the most common encoding for the
World Wide Web since 2008. , UTF-8 is used by 98.9% of surveyed web sites. Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8. Virtually all countries and languages have 95% or more use of UTF-8 encodings on the web. 100% UTF-8 use. --> Many standards only support UTF-8, e.g.
JSON exchange requires it (without a byte-order mark (BOM)). UTF-8 is also required by the
WHATWG for HTML and
DOM specifications, which states "UTF-8 encoding is the most appropriate encoding for interchange of
Unicode", The
World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ... Using non-UTF-8 encodings can have unexpected results". Version 5.3 of the W3C HTML specification and the current Living Standard by WHATWG both require UTF-8. Many software programs have the ability to read/write UTF-8. It may require the user to change options from the normal settings, or may require a BOM (byte-order mark) as the first character to read the file. Examples of software supporting UTF-8 include
Microsoft Word,
Microsoft Excel (
Office 2003 and later),
Google Drive,
LibreOffice, and most databases. Software that "defaults" to UTF-8 (meaning it writes it without the user changing settings, and it reads it without a BOM) has become more common since 2010.
Windows Notepad, in all currently supported versions of Windows, defaults to writing UTF-8 without a BOM (a change from
Notepad), bringing it into line with most other text editors. Some system files on
Windows 11 require UTF-8 with no requirement for a BOM, and almost all files on macOS and most Linux distributions are required to be UTF-8 without a BOM. Programming languages that default to UTF-8 for
I/O include
Ruby 3.0,
R 4.2.2,
Raku and
Java 18.
Python 3.15 makes UTF-8 the default for I/O; previous versions require an option to open() to read/write UTF-8.
C++23 adopted UTF-8 as the only portable source code file format. Backwards compatibility is a serious impediment to changing code and APIs using
UTF-16 to use UTF-8, but this is happening. In May 2019, Microsoft
added the capability for an application to set UTF-8 as the "code page" for the Windows API, removing the need to use UTF-16; and more recently has recommended programmers use UTF-8, and even states "UTF-16 [...] is a unique burden that Windows places on code that targets multiple platforms". The default string primitive in
Go,
Julia,
Rust,
Swift (since version 5), and
PyPy uses UTF-8 internally in all cases. Python (since version 3.3) uses UTF-8 internally for Python C API extensions and sometimes for strings and a future version of Python is planned to store strings as UTF-8 by default. Modern versions of
Microsoft Visual Studio use UTF-8 internally. All currently supported versions of Microsoft SQL Server support UTF-8 for importing and exporting, and in addition all on mainstream support, i.e. since SQL Server 2019, support UTF-8 internally, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements".
Java internally uses UTF-16 for the char data type and, consequentially, the Character, String, and StringBuffer classes, but for I/O uses
"Modified UTF-8", which is the same as CESU-8, except the
null character uses the two-byte overlong encoding instead of just . Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including , which allows such strings (with a null byte appended) to be processed by traditional
null-terminated string functions. Java reads and writes normal UTF-8 to files and streams, but it uses Modified UTF-8 for object
serialization, for the
Java Native Interface, and for embedding constant strings in
Java class files.
Tcl also uses the same modified UTF-8 as Java for internal representation of Unicode data, but uses strict CESU-8 for external data. The
Raku programming language (formerly Perl 6) uses utf-8 encoding by default for I/O (
Perl 5 also supports it); though that choice in Raku also implies "normalization into Unicode
NFC (normalization form canonical). In some cases the user will want to ensure no normalization is done; for this "utf8-c8" can be used. That
UTF-8 Clean-8 variant, implemented by Raku, is an encoder/decoder that preserves bytes as is (even illegal UTF-8 sequences) and allows for Normal Form Grapheme synthetics. Version 3 of the
Python programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to 128 reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to ... which are low (trailing) surrogate values and thus "invalid" UTF-16, as used by
Python's
PEP 383 (or "surrogateescape") approach.
NumPy version 2.0, and its file formats, support UTF-8 (adding StringDType for it). Another encoding called
MirBSD OPTU-8/16 converts them to ... in a
Private Use Area. In either approach, the byte value is encoded in the low eight bits of the output code point. These encodings are needed if invalid UTF-8 is to survive translation to and then back from the UTF-16 used internally by Python, and as Unix filenames can contain invalid UTF-8 it is necessary for this to work. Most file systems on
Unix-like systems can use UTF-8 to encode file names, as looking up file names is done by comparing the bytes of file names. Linux's
ext4 and macOS's
APFS file systems support case-insensitive file name lookups, which require that the encoding of file names be specified; ext4 supports UTF-8 and uses it by default, and APFS requires UTF-8. Apple's older
HFS Plus uses
UTF-16 for file names, but uses UTF-8 in
symbolic links. Windows' filesystem,
NTFS, uses UTF-16 for file names. == Standards ==