Both users and applications need to identify a file's format so that the file can be used appropriately. Generally, the methods for identification vary by
operating system, with each approach having its advantages and disadvantages.
Filename extension One popular method used by many operating systems, including
Windows,
macOS,
CP/M,
MS-DOS,
VMS, and
VM/CMS, is to indicate the format of a file with a suffix of the
file name, known as the
extension. For example, an
HTML document is identified by a file name that ends with or , and a
GIF image by . In the now-antiquated
FAT file system, file names were limited to eight characters for the base name plus a three-character extension, known as an
8.3 filename. Due to the prevalence of this naming scheme, many formats still use three-character extensions even though modern systems support longer extensions. Since there is no standardized list of extensions, more than one format can use the same extension especially for three-letter extensions since there is a limited number of three-letter combinations. This situation can confuse both users and applications. One implication of indicating the file type with the extension is that the users and applications can be tricked into treating a file as a different format simply by renaming it. For example, an
HTML file can be treated as
plain text by adding (or changing the existing) extension . Although this strategy is useful, it can be confusing to less technical users who accidentally make a file unusable (or "lose" it). To try to avoid this scenario, Windows and macOS support hiding the extension. Hiding the extension, however, can create the appearance of multiple files with the same name in the same folder, which is confusing for people. For example, an image may be needed both in Encapsulated PostScript| format (for publishing) and
.png format (for web sites) and one might name them with the same base name (for example, and ). With extensions hidden they appear to have the same name: . Hiding extensions can also pose a security risk. For example, a malicious user could create an
executable program with an innocent name such as "". The "" would be hidden and an unsuspecting user would see "", which would appear to be a
JPEG image, usually unable to harm the machine. However, the operating system would still see the "" extension and run the program, which would then be able to cause harm to the computer. The same is true with files with only one extension: as it is not shown to the user, no information about the file can be deduced without explicitly investigating the file. To further trick users, it is possible to store an icon inside the program, in which case some operating systems' icon assignment for the executable file () would be overridden with an icon commonly used to represent JPEG images, making the program look like an image. Extensions can also be spoofed: some
Microsoft Word macro viruses create a Word file in template format and save it with a extension. Since Word generally ignores extensions and looks at the format of the file, these would open as templates, execute, and spread the virus. This represents a practical problem for Windows systems where extension-hiding is turned on by default.
Internal metadata A file's format may be indicated inside the file itself either as information intended for this purpose or as identifiable data within the format that can be used for identification even though that is not its intended purpose. Often intentionally placed information is located at the beginning of a file since this is relatively easy to read from a file both by users and applications. When the information at the beginning of the file is a structure that contains other
metadata, then the structure is often called a
file header. When the file starts with a relatively small datum that only indicates the format, then it is often called a
magic number.
File header The metadata contained in a
file header are usually stored at the start of the file, but might be present in other areas too, often including the end, depending on the file format or the type of data contained. Character-based (text) files usually have character-based headers, whereas binary formats usually have binary headers, although this is not a rule. Text-based file headers usually take up more space, but being human-readable, they can easily be examined by using simple software such as a text editor or a hexadecimal editor. As well as indicating the file format, file headers may contain metadata about the file and its contents. For example, most
image files store information about image format, size, resolution and
color space, and optionally
authoring information such as who made the image, when and where it was made, what camera model and photographic settings were used (
Exif), and so on. Such metadata may be used by software reading or interpreting the file during the loading process and afterwards. File headers may be used by an
operating system to quickly gather information about a file without loading it all into memory, but doing so uses more of a computer's resources than reading directly from the
directory information. For instance, when a
graphic file manager has to display the contents of a folder, it must read the headers of many files before it can display the appropriate icons, but these will be located in different places on the storage medium thus taking longer to access. A folder containing many files with complex metadata such as
thumbnail information may require considerable time before it can be displayed. If a header is
binary hard-coded such that the header itself needs complex interpretation in order to be recognized, especially for metadata content protection's sake, there is a risk that the file format can be misinterpreted. It may even have been badly written at the source. This can result in corrupt metadata which, in extremely bad cases, might even render the file unreadable. A more complex example of file headers are those used for
wrapper (or container) file formats.
Magic number One way to incorporate file type metadata is to store a "magic number" inside the file itself. Originally, this term was used for 2-byte identifiers at the start of files, but since any binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification.
GIF images, for instance, always begin with the
ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate
document type definition that starts with <!DOCTYPE html, or, for
XHTML, the
XML identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML. The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need to check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where file types do not lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if the file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type. So-called
shebang lines in
script files are a special case of magic numbers. There, the magic number consists of human-readable text within the file that identifies a specific
interpreter and options to be passed to it. Another operating system using magic numbers is
AmigaOS, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in
Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the
Amiga standard Datatype recognition system. Another method was the
FourCC method, originating in
OSType on Macintosh, later adapted by
Interchange File Format (IFF) and derivatives.
External metadata A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself. This approach keeps the metadata separate from both the main data and the name, but is also less
portable than either filename extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions— for instance, for compatibility with
MS-DOS's three character limit— most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata. Note that
zip files and other
archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension ). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by
FTP transmissions or sent by email as an attachment. At the destination, the single file received has to be unzipped by a compatible utility to be useful. The problems of handling metadata are solved this way using zip files or archive files.
Mac OS type-codes The
Mac OS'
Hierarchical File System and
HFS+ file system, and the
Apple File System, store codes for
creator and
type as part of the directory entry for each file. These codes are referred to as OSTypes. These codes could be any 4-byte sequence but were often selected so that the ASCII representation formed a sequence of meaningful characters, such as an abbreviation of the application's name or the developer's initials. For instance a
HyperCard "stack" file has a
creator of (from Hypercard's previous name, "WildCard") and a
type of . The
BBEdit text editor has a creator code of referring to its original programmer,
Rich Siegel. The type code specifies the format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of , but each open in a different program, due to having differing creator codes. This feature was intended so that, for example, human-readable plain-text files could be opened in a general-purpose text editor, while programming or HTML code files would open in a specialized editor or
IDE. However, this feature was often the source of user confusion, as which program would launch when the files were double-clicked was often unpredictable.
RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions—e.g. the
hexadecimal number is "aliased" to , representing a
PostScript file.
macOS uniform type identifiers (UTIs) A Uniform Type Identifier (UTI) is a method used in
macOS for uniquely identifying "typed" classes of entities, such as file formats. It was developed by
Apple as a replacement for OSType (type & creator codes). The UTI is a
Core Foundation string, which uses a
reverse-DNS string. Some common and standard types use a domain called (e.g. for a
Portable Network Graphics image), while other domains can be used for third-party types (e.g. for
Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, conforms to a supertype of , which itself conforms to a supertype of . A UTI can exist in multiple hierarchies, which provides great flexibility. In addition to file formats, UTIs can also be used for other entities which can exist in macOS, including: • Pasteboard data •
Folders (directories) • Translatable types (as handled by the Translation Manager) • Bundles • Frameworks • Streaming data • Aliases and symlinks
VSAM Catalog In IBM
OS/VS through
z/OS, the VSAM catalog (prior to
ICF catalogs) and the VSAM Volume Record in the VSAM Volume Data Set (VVDS) (with ICF catalogs) identifies the type of VSAM dataset.
VTOC In IBM
OS/360 through
z/OS, a format 1 or 7
Data Set Control Block (DSCB) in the
Volume Table of Contents (VTOC) identifies the Dataset Organization (
DSORG) of the dataset described by it.
OS/2 extended attributes The
HPFS, FAT12, and FAT16 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value, and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under
OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types. The
NTFS filesystem also allows storage of OS/2 extended attributes, as one of the file
forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
POSIX extended attributes On Unix and
Unix-like systems, the
ext2,
ext3,
ext4,
ReiserFS version 3,
XFS,
JFS,
FFS, and
HFS+ filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique and a value can be accessed through its related name.
PRONOM unique identifiers (PUIDs) The
PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique, and unambiguous identifiers for file formats, which has been developed by
The National Archives of the UK as part of its
PRONOM technical registry service. PUIDs can be expressed as
Uniform Resource Identifiers using the namespace. Although not yet widely used outside of the UK government and some
digital preservation programs, the PUID scheme does provide greater granularity than most alternative schemes.
MIME types MIME types are widely used in many
Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by
IANA) consisting of a
type and a
sub-type, separated by a
slash—for instance, or . These were originally intended as a way of identifying what type of file was attached to an
e-mail, independent of the source and target operating systems. MIME types identify files on
BeOS,
AmigaOS 4.0 and
MorphOS, as well as store unique application signatures for application launching. In AmigaOS and MorphOS, the Mime type system works in parallel with Amiga specific Datatype system. There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
File format identifiers (FFIDs) File format identifiers are another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form . The first part indicates the organization origin/maintainer (this number represents a value in a company/standards organization database), and the 2 following digits categorize the type of file in
hexadecimal. The final part is composed of the usual filename extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of where indicates an image file, is the standard number and indicates the
International Organization for Standardization (ISO).
File content based format identification Another less popular way to identify the file format is to examine the file contents for distinguishable patterns among file types. The contents of a file are a sequence of bytes and a byte has 256 unique permutations (0–255). Thus, counting the occurrence of byte patterns that is often referred to as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use a byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types. == File structure ==