Content-addressable storage

Location-based approaches Traditional file systems generally track files based on their filename. On random-access media like a floppy disk, this is accomplished using a directory that consists of some sort of list of filenames and pointers to the data. The pointers refer to a physical location on the disk, normally using disk sectors. On more modern systems and larger formats like hard drives, the directory is itself split into many subdirectories, each tracking a subset of the overall collection of files. Subdirectories are themselves represented as files in a parent directory, producing a hierarchy or tree-like organization. The series of directories leading to a particular file is known as a "path". In the context of CAS, these traditional approaches are referred to as "location-addressed", as each file is represented by a list of one or more locations, the path and filename, on the physical storage. In these systems, the same file with two different names will be stored as two files on disk and thus have two addresses. The same is true if the same file, even with the same name, is stored in more than one location in the directory hierarchy. This makes them less than ideal for a digital archive, where any unique information should only be stored once. CAS and FCS Although location-based storage is widely used in many fields, this was not always the case. Previously, the most common way to retrieve data from a large collection was to use some sort of identifier based on the content of the document. For instance, the ISBN system is used to generate a unique number for every book. If one performs a web search for "ISBN 0465048994", one will be provided with a list of locations for the book Why Information Grows on the topic of information storage. Although many locations will be returned, they all refer to the same work, and the user can then pick whichever location is most appropriate. Additionally, if any one of these locations changes or disappears, the content can be found at any of the other locations. Because the keys are not human-readable, CAS systems implement a second type of directory that stores metadata that will help users find a document. These almost always include a filename, allowing the classic name-based retrieval to be used. But the directory will also include fields for common identification systems like ISBN or ISSN codes, user-provided keywords, time and date stamps, and full-text search indexes. Users can search these directories and retrieve a key, which can then be used to retrieve the actual document. Using a CAS is very similar to using a web search engine. The primary difference is that a web search is generally performed on a topic basis using an internal algorithm that finds "related" content and then produces a list of locations. The results may be a list of the identical content in multiple locations. In a CAS, more than one document may be returned for a given search, but each of those documents will be unique and presented only once. Another advantage to CAS is that the physical location in storage is not part of the lookup system. If, for instance, a library's card catalog stated a book could be found on "shelf 43, bin 10", if the library is re-arranged the entire catalog has to be updated. In contrast, the ISBN will not change and the book can be found by looking for the shelf with those numbers. In the computer setting, a file in the DOS filesystem at the path A:\myfiles\textfile.txt points to the physical storage of the file in the myfiles subdirectory. This file disappears if the floppy is moved to the B: drive, and even moving its location within the disk hierarchy requires the user-facing directories to be updated. In CAS, only the internal mapping from key to physical location changes, and this exists in only one place and can be designed for efficient updating. This allows files to be moved among storage devices, and even across media, without requiring any changes to the retrieval. For data that changes frequently, CAS is not as efficient as location-based addressing. In these cases, the CAS device would need to continually recompute the address of data as it was changed. This would result in multiple copies of the entire almost-identical document being stored, the problem that CAS attempts to avoid. Additionally, the user-facing directories would have to be continually updated with these "new" files, which would become polluted by many similar documents that would make searching more difficult. In contrast, updating a file in a location-based system is highly optimized, only the internal list of sectors has to be changed and many years of tuning have been applied to this operation. Because CAS is used primarily for archiving, file deletion is often tightly controlled or even impossible under user control. In contrast, automatic deletion is a common feature, removing all files older than some legally defined requirement, say ten years. In distributed computing The simplest way to implement a CAS system is to store all of the files within a typical database to which clients connect to add, query, and retrieve files. However, the unique properties of content addressability mean that the paradigm is well suited for computer systems in which multiple hosts collaboratively manage files with no central authority, such as distributed file sharing systems, in which the physical location of a hosted file can change rapidly in response to changes in network topology, while the exact content of the files to be retrieved are of more importance to users than their current physical location. In a distributed system, content hashes are often used for quick network-wide searches for specific files, or to quickly see which data in a given file has been changed and must be propagated to other members of the network with minimal bandwidth usage. In these systems, content addressability allows highly variable network topology to be abstracted away from users who wish to access data, compared to systems like the World Wide Web, in which a consistent location of a file or service is key to easy use. Content-addressable networks == History ==