There are two main classes of indexing schemata for document retrieval systems:
form based (or
word based), and
content based indexing. The document classification scheme (or
indexing algorithm) in use determines the nature of the document retrieval system.
Form based Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A
suffix tree algorithm is an example for form based indexing.
Content based The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an
inverted index algorithm. A
signature file is a technique that creates a
quick and dirty filter, for example a
Bloom filter, that will keep all the documents that match to the query and
hopefully a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to
inverted indexes in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted indexes in certain environments. ==Example: PubMed==