Treebanks are often created on top of a corpus that has already been annotated with
part-of-speech tags. In turn, treebanks are sometimes enhanced with
semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a
parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank. Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows
HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate
phrase structure (for example the Penn Treebank or ICE-GB) and those that annotate
dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank). It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for
John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation): (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary))) (. .)) This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific
XML schemes, numbered indentation and various types of standoff notation. == Applications ==