Some of the earliest efforts at grammatical description were based at least in part on corpora of particular religious or cultural significance. For example,
Prātiśākhya literature described the sound patterns of
Sanskrit as found in the
Vedas, and
Pāṇini's grammar of
classical Sanskrit was based at least in part on analysis of that same corpus. Similarly, the early
Arabic grammarians paid particular attention to the language of the
Quran. In the Western European tradition, scholars prepared
concordances to allow detailed study of the language of the Bible and other canonical texts.
English corpora A landmark in modern corpus linguistics was the publication of
Computational Analysis of Present-Day American English in 1967. Written by
Henry Kučera and
W. Nelson Francis, the work was based on an analysis of the
Brown Corpus, which is a structured and balanced corpus of one million words of American English from the year 1961. The corpus comprises 2000 text samples, from a variety of genres. The Brown Corpus was the first computerized corpus designed for linguistic research. Kučera and Francis subjected the Brown Corpus to a variety of computational analyses and then combined elements of linguistics, language teaching,
psychology, statistics, and sociology to create a rich and variegated opus. A further key publication was
Randolph Quirk's "Towards a description of English Usage" in 1960 in which he introduced
the Survey of English Usage. Quirk's corpus was the first modern corpus to be built with the purpose of representing the whole language. Shortly thereafter, Boston publisher
Houghton-Mifflin approached Kučera to supply a million-word, three-line citation base for its new
American Heritage Dictionary, the first
dictionary compiled using corpus linguistics. The
AHD took the innovative step of combining prescriptive elements (how language
should be used) with descriptive information (how it actually
is used). Other publishers followed suit. The British publisher Collins'
COBUILD monolingual learner's dictionary, designed for users learning
English as a foreign language, was compiled using the
Bank of English. The
Survey of English Usage Corpus was used in the development of one of the most important Corpus-based Grammars, which was written by Quirk
et al. and published in 1985 as
A Comprehensive Grammar of the English Language. The
Brown Corpus has also spawned a number of similarly structured corpora: the
LOB Corpus (1960s
British English), Kolhapur (
Indian English), Wellington (
New Zealand English), Australian Corpus of English (
Australian English), the Frown Corpus (early 1990s
American English), and the FLOB Corpus (1990s British English). Other corpora represent many languages, varieties and modes, and include the
International Corpus of English, and the
British National Corpus, a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities (
Oxford and
Lancaster) and the
British Library. For contemporary American English, work has stalled on the
American National Corpus, but the 400+ million word
Corpus of Contemporary American English (1990–present) is now available through a web interface. The first computerized corpus of transcribed spoken language was constructed in 1971 by the Montreal French Project, containing one million words, which inspired
Shana Poplack's much larger corpus of spoken French in the Ottawa-Hull area.
Multilingual corpora In the 1990s, many of the notable early successes on statistical methods in natural-language programming (NLP) occurred in the field of
machine translation, due especially to work at IBM Research. These systems were able to take advantage of existing multilingual
textual corpora that had been produced by the
Parliament of Canada and the
European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. There are corpora in non-European languages as well. For example, the National Institute for Japanese Language and Linguistics in Japan has built a number of corpora of spoken and written Japanese.
Sign language corpora have also been created using video data.
Ancient languages corpora Besides these corpora of living languages, computerized corpora have also been made of collections of texts in ancient languages. An example is the
Andersen-Forbes database of the Hebrew Bible, developed since the 1970s, in which every clause is parsed using graphs representing up to seven levels of syntax, and every segment tagged with seven fields of information. The
Quranic Arabic Corpus is an annotated corpus for the Classical Arabic language of the
Quran. This is a recent project with multiple layers of annotation including morphological segmentation,
part-of-speech tagging, and syntactic analysis using dependency grammar. The Digital Corpus of Sanskrit (DCS) is a "Sandhi-split corpus of Sanskrit texts with full morphological and lexical analysis... designed for text-historical research in Sanskrit linguistics and philology."
Corpora from specific fields Besides pure linguistic inquiry, researchers had begun to apply corpus linguistics to other academic and professional fields, such as the emerging sub-discipline of
Law and Corpus Linguistics, which seeks to understand legal texts using corpus data and tools. The
DBLP Discovery Dataset concentrates on
computer science, containing relevant computer science publications with sentient metadata such as author affiliations, citations, or study fields. A more focused dataset was introduced by NLP Scholar, a combination of papers of the
ACL Anthology and
Google Scholar metadata. Corpora can also aid in translation efforts or in teaching foreign languages. == Methods ==