Applications •
Automated essay scoring (AES) – the use of specialized computer programs to assign grades to essays written in an educational setting. It is a method of educational assessment and an application of natural-language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification. •
Automatic image annotation – process by which a computer system automatically assigns textual metadata in the form of captioning or keywords to a digital image. The annotations are used in image retrieval systems to organize and locate images of interest from a database. •
Automatic summarization – process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. • Types • Keyphrase extraction – • Document summarization – •
Multi-document summarization – • Methods and techniques • Extraction-based summarization – • Abstraction-based summarization – • Maximum entropy-based summarization – •
Sentence extraction – • Aided summarization – • Human aided machine summarization (HAMS) – • Machine aided human summarization (MAHS) – •
Automatic taxonomy induction – automated construction of
tree structures from a corpus. This may be applied to building taxonomical classification systems for reading by end users, such as web directories or subject outlines. •
Coreference resolution – in order to derive the correct interpretation of text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions need to be connected to the right individuals or objects. Given a sentence or larger chunk of text, coreference resolution determines which words ("mentions") refer to which objects ("entities") included in the text. •
Anaphora resolution – concerned with matching up pronouns with the nouns or names that they refer to. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to). •
Dialog system – •
Foreign-language reading aid – computer program that assists a non-native language user to read properly in their target language. The proper reading means that the pronunciation should be correct and stress to different parts of the words should be proper. •
Foreign-language writing aid – computer program or any other instrument that assists a non-native language user (also referred to as a foreign-language learner) in writing decently in their target language. Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks. •
Grammar checking – the act of verifying the grammatical correctness of written text, especially if this act is performed by a
computer program. •
Information retrieval – •
Cross-language information retrieval – •
Machine translation (MT) – aims to automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "
AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) in order to solve properly. • Classical approach of machine translation – rules-based machine translation. •
Computer-assisted translation – •
Interactive machine translation – •
Translation memory – database that stores so-called "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. •
Example-based machine translation – •
Rule-based machine translation – •
Natural-language programming – interpreting and compiling instructions communicated in natural language into computer instructions (machine code). •
Natural-language search – •
Optical character recognition (OCR) – given an image representing printed text, determine the corresponding text. •
Question answering – given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). •
Open domain question answering – •
Spam filtering – •
Sentiment analysis – extracts subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing. •
Speech recognition – given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of
text to speech and is one of the extremely difficult problems colloquially termed "
AI-complete" (see above). In
natural speech there are hardly any pauses between successive words, and thus
speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed
coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. •
Speech synthesis (Text-to-speech) – •
Text-proofing – •
Text simplification – automated editing a document to include fewer words, or use easier words, while retaining its underlying meaning and information.
Component processes •
Natural-language understanding – converts chunks of text into more formal representations such as
first-order logic structures that are easier for
computer programs to manipulate. Natural-language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural-language expression which usually takes the form of organized notations of natural-languages concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural-languages semantics without confusions with implicit assumptions such as
closed-world assumption (CWA) vs.
open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. •
Natural-language generation – task of converting information from computer databases into readable human language.
Component processes of natural-language understanding •
Automatic document classification (text categorization) – •
Automatic language identification – •
Compound term processing – category of techniques that identify compound terms and match them to their definitions. Compound terms are built by combining two (or more) simple terms, for example "triple" is a single word term but "triple heart bypass" is a compound term. •
Automatic taxonomy induction – • Corpus processing – •
Automatic acquisition of lexicon – •
Text normalization – •
Text simplification – •
Deep linguistic processing – •
Discourse analysis – includes a number of related tasks. One task is identifying the
discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the
speech acts in a chunk of text (e.g. yes–no questions, content questions, statements, assertions, orders, suggestions, etc.). •
Information extraction – •
Text mining – process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. •
Biomedical text mining – (also known as BioNLP), this is text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field drawing elements from natural-language processing, bioinformatics, medical informatics and computational linguistics. There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed. •
Decision tree learning – •
Sentence extraction – •
Terminology extraction – •
Latent semantic indexing – •
Lemmatisation – groups together all like terms that share a same lemma such that they are classified as a single item. •
Morphological segmentation – separates words into individual
morphemes and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the
morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially
inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as
Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. •
Named-entity recognition (NER) – given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although
capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or
Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all
nouns, regardless of whether they refer to names, and French and Spanish do not capitalize names that serve as
adjectives. •
Ontology learning – automatic or semi-automatic creation of
ontologies, including extracting the corresponding domain's terms and the relationships between those concepts from a corpus of natural-language text, and encoding them with an
ontology language for easy retrieval. Also called "ontology extraction", "ontology generation", and "ontology acquisition". •
Parsing – determines the
parse tree (grammatical analysis) of a given sentence. The
grammar for
natural languages is
ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). •
Shallow parsing – •
Part-of-speech tagging – given a sentence, determines the
part of speech for each word. Many words, especially common ones, can serve as multiple
parts of speech. For example, "book" can be a
noun ("the book on the table") or
verb ("to book a flight"); "set" can be a
noun,
verb or
adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little
inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a
tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning. •
Query expansion – •
Relationship extraction – given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom). •
Semantic analysis (computational) – formal analysis of meaning, and "computational" refers to approaches that in principle support effective implementation. •
Explicit semantic analysis – •
Latent semantic analysis – •
Semantic analytics – •
Sentence breaking (also known as
sentence boundary disambiguation and sentence detection) – given a chunk of text, finds the sentence boundaries. Sentence boundaries are often marked by
periods or other
punctuation marks, but these same characters can serve other purposes (e.g. marking
abbreviations). •
Speech segmentation – given a sound clip of a person or people speaking, separates it into words. A subtask of
speech recognition and typically grouped with it. •
Stemming – reduces an inflected or derived word into its
word stem, base, or
root form. •
Text chunking – •
Tokenization – given a chunk of text, separates it into distinct words, symbols, sentences, or other units •
Topic segmentation and recognition – given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment. •
Truecasing – •
Word segmentation – separates a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and
Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the
vocabulary and
morphology of words in the language. •
Word-sense disambiguation (WSD) – because many words have more than one
meaning, word-sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as
WordNet. •
Word-sense induction – open problem of natural-language processing, which concerns the automatic identification of the senses of a word (i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of word-sense disambiguation (WSD), which relies on a predefined sense inventory and aims to solve the ambiguity of words in context. •
Automatic acquisition of sense-tagged corpora – •
W-shingling – set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.
Component processes of natural-language generation Natural-language generation – task of converting information from computer databases into readable human language. •
Automatic taxonomy induction (ATI) – automated building of
tree structures from a corpus. While ATI is used to construct the core of ontologies (and doing so makes it a component process of natural-language understanding), when the ontologies being constructed are end user readable (such as a subject outline), and these are used for the construction of further documentation (such as using an outline as the basis to construct a report or treatise) this also becomes a component process of natural-language generation. •
Document structuring – == History of natural-language processing ==