Word segmentation Word segmentation is the problem of dividing a string of written language into its component words. In English and many other languages using some form of the
Latin alphabet, the
space is a good approximation of a
word divider (word
delimiter), although this concept has limits because of the variability with which languages
emically regard
collocations and
compounds. Many
English compound nouns are variably written (for example,
ice box = ice-box = icebox;
pig sty = pig-sty = pigsty) with a corresponding variation in whether speakers think of them as
noun phrases or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast,
German compound nouns show less orthographic variation, with solidification being a stronger norm. However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where
sentences but not words are delimited,
Thai and
Lao, where phrases and sentences but not words are delimited, and
Vietnamese, where syllables but not words are delimited. In some writing systems however, such as the
Ge'ez script used for
Amharic and
Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character. The
Unicode Consortium has published a
Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts.
Word splitting is the process of
parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist. Word splitting may also refer to the process of
hyphenation. Some scholars have suggested that modern Chinese should be written in word segmentation, with spaces between words like written English. Because there are ambiguous texts where only the author knows the intended meaning. For example, "美国会不同意。" may mean "美国 会 不同意。" (The US will not agree.) or "美 国会 不同意。" (The US Congress does not agree). For more details, see
Chinese word-segmented writing.
Intent segmentation Intent segmentation is the problem of dividing written words into keyphrases (2 or more group of words). In English and all other languages the core intent or desire is identified and become the corner-stone of the keyphrase Intent segmentation. Core product/service, idea, action & or thought anchor the keyphrase. "[All things are made of
atoms]. [Little
particles that move] [around in perpetual
motion], [attracting each
other] [when they are a little
distance apart], [but
repelling] [upon being
squeezed] [into
one another]."
Sentence segmentation Sentence segmentation is the problem of dividing a string of written language into its component
sentences. In English and some other languages, using punctuation, particularly the
full stop/period character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example,
Mr. is not its own sentence in "
Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries. As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.
Topic segmentation Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple
classification of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in
document classification. Segmenting the text into
topics or
discourse turns might be useful in some natural processing tasks: it can improve
information retrieval or
speech recognition significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in
topic detection and tracking systems and
text summarizing problems. Many different approaches have been tried: e.g.
HMM,
lexical chains, passage similarity using word
co-occurrence,
clustering,
topic modeling, etc. It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem. WindowDiff(ref,hyp) 1 \over{N-k} \sum |b(ref_i,ref_{i+k})-b(hyp_i,hyp_{i+k})| -->
Other segmentation problems Processes may be required to segment text into segments besides mentioned, including
morphemes (a task usually called
morphological analysis) or
paragraphs. == Automatic segmentation approaches ==