The process of constructing co-occurrence networks includes identifying keywords in the text, calculating the frequencies of co-occurrences, and analyzing the networks to find central words and clusters of themes in the network. Co-occurrence networks can be created for any given list of terms (any
dictionary) in relation to any collection of texts (any
text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected. Individual terms are, within the context of text mining, symbolically represented as
text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several
synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (
natural language processing) algorithms that interrogate segments of text for possible alternatives such as
word order, spacing and
hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a
noun based on a preceding string of text known to be an
article). Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the
domain represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term. Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms, etc. == Applications and use ==