String mining typically deals with a limited
alphabet for items that appear in a
sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the
ASCII character set used in natural language text,
nucleotide bases 'A', 'G', 'C' and 'T' in
DNA sequences, or
amino acids for
protein sequences. In
biology applications analysis of the arrangement of the alphabet in strings can be used to examine
gene and
protein sequences to determine their properties. Knowing the sequence of letters of a
DNA or a
protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and
biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when
insertions,
deletions and
mutations occur in a string. A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include: •
Repeat-related problems: that deal with operations on single sequences and can be based on
exact string matching or
approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences. •
Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include
BLAST for comparing a single sequence with multiple sequences in a database, and
ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See
sequence alignment. ==Itemset mining==