The frequency of a set of
k-mers in a species's genome, in a genomic region, or in a class of sequences can be used as a "signature" of the underlying sequence. Comparing these frequencies is computationally easier than
sequence alignment and is an important method in
alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.
Sequence assembly In sequence assembly,
k-mers are used during the construction of
De Bruijn graphs. In order to create a De Bruijn Graph, the
k-mers stored in each edge with length L must overlap another string in another edge by L-1 in order to create a
vertex. Reads generated from
next-generation sequencing will typically have different read lengths being generated. For example, reads by
Illumina's sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible
k-mers violate the key assumption of De Bruijn graphs that all the
k-mer reads must overlap its adjoining
k-mer in the genome by k-1 (which cannot occur when all the possible
k-mers are not present). The solution to this problem is to break these
k-mer sized reads into smaller
k-mers, such that the resulting smaller
k-mers will represent all the possible
k-mers of that smaller size that are present in the genome. Furthermore, splitting the
k-mers into smaller sizes also helps alleviate the problem of different initial read lengths. In this example, the five reads do not account for all the possible 7-mers of the genome, and as such, a De Bruijn graph cannot be created. But, when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a De Bruijn graph. Beyond being used directly for sequence assembly,
k-mers can also be used to detect genome mis-assembly by identifying
k-mers that are overrepresented which suggest the presence of
repeated DNA sequences that have been combined. In addition,
k-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of
metagenomics.
Choice of k-mer size The choice of the
k-mer size has many different effects on the sequence assembly. These effects vary greatly between lower sized and larger sized
k-mers. Therefore, an understanding of the different
k-mer sizes must be achieved in order to choose a suitable size that balances the effects. The effects of the sizes are outlined below.
Lower k-mer sizes • A lower
k-mer size will decrease the amount of edges stored in the graph, and as such, will help decrease the amount of space required to store DNA sequence. • Having smaller sizes will increase the chance for all the
k-mers to overlap, and as such, have the required subsequences in order to construct the De Bruijn graph. • However, by having smaller sized
k-mers, you also risk having many vertices in the graph leading into a single k-mer. Therefore, this will make the reconstruction of the genome more difficult as there is a higher level of path ambiguities due to the larger amount of vertices that will need to be traversed. • Information is lost as the
k-mers become smaller. •
E.g. The possibility of AGTCGTAGATGCTG is lower than ACGT, and as such, holds a greater amount of information (refer to
entropy (information theory) for more information). • Smaller
k-mers also have the problem of not being able to resolve areas in the DNA where small
microsatellites or repeats occur. This is because smaller
k-mers will tend to sit entirely within the repeat region and is therefore hard to determine the amount of repetition that has actually taken place. •
E.g. For the subsequence ATGTGTGTGTGTGTACG, the amount of repetitions of TG will be lost if a
k-mer size less than 16 is chosen. This is because most of the
k-mers will sit in the repeated region and may just be discarded as repeats of the same
k-mer instead of referring the amount of repeats.
Higher k-mer sizes • Having larger sized
k-mers will increase the number of edges in the graph, which in turn, will increase the amount of memory needed to store the DNA sequence. • By increasing the size of the
k-mers, the number of vertices will also decrease. This will help with the construction of the genome as there will be fewer paths to traverse in the graph. and eukaryotes. Another application of
k-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of
Erwinia with moderate success. Similar to the direct use of GC-content for taxonomic purposes is the use of Tm, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher Tm. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔTm as factor in determining species boundaries as part of the
phylogenetic species concept, though this proposal does not appear to have gained traction within the scientific community. Other applications within genetics and genomics include: •
RNA isoform quantification from
RNA-seq data • Classification of human mitochondrial
haplogroup • Detection of recombination sites in genomes • Estimation of
genome size using
k-mer frequency vs
k-mer depth • Characterization of
CpG islands by flanking regions •
De novo detection of
repeated sequence such as
transposable element •
DNA barcoding of species. • Characterization of protein-binding
sequence motifs • Identification of
mutation or
polymorphism using next generation
sequencing data
Metagenomics k-mer frequency and spectrum variation is heavily used in metagenomics for both analysis and binning. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism (or
operational taxonomic unit), which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (
k = 4) frequencies. Other tools that similarly rely on
k-mer frequency for metagenomic binning are CompostBin (
k = 6), PCAHIER, PhyloPythia (5 ≤
k ≤ 6), CLARK (
k ≥ 20), and TACOA (2 ≤
k ≤ 6). Recent developments have also applied
deep learning to metagenomic binning using
k-mers. Other applications within metagenomics include: • Recovery of reading frames from raw reads • Estimation of
species abundance in metagenomic samples • Determination of which species are present in samples • Identification of
biomarkers for diseases from samples
Biotechnology Modifying
k-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates. With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis. In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates. The most studied application of
k-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode
dengue virus, the virus that causes
dengue fever, such that its codon-pair bias was more different to mammalian codon-usage preference than the
wild type. Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened
pathogenicity while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine as well a vaccine for
Marek's disease herpesvirus (MDV). Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the
oncogenicity of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use. Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias. By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attenuation of viruses is an increase in dinucleotides poorly suited for translation. GC-content, due to its effect on
DNA melting point, is used to predict annealing temperature in
PCR, another important biotechnology tool. ==Implementation==