Nucleotide sequence analyses identify functional elements like protein binding sites, uncover genetic variations like SNPs, study gene expression patterns, and understand the genetic basis of traits. It helps to understand mechanisms that contribute to processes like replication and transcription. Some of the tasks involved are outlined below.
Quality control and preprocessing Quality control assesses the quality of sequencing reads obtained from the sequencing technology (e.g.
Illumina). It is the first step in sequence analysis to limit wrong conclusions due to poor quality data. The tools used at this stage depend on the sequencing platform. For instance, FastQC checks the quality of short reads (including RNA sequences), Nanoplot or PycoQC are used for
long read sequences (e.g. Nanopore sequence reads), and MultiQC aggregates the result of FastQC in a webpage format. Quality control provides information such as read lengths,
GC content, presence of adapter sequences (for short reads), and a quality score, which is often expressed on a
PHRED scale. If adapters or other artifacts from PCR amplification are present in the reads (particularly short reads), they are removed using software such as Trimmomatic or Cutadapt.
Read alignment At this step, sequencing reads whose quality have been improved are mapped to a
reference genome using alignment tools like BWA for short DNA sequence reads, minimap for long read DNA sequences, and STAR for RNA sequence reads. The purpose of mapping is to find the origin of any given read based on the reference sequence. It is also important for detecting variations or
phylogenetic studies. The output from this step, that is, the aligned reads, are stored in compatible file formats known as SAM, which contains information about the reference genome as well as individual reads. Alternatively,
BAM file formats are preferred as they use much less desk or storage space. are used to identify differences compared to the reference sequence. The choice of variant calling tool depends heavily on the sequencing technology used, so GATK is often used when working with short reads, while long read sequences require tools like DeepVariant and Sniffles. Tools may also differ based on organism (prokaryotes or eukaryotes), source of sequence data (cancer vs
metagenomic), and variant type of interest (SNVs or structural variants). The output of variant calling is typically in
vcf format, and can be filtered using allele frequencies, quality scores, or other factors based on the research question at hand. or custom scripts and pipeline. The output from this step is an annotation file in bed or txt format. and identify differentially expressed genes (DEGs) between experimental conditions using statistical methods like
DESeq2. This is carried out to compare the expression levels of genes or isoforms between or across different samples, and infer biological relevance.
Functional enrichment analysis Functional enrichment analysis identifies biological processes, pathways, and functional impacts associated with differentially expressed genes obtained from the previous step. It uses tools like GOSeq and Pathview. This creates a table with information about what pathways and molecular processes are associated with the differentially expressed genes, what genes are down or upregulated, and what
gene ontology terms are recurrent or over-represented. RNA sequence analysis explores gene expression dynamics and regulatory mechanisms underlying biological processes and diseases. Interpretation of images and tables are carried out within the context of the hypotheses being investigated. == Analyzing protein sequences ==