Generating data on RNA transcripts can be achieved via either of two main principles: sequencing of individual transcripts (
ESTs, or RNA-Seq) or
hybridisation of transcripts to an ordered array of nucleotide probes (microarrays). disruption of macromolecules and nucleotide complexes, separation of RNA from undesired
biomolecules including DNA, and concentration of the RNA via
precipitation from solution or
elution from a solid matrix. Isolated RNA may additionally be treated with
DNase to digest any traces of DNA. It is necessary to enrich messenger RNA as total RNA extracts are typically 98%
ribosomal RNA. Enrichment for transcripts can be performed by
poly-A affinity methods or by depletion of ribosomal RNA using sequence-specific probes. Degraded RNA may affect downstream results; for example, mRNA enrichment from degraded samples will result in the depletion of
5' mRNA ends and an uneven signal across the length of a transcript.
Snap-freezing of tissue prior to RNA isolation is typical, and care is taken to reduce exposure to RNase enzymes once isolation is complete.
Serial and cap analysis of gene expression (SAGE/CAGE) .'' Within the organisms, genes are
transcribed and
spliced (in
eukaryotes) to produce mature
mRNA transcripts (red). The mRNA is extracted from the organism, and
reverse transcriptase is used to copy the mRNA into stable double-stranded–cDNA (
ds-
cDNA; blue). In SAGE, the ds-cDNA is digested by
restriction enzymes (at location 'X' and 'X'+11) to produce 11-nucleotide "tag" fragments. These tags are concatenated and sequenced using long-read
Sanger sequencing (different shades of blue indicate tags from different genes). The sequences are
deconvoluted to find the frequency of each tag. The tag frequency can be used to report on
transcription of the gene that the tag came from.
Serial analysis of gene expression (SAGE) was a development of EST methodology to increase the throughput of the tags generated and allow some quantitation of transcript abundance. Therefore, the
transcriptional start site of genes can be identified when the tags are aligned to a reference genome. Identifying gene start sites is of use for
promoter analysis and for the
cloning of full-length cDNAs. SAGE and CAGE methods produce information on more genes than was possible when sequencing single ESTs, but sample preparation and data analysis are typically more labour-intensive. Transcript abundance is determined by hybridisation of
fluorescently labelled transcripts to these probes. The
fluorescence intensity at each probe location on the array indicates the transcript abundance for that probe sequence. drops of a range of purified
cDNAs arrayed on the surface of a glass slide. These probes are longer than those of high-density arrays and cannot identify
alternative splicing events. Spotted arrays use two different
fluorophores to label the test and control samples, and the ratio of fluorescence is used to calculate a relative measure of abundance. High-density arrays use a single fluorescent label, and each sample is hybridised and detected individually. High-density arrays were popularised by the
Affymetrix GeneChip array, where each transcript is quantified by several short 25
-mer probes that together
assay one gene. NimbleGen arrays were a high-density array produced by a
maskless-photochemistry method, which permitted flexible manufacture of arrays in small or large numbers. These arrays had 100,000s of 45 to 85-mer probes and were hybridised with a one-colour labelled sample for expression analysis. Some designs incorporated up to 12 independent arrays per slide.
RNA-Seq .'' Within the organisms, genes are transcribed and spliced (in eukaryotes) to produce mature mRNA transcripts (red). The mRNA is extracted from the organism, fragmented, and copied into stable ds-cDNA (blue). The ds-cDNA is sequenced using
high-throughput, short-read sequencing methods. These sequences can then be
aligned to a reference genome sequence to reconstruct which genome regions were being transcribed. This data can be used to annotate where expressed genes are, their relative expression levels, and any alternative splice variants. Theoretically, there is no upper limit of quantification in RNA-Seq, and background noise is very low for 100 bp reads in non-repetitive regions. Since the first descriptions in 2006 and 2008, RNA-Seq has been rapidly adopted and overtook microarrays as the dominant transcriptomics technique in 2015. The quest for transcriptome data at the level of individual cells has driven advances in RNA-Seq library preparation methods, resulting in dramatic advances in sensitivity.
Single-cell transcriptomes are now well described and have even been extended to
in situ RNA-Seq where transcriptomes of individual cells are directly interrogated in
fixed tissues.
Methods RNA-Seq was established in concert with the rapid development of a range of high-throughput DNA sequencing technologies. However, before the extracted RNA transcripts are sequenced, several key processing steps are performed. Methods differ in the use of transcript enrichment, fragmentation, amplification, single or paired-end sequencing, and whether to preserve strand information. Small RNAs, such as
micro RNAs, can be purified based on their size by
gel electrophoresis and extraction. Since mRNAs are longer than the read-lengths of typical high-throughput sequencing methods, transcripts are usually fragmented prior to sequencing. The fragmentation method is a key aspect of sequencing library construction.
Fragmentation may be achieved by
chemical hydrolysis,
nebulisation,
sonication, or
reverse transcription with
chain-terminating nucleotides. During preparation for sequencing, cDNA copies of transcripts may be amplified by
PCR to enrich for fragments that contain the expected 5' and 3' adapter sequences. Amplification is also used to allow sequencing of very low input amounts of RNA, down to as little as 50
pg in extreme applications.
Spike-in controls of known RNAs can be used for quality control assessment to check library preparation and sequencing, in terms of
GC-content, fragment length, as well as the bias due to fragment position within a transcript.
Unique molecular identifiers (UMIs) are short random sequences that are used to individually tag sequence fragments during library preparation so that every tagged fragment is unique. UMIs provide an absolute scale for quantification, the opportunity to correct for subsequent amplification bias introduced during library construction, and accurately estimate the initial sample size. UMIs are particularly well-suited to single-cell RNA-Seq transcriptomics, where the amount of input RNA is restricted and extended amplification of the sample is required. Once the transcript molecules have been prepared they can be sequenced in just one direction (single-end) or both directions (paired-end). A single-end sequence is usually quicker to produce, cheaper than paired-end sequencing and sufficient for quantification of gene expression levels. Paired-end sequencing produces more robust alignments/assemblies, which is beneficial for gene annotation and transcript
isoform discovery. Without strand information, reads can be aligned to a
gene locus but do not inform in which direction the gene is transcribed. Stranded-RNA-Seq is useful for deciphering transcription for
genes that overlap in different directions and to make more robust gene predictions in non-model organisms. Direct sequencing of RNA using
nanopore sequencing represents a current state-of-the-art RNA-Seq technique. Nanopore sequencing of RNA can detect
modified bases that would be otherwise masked when sequencing cDNA and also eliminates
amplification steps that can otherwise introduce bias. The sensitivity and accuracy of an RNA-Seq experiment are dependent on the
number of reads obtained from each sample. The current benchmarks recommended by the
Encyclopedia of DNA Elements (ENCODE) Project are for 70-fold exome coverage for standard RNA-Seq and up to 500-fold exome coverage to detect rare transcripts and isoforms. == Data analysis ==