RNA Processing: A Practical Approach: Volume II

RNA Processing in eukaryotes - RNA splicing, capping and editing

eLife digest

Global donor and acceptor splicing site kinetics in human cells

Splice site identification using probabilistic parameters and SVM classification. Optimized mixed Markov models for motif identification. Evidence implicating BRD1 with brain development and susceptibility to both schizophrenia and bipolar affective disorder. Comprehensive splice-site analysis using comparative genomics. Phenotypic consequences of branch point substitutions. Gene finding using multiple related species: a classification approach. The average quality values for each sequence position are indicated by a thin curved line. The number of reads after filtering was reduced to 2,, The Y-axis on the graph shows the quality scores.

A higher score reflects better base call. The background of the graph divides the Y-axis into high Upper , moderate Center , and poor Lower quality calls. B Effect of filtering sequence reads. The sequence reads obtained before and after filtering, as indicated in A, were mapped to the reference genome and visualized. Mismatches are indicated by black circles. Williams et al. Therefore, if trimming is applied, extreme care should be taken, and other measures, such as length filtering, should be considered in the preprocessing pipeline to minimize the introduction of unwanted bias.

In our follow-up examination of the reads obtained using an Illumina MiSeq platform, we concluded that for relatively long sequencing reads, such as or bases, with low sequence errors, aggressive trimming of sequencing reads is generally no longer necessary for estimating the gene expression level. In the following section, we propose correction of the reference sequence using RNA-Seq reads in cases in which the genome sequence of the same strain used in the RNA-Seq experiment is not available to avoid mismatches between the RNA-Seq reads and the reference.

The removal and trimming of unreliable sequences are necessary for this purpose. The pipeline effectively works for microorganisms, genome sequences, and gene models, which are reliable due to significant correction and curation by the efforts of a large number of researchers. Typical examples of such microorganism are Escherichia coli , Bacillus subtilis , Saccharomyces cerevisiae , Schizosaccharomyces pombe , Neurospora crassa , and Aspergillus nidulans , which are known as model organisms.

Among microorganisms, filamentous fungi generally have the largest genome sizes and introns in most existing genes and are thus thought to require a pipeline with the highest performance and various functions for the analyses. Furthermore, filamentous fungi are potential producers of various secondary metabolites, which are economically important and have a large number of highly diverse secondary metabolism-related genes. Thus, the genomes of filamentous fungi and actinomycetes remain attractive targets in this field. Example of the RNA-Seq analysis pipeline. A Typical simple pipeline. Next, tools such as Cufflinks count the number of reads mapped to each genomic feature and extract differentially expressed genes DEGs.

B Proposed pipeline for microorganisms whose reference sequence and gene models are not extensively corrected or curated. GenRecon is a Perl script that outputs a consensus sequence based on output by the variant detect tool, VarScan [11]. SpliceSelect is a Perl script that integrates splice site positions from multiple TopHat output files. RNA-Seq reads can be analyzed without their corresponding genome sequence as a reference through the de novo assembly of the reads. However, we do not include the de novo assembly of RNA-Seq reads in this chapter because sequencing the genome of a microorganism using next-generation sequencers, such as Illumina technology, is relatively inexpensive in terms of cost and time.

For example, we have used the improved pipeline for the analysis of the genome sequences obtained from a short-read sequencer, SOLiD xl, in combination with the de novo assembly pipeline that the manufacturer developed for mate-paired sequences [ 8 ] with successive automatic annotation. Based on the assumption that the genome sequence is available as a reference for the microorganism, the strategy of mapping the transcriptome is not included in this chapter. The sequencing platforms described above are widely used, and bioinformatics tools have been extensively developed for each platform.

The characteristics of the errors depend on the sequencing platform, such as those manufactured by Illumina, Life Technologies, and Pacific Bioscience. The number of reads, read length, and data format also varies by platform. Furthermore, more than one platform, such as a combination of Illumina and Pacific Bioscience or Life Technologies and Illumina [ 9 ], might be used, which also requires a specific methodology for obtaining reasonable results. Most of the mapping tools search the nucleotide sequences with a similarity greater than a certain threshold value in the reference sequence for each RNA-Seq read.

Multiple mapping algorithms are widely used to accurately identify the most homologous positions on the reference sequence. However, a shorter read length than the repetitive elements in the reference sequence and sequencing errors complicates the problem. A typical RNA-Seq experiment consists of the sequencing of both ends of a cDNA fragment to generate two reads a read pair separated by a sequence of variable length. The accurate alignment of these read pairs is essential to the downstream analysis of an RNA-Seq experiment, but RNA-Seq read alignment is challenging due to the noncontiguous nature of mRNA transcripts resulting from the existence of introns in eukaryotic genes.

Software programs that support splice alignment use different strategies from several perspectives [ 15 ]. The method of determining the position on the reference sequence where a read is mapped can be roughly classified into two groups: exon first and seed and extend.

Exon-first methods, such as TopHat, utilize a two-step process. First, they map reads to the reference sequence without allowing large gaps. Subsequently, the unmapped reads are divided into short segments, and each is independently aligned to the reference sequence. The discontinued region on the genome where contiguous segments are mapped is treated as a candidate of two connected exons obtained by splice alignment.

The exon-first approach is the most effective in cases in which a majority of the reads can be mapped without gaps. If retrotransposed genes or pseudogenes originating from transcripts with multiple exons are present in the genome sequence, software that employs the exon-first approach might preferentially map the reads to the retrotransposed region. In seed-and-extend methods, such as STAR, reads are divided into short seeds k-mers , the positions where they are present in the genome are searched, and alignments are built and extended using this information.

Seed-and-extend methods are generally considered more sensitive but slower than exon-first methods. However, with great efforts, excellent software programs using seed-and-extend or hybrid methods have been developed in recent years. Substantial effort has been spared, and software using the seed-and-extend method has become sufficiently fast. In a typical expression analysis of microorganisms using RNA-Seq, the computational processing time required for mapping reads to the reference genome sequence is no longer a major problem.

For transcript quantification, software such as Kallisto [ 16 ] and Salmon [ 17 ], which use newer algorithms that do not require the pre-mapping of reads to a reference sequence, has become increasingly faster. A very large-scale expression analysis with RNA-Seq could be performed using this type of software. Widely distributed strains, such as S. The mutation frequency can be decreased by careful handling, such as decreasing the number of inoculation processes and avoiding stressful conditions. However, the introduction of mutations cannot be completely prevented due to spontaneous mutation, which is a natural characteristic of all organisms.

The basic procedure for resolving this problem is to sequence the genome of the strain for which RNA-Seq is performed. However, because the sequencing strategy, including sample preparation, for genome sequencing is different from that used for RNA-Seq and because of the cost- and time-saving requirements, RNA-Seq data sometime have to be analyzed using the reference sequence deposited in a public database. To overcome this problem without losing reliability, we have addressed the correction of the reference sequence using RNA-Seq reads based on two methods: 1 RNA-Seq reads are mapped to the reference sequence using the spliced mapper mentioned in the previous section, and the reference sequence is corrected using the consensus of the mapped reads.

The former method was almost completely automatable and worked well for small variations, such as single-base substitution. With the latter method, it was necessary to process a number of isoform candidates at the same loci of the reference genome outputted by the transcriptome assembler, which required time and effort to tune the various parameters and threshold values. Unless the genome has undergone a complicated structural change from the reference sequence, the former method is sufficient.

After correcting the reference sequence, the reads were again mapped to the corrected reference sequence. This strategy worked fairly well. Typical examples of the gene modeling problem are found by analyzing filamentous fungi. Industrially important fungi are often isolated due to their production of useful secondary metabolites.

Because their genomes are generally unknown, sequencing and successive gene modeling are indispensable but are performed by a limited number of researchers with a limited amount of knowledge. In such cases, RNA-Seq reads can be used to correct gene models prior to expression analysis to obtain accurate expression levels. Several researchers have attempted to improve the accuracy of predicting protein-coding genes, and these attempts have included the use of RNA-Seq.

After RNA-Seq reads are mapped to the genome, spliced mapped reads can be used as valuable information for gene finding. In recent years, gene prediction software using RNA-Seq for both model training and gene prediction with the trained model has been developed and has demonstrated high accuracy for gene structure prediction [ 19 , 20 ]. The training of conventional gene finding depends on the gene models in the genomes of species other than the target one.

However, the gene models of the species already deposited in public databases have not always been experimentally confirmed but are the results of predictions based on the results of other genomes. Thus, the use of the results of RNA-Seq read mapping, which provides direct information of the CDSs of the target species, in combination with recent gene finding algorithms, enables significant improvement in gene modeling.

In this pipeline, exon-intron boundary information is predicted using mapped RNA-Seq, and coding sequence candidates is obtained by homology searches between the genome sequence and protein sequence databases, such as the Swiss-Prot database. This pipeline worked well for gene prediction of non-model organisms and has been used for the genome analysis of various filamentous fungi.

The improvements in the predicted gene structures are thought to contribute to more accurate RNA-Seq expression quantification as transcript references. Because the degradation will not be complete, the ribosomal RNA sequences have to be removed after sequencing by searching the consensus sequence in the reads. Another problem is that bacterial genes are sometimes overlapped on the genome and might be transcribed even in different orientations. To solve this problem, strand-specific RNA-Seq has the advantage of obtaining useful information for gene modeling.

However, because bacterial mRNA does not have poly-A tails, as described above, preparation of a strand-specific library is more difficult than the preparation of eukaryotic mRNA. A strand-specific library for bacteria can be prepared basically by two methods [ 21 ]: i adapter ligation to the first strand synthesized in the cDNA preparation [ 22 ] and ii chemical modification of RNA or the second strand of the cDNA [ 23 — 25 ].

Expression analysis with RNA-Seq typically begins by counting the number of reads mapped to reference transcript sequences. We can resolve the various mapping problems mentioned above and perform mapping to the genome with accurately predicted gene structures or assembled transcript sequences using transcriptome assembly software.

Microarrays are widely used for the quantification of the abundance of mRNAs corresponding to genes. In microarray experiments, the gene expression level is measured as a continuous value, intensity. RNA-Seq differs from microarrays in that it addresses nonnegative discrete values, i. Analytical methods for microarray data that assume a Gaussian distribution, such as linear discriminant analysis, might not perform as well for RNA-Seq data with a discrete distribution.

Let us consider the problem of quantifying gene expression levels using discrete RNA-Seq data and a related problem, namely, the identification of differentially expressed genes DEGs between conditions. Thus, the total number of observed reads for a transcript is proportional to the number of expressed mRNAs for the transcript multiplied by the length of the transcript. To compensate for this bias, it is a common practice to divide the number of mapped reads by the transcript length.

Unfortunately, this correction is not sufficient to test whether gene expression differs between conditions. Oshlack and Wakefield showed that the power of a t -test of the count data, regardless of whether it is divided by the length of the transcript, is proportional to the square root of the length of the transcript [ 26 ]. Therefore, for a given expression level, the test becomes more significant for longer transcripts. Many methods have been developed for assessing differential expression from RNA-Seq data. Count data, such as the counts of mapped fragments of RNA-Seq data, are often modeled as a Poisson distribution.

The Poisson distribution has equal mean and variance values, and DEGs can be identified by conducting a likelihood ratio test between conditions. Real RNA-Seq data often exhibits overdispersion. The count data measured via RNA-Seq often has a variance that is larger than the mean due to various biases and errors as well as length bias. A negative binomial distribution is widely used for modeling such cases.

Several RNA-Seq data analysis software packages incorporating these models have been developed. Soneson and Delorenzi evaluated eleven software packages that implemented various methods to model count data for differential expression analyses of RNA-Seq data [ 27 ]. When designing experiments to analyze differential expressions using RNA-Seq, it is necessary to carefully consider the type of method used for DEG extraction and the amount of biological replications that are needed.

Three replicates often give reproducible results in successive independent experiments in terms of the assignment of a gene s with the expression of interest, although a single experiment often fails to yield reproducible results. The comparison of the transcriptome for each condition often shows a large number of DEGs. Therefore, outlining the changes in the expression profile by extracting features common to genes whose expression intensity has changed is a common approach.

Gene set enrichment analysis GSEA is a popular method for condensing information from gene expression profiles into a summary of pathways or functional groups. However, most RNA-Seq data obtained so far have only small replicates, which enforces application of the gene-permuting GSEA method or preranked GSEA , resulting in a great number of false positives due to the inter-gene correlation in each gene set.

Yoon et al. As shown recently, RNA-Seq also enables the detection of alternative slicing from various fungi and higher organisms, such as mammals and plants. Both tools can detect isoforms of transcripts based on mapping information generated by TopHat using a graph-based method. These tools are widely used for the analysis of higher organisms, such as mammals and plants, but not fungi. Splicing variants have been found in various fungi, including Aspergillus oryzae [ 32 ], Magnaporthe grisea [ 33 ], Cryptococcus neoformans [ 34 ], and Trichoderma longibrachiatum [ 35 ], by deep RNA-Seq despite their significantly lower frequency compared with that found in higher organisms.