The ability to perform small RNA sequencing on extracellular RNA samples allows us to measure, at least in a semi-quantitative manner, the abundance of miRNAs, piRNAs, fragments of tRNAs, long non-coding RNAs, and coding RNAs. One of the most significant advantages of RNA-Seq technology is that it can detect and measure any RNA that is present, whether or not it is a known sequence. Nonetheless, there are important unanswered questions about the accuracy of RNA-Seq and the optimal approach for processing the data obtained. All RNA-seq experiments are subject to sources of systematic variation such as library size, transcript length, and G-C content (Dillies et al., 2013). Small RNA-seq experiments are further impacted by the highly non-normal distribution of expression of different small RNAs, particularly miRNAs. Often a few miRNAs account for a very large fraction of total reads while the vast majority of miRNAs each contribute a small percentage of reads. Moreover, sample input amounts of extracellular RNA are often extremely limited, increasing the potential for both sampling error and experimental bias.
To address these issues, a number of normalization methods have been developed which can be assigned three basic categories: 1) scaling; 2) normalizing in order to achieve similar data distributions; and 3) Reads Per Kilobase Mapped (RPKM). There appears to be a lack of consensus regarding the optimal normalization method, but it is known that different methods can result in different results in downstream analysis, particularly differential expression analysis (DE).
Two studies highlighted here, Garmire and Subramaniam, 2012 and Dillies et al., 2013, do not agree on the best way to do normalization for small RNA-seq, but point to a number of methods analyzed and elucidate some of the key problems. We review these studies briefly here and point out the importance of rigorous comparisons of normalization methods for the future of differential comparisons of data sets.
The first class of methods, scaling methods, involves application of a standard linear mathematical operation to each sample. Scaling generally means changing the size of something, and normalizing by simply dividing by the total read number (going from read numbers to read fractions) is probably the simplest kind of scaling. Scaling approaches to normalization include global scaling, Lowess, and the Trimmed Mean Method (TMM) (Robinson and Oshlack, 2010). These approaches each use a different method to calculate a linear scaling factor. Global scaling uses a factor that is based on the difference in the means of the data sets to be compared. Lowess normalizes based on a multiple-regression model. TMM determines a scaling factor, which is the weighted trimmed mean of log expression. This factor is calculated after double trimming values at the two extremes based on log-intensity ratios (M-values) and log-intensity averages (A-values). According to Garmire and Subramaniam, variability in estimated RNA content may be even more pronounced for microRNA-seq datasets after application of this method.
Another general approach to normalization is to preserve aspects of the distribution of the data among different data sets. As with scaling, there are a number of different approaches to achieving matched data distributions, including quantile, Variance stabilization (VSN), the invariant method (INV), and DESeq. Quantile normalization has been extensively used for microarray data, with the goal of making the distribution of expression levels across samples similar. Conditional quantile normalization is a modification of quantile normalization that combines robust generalized regression and quantile normalization (Hansen, Irizarry, and Wu, 2012). The goal of the VSN method is to make the distribution of variance across different levels of expression similar. In INV normalization, a set of invariant miRNAs are selected, which are then used with one of the other methods (such as Lowess or VSN). In DESeq, a scaling factor for each sample in the dataset is obtained by computing the median of the ratios of each gene in one sample over the geometric mean of that gene across all samples. The same scaling factor is then applied to the read counts for all of the genes in that sample. RPKM is a method that has been widely used for long RNAseq datasets. RPKM is performed on each sample separately, and consists of taking the ratio of the number of counts for a given gene and the product of the total counts for all genes and the mature transcript length for that gene. Methods such as DEseq and TMM rely on the assumption that most genes are not differentially expressed (Dillies et al., 2013), which may not hold true for all microRNA-seq data sets.
Garmire and Subramaniam compared a number of normalization methods applied to mammalian microRNA-Seq data using two publicly available datasets that were chosen by the authors due to the availability of matched PCR data. They assessed the performance of the normalization methods by calculating the mean square error (MSE) and the Kolmogorv-Smirnov statistic (which is a measure of the difference of two distributions), as well as comparison with PCR data on the same samples and inspection of the results of differential expression. The authors show that Lowess, quantile and VSN normalization resulted in a smaller MSE, while TMM and VSN produced a higher MSE. Similarly, TMM and INV resulted in a larger K-S statistic, indicating a bigger change in the distribution. Quantile and Lowess normalization also had the best concordance with qPCR data. The authors thus concluded that Lowess and quantile normalization performed better than other methods studied. The primary limitations of this study include the fact that it did not incorporate strategies to compare the results of the normalization method to any “gold standard” for which the real distribution was known, and that the number of data sets used for the analysis was very small. In addition, other methods have been implemented since the publication of their study.
Dillies et al. published another comparison of normalization methods for RNAseq data in about the same time frame. Most of the datasets used were long RNAseq data, but a murine microRNA dataset was included. The seven methods compared in this study included TC (total counts), UQ (Upper Quartile), Median, DEseq, TMM, quantile and RPKM. TC, UQ and median are scaling approaches that are quite similar, involving the calculation of a scaling factor based on either the ratio of total counts (TC), upper quartile of counts (UQ) or median of counts. The authors assessed these normalization methods on real as well as simulated data. In their analysis, it is apparent that when boxplots of counts before and after normalization are assessed, the differences are most apparent in conditions where there are large differences in library size between samples. These differences do not improve after TC and RPKM normalization of the microRNA-seq data. Other features of microRNA-seq data that influence the results of the normalization according to the authors are the presence of high-count genes and a large number of 0 counts. Quantile normalization increases the intra-condition variability in the murine miRNA data. Based on the results of differential expression in this study, it is apparent that the differences in the final results depend on the normalization method and not on the model chosen to assess for differential expression (DESeq or TSPM). In the simulated data, which contains high count genes and may resemble microRNA seq data more closely than the other datasets, only DESeq and TMM were successful in achieving a low false positive rate and high power. The authors conclude that DESeq and TMM are the methods of choice based on their ability to perform in the presence of different library sizes and composition.
The differences in the conclusions from these two studies are likely due to the different normalization methods and specific datasets each used, and on the parameters they used to evaluate the performance of the normalization methods. Since the Garmire and Subramanian paper looks only at miRNA data and the Dillies et al. paper looks at a mix of data, including miRNA and mRNA data, it is difficult to compare the results. We suggest that the additional complexities of compiling mRNA levels from multiple read sequences may confuse the latter analysis. However, both studies agree that the use of different normalization approaches can result in significant differences in downstream differential expression results. One of the key weaknesses of both of these papers is the lack of data from “gold standard” datasets for which the quantities of the different RNAs were definitively known.
For this reason, it would be valuable to develop the tools necessary for rigorous comparison of the available normalization methods. Such tools may include the generation of standardized data sets, such as small RNA libraries constructed from purely synthetic miRNAs, for which the content can be completely controlled, or RNAs from biological specimens that contain synthetic miRNAs spiked in at controlled concentrations. Corresponding qPCR results, used with calibration curves, should be generated for such datasets, which will serve as an orthogonal measurement technology for developing and evaluating normalization methods. Overall, this important topic needs careful attention for the establishment of reference exRNA profiles, and for the realization of the full potential of the powerful technology of high throughput RNA-seq.
* * *
Note that one of us recently presented a web seminar on a related topic, Understanding and using small RNA-seq, that is available for viewing. Also, members of the ERCC will be jointly presenting a workshop on Data Normalization Challenges and Solutions as part of the CHI conference Extracellular RNA in Drug and Diagnostic Development in Cambridge, MA, 3-6 April 2016. See the Event page for more details.
Exosome Diagnostics, Inc. has announced the launch of ExoDx Lung(ALK), the first ever CLIA-validated exosome based blood test. This test detects EML4-ALK fusion transcripts in the plasma of lung cancer patients whose primary tumors carry this mutation. Although these patients make up a small minority of cases, identifying them is important because their tumors are particularly sensitive to ALK inhibitors. Current clinical tests are performed on biopsied tumor tissue. Major drawbacks include not only the risks associated with the invasive biopsy procedure but also low test performance for some of the commonly used methods (especially immunocytochemistry-based tests). In contrast, ExoDx Lung(ALK) requires only a standard blood draw and has 88% sensitivity and 100% specificity (as reported by the manufacturer). This is a major milestone in the application of exosome-based biomarkers to precision medicine, especially in the areas of companion diagnostics and targeted therapies.
The key discovery that let to this test was made eight years ago when Dr. Johan Skog, the Chief Scientific Officer at Exosome Diagnostics, demonstrated that a mutation present in the tumors of patients with glioblastoma could be detected in their blood (Skog et al., 2008). The publication reporting this finding has been cited over 1600 times, indicating its significance and potential for broad application. Now lung cancer patients can directly benefit from the first clinical test based on this discovery, in the form of a non-invasive test to determine the EML4/ALK mutation status of their tumors. We are sure that many more biofluid-based tests, targeting not only cancers but also other conditions where non-invasive testing is desired, will soon follow.
The first public release of the exRNA Atlas is now available via the ExRNA Atlas link in the Quick Links section of the exRNA Portal. The Atlas is produced by the NIH Common Fund’s Extracellular RNA Communication (ERC) Consortium and includes 519 small exRNA profiles from eight laboratories. Each profile in the exRNA Atlas acknowledges the contributing laboratory. The profiles were derived from about 6.4 billion reads uniformly processed using the exceRpt small RNA-seq pipeline. Faceted filtering and data navigation tools — hosted by GenboreeKB — are enabled by rich metadata standards developed by the consortium and metadata annotations contributed by the data producers. Uniform data quality metrics agreed by the consortium were applied to all datasets. On behalf of the Bioinformatics Research Lab at Baylor College of Medicine and the whole Data Management and Resource Repository (DMRR), I would like to thank the contributors and the consortium for the outstanding team effort required to reach this important milestone!
To balance the desire of data contributors to have a protected period of time to analyze and publish the data they have produced, the data access policy for datasets in the exRNA Atlas provides for a 12-month embargo period. The embargo period expires on 1 July, 2016 for the profiles in the current release. Researchers may analyze embargoed datasets from the Atlas but may not publish or make scientific presentations about them until the embargo period has ended. The Atlas will be updated regularly with new profiles, each new profile having its own 12-month embargo period per the data access policy. The read-level information for the profiles in the Atlas will be deposited in GEO (unrestricted access) or dbGaP (controlled access). The exRNA Atlas profiles will contain links to these archival records as the data are deposited.
The exRNA Atlas website is currently optimized for the Firefox browser. Extensive testing on other browsers is yet to be performed. Not too many problems are expected on other browsers, but if you do encounter a problem, consider using Firefox. Optimization for mobile devices is also yet to be completed.
Sai Subramanian of the DMRR highlighted the features of this new release of the exRNA Atlas on an ERCC webinar on 4 Feb, 2016 at 1pm ET. If you missed the live talk, it will be available soon afterwards at exRNA.org/About.
Where do we go from here? Of course, we are just at the beginning. By the end of 2016, the amount of data from the consortium’s reference profile projects will likely dwarf this first release. Our next focus here at the DMRR will be to “test drive” the data by performing a number of integrative analyses and to deploy analysis tools that may be applied both to the Atlas profiles and to profiles that are not yet public. Stay tuned and Happy New exRNA Year!