Next-generation sequencing (NGS) has emerged as an important high throughput technology in biomedical research and translation for its ability to accurately capture genetic information. But choosing proper analysis methods for identifying biomarkers from high throughput data remains a critical challenge for most users.
For instance, RNA-sequencing (RNA-seq) is an NGS technology that examines the presence and quantity of RNA in biological samples, and it requires bioinformatics analysis to make sense of it all. However, there are hundreds of bioinformatics tools with different data analysis pipelines that result in various results for the same dataset. This can significantly hinder the ability to reliably reproduce RNA-seq related research and applications, especially for the regulatory approval process by the U.S. Food and Drug Administration (FDA).
Choosing the right analysis model and tool to do the proper job for high throughput data analysis remains a great challenge. So the FDA invited a team of researchers at the Georgia Institute of Technology to conduct a comprehensive investigation of RNA-seq data analysis pipelines for gene expression estimation to recommend best practices.
“No common standard for selecting high throughput RNA-seq data analysis tools has been established yet. This has been a huge challenge for studying hundreds of tools that form tens of thousands of analysis pipelines,” noted May Dongmei Wang, a professor in the Wallace H. Coulter Department of Biomedical Engineering at Georgia Tech and Emory University who led the investigation.
Wang and her colleagues presented their results in the journal Nature Scientific Reports. In their study, the researchers developed three metrics – accuracy, precision, and reliability – and systematically evaluated 278 representative NGS RNA-seq pipelines.
“We demonstrate that those RNA-seq pipelines performing well in gene expression estimation will lead to the improved downstream prediction of disease outcome. This is an important discovery,” said Wang, the corresponding author of the paper, “Impact of RNA-seq Data Analysis Algorithms on Gene Expression Estimation and Downstream Prediction.”
She added, “Because the FDA is a regulatory agency for approving novel medical devices for NGS-genomics to be utilized in daily clinical practices for personalized and precision medicine and health, it is critical to see whether gene expression generated from RNA-seq acquisition and analysis pipeline are reproducible and reliable.”
The team’s comprehensive investigation revealed that the high throughput RNA-seq data quantification modules – mapping, quantification, and normalization – jointly impacted the accuracy, precision, and reliability of gene expression estimation, which in turn affected the downstream clinical outcome prediction (as shown in two cancer case studies of neuroblastoma and lung adenocarcinoma).
“Clinicians and biomedical researchers can use our findings to select RNA-seq pipelines for their clinical practice or research,” Wang said. “And bioinformaticians can use these benchmark datasets, results, and metrics to develop and evaluate new RNA-seq tools and pipelines.”
But one size does not fit every need, as in any machine learning paradigm, Wang noted.
“The machine learning and algorithms are heavily dependent on goals,” she said. “Thus, based on our extensive experience in biomedical big data analytics and AI for almost two decades, we suggested that the FDA identify top goals for clinical genomics applications first. Based on different needs, different RNA-seq pipelines will be selected to achieve the optimal performance.”
Source: Georgia Tech