Moritz G. Smolka1, Florian Breitwieser2, Steven L. Salzberg2, 3, 4, Arndt von Haeseler1, Michael C. Schatz3, Fritz J. Sedlazeck3,  Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, Vienna, Vienna, Austria;  Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University;  Department of Computer Science, Johns Hopkins University;  Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland USA
In recent years over 100 read mappers have been published to analyze high throughput sequencing data, each of which is optimized for different assays or requirements. The large number of potential mappers and the even larger number of possible parameter settings make it challenging to choose the most appropriate mapper for a given experiment. Consequently most users rely on default, unoptimized, parameters for one of a few popular methods, even when this choice performs very poorly compared to an optimized approach. This may introduce substantial biases in subsequent analyses, including reduced coverage, false determination of allele-specific expression, mis-identification of infectious agents, or other artifacts.
We previously reported Teaser, a benchmarking tool for DNA-seq mappers that has since been used by a number of large studies. Here we extend Teaser to benchmark the mapping of bisulfite, RNA-, and metagenomic sequencing data. Teaser can be applied to any number of mapping methods, and even automatically investigate their parameter settings. The benchmarks are highly customizable, so that read length, SNP rate and other key parameters can be adapted to the experiment at hand. After launching, a detailed assay-specific report is generated for each mapper configuration, often in less than 20 minutes even for mammalian-sized genomes. This empowers researchers to make an informed decision on the most suitable method for their needs and allows them to fully utilize their data set.
Using Teaser, we investigated how well RNA-Seq mappers (e.g. HISAT, STAR) and quantification methods (Kallisto, Sailfish) perform on a variety of genomes and analysis tasks. Here, Teaser provides insights into the accuracy of read alignments spanning multiple exons that enables isoform-level quantification and detection of novel isoforms. Furthermore, we used Teaser to investigate the ability of metagenomics methods (e.g. Kraken, CLARK) to obtain correct predictions of different taxonomic levels (e.g. genus or species) given different read lengths and sequencing error rates, and including strains not present in the reference databases.
Teaser is available as a webserver (teaser.cibiv.univie.ac.at) or as a standalone package (github.com/Cibiv/Teaser).