Jacob Pritt, Computer Science
RNA Sequencing (RNA-Seq) is an increasingly important tool in studying genome structure and gene expression. RNA-Seq reads must first be aligned by a tool such as Tophat, after which downstream tools such as Cufflinks, DESeq, and Derfinder can explore gene structure and expression. With the increasing availability of RNA-seq datasets covering hundreds or thousands of samples, there is a crucial need for methods that store results in a way that is both compact and easy to query.
We will be presenting Boiler, a tool for compressing a set of aligned reads in SAM/BAM format. Boiler partitions reads based on the introns they span and combines them into coverage vectors, compressing a BAM file to less than 1/5 of its original size. Reads can be reproduced from the coverage vectors without significantly compromising accuracy. We tested the downstream effects of compression on Cufflinks transcript assembly and found that the assembled transcripts were almost identical before and after compression.
Our tool also supports common queries needed for tools like Cufflinks and DESeq, many of which are not naturally supported by the SAM/BAM format. Boiler supports queries for coverage levels and individual reads within a gene or interval, and gene boundaries. Our compressed file format is designed to return these queries quickly and accurately without the need to fully expand the compressed file.