With access to high-throughput technologies, researchers struggle to store their raw data. Many just give up.
How do you handle terabytes of data? That is a question that more and more investigators must face, on a weekly basis.
Consider, for example, light-sheet fluorescence imaging. This microscopy technique captures high-density images of living samples at the impressive pace of 5 frames per second. After 36 hours, you get beautifully rich data—2 or 3 terabytes of it. The few expert users worldwide save the raw images onto hard disks. It is costly and impractical for sharing, but they accept this as a price to pay for this cutting-edge technology. From the beginning, they knew what difficulty they were getting into.
This is not necessarily the case of the increasing number of researchers who launch next-generation sequencing applications, often ill-prepared for the challenges and hidden costs of becoming a massive data producer. 'Next-gen' sequencing machines are rapidly becoming a laboratory staple, and their applications have expanded well beyond decoding genomes—as illustrated, for example, by articles in this issue focusing on transcriptome analysis (p. 597, 613, 621). Accordingly, problems that once were the concern of genome sequencing centers have become the lot of individual researchers.
A typical sequencing run generates on the order of 100 gigabytes of raw data (in the form of fluorescence intensity values extracted from close to a terabyte of images). Investigators must then translate fluorescence values into base identity, a process dubbed 'base calling'. Handling such an amount of data—transferring it between computers, processing and storing it—is not trivial, and it is costly. It is a big issue for large sequencing centers, but they have the know-how and a solid infrastructure. Individual labs, in contrast, have had to find pragmatic solutions, such as walking hard drives between buildings as a faster alternative to transfer via their institution's network.
With regard to storage, the field, including sequencing centers, has largely adopted a perhaps exceedingly pragmatic solution: analyze the sequencing run, save the sequence (and its 'quality scores'), save the sample and 'dump' the raw data.
Operating on the principle that it is cheaper to resequence an occasional sample than to systematically store the data may very well be the only workable practice for the time being. It is too early, however, to be a comfortable solution. First, base-calling algorithms are not yet as good as one would wish, and reprocessing data later will likely add value to experiments. As an historical parallel, reprocessing 'old' Sanger capillary electrophoresis traces with 'new' algorithms has proven extremely fruitful for assembling Craig Venter's genome, published in 2007. Second, the development of better data-analysis algorithms relies on the availability of large amounts of raw data, for testing new solutions and calibrating them against each other. Third, although it is improbable that genome sequencing will ever be tainted by fraud, it cannot be excluded that one of the increasingly diverse applications of sequencing will come under scrutiny. Generally speaking, physical samples become contaminated, misplaced and otherwise lost. In some cases at least, the investment already committed to an experiment calls for a solution to preserve the data.
The 1,000 Genomes project, leading by example, requests that its participants save sequencing raw data. For this they will take advantage of public archives that are being developed at the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute with the mechanism and storage capacity to accommodate unprocessed intensity value files. Remarkably, NCBI's Short Read Archive has already started accepting such data for one of the sequencing platforms—and receives individual submissions outside large sequencing projects.
With this option in place, the systematic 'analyze and dump' tactic is not justified. One can anticipate that small labs will face difficulties as efficient submission will require specialized data-transfer protocols and a lot of 'information technology' involvement. But at the moment, it is worth considering the headache, for some experiments at least. It is likely that once data-analysis algorithms will have improved and stabilized, the need for raw data will decrease, avoiding the saturation of databases.
Comparing imaging and sequencing, it is clear that there is no one-size-fits-all solution for the raw data challenge. The solutions depend on the number of users, the fraction of the experimental cost that data storage represents and the intrinsic value of raw data compared to processed data. Even within the confines of one technology, these parameters evolve with time and vary with specific applications. But settling early into the false comfort of saving only processed data is not recommended for any field.