Figure 1: Applying a MapReduce approach in the cloud to solve embarrassingly parallelizable problems. | Nature Reviews Genetics

Figure 1: Applying a MapReduce approach in the cloud to solve embarrassingly parallelizable problems.

From: Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology

Figure 1

To traverse a 1 petabyte (PB) data set, Trelles et al. mistakenly assume that the 1 PB data set needs to be traversed by every node. The ideal MapReduce application (depicted in the upper panel) instead distributes 1 terabyte (TB) to each of the 1,000 nodes for concurrent processing (the 'map' step in MapReduce). Furthermore, although Trelles et al. cite a paper that they claim indicates a 15 MB/s link between storage and nodes6, the bandwidth quoted appears to be for a single input/output stream only. As shown in the lower panel, best practice is to launch multiple 'mappers' per node to saturate the available network bandwidth7, which has been previously benchmarked at ~50 MB/s8 (threefold higher than the 15 MB/s claimed) and consistent with the 90+ MB/s virtual machine (VM)-to-VM bandwidth reported6. Each node can process 1 TB at 50 MB/s at $0.34/h; therefore, the back-of-the-envelope calculations of Trelles et al. should be updated to state that 1,000 nodes could traverse 1 PB of data in ~350 minutes (not 750 days) at a cost of ~US$2,040 (not $6,000,000).

Back to article page