On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology using MapReduce

B. Zhang, T. Estrada, P. Cicotti, and M. Taufer, in the Proceedings of the 16th IEEE International Conferences on Computational Science and Engineering (CSE), Sydney, Australia (2013).

In this paper, we present two variations of a general analysis algorithm for large datasets residing in distributed mem- ory systems. Both variations avoid the need to move data among nodes because they extract relevant data properties locally and concurrently and transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. We test the two variations using the SDSC’s supercomputer Gordon, the MapReduce-MPI library, and a structural biology dataset of 100 million protein-ligand records. We evaluate both variations for their sensitivity to data distribution and load imbalance. Our observations indicate that the first variation is sensitive to data content and distribution while the second variation is not. Moreover, the second variation can self-heal load imbalance and it outperforms the first in all the fifteen cases considered.

Return to MapReduce-MPI page