Scalable Big Data Clustering by Random Projection Hashing

This project is developing a novel algorithm, called <i>Random Projection Hash</i> or RPHash. RPHash utilizes aspects of random projection, locality sensitive hashing (LSH), and count-min sketch to achieve computational scalability and linear achievable gains from parallel speed up. The approach is data agnostic, minimizes communication overhead, and has a priori predictable computational time. The system is deployable on commercially available cloud resources running the Spark implementation of MapReduce. The RPHash solution will have a wide applicability to a variety of standard clustering applications while this project will focus on a subset of clustering problems in the biological data analysis space. RPHash also combats de-anonymization attacks inherently resulting from its algorithmic requirements thus addressing requirements involving the handling and privacy protection of health care data as well as the inherent privacy concerns of using cloud based services. Furthermore, RPHash will allow researchers to scale their clustering problems without the need for specialized equipment or computing resources. The proposed cloud processing solution will allow researchers to arbitrarily scale their processing needs using virtually limitless commercial processing