10.6084/m9.figshare.6174083.v1 Philip Wilsey Philip Wilsey Sayantan Dey Sayantan Dey Lee Carraher Lee Carraher Anindya Moitra Anindya Moitra Scalable Big Data Clustering by Random Projection Hashing figshare 2018 NSF-SI2-2018-Talk data mining Privacy preserving data mining random projection locality sensitive hashing count-min sketch Computer Engineering 2018-04-23 22:49:07 Journal contribution https://figshare.com/articles/journal_contribution/Scalable_Big_Data_Clustering_by_Random_Projection_Hashing/6174083 This project is developing a novel algorithm, called <i>Random Projection Hash</i> or RPHash. RPHash utilizes aspects of random projection, locality sensitive hashing (LSH), and count-min sketch to achieve computational scalability and linear achievable gains from parallel speed up. The approach is data agnostic, minimizes communication overhead, and has a priori predictable computational time. The system is deployable on commercially available cloud resources running the Spark implementation of MapReduce. The RPHash solution will have a wide applicability to a variety of standard clustering applications while this project will focus on a subset of clustering problems in the biological data analysis space. RPHash also combats de-anonymization attacks inherently resulting from its algorithmic requirements thus addressing requirements involving the handling and privacy protection of health care data as well as the inherent privacy concerns of using cloud based services. Furthermore, RPHash will allow researchers to scale their clustering problems without the need for specialized equipment or computing resources. The proposed cloud processing solution will allow researchers to arbitrarily scale their processing needs using virtually limitless commercial processing