10.6084/m9.figshare.6174083.v1
Philip Wilsey
Philip
Wilsey
Sayantan Dey
Sayantan
Dey
Lee Carraher
Lee
Carraher
Anindya Moitra
Anindya
Moitra
Scalable Big Data Clustering by Random Projection Hashing
figshare
2018
NSF-SI2-2018-Talk
data mining
Privacy preserving data mining
random projection
locality sensitive hashing
count-min sketch
Computer Engineering
2018-04-23 22:49:07
Journal contribution
https://figshare.com/articles/journal_contribution/Scalable_Big_Data_Clustering_by_Random_Projection_Hashing/6174083
This project is developing a novel algorithm, called <i>Random Projection Hash</i> or RPHash. RPHash utilizes aspects of random projection, locality sensitive hashing (LSH), and count-min sketch to achieve computational scalability and linear achievable gains from parallel speed up. The approach is data agnostic, minimizes communication overhead, and has a priori predictable computational time. The system is deployable on commercially available cloud resources running the Spark implementation of MapReduce. The RPHash solution will have a wide applicability to a variety of standard clustering applications while this project will focus on a subset of clustering problems in the biological data analysis space. RPHash also combats de-anonymization attacks inherently resulting from its algorithmic requirements thus addressing requirements involving the handling and privacy protection of health care data as well as the inherent privacy concerns of using cloud based services. Furthermore, RPHash will allow researchers to scale their clustering problems without the need for specialized equipment or computing resources. The proposed cloud processing solution will allow researchers to arbitrarily scale their processing needs using virtually limitless commercial processing