figshare
Browse
NSFPresentation17.pdf (2.33 MB)

Scalable Big Data Clustering by Random Projection Hashing

Download (2.33 MB)
journal contribution
posted on 2017-02-03, 21:52 authored by Lee Carraher, Anindya Moitra, Sayantan Dey, Philip WilseyPhilip Wilsey
RPHash provides a solution to the approximate k-means clustering problem for very large distributed datasets.  Distributed data models have gained popularity in recent years following the efforts of commercial, academic and government organizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-core computation quickly becomes infeasible, requiring distributed multi-processing. Our solution achieves comparable clustering performance to other popular clustering algorithms, with improved overall complexity growth while being amenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guarantees regarding data privacy de-anonymization.
 

Funding

ACI-1440420

History