NSFPresentation17.pdf (2.33 MB)
Scalable Big Data Clustering by Random Projection Hashing
journal contribution
posted on 2017-02-03, 21:52 authored by Lee Carraher, Anindya Moitra, Sayantan Dey, Philip WilseyPhilip WilseyRPHash provides a solution to the approximate k-means clustering problem for very large distributed datasets. Distributed data models have gained popularity in recent years following the efforts of commercial, academic and government organizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-core computation quickly becomes infeasible, requiring distributed multi-processing. Our solution achieves comparable clustering performance to other popular clustering algorithms, with improved overall complexity growth while being amenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guarantees regarding data privacy de-anonymization.