Untitled ItemScalable Big Data Clustering by Random Projection Hashing

Wilsey, Philip; Dey, Sayantan; Carraher, Lee; Moitra, Anindya

doi:10.6084/m9.figshare.6179132.v1

poster.pdf (318.99 kB)

Untitled ItemScalable Big Data Clustering by Random Projection Hashing

poster

posted on 2018-04-24, 20:49 authored by Philip WilseyPhilip Wilsey, Sayantan Dey, Lee Carraher, Anindya Moitra

This project is developing a novel algorithm, called Random Projection Hash or RPHash. RPHash utilizes aspects of random projection, locality sensitive hashing (LSH), and count-min sketch to achieve computational scalability and linear achievable gains from parallel speed up. The approach is data agnostic, minimizes communication overhead, and has a priori predictable computational time. The system is deployable on commercially available cloud resources running the Spark implementation of MapReduce. The RPHash solution will have a wide applicability to a variety of standard clustering applications while this project will focus on a subset of clustering problems in the biological data analysis space. RPHash also combats de-anonymization attacks inherently resulting from its algorithmic requirements thus addressing requirements involving the handling and privacy protection of health care data as well as the inherent privacy concerns of using cloud based services. Furthermore, RPHash will allow researchers to scale their clustering problems without the need for specialized equipment or computing resources. The proposed cloud processing solution will allow researchers to arbitrarily scale their processing needs using virtually limitless commercial processing resources.

Funding

ACI-1440420

History

Usage metrics

Keywords

NSF-SI2-2018 data clustering high-dimensional datasets map-reduce parallel and distributed computing Computer Engineering

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Untitled ItemScalable Big Data Clustering by Random Projection Hashing

Funding

ACI-1440420

History

Usage metrics

Categories

Keywords

Licence

Exports