Scalable Tree Algorithms for Machine Learning Applications
2017-02-21T12:55:01Z (GMT) by
Are present science codes ready to face the rapidly growing volume of data sets? What if data analyses could be made orders of magnitude faster? What if the results of these analyses could be made far more precise for a given computing time on a given architecture? Answering these questions will play an increasingly important role in data-driven science and engineering domains as the volume of data sets continues to grow while the performance improvements due to silicon-based technologies are slowing down in a post-Moore era of computing. To overcome these computing challenges and to be able to make sense of large data sets, the next generation of analysis softwares will need to build bridges between big data and high performance computing approaches.<br><br>The STAMLA project aims at developing efficient and scalable tree algorithms inspired from high performance simulation codes for machine learning. Over the last few years, machine learning has become a popular technique in data mining to extract information from data sets, build models and make predictions accross a wide range of application areas. However, current tools have been built in high level languages with more focus on functionalities than on pure performance. But as scientific experiments are accumulating more and more data, and as complex models are requiring larger and larger training sets, scalability issues are emerging. In the same time, in high performance computing, petascale simulations have shown that fundamental data structures optimizations can have a significant impact on overall code performances. In particular, by replacing straighforward tree implementations with implicit trees based on hash tables, simulation codes are able to make the most of modern architecture, leveraging cache and vectorization. This research project will apply this knowledge to machine learning softwares in order to overcome the limitations of existing libraries and make analyses of extremely large data sets possible.