BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka
1. Benchmarking Machine Learning
Tools for Scalability, Speed and
Accuracy
Szilárd Pafka, PhD
Chief Scientist, Epoch
H2O World Conference, Mountain View
Nov 2015
2.
3. Disclaimer:
I am not representing my employer (Epoch) in this talk
I cannot confirm nor deny if Epoch is using or not any of
the methods, tools, results etc. mentioned in this talk.
The results presented in this talk should not be
considered as any indication whether Epoch is using
these methods, tools, results etc. or not.
4.
5. I usually use other people’s code [...] it is usually not
“efficient” (from time budget perspective) to write my own
algorithm [...] I can find open source code for what I want to
do, and my time is much better spent doing research and
feature engineering -- Owen Zhang
http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
20. Distributed computation generally is hard, because it
adds an additional layer of complexity and [network]
communication overhead. The ideal case is scaling
linearly with the number of nodes; that’s rarely the case.
Emerging evidence shows that very often, one big
machine, or even a laptop, outperforms a cluster.
http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/
21.
22. n = 10K, 100K, 1M, 10M, 100M
Training time
RAM usage
AUC
CPU % by core
read data, pre-process, score test data
44. we will continue to run large [...] jobs to scan petabytes of [...] data to
extract interesting features, but this paper explores the interesting
possibility of switching over to a multi-core, shared-memory system for
efficient execution on more refined datasets [...] e.g., machine learning
http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf