O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

What's new in Apache Hivemall v0.5.0

1.061 visualizações

Publicada em

April 17, 2018 at Dots
https://techplay.jp/event/663945

Publicada em: Dados e análise
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

What's new in Apache Hivemall v0.5.0

  1. 1. Hivemall v0.5.0 Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 12018/4/17 Hivemall meetup
  2. 2. v0.5.0 22018/4/17 Hivemall meetup • • • • • •
  3. 3. What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use 32018/4/17 Hivemall meetup
  4. 4. Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop 42018/4/17 Hivemall meetup
  5. 5. Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive 52018/4/17 Hivemall meetup
  6. 6. Hivemall on Apache Hive 62018/4/17 Hivemall meetup
  7. 7. Hivemall on Apache Spark Dataframe 72018/4/17 Hivemall meetup
  8. 8. Hivemall on SparkSQL 82018/4/17 Hivemall meetup
  9. 9. Hivemall on Apache Pig 92018/4/17 Hivemall meetup
  10. 10. Online Prediction by Apache Streaming 102018/4/17 Hivemall meetup
  11. 11. What’s new in v0.5.0? 11 Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark v2.0/v2.1/v2.2 SparkSQL/Dataframe support, Top-k data processing 2018/4/17 Hivemall meetup
  12. 12. 12 Generic Classifier/Regressor OLD Style New Style from v0.5.0 2018/4/17 Hivemall meetup
  13. 13. 13 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization 2018/4/17 Hivemall meetup
  14. 14. 2018/4/17 Hivemall meetup 14 -eta0 <arg> The initial learning rate [default 0.1] -iter,--iterations <arg> The maximum number of iterations [default: 10] -lambda <arg> Regularization term [default 0.0001] -loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss, SquaredHingeLoss, ModifiedHuberLoss, or a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss, SquaredEpsilonInsensitiveLoss, HuberLoss] -mini_batch,--mini_batch_size <arg> Mini batch size [default: 1]. Expecting the value in range [1,100] or so. -opt,--optimizer <arg> Optimizer to update weights [default: adagrad, sgd, adadelta, adam] -reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet] Generic Classifier/Regressor Hyperparameters Adagrad+RDA by the default
  15. 15. RandomForest in Hivemall Ensemble of Decision Trees 152018/4/17 Hivemall meetup
  16. 16. Image borrowed from http://alfredplpl.hatenablog.com/entry/2013/12/24/225420 2018/4/17 Hivemall meetup 16 What’s OOB in RandomForests? uniform/stratified sampling
  17. 17. 2018/4/17 Hivemall meetup 17 Stratified Sampling ( ) ) https://bellcurve.jp/statistics/course/8007.html
  18. 18. 2018/4/17 Hivemall meetup 18 What’s OOB in RandomForests? ) http://alfredplpl.hatenablog.com/entry/2013/12/24/225420 学習に使っていないデータを モデルの精度評価に利用
  19. 19. Training of RandomForest 19 Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! 2018/4/17 Hivemall meetup train_randomforest_classifier(array<double|string> features, int label [, const string options, const array<double> classWeights])
  20. 20. • Dense Vector (array<double>) • Sparse Vector (array<string>) in a LIBSVM format • feature := <index>[“:”<value>] where index := <integer> starting with 1 (index = 0 is reserved for bias clause) and value := <floating point> (default 1.0 if not provided) 2018/4/17 Hivemall meetup 20 Supported Feature Vector Format of Random Forests 1.0, 0.0, 3.0 1:1.0, 2:0.0, 3:3.0 1:1.0, 3:3.0 select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331")); ["1828616:3.3","6238429:4.999","6238429"] 1:1.0, 3
  21. 21. Feature Engineering – Feature Hashing 212018/4/17 Hivemall meetup
  22. 22. 2018/4/17 Hivemall meetup 22 Random Forests Taining Hyperparameters -attrs,--attribute_types <arg> Comma separated attribute types (Q for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C]) -depth,--max_depth <arg> The maximum number of the tree depth [default: Integer.MAX_VALUE] -leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes [default: Integer.MAX_VALUE] -min_samples_leaf <arg> The minimum number of samples in a leaf node [default: 1] -rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR] -seed <arg> seed value in long [default: -1 (random)] -splits,--min_split <arg> A node that has greater than or equals to `min_split` examples will split [default: 2] -stratified,--stratified_sampling Enable Stratified sampling for unbalanced data -subsample <arg> Sampling rate in range (0.0,1.0] -trees,--num_trees <arg> The number of trees for each task [default: 50] -vars,--num_variables <arg> The number of random selected features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length) is considered if num_variable is (0.0,1.0]
  23. 23. Prediction of RandomForest 232018/4/17 Hivemall meetup 決定木の予測クラスの投票に基づく事後確率 OOBエラー率に基づくmodelの信憑性
  24. 24. 24 Decision Tree Visualization 2018/4/17 Hivemall meetup
  25. 25. 25 Decision Tree Visualization 2018/4/17 Hivemall meetup http://viz-js.com/
  26. 26. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins 262018/4/17 Hivemall meetup
  27. 27. 2018/4/17 Hivemall meetup Feature Engineering – Feature Binning 27
  28. 28. Evaluation Metrics 282018/4/17 Hivemall meetup
  29. 29. Map tiling functions 292018/4/17 Hivemall meetup
  30. 30. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 302018/4/17 Hivemall meetup
  31. 31. 31 SELECT count(distinct id) FROM data Sketch and NLP functions SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "] 2018/4/17 Hivemall meetup
  32. 32. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 322018/4/17 Hivemall meetup
  33. 33. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 332018/4/17 Hivemall meetup
  34. 34. Take this… Anomaly/Change-point Detection by ChangeFinder 342018/4/17 Hivemall meetup
  35. 35. Anomaly/Change-point Detection by ChangeFinder …and do this! 352018/4/17 Hivemall meetup
  36. 36. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 362018/4/17 Hivemall meetup
  37. 37. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation 372018/4/17 Hivemall meetup
  38. 38. Online mini-batch LDA 382018/4/17 Hivemall meetup
  39. 39. 39 Probabilistic Latent Semantic Analysis - training 2018/4/17 Hivemall meetup
  40. 40. 40 Probabilistic Latent Semantic Analysis - predict 2018/4/17 Hivemall meetup
  41. 41. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü Merge Brickhouse UDFs ü XGBoost support ü LightGBM support ü Gradient Boosting Future work for v0.5.2 and later 41 PR#91 PR#116 PR#58 PR#111 2018/4/17 Hivemall meetup PR#135
  42. 42. SELECT from_json(to_json( ARRAY( NAMED_STRUCT("country", "japan", "city", "tokyo"), NAMED_STRUCT("country", "japan", "city", "osaka") ) ),'array<struct<city:string>>') 2018/4/17 Hivemall meetup 42 Brickhouse functions https://github.com/klout/brickhouse
  43. 43. Prediction tracing of Decision Tree 432018/4/17 Hivemall meetup Trace how predicted
  44. 44. 44 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; 2018/4/17 Hivemall meetup Experimental Not yet supported in TD
  45. 45. Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs Try our the first Apache release (v0.5.0)! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin 452018/4/17 Hivemall meetup
  46. 46. Any feature request or questions? BTW, we are hiring! 462018/4/17 Hivemall meetup
  47. 47. 472018/4/17 Hivemall meetup Hivemall Digdag
  48. 48. 482018/4/17 Hivemall meetup Machine Learning Workflow using Digdag
  49. 49. 492018/4/17 Hivemall meetup Machine Learning Workflow using Digdag

×