O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

What's new in Hivemall v0.5.0

255 visualizações

Publicada em

This slide shows what's new in Apache Hivemall v0.5.0. English-only version.

Publicada em: Dados e análise
  • Seja o primeiro a comentar

What's new in Hivemall v0.5.0

  1. 1. What’s New in Hivemall v0.5.0 Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall 1
  2. 2. Released the first Apache Release, v0.5.0, on Mar 5, 2018. hivemall.incubator.apache.org 2
  3. 3. What’s new in v0.5.0? 3 Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark v2.0/v2.1/v2.2 SparkSQL/Dataframe support, Top-k data processing
  4. 4. 4 Generic Classifier/Regressor OLD Style New Style from v0.5.0
  5. 5. 5 •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization
  6. 6. 6 -eta0 <arg> The initial learning rate [default 0.1] -iter,--iterations <arg> The maximum number of iterations [default: 10] -lambda <arg> Regularization term [default 0.0001] -loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss, SquaredHingeLoss, ModifiedHuberLoss, or a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss, SquaredEpsilonInsensitiveLoss, HuberLoss] -mini_batch,--mini_batch_size <arg> Mini batch size [default: 1]. Expecting the value in range [1,100] or so. -opt,--optimizer <arg> Optimizer to update weights [default: adagrad, sgd, adadelta, adam] -reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet] Generic Classifier/Regressor Hyperparameters Adagrad+RDA by the default
  7. 7. RandomForest in Hivemall Ensemble of Decision Trees 7
  8. 8. Training of RandomForest 8 Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! train_randomforest_classifier(array<double|string> features, int label [, const string options, const array<double> classWeights])
  9. 9. • Dense Vector (array<double>) • Sparse Vector (array<string>) in a LIBSVM format • feature := <index>[“:”<value>] where index := <integer> starting with 1 (index = 0 is reserved for bias clause) and value := <floating point> (default 1.0 if not provided) 9 Supported Feature Vector Format of Random Forests 1.0, 0.0, 3.0 1:1.0, 2:0.0, 3:3.0 1:1.0, 3:3.0 select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331")); ["1828616:3.3","6238429:4.999","6238429"] 1:1.0, 3
  10. 10. Feature Engineering – Feature Hashing 10
  11. 11. 11 Random Forests Taining Hyperparameters -attrs,--attribute_types <arg> Comma separated attribute types (Q for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C]) -depth,--max_depth <arg> The maximum number of the tree depth [default: Integer.MAX_VALUE] -leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes [default: Integer.MAX_VALUE] -min_samples_leaf <arg> The minimum number of samples in a leaf node [default: 1] -rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR] -seed <arg> seed value in long [default: -1 (random)] -splits,--min_split <arg> A node that has greater than or equals to `min_split` examples will split [default: 2] -stratified,--stratified_sampling Enable Stratified sampling for unbalanced data -subsample <arg> Sampling rate in range (0.0,1.0] -trees,--num_trees <arg> The number of trees for each task [default: 50] -vars,--num_variables <arg> The number of random selected features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length) is considered if num_variable is (0.0,1.0]
  12. 12. Prediction of RandomForest 12 Posterior probability based on voting of Decision Trees Reliability of a model based on OOB error rate
  13. 13. 13 Decision Tree Visualization
  14. 14. 14 Decision Tree Visualization http://viz-js.com/
  15. 15. 15 Efficient All-pairs Cosine Similarity using DIMSM https://blog.twitter.com/engineering/en_us/a/2014/all-pairs-similarity-via-dimsum.html All-pair similarity is very computation heavy: O(N2) where N is number of items or users Twitter’s solution is DIMSUM
  16. 16. 16 All-pairs Cosine Similarity using DIMSM Find a concreate example in https://github.com/treasure-data/workflow-examples/tree/master/machine-learning/collaborative_filtering
  17. 17. Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins 17
  18. 18. Feature Engineering – Feature Binning 18
  19. 19. Evaluation Metrics 19
  20. 20. Map tiling functions 20
  21. 21. Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15 21
  22. 22. 22 SELECT count(distinct id) FROM data Sketch and NLP functions SELECT approx_count_distinct(id) FROM data select tokenize_ja(“ ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [“ ”, "," "," "]
  23. 23. Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation 23
  24. 24. Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder 24
  25. 25. Take this… Anomaly/Change-point Detection by ChangeFinder 25
  26. 26. Anomaly/Change-point Detection by ChangeFinder …and do this! 26
  27. 27. Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. 27
  28. 28. • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation 28
  29. 29. Online mini-batch LDA 29
  30. 30. 30 Probabilistic Latent Semantic Analysis - training
  31. 31. 31 Probabilistic Latent Semantic Analysis - predict
  32. 32. ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü Merge Brickhouse UDFs ü XGBoost support ü LightGBM support ü Gradient Boosting Future work for v0.5.2 and later 32 PR#91 PR#116 PR#58 PR#111 PR#135
  33. 33. SELECT from_json(to_json( ARRAY( NAMED_STRUCT("country", "japan", "city", "tokyo"), NAMED_STRUCT("country", "japan", "city", "osaka") ) ),'array<struct<city:string>>') 33 Brickhouse functions https://github.com/klout/brickhouse
  34. 34. Prediction tracing of Decision Tree 34 Trace how predicted
  35. 35. 35 SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; Experimental Not yet supported in TD
  36. 36. 36 Hivemall Digdag
  37. 37. 37 Machine Learning Workflow using Digdag
  38. 38. 38 Machine Learning Workflow using Digdag