What's new in Apache Hivemall v0.5.0

April 17, 2018 at Dots

  Hivemall v0.5.0 Research Engineer, Treasure Data Makoto YUI @myui @ApacheHivemall
  v0.5.0
  What is Apache Hivemall Scalable machine learning library built as a collection of Hive UDFs Multi/Cross platform VersatileScalableEase-of-use
  Hivemall is easy and scalable … ML made easy for SQL developers Born to be parallel and scalable Ease-of-use Scalable 100+ lines of code CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers This query automatically runs in parallel on Hadoop
  Hivemall is a multi/cross-platform ML library HiveQL SparkSQL/Dataframe API Pig Latin Hivemall is Multi/Cross platform .. Multi/Cross platform prediction models built by Hive can be used from Spark, and conversely, prediction models build by Spark can be used from Hive
  Hivemall on Apache Hive
  Hivemall on Apache Spark Dataframe
  Hivemall on SparkSQL
  Hivemall on Apache Pig
  Online Prediction by Apache Streaming
  What's new in v0.5.0? Anomaly/Change Point Detection Topic Modeling (Soft Clustering) Algorithm: LDA, pLSA Algorithm: ChangeFinder, SST Hivmall on Spark v2.0/v2.1/v2.2 SparkSQL/Dataframe support, Top-k data processing
  Generic Classifier/Regressor OLD Style New Style from v0.5.0
  •Squared Loss •Quantile Loss •Epsilon Insensitive Loss •Squared Epsilon Insensitive Loss •Huber Loss Generic Classifier/Regressor Available Loss functions •HingeLoss •LogLoss (synonym: logistic) •SquaredHingeLoss •ModifiedHuberLoss • L1 • L2 • ElasticNet • RDA Other options For Binary Classification: For Regression: • SGD • AdaGrad • AdaDelta • ADAM Optimizer • Iteration support • mini-batch • Early stopping Regularization
  -eta0 <arg> The initial learning rate [default 0.1] -iter,--iterations <arg> The maximum number of iterations [default: 10] -lambda <arg> Regularization term [default 0.0001] -loss,--loss_function <arg> Loss function [HingeLoss (default) , LogLoss, SquaredHingeLoss, ModifiedHuberLoss, or a regression loss: SquaredLoss, QuantileLoss, EpsilonInsensitiveLoss, SquaredEpsilonInsensitiveLoss, HuberLoss] -mini_batch,--mini_batch_size <arg> Mini batch size [default: 1]. Expecting the value in range [1,100] or so. -opt,--optimizer <arg> Optimizer to update weights [default: adagrad, sgd, adadelta, adam] -reg,--regularization <arg> Regularization type [default: rda, l1, l2, elasticnet] Generic Classifier/Regressor Hyperparameters Adagrad+RDA by the default
  RandomForest in Hivemall Ensemble of Decision Trees
  Image borrowed from http://alfredplpl.hatenablog.com/entry/2013/12/24/225420 What's OOB in RandomForests? uniform/stratified sampling
  Stratified Sampling ( ) ) https://bellcurve.jp/statistics/course/8007.html
  What's OOB in RandomForests? ) http://alfredplpl.hatenablog.com/entry/2013/12/24/225420 学習に使っていないデータを モデルの精度評価に利用
  Training of RandomForest Good news: Sparse Vector Input (Libsvm format) is supported since v0.5.0 in addition Dense Vector! train_randomforest_classifier(array<double|string> features, int label [, const string options, const array<double> classWeights])
  • Dense Vector (array<double>) • Sparse Vector (array<string>) in a LIBSVM format • feature := <index>[":"<value>] where index := <integer> starting with 1 (index = 0 is reserved for bias clause) and value := <floating point> (default 1.0 if not provided) Supported Feature Vector Format of Random Forests 1.0, 0.0, 3.0 1:1.0, 2:0.0, 3:3.0 1:1.0, 3:3.0 select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331")); ["1828616:3.3","6238429:4.999","6238429"] 1:1.0, 3
  Feature Engineering – Feature Hashing
  Random Forests Taining Hyperparameters -attrs,--attribute_types <arg> Comma separated attribute types (Q for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C]) -depth,--max_depth <arg> The maximum number of the tree depth [default: Integer.MAX_VALUE] -leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes [default: Integer.MAX_VALUE] -min_samples_leaf <arg> The minimum number of samples in a leaf node [default: 1] -rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY, CLASSIFICATION_ERROR] -seed <arg> seed value in long [default: -1 (random)] -splits,--min_split <arg> A node that has greater than or equals to `min_split` examples will split [default: 2] -stratified,--stratified_sampling Enable Stratified sampling for unbalanced data -subsample <arg> Sampling rate in range (0.0,1.0] -trees,--num_trees <arg> The number of trees for each task [default: 50] -vars,--num_variables <arg> The number of random selected features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length) is considered if num_variable is (0.0,1.0]
  Prediction of RandomForest 決定木の予測クラスの投票に基づく事後確率 OOBエラー率に基づくmodelの信憑性
  Decision Tree Visualization
  Decision Tree Visualization http://viz-js.com/
  Feature Engineering – Feature Binning Maps quantitative variables to fixed number of bins based on quantiles/distribution Map Ages into 3 bins
  Feature Engineering – Feature Binning
  Evaluation Metrics
  Map tiling functions
  Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n Map tiling functions Zoom=10 Zoom=15
  SELECT count(distinct id) FROM data Sketch and NLP functions SELECT approx_count_distinct(id) FROM data select tokenize_ja(" ", "normal", null, null, "https://s3.amazonaws.com/td- hivemall/dist/kuromoji-user-dict-neologd.csv.gz"); [" ", "," "," "]
  Other Supported Features Anomaly Detection ✓Local Outlier Factor (LoF) ✓ChangeFinder Clustering / Topic models ✓Online mini-batch LDA ✓Online mini-batch PLSA Change Point Detection ✓ChangeFinder ✓Singular Spectrum Transformation
  Efficient algorithm for finding change point and outliers from time-series data J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006. Anomaly/Change-point Detection by ChangeFinder
  Take this… Anomaly/Change-point Detection by ChangeFinder
  Anomaly/Change-point Detection by ChangeFinder …and do this!
  Efficient algorithm for finding change point and outliers from timeseries data Anomaly/Change-point Detection by ChangeFinder J. Takeuchi and K. Yamanishi, A Unifying Framework for Detecting Outliers and Change Points from Time Series, IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.
  • T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T. • T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007. Change-point detection by Singular Spectrum Transformation
  Online mini-batch LDA
  Probabilistic Latent Semantic Analysis - training
  Probabilistic Latent Semantic Analysis - predict
  ü Word2Vec support ü Multi-class Logistic Regression ü Field-aware Factorization Machines ü SLIM recommendation ü Merge Brickhouse UDFs ü XGBoost support ü LightGBM support ü Gradient Boosting Future work for v0.5.2 and later PR#91 PR#116 PR#58 PR#111 PR#135
  SELECT from_json(to_json( ARRAY( NAMED_STRUCT("country", "japan", "city", "tokyo"), NAMED_STRUCT("country", "japan", "city", "osaka") ) ),'array<struct<city:string>>') Brickhouse functions https://github.com/klout/brickhouse
  Prediction tracing of Decision Tree Trace how predicted
  SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training_data XGBoost support in Hivemall SELECT rowed, AVG(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN test_data_with_id ) t GROUP BY rowid; Experimental Not yet supported in TD
  Conclusion and Takeaway Hivemall is a multi/cross-platform ML library providing a collection of machine learning algorithms as Hive UDFs/UDTFs Try our the first Apache release (v0.5.0)! We welcome your contributions to Apache Hivemall J HiveQL SparkSQL/Dataframe API Pig Latin
  Any feature request or questions? BTW, we are hiring!
  Hivemall Digdag
  Machine Learning Workflow using Digdag
  Machine Learning Workflow using Digdag