O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Anwar Rizal – Streaming & Parallel Decision Tree in Flink

7.569 visualizações

Publicada em

Flink Forward 2015

Publicada em: Tecnologia
  • Legitimate jobs paying $40/h, Tap into the booming online job industry and start working now! ➥➥➥ https://tinyurl.com/ezpayjobs2019
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Earn $90/day Working Online You won't get rich, but it is going to make you some money! ◆◆◆ http://ishbv.com/ezpayjobs/pdf
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Anwar Rizal – Streaming & Parallel Decision Tree in Flink

  1. 1. Streaming &Parallel 
 Decision Tree in Flink 1 2 3 4 1 2 3 4 anwar.rizal @anrizal
  2. 2. 1 2 3 4 Outlines Motivation Architecture Decision Trees Implementation Conclusion
  3. 3. Motivation Motivation Architecture Decision Trees Implementation Conclusion
  4. 4. Motivation Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified
  5. 5. Motivation $90 $90 $120 $90 $90 $150 $200 $90 $75 $90 $90 $90 $90 $90 $120 $90 Sold out Sold out $75 $90 $90 $120 $90 $90 $90 $100 $90 $120
  6. 6. Motivation $90 $90 $120 $90 $90 $150 $200 $90 $75 $90 $90 $90 $90 $90 $120 $90 Sold out Sold out $75 $90 $90 $120 $90 $90 $90 $100 $90 $120 (predicted) to increase zero to two days (predicted) to increase this week (predicted) to increase next week
  7. 7. Motivation FRA – NYC FRA - LON FRA - MEX
  8. 8. Motivation FRA – NYC FRA - LON FRA - MEX Need attention revenue decrease Need attention passenger decrease Need attention revenue decrease, cost increase
  9. 9. Motivation Need a classifier system on streaming data The data used for learning come as a stream So are the data to be classified
  10. 10. Motivation The classifier is kept fresh No need for separate batch learning/evaluation The feedback is taken into account in real time, regularly The classifier can be introspected Transparent model structure (e.g. know the tree, information gain for each split point) Known expected performance (accuracy, precision, recall, AUC) Seamless support for workflow of machine learning Data preprocessing: up/down sampling, imputations, … Feature selections Model evaluation, cross validation, MUST
  11. 11. Motivation The classifier is immediately available The classifier can already predict during learning When learning phase is terminated, it starts another cycle of learning The classifier has a meta-learning capability The classifier has several models different parameters It is possible to learn about the learning capability of the models NICE TO HAVE
  12. 12. Motivation Learning Learning & Classifying End of learning New cycle of learning Cycle of Learning, Classifying during Learning, End of Learning, Classifying, New Learning
  13. 13. Motivation Classifying Application Stream Learner Labeled points Classifier Predicted points Unlabeled points
  14. 14. Motivation Architecture Decision Trees Implementation Conclusion DecisionTrees
  15. 15. DecisionTrees From origin to recent developments “Understand data by asking a sequence of questions ” Classification and Regression Trees (CART) by Breiman et al. in 1984 “Pool decision trees to improve generalization” Random Forests by Breiman in 1999 “Let’s play: pose estimation for XBox’s Kinect” Shotton et al. 2011
  16. 16. DecisionTrees Streaming Decision Trees “A classifier for streaming data with a bound” Hoeffding Tree (VFDT), Dominguez & Hulthen 2000 “Use of Approximate Histograms for Decision Tree” Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov 2010
  17. 17. Advance purchase Reservation Subspace Class FIRST BUSINESS ECONOMY Train a decision tree - get the intuition! 1 2 3 4 1 2 3 4 Busy procrastinators Tourists Foreseeing businessmen Tourists Brad Pitt Save money for the company Business Leisure Supervision
  18. 18. Advance purchase Reservation Subspace Class FIRST BUSINESS ECONOMY Classifying - get the intuition ! Business Leisure 1 2 3 4 1 2 3 4 + confidence measure
  19. 19. Advance purchase Reservation Subspace Class FIRST BUSINESS ECONOMY Decision tree - node optimization          Information Gain  
  20. 20. DecisionTrees Streaming Decision Trees The batch version of decision trees require view of the full learning data set In streaming each point can only be seen once the processing should be fast, can’t afford too much access to disks
  21. 21. DecisionTrees Streaming Decision Tree – get the intuition ! Instead of using every point, the points are compressed The real position of each point is then approximated
  22. 22. Advance purchase Reservation Subspace Class FIRST BUSINESS ECONOMY Streaming decision tree - get the intuition! 1 2 3 4 1 2 3 4 Busy procrastinators Tourists Foreseeing businessmen Tourists Business Leisure Supervision + Count of points nearby
  23. 23. DecisionTrees Streaming Decision Tree – the Question “How to find split points for a decision tree ?“
  24. 24. label 1 / feature 1 Count 0 2 4 6 8 Feature 1 2 5 7.5 9 11 DecisionTrees Compressing Data
 An approximate histogram is built for each label/feature
  25. 25. label n/feature 1 0 4 8 4 6 8.5 10 13 label n feature 1 0 10 20 30 40 1 3.5 7 11 14 Total label 1 / feature 1 0 4 8 2 5 7.5 9 11 label 0 DecisionTrees For each feature, all histograms of th feature are merged Prepare Split Candidates (1/2)
  26. 26. Total 0 10 20 30 40 1 3.5 7 11 14 Total Get the split candidates s.t. the interval between two split candidates have same number of points (the colored square is as large as each other ) Total 0 10 20 30 40 1 3.5 7 11 14 Total u1 u2 DecisionTrees Prepare Split Candidates (2/2)
  27. 27. Find the split point that maximizes the information gain using the split points histogram per feature/label Total 0 10 20 30 40 1 3.5 7 11 14 Total u1 u2 DecisionTrees Determine Split
  28. 28. Advance purchase Reservation Subspace Class FIRST BUSINESS ECONOMY The Intuition is not exactly precise Business Leisure Supervision • The histograms can no longer be used for further split • And of course, we have already lost original data
  29. 29. A different data set is used for different iteration DecisionTree * If there are not enough data, the same data can be reinjected instead, Kafka is very good for this Subsequent Split – get the intuition !
  30. 30. Motivation Architecture Decision Trees Implementation Conclusion Implementation
  31. 31. Implementation Stream Learner
  32. 32. Implementation Stream Learner We use two kafka streams: • One for labeled data stream • One for the tree developed so far (the topic is also use by classifying applications) • Because we need to annotate each message with the tree so far
  33. 33. Implementation Code Outlines val kafkaDataStream: DataStream[Point]= val kafkaTreeStream: DataStream[Node] = // annotate each message with the latest tree val annotatedDataStream: DataStream[AnnotatedPoint] = (kafkaDataStream connect kafkaTreeStream) flatMap (new AnnotateMessageCoFlatMap(…)) // create histogram per feature / node val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) } .timeWindowAll(Time.of(1, TimeUnit.MINUTES)) .reduce{ (n1, n2) => mergeHistogram(n1, n2) } // merge histogram val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2) } val newTree = mergedHistograms .filter(hs => haveEnoughPoints(hs) && toSplit(hs)) .map{ n => val splitPoint = maxInformationGain( calculateSplitCandidates(n))
  34. 34. val Histogram ➔accumulate var histogram ➔re-accumulate from 0 h1
  35. 35. Motivation Architecture Decision Trees Implementation 
 Conclusion Conclusion
  36. 36. Conclusions Summary Streaming algorithms based on approximate histograms are explained The streaming decision trees algorithms open possibilities to have interesting properties of classifier: freshness and continuous learning Flink together with Kafka allow an implementation of the algorithm in a nice way
  37. 37. Conclusions Next Steps Random Forests: Trees with randomly selected features at each level Trees with different span of data (trees with more but old data might behave worse than trees with less but more fresh data: forgetting capabilities) Providing information of what type of trees behave better at a given period of time (meta learning)
  38. 38. Thanks! Credit to: Yiqing Yan (Eurecom) & Tianshu Yang (Telecom Bretagne), Amadeus Interns

×