O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
1	
  
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread o...
2
Design	
  Pa+erns	
  for	
  Large-­‐Scale	
  	
  
Real-­‐Time	
  Learning	
  
QCon	
  London	
  2014	
  
Sean	
  Owen	
 ...
3
What	
  We	
  Talk	
  About	
  When	
  	
  
We	
  Talk	
  About	
  Data	
  Science	
  
4
www.quora.com/Data-­‐Science/What-­‐is-­‐the-­‐difference-­‐between-­‐a-­‐data-­‐scienJst-­‐and-­‐a-­‐staJsJcian	
  
5
6	
  
	
  tist
Data	
  Science	
  Is	
  Exploratory	
  Analy-cs?	
  
7	
  
www.tc.umn.edu/~zief0002/Comparing-­‐Groups/blog.html	
  
then...
Example:	
  Drug	
  InteracJons	
  
8	
  
Cloudera	
  analysis	
  of	
  FDA	
  drug	
  
data:	
  “Our	
  analysis	
  revea...
9
Example:	
  Data	
  Science	
  in	
  the	
  Field	
  
10	
  
•  [Large	
  European	
  e-­‐commerce	
  site]	
  
•  Wants	
...
Example:	
  
11	
  
•  Search,	
  ML	
  over	
  PaJent	
  Data	
  
•  MapReduce	
  for	
  indexing,	
  learning	
  
•  HBa...
12
Adding	
  OperaJonal	
  AnalyJcs	
  
2014:	
  Lab	
  to	
  Factory	
  
13	
  
Data	
  Science	
  Will	
  Be	
  Opera-onal	
  Analy-cs	
  
14	
  
I	
  Built	
  A	
  Model	
  On	
  Hadoop.	
  Now	
  What?	
  
15	
  
Build	
  Model	
   Query	
  Model	
  Collect	
  Input...
16
Example:	
  Oryx	
  
17	
  
www.mw+l.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mw+l.jpg	
  
Gaps	
  to	
  fill,	
  and	
  Goals	
  
18	
  
•  Model	
  Building	
  
•  Large-­‐scale	
  
•  Con-nuous	
  
•  Apache	
  ...
Large-­‐Scale	
  or	
  Real-­‐Time?	
  
19	
  
Large-­‐Scale	
  
Offline	
  
Batch	
  
Real-­‐Time	
  
Online	
  
Streaming	...
Lambda	
  Architecture	
  
20	
  
•  Batch,	
  Stream	
  	
  
Processing	
  are	
  different	
  
•  Tackle	
  separately	
 ...
21	
  
Batch	
  
Serving/Speed	
  
Two	
  Layers	
  
22	
  
•  Computa-on	
  Layer	
  
•  Java-­‐based	
  server	
  process	
  
•  Client	
  of	
  Hadoop	
  ...
CollaboraJve	
  Filtering	
  :	
  ALS	
  
23	
  
•  AlternaJng	
  Least	
  Squares	
  
•  Latent-­‐factor	
  model	
  
•  ...
Clustering	
  :	
  k-­‐means++	
  
24	
  
•  Well-­‐known	
  and	
  
understood	
  
•  Parallelizable	
  
•  Clusters	
  u...
ClassificaJon	
  /	
  Regression	
  :	
  RDF	
  
25	
  
•  Random	
  Decision	
  Forests	
  
•  Ensemble	
  method	
  
•  N...
PMML	
  
26	
  
•  PredicJve	
  Modeling	
  
Markup	
  Language	
  
•  XML-­‐based	
  format	
  for	
  
predicJve	
  model...
Extra:	
  Apache	
  Spark	
  as	
  “Crossover	
  Hit”	
  
27	
  
•  Exploratory-­‐friendly	
  
•  REPL	
  
•  Scala	
  clo...
Thanks!	
  
28	
  
?	
  
29	
  
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/pattern-
real-time-learning
Próximos SlideShares
Carregando em…5
×

Design Patterns for Large-Scale Real-Time Learning

722 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1jupcVS.

Sean Owen provides examples of operational analytics projects in the field, presenting a reference architecture and algorithm design choices for a successful implementation based on his experience with customers and Oryx/Cloudera. Filmed at qconlondon.com.

Sean Owen is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google.

Publicada em: Tecnologia, Educação
  • Seja o primeiro a comentar

Design Patterns for Large-Scale Real-Time Learning

  1. 1. 1  
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /pattern-real-time-learning http://www.infoq.com/presentati ons/nasa-big-data http://www.infoq.com/presentati ons/nasa-big-data
  3. 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. 2 Design  Pa+erns  for  Large-­‐Scale     Real-­‐Time  Learning   QCon  London  2014   Sean  Owen  /  Director  of  Data  Science  /  Cloudera  
  5. 5. 3 What  We  Talk  About  When     We  Talk  About  Data  Science  
  6. 6. 4 www.quora.com/Data-­‐Science/What-­‐is-­‐the-­‐difference-­‐between-­‐a-­‐data-­‐scienJst-­‐and-­‐a-­‐staJsJcian  
  7. 7. 5
  8. 8. 6    tist
  9. 9. Data  Science  Is  Exploratory  Analy-cs?   7   www.tc.umn.edu/~zief0002/Comparing-­‐Groups/blog.html   thenextweb.com/microsoS/2013/07/08/microsoS-­‐brings-­‐the-­‐office-­‐store-­‐to-­‐22-­‐new-­‐markets-­‐adds-­‐power-­‐bi-­‐an-­‐intelligence-­‐tool-­‐to-­‐office-­‐365/  
  10. 10. Example:  Drug  InteracJons   8   Cloudera  analysis  of  FDA  drug   data:  “Our  analysis  revealed  a  few   drug  pairs  with  surprisingly  high   correlaJons  with  adverse  events   that  did  not  show  up  in  a  search  of   the  academic  literature:   gabapenJn  (a  seizure  medicaJon)   taken  in  conjuncJon  with   hydrocodone/paracetamol  was   correlated  with  memory   impairment,  and  haloperidol  in   conjuncJon  with  lorazepam  was   correlated  with  the  paJent   entering  into  a  coma.”   blog.cloudera.com/blog/2011/11/using-­‐hadoop-­‐to-­‐analyze-­‐adverse-­‐drug-­‐events/  
  11. 11. 9
  12. 12. Example:  Data  Science  in  the  Field   10   •  [Large  European  e-­‐commerce  site]   •  Wants  real-­‐Jme  recommendaJons    for  new  and  returning  users   •  Data  streamed  from  web  server  via     Flume  to  HDFS   •  MulJple  data  sources   •  100K+  products,  20M  users   Exploratory?  
  13. 13. Example:   11   •  Search,  ML  over  PaJent  Data   •  MapReduce  for  indexing,  learning   •  HBase  for  storage  and  fast  access   •  Also:  Storm  for     incremental  update   •  And:  relaJonal  DB  for   most  recent  derived  data   •  API  façade  for  input;   API  for  querying  learning   engineering.cerner.com/2013/02/near-­‐real-­‐Jme-­‐processing-­‐over-­‐hadoop-­‐and-­‐hbase/  Engineering   Machine  Learning  
  14. 14. 12 Adding  OperaJonal  AnalyJcs  
  15. 15. 2014:  Lab  to  Factory   13  
  16. 16. Data  Science  Will  Be  Opera-onal  Analy-cs   14  
  17. 17. I  Built  A  Model  On  Hadoop.  Now  What?   15   Build  Model   Query  Model  Collect  Input   Repeat   ?   ?   ?  
  18. 18. 16 Example:  Oryx  
  19. 19. 17   www.mw+l.com/wp-­‐content/uploads/2013/11/IMG_5446_edited-­‐2_mw+l.jpg  
  20. 20. Gaps  to  fill,  and  Goals   18   •  Model  Building   •  Large-­‐scale   •  Con-nuous   •  Apache  Hadoop™-­‐based   •  Few,  good  algorithms   •  Model  Serving   •  Real-­‐-me  query   •  Real-­‐-me  update   •  Algorithms   •  Parallelizable   •  Updateable   •  Works  on  diverse  input   •  Interoperable   •  PMML  model  format   •  Simple  REST  API   •  Open  source  
  21. 21. Large-­‐Scale  or  Real-­‐Time?   19   Large-­‐Scale   Offline   Batch   Real-­‐Time   Online   Streaming   vs   Why  Don’t  We  Have  Both?   λ!  
  22. 22. Lambda  Architecture   20   •  Batch,  Stream     Processing  are  different   •  Tackle  separately  in     2+  Layers   •  Batch  Layer:  offline,   asynchronous   •  Serving  /  Speed  Layer:   real-­‐Jme,  incremental,   approximate   jameskinley.tumblr.com/post/37398560534/the-­‐lambda-­‐architecture-­‐principles-­‐for-­‐architecJng   …  λ?  
  23. 23. 21   Batch   Serving/Speed  
  24. 24. Two  Layers   22   •  Computa-on  Layer   •  Java-­‐based  server  process   •  Client  of  Hadoop  2.x   •  Periodically  builds   “generaJon”  from  recent   data  and  past  model   •  Baby-­‐sits  MapReduce*   jobs  (or,  locally  in-­‐core)   •  Publishes  models   •  Serving  Layer   •  Apache  Tomcat™-­‐based   server  process   •  Consumes  models  from   HDFS  (or  local  FS)   •  Serves  queries  from   model  in  memory   •  Updates  from  new  input   •  Also  writes  input  to  HDFS   •  Replicas  for  scale   *  Apache  Spark  later  
  25. 25. CollaboraJve  Filtering  :  ALS   23   •  AlternaJng  Least  Squares   •  Latent-­‐factor  model   •  Accepts  implicit  or     explicit  feedback   •  Real-­‐Jme  update     via  fold-­‐in  of  input   •  No  cold-­‐start   •  Parallelizable   YT   X  
  26. 26. Clustering  :  k-­‐means++   24   •  Well-­‐known  and   understood   •  Parallelizable   •  Clusters  updateable   cwiki.apache.org/confluence/display/MAHOUT/K-­‐Means+Clustering  
  27. 27. ClassificaJon  /  Regression  :  RDF   25   •  Random  Decision  Forests   •  Ensemble  method   •  Numeric,  categorical     features  and  target     •  Very  parallel   •  Nodes  updateable   •  Works  well  on  many   problems   age$>$30 female? Yes income$>$20000 Yes Yes No
  28. 28. PMML   26   •  PredicJve  Modeling   Markup  Language   •  XML-­‐based  format  for   predicJve  models   •  Standardized  by  Data   Mining  Group   (www.dmg.org)   •  Wide  tool  support   <PMML xmlns="http://www.dmg.org/PMML-4_1"! version="4.1">! <Header copyright="www.dmg.org"/>! <DataDictionary numberOfFields="5">! <DataField name="temperature"! optype="continuous"! dataType="double"/>! …! </DataDictionary>! <TreeModel modelName="golfing"! functionName="classification">! <MiningSchema>! <MiningField name="temperature"/>! … ! </MiningSchema>! <Node score="will play">! <Node score="will play">! <SimplePredicate field="outlook"! operator="equal" ! value="sunny"/>! …! </Node>! </Node>! </TreeModel>! </PMML>! www.dmg.org/v4-­‐1/TreeModel.html  
  29. 29. Extra:  Apache  Spark  as  “Crossover  Hit”   27   •  Exploratory-­‐friendly   •  REPL   •  Scala  closures   •  MLlib   •  OperaJonal-­‐friendly   •  Distributed   •  Hadoop  integraJon   •  All  Java  libraries  available   blog.cloudera.com/blog/2014/03/why-­‐apache-­‐spark-­‐is-­‐a-­‐crossover-­‐hit-­‐for-­‐data-­‐scienJsts/  
  30. 30. Thanks!   28   ?  
  31. 31. 29  
  32. 32. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/pattern- real-time-learning

×