SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
KNITTING BOAR
    Machine Learning, Mahout, and Parallel Iterative Algorithms




    Josh Patterson
    Principal Solutions Architect




1
✛ Josh Patterson
   > Master’s Thesis: self-organizing mesh networks
       ∗   Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
   > Conceived, built, and led Hadoop integration for openPDC project
      at Tennessee Valley Authority (TVA)
   > Twitter: @jpatanooga

   > Email:    josh@cloudera.com
✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
Introduction to
    MACHINE LEARNING




4
✛ What is Data Mining?
  > “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
  > Raw data essentially useless
      ∗ Data is simply recorded facts
      ∗ Information is the patterns underlying the data

✛ Machine Learning
  > Algorithms for acquiring structural descriptions from
    data “examples”
      ∗ Process of learning “concepts”
✛ Information Retrieval
   > information science, information
     architecture, cognitive psychology, linguistics, and
     statistics.
✛ Natural Language Processing
  > grounded in machine learning, especially statistical
    machine learning
✛ Statistics
  > Math and stuff
✛ Machine Learning
  > Considered a branch of artificial intelligence
✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization



        “Descriptive Statistics”
✛ Don’t always assume you need “scale” and
  parallelization
  >   Try it out on a single machine first
  >   See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy
  machine?
✛ We can always use the constructed model
  back in MapReduce to score a ton of new
  data
✛   http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG
    MOD2012.pdf
    >   Looks to study data with descriptive statistics in the hopes of building models for
        predictive analytics

✛   Does majority of ML work via Pig custom integrations
    >   Pipeline is very “Pig-centric”
    >   Example: https://github.com/tdunning/pig-vector
    >   They use SGD and Ensemble methods mostly being conducive
        to large scale data mining
✛   Questions they try to answer
    >   Is this tweet spam?
    >   What star rating might this user give this movie?
✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
  or Pig
✛ ML work performed with
  >   SAS
  >   SPSS
  >   R
  >   Mahout
Introduction to
11
     MAHOUT
✛ Classification
   > “Fraud detection”
 ✛ Recommendation
   > “Collaborative
     Filtering”
 ✛ Clustering
   > “Segmentation”
 ✛ Frequent Itemset
     Mining


12                       Copyright 2010 Cloudera Inc. All rights reserved
✛ Stochastic Gradient Descent
   > Single process
   > Logistic Regression Model Construction
 ✛ Naïve Bayes
   > MapReduce-based
   > Text Classification
 ✛ Random Forests
   > MapReduce-based




13                    Copyright 2010 Cloudera Inc. All rights reserved
✛ An algorithm that looks at a user’s past actions
  and suggests
   > Products
   > Services
   > People
✛ Advertisement
  > Cloudera has a great Data Science training course on
    this topic
  > http://university.cloudera.com/training/data_science/in
    troduction_to_data_science_-
    _building_recommender_systems.html
✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation
✛   Why Machine Learning?
    >   Growing interest in predictive modeling

✛   Linear Models are Simple, Useful
    >   Stochastic Gradient Descent is a very popular tool for
        building linear models like Logistic Regression

✛   Building Models Still is Time Consuming
    >   The “Need for speed”
    >   “More data beats a cleverer algorithm”
Introducing
KNITTING BOAR




 17
✛ Parallelize Mahout’s Stochastic Gradient Descent
  >   With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
   using YARN
  >   Wanted a first class Hadoop-Yarn citizen
  >   Work through dev progressions towards a stable state
  >   Worry about “frameworks” later
✛ Training                        Training Data

    > Simple gradient descent
      procedure
    > Loss functions needs to be
      convex
 ✛ Prediction                         SGD

   > Logistic Regression:
       ∗ Sigmoid function using
         parameter vector (dot)
         example as exponential
                                     Model
         parameter


19
Current Limitations
 ✛ Sequential algorithms on a single node only
   goes so far
 ✛ The “Data Deluge”
     > Presents algorithmic challenges when combined with
       large data sets
     > need to design algorithms that are able to perform in
       a distributed fashion
 ✛ MapReduce only fits certain types of algorithms




20
Distributed Learning Strategies
 ✛ Langford, 2007
    > Vowpal Wabbit
 ✛ McDonald 2010
   > Distributed Training Strategies for the Structured
     Perceptron
 ✛ Dekel 2010
   > Optimal Distributed Online Prediction Using Mini-
     Batches




21
Input             Processor    Processor    Processor



                                         Superstep 1
     Map      Map      Map

                             Processor    Processor    Processor



     Reduce         Reduce               Superstep 2

                                             . . .
           Output


22
“Are the gains gotten from using X worth the integration
     costs incurred in building the end-to-end solution?

     If no, then operationally, we can consider the Hadoop
     stack …

     there are substantial costs in knitting together a
     patchwork of different frameworks, programming
     models, etc.”

     –– Lin, 2012



23
✛ Parallel Iterative implementation of SGD on
     YARN

 ✛ Workers work on partitions of the data
 ✛ Master keeps global copy of merged parameter
     vector




24
✛ Each given a split of the total dataset
   > Similar to a map task
 ✛ Using a modified OLR
   > process N samples in a epoch (subset of split)
 ✛ Local parameter vector sent to master node
    > Master averages all workers’ vectors together




25
✛ Gathers and averages worker parameter vectors
   > From worker OLR runs
 ✛ Produces new global parameter vector
   > By averaging workers’ vectors
 ✛ Sends update to all workers
   > Workers replace local parameter vector with new
     global parameter vector




26
✛ ComputableMaster
                      Worker   Worker   Worker
   > Setup()
   > Compute()                 Master
   > Complete()
 ✛ ComputableWorker   Worker   Worker   Worker


   > Setup()
                               Master
   > Compute()
                                . . .




27
OnlineLogisticRegression
                                              Knitting Boar’s POLR
                                    Split 1             Split 2             Split 3
           Training Data




                                 Worker 1             Worker 2
                                                                     …   Worker N




                                Partial Model        Partial Model       Partial Model
     OnlineLogisticRegression


                                                     Master



             Model
                                                    Global Model

28
300


     250


     200


     150                                                                     OLR
                                                                             POLR
     100


      50


       0
           4.1   8.2   12.3   16.4   20.5   24.6   28.7   32.8   36.9   41




                 Input Size vs Processing Time


29
Knitting Boar
     PARTING THOUGHTS




30
✛ Parallel SGD
   > The Boar is temperamental, experimental
       ∗ Linear speedup (roughly)

 ✛ Developing YARN Applications
   > More complex the just MapReduce
   > Requires lots of “plumbing”
 ✛ IterativeReduce
    > Great native-Hadoop way to implement algorithms
    > Easy to use and well integrated



31
✛ Knitting Boar
   > https://github.com/jpatanooga/KnittingBoar
   > 100% Java
   > ASF 2.0 Licensed
   > Quick Start
       ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

 ✛ IterativeReduce
    > https://github.com/emsixteeen/IterativeReduce
    > 100% Java
    > ASF 2.0 Licensed


32
✛ Machine Learning is hard
       > Don’t believe the hype
       > Do the work
     ✛ Model development takes
       time
       > Lots of iterations
       > Speed is key here


        Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg



33
✛ Strata / Hadoop World 2012 Slides
   > http://www.cloudera.com/content/cloudera/en/resourc
     es/library/hadoopworld/strata-hadoop-world-2012-
     knitting-boar_slide_deck.html
 ✛ Mahout’s SGD implementation
   > http://lingpipe.files.wordpress.com/2008/04/lazysgdre
     gression.pdf
 ✛ MapReduce is Good Enough? If All You Have is
     a Hammer, Throw Away Everything That’s Not a
     Nail!
     > http://arxiv.org/pdf/1209.2191v1.pdf


34
✛ Langford
    > http://hunch.net/~vw/
 ✛ McDonald, 2010
   > http://dl.acm.org/citation.cfm?id=1858068




35
✛ http://eteamjournal.files.wordpress.com/2011/03/
   photos-of-mount-everest-pictures.jpg
 ✛ http://images.fineartamerica.com/images-
   medium-large/-say-hello-to-my-little-friend--luis-
   ludzska.jpg
 ✛ http://freewallpaper.in/wallpaper2/2202-2-
   2001_space_odyssey_-_5.jpg




36

Mais conteúdo relacionado

Mais procurados

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Rafael Ferreira da Silva
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTIONijcses
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
Optimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative EnvironmentsOptimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative EnvironmentsBehrouz Derakhshan
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraphHsiao-Fei Liu
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Eduserv
 

Mais procurados (20)

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
genetic paper
genetic papergenetic paper
genetic paper
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Wor...
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Optimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative EnvironmentsOptimizing Machine Learning Pipelines in Collaborative Environments
Optimizing Machine Learning Pipelines in Collaborative Environments
 
Sathya Final review
Sathya Final reviewSathya Final review
Sathya Final review
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraph
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
 

Destaque

Administracion estrategica
Administracion estrategicaAdministracion estrategica
Administracion estrategicaandreina2596
 
Cannabis Juice Benefits
Cannabis Juice BenefitsCannabis Juice Benefits
Cannabis Juice BenefitsJohnny Sayegh
 
Landbrugssektoren – kvinders rolle i udviklingslande
Landbrugssektoren – kvinders rolle i udviklingslandeLandbrugssektoren – kvinders rolle i udviklingslande
Landbrugssektoren – kvinders rolle i udviklingslandeSanne Chipeta
 
SAC como Estrategia Competititva
SAC como Estrategia CompetititvaSAC como Estrategia Competititva
SAC como Estrategia CompetititvaAlejandraL Unzueta
 
6.3.1
6.3.16.3.1
6.3.1renee
 
Presentación1 dhticflore
Presentación1 dhticflorePresentación1 dhticflore
Presentación1 dhticfloreSelina Flor
 
IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...
IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...
IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...Ikinnoveer
 
Balanced Scorecard Templates - Version 3
Balanced Scorecard Templates - Version 3Balanced Scorecard Templates - Version 3
Balanced Scorecard Templates - Version 3Clive Keyte
 
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault ToleranceParallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault ToleranceUniversity of Technology - Iraq
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringAndreina Uzcategui
 
Verbal reasoning gre
Verbal reasoning   greVerbal reasoning   gre
Verbal reasoning greshivgan
 
PMSM Laporan Pertanggungjawaban 2013 2016 revised
PMSM Laporan Pertanggungjawaban 2013 2016 revisedPMSM Laporan Pertanggungjawaban 2013 2016 revised
PMSM Laporan Pertanggungjawaban 2013 2016 revisedGunawan Wicaksono
 

Destaque (16)

Administracion estrategica
Administracion estrategicaAdministracion estrategica
Administracion estrategica
 
Farmad niet opgeloste problemen
Farmad niet opgeloste problemenFarmad niet opgeloste problemen
Farmad niet opgeloste problemen
 
Cannabis Juice Benefits
Cannabis Juice BenefitsCannabis Juice Benefits
Cannabis Juice Benefits
 
Landbrugssektoren – kvinders rolle i udviklingslande
Landbrugssektoren – kvinders rolle i udviklingslandeLandbrugssektoren – kvinders rolle i udviklingslande
Landbrugssektoren – kvinders rolle i udviklingslande
 
SAC como Estrategia Competititva
SAC como Estrategia CompetititvaSAC como Estrategia Competititva
SAC como Estrategia Competititva
 
6.3.1
6.3.16.3.1
6.3.1
 
Presentación1 dhticflore
Presentación1 dhticflorePresentación1 dhticflore
Presentación1 dhticflore
 
Ginkgo sales
Ginkgo salesGinkgo sales
Ginkgo sales
 
IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...
IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...
IA Een innovatieve toekomst in de bouw. Sessie 1 deel 1: Jan Desmyter Nieuwe ...
 
latoya documents
latoya documentslatoya documents
latoya documents
 
Makalah 1
Makalah 1Makalah 1
Makalah 1
 
Balanced Scorecard Templates - Version 3
Balanced Scorecard Templates - Version 3Balanced Scorecard Templates - Version 3
Balanced Scorecard Templates - Version 3
 
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault ToleranceParallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
 
Parallel Algorithms K – means Clustering
Parallel Algorithms K – means ClusteringParallel Algorithms K – means Clustering
Parallel Algorithms K – means Clustering
 
Verbal reasoning gre
Verbal reasoning   greVerbal reasoning   gre
Verbal reasoning gre
 
PMSM Laporan Pertanggungjawaban 2013 2016 revised
PMSM Laporan Pertanggungjawaban 2013 2016 revisedPMSM Laporan Pertanggungjawaban 2013 2016 revised
PMSM Laporan Pertanggungjawaban 2013 2016 revised
 

Semelhante a Knitting boar - Toronto and Boston HUGs - Nov 2012

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceIRJET Journal
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworksReem Abdel-Rahman
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningMakoto Yui
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 

Semelhante a Knitting boar - Toronto and Boston HUGs - Nov 2012 (20)

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modeling
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 

Mais de Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial IntelligenceJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkJosh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 

Mais de Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 

Último

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Último (20)

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Knitting boar - Toronto and Boston HUGs - Nov 2012

  • 1. KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect 1
  • 2. ✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email: josh@cloudera.com
  • 3. ✛ Introduction to Machine Learning ✛ Mahout ✛ Knitting Boar and YARN ✛ Parting Thoughts
  • 4. Introduction to MACHINE LEARNING 4
  • 5. ✛ What is Data Mining? > “the process of extracting patterns from data” ✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data ✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
  • 6. ✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics. ✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning ✛ Statistics > Math and stuff ✛ Machine Learning > Considered a branch of artificial intelligence
  • 7. ✛ ETL ✛ Joining multiple disparate data sources ✛ Filtering data ✛ Aggregation ✛ Cube materialization “Descriptive Statistics”
  • 8. ✛ Don’t always assume you need “scale” and parallelization > Try it out on a single machine first > See if it becomes a bottleneck! ✛ Will the data fit in memory on a beefy machine? ✛ We can always use the constructed model back in MapReduce to score a ton of new data
  • 9. http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG MOD2012.pdf > Looks to study data with descriptive statistics in the hopes of building models for predictive analytics ✛ Does majority of ML work via Pig custom integrations > Pipeline is very “Pig-centric” > Example: https://github.com/tdunning/pig-vector > They use SGD and Ensemble methods mostly being conducive to large scale data mining ✛ Questions they try to answer > Is this tweet spam? > What star rating might this user give this movie?
  • 10. ✛ Data collection performed w Flume ✛ Data cleansing / ETL performed with Hive or Pig ✛ ML work performed with > SAS > SPSS > R > Mahout
  • 12. ✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining 12 Copyright 2010 Cloudera Inc. All rights reserved
  • 13. ✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based 13 Copyright 2010 Cloudera Inc. All rights reserved
  • 14. ✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People ✛ Advertisement > Cloudera has a great Data Science training course on this topic > http://university.cloudera.com/training/data_science/in troduction_to_data_science_- _building_recommender_systems.html
  • 15. ✛ Cluster words across docs to identify topics ✛ Latent Dirichlet Allocation
  • 16. Why Machine Learning? > Growing interest in predictive modeling ✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression ✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • 18. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible ✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • 19. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter 19
  • 20. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms 20
  • 21. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches 21
  • 22. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output 22
  • 23. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 2012 23
  • 24. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector 24
  • 25. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a epoch (subset of split) ✛ Local parameter vector sent to master node > Master averages all workers’ vectors together 25
  • 26. ✛ Gathers and averages worker parameter vectors > From worker OLR runs ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector 26
  • 27. ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . . 27
  • 28. OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model 28
  • 29. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time 29
  • 30. Knitting Boar PARTING THOUGHTS 30
  • 31. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated 31
  • 32. ✛ Knitting Boar > https://github.com/jpatanooga/KnittingBoar > 100% Java > ASF 2.0 Licensed > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > https://github.com/emsixteeen/IterativeReduce > 100% Java > ASF 2.0 Licensed 32
  • 33. ✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg 33
  • 34. ✛ Strata / Hadoop World 2012 Slides > http://www.cloudera.com/content/cloudera/en/resourc es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ Mahout’s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf 34
  • 35. ✛ Langford > http://hunch.net/~vw/ ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 35
  • 36. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://freewallpaper.in/wallpaper2/2202-2- 2001_space_odyssey_-_5.jpg 36

Notas do Editor

  1. Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  2. Yeah? Ok let’s look at doing ETL in HadoopAnd then running the model construction phase in another tool like RNo?We need to think of a way to either Refactor the algorithm into MapReducePartition the data such that a reducer can work on each subset
  3. Frequent itemset mining – what appears together
  4. “What do other people w/ similar tastes like?”“strength of associations”
  5. “say hello to my leeeeetle friend….”
  6. Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  7. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  8. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  9. Bottou similar to Xu2010 in the 2010 paper
  10. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  11. Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  12. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  13. 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  14. Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  15. Basecamp: use story of how we get to basecamp to see how to climb some more