O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Introduction to Mahout and Machine Learning

72.342 visualizações

Publicada em

This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.

Publicada em: Tecnologia, Educação
  • Today's dogs suffer from a lack of mental stimulation and quality time spent with "their" people. The resulting boredom and anxiety can lead to no end of physical and behavioral problems. Brain Training for Dogs is the solution! In a clear and concise manner, Adrienne Farricelli walks owners through a series of puzzles and exercises that will challenge and entertain dogs of all abilities. ★★★ http://t.cn/Aie4mTQb
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • USA Today Has Proof That Lotto Is NOT Random  http://t.cn/Airf5UFH
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Introduction to Mahout and Machine Learning

  1. 1. { “Mahout” : “Scalable Machine Learning Library” } { “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”, “Twitter” : “@vrdmr” } 1
  2. 2. { “Mahout” : “Introduction” } 2
  3. 3. { “Introduction” : “History and Etymology” } • A Scalable Machine Learning Library built on Hadoop, written in Java. • Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” • Started as a Lucene sub-project. Became Apache TLP in April 2010. • Latest version out – 0.6 (released on 6th Feb 2012). • Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop. • Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten. • Taste Recommendation Framework was added later by Sean Owen. 3 Figure 1.1 Apache Mahout and its related projects within the Apache Foundation. Much of Mahout’s work has been to not only implement these algorithms conventionally, and scalable way, but also to convert some of these algorithms to work at scale on to Hadoop’s mascot is an elephant, which at last explains the project name! Mahout incubates a number of techniques and algorithms, many still in developm experimental phase. At this early stage in the project's life, three core themes are evident filtering / recommender engines, clustering, and classification. This is by no means all tha Mahout, but are the most prominent and mature themes at the time of writing. These the scope of this book. Chances are that if you are reading this, you are already aware of the interesting pot three families of techniques. But just in case, read on. 2
  4. 4. { “Mahout” : “Machine Learning” } 4
  5. 5. { “Machine Learning” : “Introduction” } “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience” • Branch of Artificial Intelligence • Design and Development of Algorithms • Computers Evolve Behavior based on Empirical Data . • Supervised Learning • Using Labeled training data, to create a Classifier that can predict output for unseen inputs. • Unsupervised Learning • Using Unlabeled training data to create a function that can predict output. • Semi-Supervised Learning 5
  6. 6. { “Machine Learning” : “Applications” } • Recommend Friends, Dates, Products to end-user. • Classify content into pre-defined groups. • Find Similar content based on Object Properties. • Identify key topics in large Collections of Text. • Detect Anomalies within given data. • Ranking Search Results with User Feedback Learning. • Classifying DNA sequences. • Sentiment Analysis/ Opinion Mining • Computer Vision. • Natural Language Processing, • BioInformatics. • Speech and HandWriting Recognition. • Others ... 6
  7. 7. {“Machine Learning”: “Challenges”} • BigData • Yesterdays Processing on next generation Data. • Time for Processing • Large and Cheap Storage 7 Size Classification Tools Lines Sample Data Analysis and Visualization Whiteboard, bash,... KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R, Processing, bash,... MBs - low GBs Online Data Storage MySQL (DBs),... MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/ LAPACK,... MBs - low GBs Online Data Visualization Flare, AmCharts, Raphael, Protovis,... GBs - TBs - PBs Big Data Storage HDFS, HBase, Cassandra,... GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,...
  8. 8. { “Machine Learning” : “Mahout for Big Data”} • Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”. • Some Algorithms won’t scale to massive machine clusters • Others fit logically on MapReduce framework like Apache Hadoop • Most Mahout implementations are MapReduce enabled • Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”. • The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library. • The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine Learning Open-Source Softwares) 8
  9. 9. { “Mahout” : “Internals” } 9
  10. 10. 10 { “Internals” : “Architecture” } Math% Vectors/Matrices/SVD% Recommenders%Clustering%Classifica9on% Freq.% Pa>ern% Mining% Evolu9onary% Algorithms% U9li9es% Lucene/Vectorizer% Collec9ons% (primi9ves)% Apache% Hadoop% Applica9ons% Examples% Regression% Dimension% Reduc9on%
  11. 11. • Scalable • Dual-Mode (Sequential and MapReduce Enabled) • Support for easy Extension. • Large Number of Data Source Enabled including the newer NoSQL variants. • It is a Java library. It is a framework of tools intended to be used and adapted by developers. • Advanced Implementations of Java’s Collections Framework for better Performance. 11 { “Internals” : “Features” }
  12. 12. { “Mahout” : “Algorithms” } 12
  13. 13. • Help Users find items they might like based on historical behavior and preferences • Top-level packages define the Mahout interfaces to these key abstractions: • DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel • UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood. • Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering Recommender. 13 { “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
  14. 14. 14 { “Algorithms” : “Recommender Systems”, “id” : “Example”} 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 Binary Values Recommendation Alice Bob John Jane Bill Steve Larry Don Jack
  15. 15. 15 { “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”} 1 1/3 – 0.33 5/8 – 0.625 5/8 – 0.625 1/3 – 0.33 1 3/8 – 0.375 3/8 – 0.375 5/8 – 0.625 3/8 – 0.375 1 5/7 – 0.714 5/8 – 0.625 3/8 – 0.375 5/7 – 0.714 1 Tanimoto Coefficient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
  16. 16. 16 { “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”} 1 0.507 0.772 0.772 0.507 1 0.707 0.707 0.772 0.707 1 0.833 0.772 0.707 0.833 1 Cosine Coefficient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
  17. 17. • Assigning Data to discreet Categories. • Train a model on Labeled Data • Run the Model on new, Unlabeled Data • Classifier: An algorithm that implements classification, especially in a concrete implementation. • Classification Algorithms • Maximum entropy classifier • Naïve Bayes classifier • Decision trees, decision lists • Support vector machines • Kernel estimation and K-nearest-neighbor algorithms • Perceptrons • Neural networks (multi-level perceptrons) 17 { “Algorithms” : “Classification” , “id” : “Introduction”} Spam Not spam ?
  18. 18. 18 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Train: Not Spam President Obama’s Nobel Prize Speech
  19. 19. 19 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Train: Spam Spam Email Content
  20. 20. 20 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Run “Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”
  21. 21. 21 { “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”} • Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs. • Training: • Read the Features • Calculate per-Document Statistics • Normalize across Categories • Calculate normalizing factor of each label • Testing • Classification (fifth job, explicitly invoked) algorithm through which the system will learn, and the variables used as input are key steps in the phase of building the classification system. The basic steps in building a classification system are illustrated in figure 13.2. Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc with new input examples to estimate the target variable. The figure shows two phases of the classification process, with the upper path representing training classification model and the lower path providing new examples for which the model will assign catego (the target variables) as a way to emulate decisions. For the training phase, input for the train
  22. 22. • Grouping unstructured data without any training data. • Self learning from experience. • Small intra-cluster distance - Trying for local and global Minima • Large inter-cluster distance • Mahout’s Canopy Clustering map reduce algorithm is often used to compute initial cluster centroids. 22 { “Algorithms” : “Clustering” , “id” : “Introduction”}
  23. 23. 23 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  24. 24. 24 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  25. 25. 25 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  26. 26. 26 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  27. 27. 27 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  28. 28. 28 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  29. 29. 29 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  30. 30. 30 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  31. 31. 31 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  32. 32. 32 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”} Cats Dogs
  33. 33. 33 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} + C0 C1 C2 C3 M0 M1 M2 M3 IO0 IO1 IO2 IO3 R0 R1 FO0 FO1 chunks mappers Reducers MapPhaseReducePhase Shuffling Data
  34. 34. • Assume: Number of Cluster is far lesser than Number of Points. • Therefore, |Clusters| << |Points| • Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids. 34 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} M0 M1 M2 M3 <clusterID, observation> R0 R1 Important arguments --maxIter --convergenceDelta --method
  35. 35. 35 { “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”} Map phase: assign cluster IDs Reduce phase: reset centroids
  36. 36. 36 { “Algorithms” : “Other Algorithms” } • Classification ‣ Stochastic Gradient Descent ‣ Support Vector Machines ‣ Random Forests • Clustering ‣ Latent Dirichlet Allocation - Topic models ‣ Fuzzy K-Means - Points are assigned multiple clusters ‣ Canopy clustering - Fast approximations of clusters ‣ Spectral clustering - Treat points as a graph • Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions • Dimensionality Reduction • Regression
  37. 37. 37 { “Algorithms” : “Future” } • Classification ‣ Decision Trees such as J48 and ID3 • Clustering ‣ DBScan and CoWeb Clustering techniques • Evolutionary Algorithms ‣ Classical Genetic Algorithms • Association Rules ‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
  38. 38. { “Mahout” : “Summary” } 38
  39. 39. { “Summary”: “Apache Mahout” } 39 • Scalable Library
  40. 40. 40 • Scalable Library • Three Primary Areas of Focus { “Summary”: “Apache Mahout” }
  41. 41. 41 • Scalable Library • Three Primary Areas of Focus • Other Algorithms { “Summary”: “Apache Mahout” }
  42. 42. 42 • Scalable Library • Three Primary Areas of Focus • Other Algorithms • All in your friendly neighborhood MapReduce { “Summary”: “Apache Mahout” }
  43. 43. { “Mahout” : “Demo” } 43
  44. 44. { “Mahout” : “Questions” } 44
  45. 45. { “Mahout” : “References” } 45
  46. 46. • Books • “Mahout in Action”, Owen et. al., Manning Pub. • “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub. • “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub. • Videos • CS-229, Machine Learning at Stanford University - Prof. Andrew Ng. • Collaborative filtering at scale - Sean Owen • Distributed Item-based Collaborative Filtering - Sebastian Schelter • EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination 46 { “References” : “Mahout Books, Tutorials, Links”, “id” : 1}
  47. 47. • WWW • http://mahout.apache.org - Mahout@Apache • http://hadoop.apache.org - Hadoop@Apache • dev@mahout.apache.org - Developer mailing list • user@mahout.apache.org - User mailing list • http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout 47 { “References” : “Mahout Books, Tutorials, Links”, “id” : 2}
  48. 48. { “Mahout” : “The End” } 48 {“Thank You” : “Have a Nice and Green Day” }

×