O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Intro to Mahout -- DC Hadoop

13.212 visualizações

Publicada em

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Intro to Mahout -- DC Hadoop

  1. 1. Intro to Apache Mahout<br />Grant Ingersoll<br />Lucid Imagination<br />http://www.lucidimagination.com<br />
  2. 2. Anyone Here Use Machine Learning?<br />Any users of:<br />Google?<br />Search?<br />Priority Inbox?<br />Facebook?<br />Twitter?<br />LinkedIn?<br />
  3. 3. Topics<br />Background and Use Cases<br />What can you do in Mahout?<br />Where’s the community at?<br />Resources<br />K-Means in Hadoop (time permitting)<br />
  4. 4. Definition<br />“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”<br />Intro. To Machine Learning by E. Alpaydin<br />Subset of Artificial Intelligence<br />Lots of related fields:<br />Information Retrieval<br />Stats<br />Biology<br />Linear algebra<br />Many more<br />
  5. 5. Common Use Cases<br />Recommend friends/dates/products<br />Classify content into predefined groups<br />Find similar content<br />Find associations/patterns in actions/behaviors<br />Identify key topics/summarize text<br />Documents and Corpora<br />Detect anomalies/fraud<br />Ranking search results<br />Others?<br />
  6. 6. Apache Mahout<br />An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License<br />http://mahout.apache.org<br />Why Mahout?<br />Many Open Source ML libraries either:<br />Lack Community<br />Lack Documentation and Examples<br />Lack Scalability<br />Lack the Apache License<br />Or are research-oriented<br />Definition:http://dictionary.reference.com/browse/mahout<br />
  7. 7. What does scalable mean to us?<br />Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm<br />Some algorithms won’t scale to massive machine clusters<br />Others fit logically on a Map Reduce framework like Apache Hadoop<br />Still others will need different distributed programming models<br />Others are already fast (SGD)<br />Be pragmatic<br />
  8. 8. Sampling of Who uses Mahout?<br />https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout<br />
  9. 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout<br />
  10. 10. Collaborative Filtering<br />Extensive framework for collaborative filtering (recommenders)<br />Recommenders<br />User based<br />Item based<br />Online and Offline support<br />Offline can utilize Hadoop<br />Many different Similarity measures<br />Cosine, LLR, Tanimoto, Pearson, others<br />
  11. 11. Clustering<br />Document level<br />Group documents based on a notion of similarity<br />K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)<br />All Map/Reduce<br />Distance Measures<br />Manhattan, Euclidean, other<br />Topic Modeling <br />Cluster words across documents to identify topics<br />Latent Dirichlet Allocation (M/R)<br />
  12. 12. Categorization<br />Place new items into predefined categories:<br />Sports, politics, entertainment<br />Recommenders<br />Implementations<br />Naïve Bayes (M/R)<br />Compl. Naïve Bayes (M/R)<br />Decision Forests (M/R)<br />Linear Regression (Seq. but Fast!)<br /><ul><li>See Chapter 17 of Mahout in Action for Shop It To Me use case:
  13. 13. http://awe.sm/5FyNe</li></li></ul><li>Freq. Pattern Mining<br />Identify frequently co-occurrent items<br />Useful for:<br />Query Recommendations<br />Apple -> iPhone, orange, OS X<br />Related product placement<br />Basket Analysis<br />Map/Reduce<br />http://www.amazon.com<br />
  14. 14. Other<br />Primitive Collections!<br />Collocations (M/R)<br />Math library<br />Vectors, Matrices, etc.<br />Noise Reduction via Singular Value Decomp (M/R)<br />
  15. 15. Prepare Data from Raw content<br />Data Sources:<br />Lucene integration<br />bin/mahout lucene.vector…<br />Document Vectorizer<br />bin/mahout seqdirectory …<br />bin/mahout seq2sparse …<br />Programmatically<br />See the Utils module in Mahout and the Iterator<Vector> classes<br />Database<br />File system<br />
  16. 16. How to: Command Line<br />Most algorithms have a Driver program<br />$MAHOUT_HOME/bin/mahout.shhelps with most tasks<br />Prepare the Data<br />Different algorithms require different setup<br />Run the algorithm<br />Single Node<br />Hadoop<br />Print out the results or incorporate into application<br />Several helper classes: <br />LDAPrintTopics, ClusterDumper, etc.<br />
  17. 17. What’s Happening Now?<br />Unified Framework for Clustering and Classification<br />0.5 release on the horizon (May?)<br />Working towards 1.0 release by focusing on:<br />Tests, examples, documentation<br />API cleanup and consistency<br />Gearing up for Google Summer of Code<br />New M/R work for Hidden Markov Models<br />
  18. 18. Summary<br />Machine learning is all over the web today<br />Mahout is about scalable machine learning<br />Mahout has functionality for many of today’s common machine learning tasks<br />Many Mahout implementations use Hadoop<br />
  19. 19. Resources<br />http://mahout.apache.org<br />http://cwiki.apache.org/MAHOUT<br />{user|dev}@mahout.apache.org<br />http://svn.apache.org/repos/asf/mahout/trunk<br />http://hadoop.apache.org<br />
  20. 20. Resources<br />“Mahout in Action” <br />Owen, Anil, Dunning and Friedman<br />http://awe.sm/5FyNe<br />“Introducing Apache Mahout” <br />http://www.ibm.com/developerworks/java/library/j-mahout/<br />“Taming Text” by Ingersoll, Morton, Farris<br />“Programming Collective Intelligence” by Toby Segaran<br />“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank<br />“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer <br />
  21. 21. K-Means<br />Clustering Algorithm<br />Nicely parallelizable!<br />http://en.wikipedia.org/wiki/K-means_clustering<br />
  22. 22. K-Means in Map-Reduce<br />Input:<br />Mahout Vectors representing the original content<br />Either:<br />A predefined set of initial centroids (Can be from Canopy)<br />--k – The number of clusters to produce<br />Iterate<br />Do the centroid calculation (more in a moment)<br />Clustering Step (optional)<br />Output<br />Centroids (as Mahout Vectors)<br />Points for each Centroid (if Clustering Step was taken)<br />
  23. 23. Map-Reduce Iteration<br />Each Iteration calculates the Centroids using:<br />KMeansMapper<br />KMeansCombiner<br />KMeansReducer<br />Clustering Step<br />Calculate the points for each Centroid using:<br />KMeansClusterMapper<br />
  24. 24. KMeansMapper<br />During Setup:<br />Load the initial Centroids (or the Centroids from the last iteration)<br />Map Phase<br />For each input<br />Calculate it’s distance from each Centroid and output the closest one<br />Distance Measures are pluggable<br />Manhattan, Euclidean, Squared Euclidean, Cosine, others<br />
  25. 25. KMeansReducer<br />Setup:<br />Load up clusters<br />Convergence information<br />Partial sums from KMeansCombiner (more in a moment)<br />Reduce Phase<br />Sum all the vectors in the cluster to produce a new Centroid<br />Check for Convergence<br />Output cluster<br />
  26. 26. KMeansCombiner<br />Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper<br />
  27. 27. KMeansClusterMapper<br />Some applications only care about what the Centroids are, so this step is optional<br />Setup:<br />Load up the clusters and the DistanceMeasure used<br />Map Phase<br />Calculate which Cluster the point belongs to<br />Output <ClusterId, Vector><br />