SlideShare a Scribd company logo
1 of 22
Download to read offline
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Mahout
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




What is Apache Mahout?

• A scalable machine learning infrastructure


• Built on top of Hadoop MapReduce


• Currently supports:


   • Clustering, classification, and collaborative filtering, etc...
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




A Little History

• Founded by folks active in the Lucene community


• Inspired by work at Stanford: “Map-Reduce for Machine Learning on
  Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-
  mapreducemulticore.pdf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Project Goal

• Create a community driven scalable and robust machine learning
  infrastructure


• Leverage Hadoop for parallel processing and scalability


• Provide an abstraction on top of Hadoop so the machine-learning users are
  not concerned with the map and reduce primitives when they build their
  solutions.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Algorithms

 • Collaborative Filtering


 • User and Item based recommenders


 • K-Means, Fuzzy K-Means clustering


 • Mean Shift clustering


 • Dirichlet process clustering
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Supported Algorithms

 • Latent Dirichlet Allocation


 • Singular value decomposition


 • Parallel Frequent Pattern mining


 • Complementary Naive Bayes classifier


 • Random forest decision tree based classifier


 • ...and growing
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Focus Areas

 • Collaborative Filtering


 • Clustering


 • Classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Build and Install

• Required Software:


   • Java 1.6.x


   • Maven 2.0.11+


• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout


• Compile & install core & examples: mvn install


   • Alternatively, individually mvn compile, mvn package, and mvn install
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                     Copyright for all other & referenced work is retained by their respective owners.




Recommendation Examples

 • mvn -q exec:java -
   Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group
   LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/
   workspace/hadoop_workspace/grouplens/ratings.dat"


 • https://cwiki.apache.org/confluence/display/MAHOUT/
   RecommendationExamples
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Common Use Cases

 • Shopping: Amazon, Netflix


 • Who to follow/friend: Twitter/Facebook


 • Web resource classification, spam filtering, financial markets pattern
   recognition, classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Collaborative Filtering Basis

  • User-based: recommend items by finding similar users. User preferences
    keep changing so this method poses challenges.


  • Item-based: calculate similarity between items and make
    recommendations. Usually items don’t change much so the method is
    often reliable.


  • Slope-one: fast and efficient item based recommendation when user
    ratings are more than boolean yes/no, like/dislike.


  • Model-based: provide recommendation on the basis of developing a
    model of users and their ratings.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Clustering Basis

 • Clustering algorithms also use the notion of similarity to group similar
   items into a cluster.


 • Both Collaborative filtering and clustering use the notion of a distance,
   which could be calculated using a number of different techniques.


    • Example: Euclidean distance,
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.


• Mahout supports building recommendation engines primarily basis the Taste
  library.


   • The library supports both user-based and item-based recommendations.


• Can be used with Java or over RESTful web-service endpoints.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Primary Classes

 • DataModel: Model for Users, Items, and Preferences


 • UserSimilarity: Interface defining the similarity between two users


 • ItemSimilarity: Interface defining the similarity between two items


 • Recommender: Interface for providing recommendations


 • UserNeighborhood: Interface for computing a neighborhood of similar
   users. These are used by the Recommenders.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Online vs Offline

 • Can do online recommendations for a few thousand data sets.


 • Leverages Hadoop for offline recommendation calculations on large data
   sets.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Understanding the Group Lens Implementation

• Provide an insight into a sample Mahout Taste Framework Implementation.


• Uses the publicly available data set


• Part of the distribution so you can analyze it, modify it, and use it as an
  inspiration for your own implementation


• Easy to follow example
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Group Lens Implementation Source

• GroupLensDataModel.java


• GroupLensRecommender.java


• GroupLensRecommenderBuilder.java


• GroupLensRecommenderEvaluatorRunner.java
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- evaluator

• Instantiates an evaluator:


   • RecommenderEvaluator evaluator = new
     AverageAbsoluteDifferenceRecommenderEvaluator();


   • a “mean average error” algorithm


• Parses input parameters:


   • File ratingsFile = TasteOptionParser.getRatings(args);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- data model

 • Parses a colon delimiter pattern file:


    • DataModel model = ratingsFile == null ? new GroupLensDataModel() :
      new GroupLensDataModel(ratingsFile);
Group Lens Runner -- evaluate with
                Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




recommendation builder

• evaluates using GroupLensRecommender


  • double evaluation = evaluator.evaluate(new
    GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

Similar to SDEC2011 Essentials of Mahout

Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source ContributionsNeev Technologies
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveKorea Sdec
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to usersjobinwilson
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to usersFlytxt
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable AppsBuğra Oral
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Den Delimarsky
 
Automated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceAutomated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceMatthew Lancaster
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Red Gate Software
 
Presentation 1 Web--dev
Presentation 1 Web--devPresentation 1 Web--dev
Presentation 1 Web--devaltsav
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat
 

Similar to SDEC2011 Essentials of Mahout (20)

Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source Contributions
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018
 
Automated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceAutomated perf optimization - jQuery Conference
Automated perf optimization - jQuery Conference
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City
 
Presentation 1 Web--dev
Presentation 1 Web--devPresentation 1 Web--dev
Presentation 1 Web--dev
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Case study
Case studyCase study
Case study
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
 

More from Korea Sdec

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerKorea Sdec
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionKorea Sdec
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopKorea Sdec
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopKorea Sdec
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopKorea Sdec
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveKorea Sdec
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 RapidantKorea Sdec
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACCKorea Sdec
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesKorea Sdec
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudKorea Sdec
 

More from Korea Sdec (14)

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

SDEC2011 Essentials of Mahout

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Mahout Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Apache Mahout? • A scalable machine learning infrastructure • Built on top of Hadoop MapReduce • Currently supports: • Clustering, classification, and collaborative filtering, etc...
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A Little History • Founded by folks active in the Lucene community • Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06- mapreducemulticore.pdf
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Project Goal • Create a community driven scalable and robust machine learning infrastructure • Leverage Hadoop for parallel processing and scalability • Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Algorithms • Collaborative Filtering • User and Item based recommenders • K-Means, Fuzzy K-Means clustering • Mean Shift clustering • Dirichlet process clustering
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Supported Algorithms • Latent Dirichlet Allocation • Singular value decomposition • Parallel Frequent Pattern mining • Complementary Naive Bayes classifier • Random forest decision tree based classifier • ...and growing
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Focus Areas • Collaborative Filtering • Clustering • Classification
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build and Install • Required Software: • Java 1.6.x • Maven 2.0.11+ • Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout • Compile & install core & examples: mvn install • Alternatively, individually mvn compile, mvn package, and mvn install
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Recommendation Examples • mvn -q exec:java - Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/ workspace/hadoop_workspace/grouplens/ratings.dat" • https://cwiki.apache.org/confluence/display/MAHOUT/ RecommendationExamples
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Common Use Cases • Shopping: Amazon, Netflix • Who to follow/friend: Twitter/Facebook • Web resource classification, spam filtering, financial markets pattern recognition, classification
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Collaborative Filtering Basis • User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges. • Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable. • Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike. • Model-based: provide recommendation on the basis of developing a model of users and their ratings.
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Clustering Basis • Clustering algorithms also use the notion of similarity to group similar items into a cluster. • Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques. • Example: Euclidean distance,
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully.
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully. • Mahout supports building recommendation engines primarily basis the Taste library. • The library supports both user-based and item-based recommendations. • Can be used with Java or over RESTful web-service endpoints.
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Primary Classes • DataModel: Model for Users, Items, and Preferences • UserSimilarity: Interface defining the similarity between two users • ItemSimilarity: Interface defining the similarity between two items • Recommender: Interface for providing recommendations • UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Online vs Offline • Can do online recommendations for a few thousand data sets. • Leverages Hadoop for offline recommendation calculations on large data sets.
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Understanding the Group Lens Implementation • Provide an insight into a sample Mahout Taste Framework Implementation. • Uses the publicly available data set • Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation • Easy to follow example
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Implementation Source • GroupLensDataModel.java • GroupLensRecommender.java • GroupLensRecommenderBuilder.java • GroupLensRecommenderEvaluatorRunner.java
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- evaluator • Instantiates an evaluator: • RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); • a “mean average error” algorithm • Parses input parameters: • File ratingsFile = TasteOptionParser.getRatings(args);
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- data model • Parses a colon delimiter pattern file: • DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
  • 21. Group Lens Runner -- evaluate with Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. recommendation builder • evaluates using GroupLensRecommender • double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com