2. • A scalable machine learning library built on
hadoop, written on Java
• In the areas of collaborative filtering,
clustering and classification. Many of the
implementations use the Apache Hadoop
platform.
• It gives ability (Drive hadoop) to Hadoop
analyze.---- data mining.
• “ Machine learning is Programming computers
to optimize a performance criterion using
example data and past Experience”
3. Mahout Points
• Take a power of apache hadoop to solve
complex probs.
• By breaking them up into multiple parallel
tasks
• Stable release-- 0.9 / 1 February 2014
• 9 Oct 2011 - Mahout in Action released
4. Why Mahout?
• Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Or are research-oriented
5. Hadoop
• That was invented by Google back in their earlier days,
so they could usefully index all the rich textural and
structural information they were collecting, and then
present meaningful and actionable results to users.
• There was nothing on the market that would let them
do that, so they built their own platform. Google’s
innovations were incorporated into Nutch, an open
source project, and Hadoop was later spun-off from
that.
• Yahoo has played a key role developing Hadoop for
enterprise applications.
6. Hadoop architect
• Hadoop is designed to run on a large number of machines that
don’t share any memory or disks. That means you can buy a whole
bunch of commodity servers, slap them in a rack, and run the
Hadoop software on each one.
• When you want to load all of your organization’s data into Hadoop,
what the software does is bust that data into pieces that it then
spreads across your different servers. There’s no one place where
you go to talk to all of your data; Hadoop keeps track of where the
data resides.
• And because there are multiple copy stores, data stored on a server
that goes offline or dies can be automatically replicated from a
known good copy.
• Hadoop derives from Google's MapReduce and Google File System
papers.
7. Current Stages of Hadoop
• Facebook processes more than 500 TB of
data daily----The site manages millions of
photos and processes billions of likes each
day. That's a whole lot of sharing.
• hive is the technique used for connecting with
Hadoop.
• Yahoo also have some technique--pig
8. How to solve common business
Problems
• Recommendation –
User info + community info=Recommendation
• Classification --Mail sparming
• Clustering --making similar groups of data