SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Machine Learning
on Big Data
with HADOOP

Albert Hui,
Associate Director, EPAM Systems
April 2013
Machine Learning on Big Data with HADOOP

Introduction

When I was doing my Master thesis in fuzzy
logic, I struggled with writing my owned
statistical algorithms as well as processing a
large set of data to train my AI (Artificial
intelligence) model. These tools I created are

1 │ EPAM SYSTEMS, INC.

widely available. Thanks for the advancement in
the software world, the forward-thinking
computer scientists and engineers to make
machine learning a reality
Machine Learning on Big Data with HADOOP

What is Machine Learning?
Supervised Machine learning
Supervised learning is tasked with learning a function from labeled training data in order to
predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam, labeling Web pages according to their genre, and recognizing
handwriting. Many algorithms are used to create supervised learners, the most common being
neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers.

Unsupervised Machine learning
Unsupervised learning is tasked with making sense of data without any examples of what is
correct or incorrect. It is most commonly used for clustering similar input into logical groups. It
also can be used to reduce the number of dimensions in a data set in order to focus on only the
most useful attributes, or to detect trends. Common approaches to unsupervised learning
include k-Means, hierarchical clustering, and self-organizing maps.

List A
Two main types of machine learning

This paper is an introduction to Machine Learning, and focuses on three objectives.
1. Shows you the trends and latest technologies in machine learning.
2. Helps you to start thinking how to better
use your data via the machine learning
techniques.
3. Demonstrates how to get started with
some sample codes and algorithms.

2 │ EPAM SYSTEMS, INC.

This paper also touches on a spectrum of technologies, mostly from the Apache Software
Foundation projects, e.g. Apache Hadoop,
Mahout, Lucene and Solr.
What is machine learning? First, machine learning is not a novel topic. In fact, it was the original
promise from the computer scientist when they
first invented computer software programming.
The long history shows scientists have not given
Machine Learning on Big Data with HADOOP

up the idea of using machines to help improving
our lives.
Machine Learning is a field of Artificial
Intelligence(AI), and AI textbooks define this
field as "the study and design of intelligent
agents“ where an intelligent agent is a system
that perceives its environment and takes actions
that maximize its chances of success. John
McCarthy, who coined the term in 1956, defines
it as "the science and engineering of making
intelligent machines, and is to program
computers to optimize a performance
criterion using example data or past
experience”. It is a field of combining efforts of
information technology, statistics, biology, linear
algebra and psychology etc. The modern
scientists divide machine learning into two major

types, Supervised and Unsupervised. This
paper focuses mainly on supervised machine
learning. And machine learning subtly differs
from data mining, which focuses on extracting
patterns – the unknown properties on the data.
But, machine learning focuses on prediction,
based on known properties learned from the
training data. Proudly, machine learning has
finally become closer to reality. IBM invented
Watson the computer to compete human brains
in a series of Jeopardy TV shows. The project
is called IBM Watson, it is the ultimate machine
learning example using natural language
processing method to demonstrate the power of
machine
learning,
see
link
http://www-03.ibm.com/innovation/us/watson/w
hat-is-watson/index.html.

Structured data: Walmart logs 1M transactions per hour; Boeing logs 640 terabytes from a
4-engine jumbo jet on one Atlantic crossing.

Unstructured data: Twitter products 90M tweets per day; Facebook creates 30Billions web
links, news, blogs, photo; YouTube has 10M+ views, uploads and download per day.

Social movement – as people getting richer, they become more literate that fuels the information growth

List D
Big Data Variety and Variability

What are the use cases of machine learning?
Believe or not, machine learning has been
integrated into our lives for years. Facebook is
able to generate tailored advertisement based
on how we are using it to connect with friends

3 │ EPAM SYSTEMS, INC.

and status/comments that you left there.
Amazon uses machine learning to recommend
you books and items that you mostly like. This
accounts for 30% of sales out of its recommendation engine. Google’s new piracy policy needs
Machine Learning on Big Data with HADOOP

you to agree on allowing them to integrate your
preference among its offered products, Gmail,
Google+, Google Map aiming to learn more
about you and to sell more to you.
Yahoo’s
spam mail detection and intelligent filtering is
also based on the clustering mechanism from
machine learning. An ordinary traditional
grocery British retailer Tesco collects 1.5 billion
piece of information every week in order to use

machine learning to adjust prices and promotions in real time. Not to mention it is mandatory
for all their executives to tweet about their company in a daily basis. There are no shortages of
use cases of machine learning; List B provides a
few more. The pioneers in the industry have
adopted so well using machine learning, now it
is our turn as corporate to respond this “big”
trend to further grow our businesses.

E-Tailing – product recommendation engines, cross channel analytics.

Retailing/consumer – Merchandizing and market basket analysis, campaign management and
optimization, market and consumer segmentation.
Financial Services – Next generation of Fraud detection and risk management, credit risk scoring decision making.
Government – Fraud detection, catching Bin laden, cybersecurity.
Web Scale – clickstream, social graph analysis, ad targeting/forecasting.
Telecommunication – customer churn management, Mobile behavior analysis.

List B
More examples of Use cases in machine learning

4 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

How Do We Use Machine
Learning to Improve Our
Business?
Before answering this question, let’s take a look
of fundamental question first. Machine learning
needs a lot of data, and actually it is a process to
turn a lot of data to make them smarter. Like in
my master thesis, my research was to optimize
machining processes using machine learning.
The fuzzy logic model that I developed was
designed to process and learn a lot of machining
data. The goal was to help people to make

better decisions. There are four steps in turning
data into smart data.
1. Collect your data;
2. Define the problem you are trying to solve;
3. Build your model to train using the data;
4. Use the model to predict and analyze the
results, and modify your model if necessary.

Advancement of technology – Digital devices soar as prices plummet, 4.6 billion mobile
phones, GPS Geospatial.

Internet – 2 billion people has access to Internet, Social networks, Digital clickstream.
Social movement – as people getting richer, they become more literate that fuels the information growth

List C
Reason of data explosion

The following discusses more details
on each step.

5 │ EPAM SYSTEMS, INC.

Step 1: Data Collection.
In this digital age, data collection is not easy.
Machine Learning on Big Data with HADOOP

We simply have too much data to collect and
process. This phenomenon is easy to explain.
The storage cost has fallen from $14M per TB in
1980 to $70 in 2010. CPU drops to $59 processor in 2009 for an AMD Radeon. Bandwidth
skyrockets from 200KB/s in 1980s’ to 4GB/s in
2010. All these factors create the “best” environment for generating a lot of data. Digital
universe estimates there will be 1.8 zetabytes (1
zetabyte = 1 billion terabyte) of data in 2015.
And by 2020, the world will have 35.2 zetabytes

growing by a factor of 44. List C contains a few
reasons for data explosion. And data today is
not just BIG. It is also very complex. There is 1.8
trillion gigabytes of data created in 2011 and
more than 90% is unstructured. It is approximately 500 quadrillion of files stored outside the
databases, and the quantity doubles every 2
years. The industry has defined it as “Era of Big
Data”, which is defined as datasets that grow so
large that they become awkward to work with
using relational databases.

Think big

Think outside the box

Think new revenue possibility

Think how to cut cost and reduce risk

List E
Think your machine learning problem

The Era of Big Data has posed some challenges
to machine learning. Many old technologies are
simply not scalable to read and process such
big data volume. However, with the advent of
Apache Hadoop, this problem has found a solution. Apache Hadoop, inspired by Google’s
white paper on Map/Reduced and Google File
System (GFS) and Big Table, is an Apache top
level project. It is open source. And it is
designed for large scale of data processing and
to deal with structured and unstructured data. It
is able to run on commodity hardware and since

6 │ EPAM SYSTEMS, INC.

it is based on Java, it is portable across heterogeneous hardware and OS platform. Why is it
best for machine learning? It is because Hadoop
is known for embarrassingly scalable as your
data grows. It is reportedly claimed that Hadoop
Map/Reduce is able to read and aggregate
petabyte of data 500X faster than other commercial software. When you have more data to
learn from, you can simply add more hardware
to the Hadoop cluster to scale. It is merely fraction of the cost comparing with other commercialized software. Corporations can simply
Machine Learning on Big Data with HADOOP

collect their data first, and have an extremely
scalable way to analyze them later. With Apache
Hadoop, it gives a new realm of possibility for

corporations to solve their machine learning
problems.

The sexy job in the next ten years will be statisticians… The ability to take data — to be able to
understand it, to process it, to extract value from it, to visualize it, to communicate it .
Hal Varian, Google Chief Economist

List F
Mahout provides statistics algorithm capability for machine learning

Step 2: Define your problem.
With Apache Hadoop, you can contextualize
your data by decomposing them in key-value
pair. For example, you can quickly learn how
many people are talking about apples, oranges
regarding fruit or windows and doors regarding
renovation. Or people are talking apple and
windows regarding Apple’s Ipad. And you can
analyze terabyte of clickstream data to form
relationship between pages and user’s behaviours and buying decision. It can also marry both
structured and unstructured data to gain broader insight of customers’ buying behaviours,
which was before nearly impossible.

(means elephant driver in Hindi) to help develop
your machine learning models. Apache Mahout
is an open source project by the Apache Software Foundation (ASF) with the primary goal of
creating scalable machine-learning algorithms
that are free to use under the Apache license. It
uses the Apache Hadoop library to enable
Mahout to scale effectively in a Hadoop cluster
cloud. Mahout contains four major implementations, they are:
1. Clustering,
2. Classification,
3. Collaborative filtering (recommendations)
and
4. Frequent-pattern mining.

Step 3: Build your model to train
using the data you collected.
Now, it is time to introduce Apache Mahout

The power of Mahout is that it contains a list of
well established statistical models. If you type
bin/mahout at the command line, you will see
the following list:

>canopy: : Canopy clustering
>cleansvd: : Cleanup and verification of SVD output
>clusterdump: : Dump cluster output to text

7 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

>dirichlet: : Dirichlet Clustering
>fkmeans: : Fuzzy K-means clustering
>fpg: : Frequent Pattern Growth
>itemsimilarity: : Compute the item-item-similarities for item-based collaborative
filtering
>kmeans: : K-means clustering
>lda: : Latent Dirchlet Allocation
>ldatopics: : LDA Print Topics
>lucene.vector: : Generate Vectors from a Lucene index
>matrixmult: : Take the product of two matrices
>meanshift: : Mean Shift clustering
>recommenditembased: : Compute recommendations using item-based collaborative filtering

8 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

Clustering

Figure 1
K-Mean Clustering

Let’s go a little deeper into each of
four implementations to help you get
started.
Given large data sets, whether they are text or
numeric, it is often useful to group together, or
cluster, similar items automatically. For
instance, given all of the news for the day from
all of the newspapers in the United States, you
might want to group all of the articles about the
same story together automatically; you can then
choose to focus on specific clusters and stories
without needing to wade through a lot of unrelated ones. Another example: Given the output
from sensors on a machine over time, you could
cluster the outputs to determine normal versus
problematic operation, because normal opera-

9 │ EPAM SYSTEMS, INC.

tions would all cluster together and abnormal
operations would be in outlying clusters. Clustering calculates the similarity between items in
the collection, but its only job is to group together similar items. In many implementations of
clustering, items in the collection are represented as vectors in an n-dimensional space. Given
the vectors in Figure 1, one can calculate the
distance between three items using measures
such as the Manhattan Distance, Euclidean
distance, or cosine similarity. Then, the actual
clusters can be calculated by grouping together
the items that are close in distance.
Popular approaches include k-Means and hierarchical clustering. Mahout comes with several
different clustering approaches.
Machine Learning on Big Data with HADOOP

Step a: Convert dataset into a Hadoop Sequence File.

$ bin/mahout seqdirectory –I mahout-work/reuters-out –o mahout-work/reuters-out-seqdir –c
UTF-8 –chunk 5

In the sequence of files, each record is a <key,
value> pair
Key=record name or file name or unique
identifier
Value = content as UTF-8 encoded string.

Step b: Writing to the sequence files
Step c: Generate Vectors using the sequence
files by running run:
$ bin/mahout seq2sparse

Step d: Start k-mean Clustering

$ bin/mahout kmeans -I mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/
mahout-work/reuters-kmeans-clusters -o mahout-work/reuters-kmeans -dm
org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 –x 10 –k 20 –ow

-c

Figure 2
Clustering: understanding data as vectors

There are many clustering algorithm available
in Mahout. They are Canopy, fuzzy k-Means,

10 │ EPAM SYSTEMS, INC.

Mean shift, Dirichlet process clustering and
spectral clustering.
Machine Learning on Big Data with HADOOP

Classification
The goal of classification (often also called categorization) is to label unseen documents, thus
grouping them together. Many classification
approaches in machine learning calculate a
variety of statistics that associate the features of
a document with the specified label, thus creating a model that can be used later to classify
unseen documents. For example, a simple
approach to classification might keep track of
the words associated with a label, as well as the
number of times those words are seen for a
given label. Then, when a new document is
classified, the words in the document are looked
up in the model, probabilities are calculated, and
the best result is output, usually along with a
score indicating the confidence t he result is
correct. Features for classification might include
words, weights for those words (based on
frequency, for instance), parts of speech, and so
on. Of course, features really can be anything
that helps associate a document with a label

Figure 3
Playing Tennis, example for classification

11 │ EPAM SYSTEMS, INC.

and can be incorporated into the algorithm.
Mahout has several implementations, Naïve
Bayes, Complementary Naïve Bayes, Decision
Forests and Logistic Regression.
The most popular classification is Naïve Bayes
which is based on condition probability of the
class given an instance. And:

where Evidence E is the instance and Event H
value for the instance.
For example: The following is to play tennis
Event H and the weather conditions E are the
evidences for the instance.
Machine Learning on Big Data with HADOOP

Based on the above dataset, we can first train the classifier
$MAHOUT_HOME/bin/mahout tennnisclassifier –i 20news-input/bayes-train-input
playtennismodel -type bayes -ng 3 –source hdfs

After the classifer is built and trained, we can
test them using Mahout by entering a new day
with its evidences.

-o

Temp=Cool, Humidity=High, Windy = true.
Hence, use the model to determine the probability of yes playing tennis vs probability of No playing tennis.

Use a new day looks like: Outlook = Sunny,

$MAHOUT_HOME/bin/mahout tennnisclassifier –m playtennismodel
3 –source hdfs
-method mapreduce

-d 20-input –type bayes –ng

[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: -------------[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: Testing:
example/rd/prepared-test/ playingTennis.txt
[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: playingTennis

70.0

21/30.0
[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: -------------[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: Testing:
example /rd/prepared-test/ NotPlayingTennis.txt
[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: notPlayingTennis
81.3953488372093
35/43.0
[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier:
[java] Summary
[java] ------------------------------------------------------[java] Correctly Classified Instances

:

9

79.5%

[java] Incorrectly Classified Instances

:

5

20.5%

[java] Total Classified Instances

:

14

[java]
[java] =======================================================
[java] Confusion Matrix
[java] ------------------------------------------------------[java] a

b

<--Classified as

[java] 21

9

|

30

a

= playingTennis

[java] 8

35

|

43

b

= NotPlayingTennis

[java] Default Category: unknown: 2

12 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

Figure 4
Summary Result of Playing Tennis after Naïve Bayes

13 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

Collaborative Filtering
Collaborative filtering (aka Recommendation) is
a technique, popularized by Amazon and others,
that uses user information such as ratings,
clicks, and purchases to provide recommendations to other users of their site. Collaborative
filtering is often used to recommend consumer

items such as books, music, and movies, but it
is also used in other applications where multiple
actors need to collaborate to narrow down data.
Chances are you've seen Collaborative filtering
in action the last time you were on Amazon.

Figure 5
Amazon’s recommendation

Given a set of users and items, Collaborative
filtering applications provide recommendations
to the current user of the system. Four ways of
generating recommendations are typical:
• User-based: Recommend items by finding
similar users. This is often harder to scale
because of the dynamic nature of users.
• Item-based: Calculate similarity between
items and make recommendations. Items
usually don't change much, so this often can
be computed offline.

14 │ EPAM SYSTEMS, INC.

• Model-based: Provide recommendations
based on developing a model of users and
their ratings.
All Collaborative filtering approaches end up
calculating a notion of similarity between users
and their rated items. There are many ways to
compute similarity, and most Collaborative filtering systems allow you to plug in different measures so that you can determine which one works
best for your data.
Machine Learning on Big Data with HADOOP

Mahout currently provides tools for building a
recommendation engine through the Taste
library; it is a very flexible and fast engine for
recommendation. And it is a Java-based. The
library comes with classes that you can inherit
the interfaces and properties. Taste supports
both user-based and item-based recommendations and comes with many choices for making
recommendations, as well as interfaces for you
to define your own. Taste consists of five primary components that work with Users, Items and
Preferences:
• DataModel: Storage for Users, Items, and
Preferences.
• UserSimilarity: Interface defining the
similarity between two users.

• ItemSimilarity: Interface defining the
similarity between two items.
• Recommender: Interface for providing
recommendations.
• UserNeighborhood: Interface for
computing a neighborhood of similar users
that can then be used by the
Recommenders.
For example, to use Taste for book recommendation using a Item-based recommendation, the
first step is to load the data containing the
recommendations and store them into the DataModel storing the users, Item book ID and their
preferences.

//create the data model for the user preferences
FileDataModel dataModel = new FileDataModel(new File(bookFile));
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
// Optional:
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel));

The second step is to construct a LogLikeligoodSimilarity and a Recommender. The
UserNeighborhood identifies users similar to the

users and is handed off to the Recommender to
create a ranked list of recommended books.

//create the data model
FileDataModel dataModel = new FileDataModel(new File(recsFile));
//Create an ItemSimilarity
ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);
//Create an Item Based Recommender
ItemBasedRecommender recommender =
new GenericItemBasedRecommender(dataModel, itemSimilarity);
//Get the recommendations
List<RecommendedItem> recommendations =
recommender.recommend(userId, 5);
TasteUtils.printRecs(recommendations, handler.map);

15 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

Figure 6
Frequent Pattern Mining: FP – Growth Algorithm

Running the main.java class will produce the following output.
[echo] Getting similar items for user: Machine Learning with Mahout with a ItemSimilarity
of 5
[java] 09/08/20 08:13:51 INFO file.FileDataModel: Creating FileDataModel
for file src/main/resources/recommendations.txt
[java] 09/08/20 08:13:51 INFO file.FileDataModel: Reading file info...
[java] 09/08/20 08:13:51 INFO file.FileDataModel: Processed 100000 lines
[java] 09/08/20 08:13:51 INFO file.FileDataModel: Read lines: 111901
[java] Data Model: Machine Learning with Mahout
[java] ----[java] Title: Mahout 21 Rating: 3.630000066757202
[java] Title: Action in Mahout Rating: 2.703000068664551
[java] Title: Big Data Rating: 4.230000019073486
[java] Title: Correction Rating: 5.0
[java] Title: Abraham Lincoln Rating: 4.739999771118164
[java] Title: Data intelligence in bigdata: 3.430000066757202
[java] Title: Boston consulting in bigdata Rating: 2.009999990463257
[java] Title: Atlanta, Georgia Rating: 4.429999828338623
[java] Recommendations list below:
[java] Doc Id: 50575 Title: April 10 Score: 4.98
[java] Doc Id: 134101348 Title: April 26 Score: 4.860541

16 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

[java] Doc Id: 133445748 Title: Mahout In action Score: 4.4308662
[java] Doc Id: 1193764 Title: Sam Doe Score: 4.404066
[java] Doc Id: 2417937 Title: Mike Olson Score: 4.24178

After processing the item of ‘Machine Learning with Mahout”, the system recommended several
more books with various levels of confidence.

17 │ EPAM SYSTEMS, INC.
Machine Learning on Big Data with HADOOP

Frequent Pattern Mining
Frequent patterns are itemsets and subsequence itemsets that appear in a data set with
frequency no less than a user-specified threshold. For example, a set of items, such as milk
and bread, that appear frequently together in a
transaction data set, is a frequent itemset. A
subsequence, such as buying first a PC, then a
digital camera, and then a memory card, if it
occurs frequently in a shopping history database, is a (frequent) sequential pattern. Finding
frequent patterns plays an essential role in
machine learning associations, correlations,
and many other interesting relationships among
data.
Frequent pattern mining is very powerful to analyze customer buying habits by finding associa-

tions between the different items that customers
place in their “shopping baskets”. For instance,
if customers are buying milk, how likely are they
going to also buy cereal (and what kind of
cereal) on the same trip to the supermarket?
Such information can lead to increased sales by
helping retailers do selective marketing and
arrange their shelf space.
FP-Growth Tree Algorithm is a high speed
frequent item-set mining algorithm. It builds a
prefix tree from an input data set which represents the data-set in a compressed form.
Frequent item-sets can be pulled from the tree
by a process pattern fragment growth involving
projection of the FP-tree

$ hadoop jar fp-growth-bell_mobility -0.4-job.jar
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i downloads-input -o reco-patterns-output
-k 50 - method mapreduce -g 500 -regex '[ ]' -s 5

Apache Mahout has been rapidly adapted by
the industry in the last two years. And it has
continued to move forward in a number of ways.
The current release has included many algorithms that can help solve real world problems.
The community's primary focus at the moment is
on pushing toward a 1.0 release by doing performance testing, documentation, API improvement, and the addition of new algorithms. And
the long term plan is to start looking at distributed, in-memory approaches to solving
machine-learning problems.
It will make
Mahout powerful and attractive to the machine
learning practitioners. Mahout is well positioned
to help solve today's most pressing big-data
problems by focusing in on scalability and
making it easier to consume complicated
machine-learning algorithms.

18 │ EPAM SYSTEMS, INC.

Step 4: Put the model to use and
fine-tune it.
It is fair to say machine learning is a trial and
error process.There is no cookie cutting solution
to a problem. When solving the basket analysis
problem for a Canadian’s biggest Drug store
chain, three different algorithms, K-means,
Latent Dirichlet Allocation and FP-Growth tree
were put into a trial and error process for a few
months to land on a right algorithm. It is not an
easy problem and requires a lot of efforts and
patience.
Thankfully, technology has helped
the data churning process and made the
process less painful.
Machine Learning on Big Data with HADOOP

Summary of Apache
Mahout
Apache Mahout has gained popularity in the last
two years. Like the name of mahout suggests, a
real mahout (elephant driver) leverages the
strength and capabilities of the elephant, so too
can Apache Mahout help you leverage the
strength and capabilities of the yellow elephant
of Apache Hadoop. Just within 2011, many user
groups, forums, blogs, webinars and even
start-up companies have sprout up around this
new technology. And its community is becoming
larger and stronger everyday to incorporate new

19 │ EPAM SYSTEMS, INC.

algorithms with increasing capabilities for solving clustering, categorization, and collaborative
filtering, frequent item mining or other machine
learning problems. Observing the industry
involvements and responses, in my opinion,
Mahout likely will become the standard technology of machine learning on Big Data. Big technology firms like IBM, Oracle will incorporate
Mahout in their product offerings to just acquire
the user base.
Machine Learning on Big Data with HADOOP

Remarks
Machine learning is definitely an exciting application that helps you to tap on the power of big
data. As for corporate data continues to grow
bigger and more complex, machine learning will
become even more attractive. The industry has
come up elegant solutions to help corporations
to solve this problem. Let’s get ready; it is just a
matter time this problem arrives at your desk.

20 │ EPAM SYSTEMS, INC.
Established in 1993, EPAM Systems (NYSE: EPAM) provides complex software engineering solutions through its award-winning
Central and Eastern European service delivery platform. Headquartered in the United States, EPAM employs approximately
8,500 IT professionals and serves clients worldwide from its locations in the United States, Canada, UK, Switzerland, Germany,
Sweden, Belarus, Hungary, Russia, Ukraine, Kazakhstan, and Poland.
EPAM is recognized among the top companies in IAOP’s “The 2012 Global Outsourcing 100”, featuring EPAM in a variety of
sub-lists, including “Leaders – Companies in Eastern Europe”. The company is also ranked among the best global service providers on “The 2012 Global Services 100” by Global Services Magazine and Neogroup, which names EPAM “Leaders - Global
Product Development” category.
For more information, please visit www.epam.com

G l ob al

EU

CIS

41 University Drive Suite 202,
Newtown (PA), 18940, USA
Phone: +1-267-759-9000
Fax: +1-267-759-8989

Corvin Offices I. Futó street 47-53
Budapest, H-1082, Hungary
Phone: +36-1-327-7400
Fax: +36-1-577-2384

9th Radialnaya Street, bldg. 2
Moscow, 115404, Russia
Phone: +7-495-730-6360
Fax: +7-495-730-6361

© 1993-2013 EPAM Systems. All Rights Reserved.

Mais conteúdo relacionado

Mais procurados

Machine Learning and Power AI Workshop v4
Machine Learning and Power AI Workshop v4Machine Learning and Power AI Workshop v4
Machine Learning and Power AI Workshop v4LennartF
 
The New Era of Cognitive Computing
The New Era of Cognitive ComputingThe New Era of Cognitive Computing
The New Era of Cognitive ComputingIBM Research
 
Km cognitive computing overview by ken martin 19 jan2015
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015HCL Technologies
 
What I learned about AI, ML and Blockchain from one Wired conference!
What I learned about AI, ML and Blockchain from one Wired conference!What I learned about AI, ML and Blockchain from one Wired conference!
What I learned about AI, ML and Blockchain from one Wired conference!John Powers
 
Big data and AI presentation slides
Big data and AI presentation slidesBig data and AI presentation slides
Big data and AI presentation slidesCloudxLab
 
Cognitive Computing and the future of Artificial Intelligence
Cognitive Computing and the future of Artificial IntelligenceCognitive Computing and the future of Artificial Intelligence
Cognitive Computing and the future of Artificial IntelligenceVarun Singh
 
centurylink-business-technology-2020-ebook-br141403
centurylink-business-technology-2020-ebook-br141403centurylink-business-technology-2020-ebook-br141403
centurylink-business-technology-2020-ebook-br141403Miriam Moreno
 
Big data v4.0
Big data v4.0Big data v4.0
Big data v4.0Ian Brown
 
Machine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersMachine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersSudha Jamthe
 
Centurylink Business Technology in 2020 ebook
Centurylink Business Technology in 2020 ebookCenturylink Business Technology in 2020 ebook
Centurylink Business Technology in 2020 ebookJake Weaver
 
Wiki stage 20151128 - v001
Wiki stage   20151128 - v001Wiki stage   20151128 - v001
Wiki stage 20151128 - v001Enrico Busto
 
Introduction To Machine Learning | Edureka
Introduction To Machine Learning | EdurekaIntroduction To Machine Learning | Edureka
Introduction To Machine Learning | EdurekaEdureka!
 

Mais procurados (18)

Machine Learning and Power AI Workshop v4
Machine Learning and Power AI Workshop v4Machine Learning and Power AI Workshop v4
Machine Learning and Power AI Workshop v4
 
Information Overload Phenomena
Information Overload PhenomenaInformation Overload Phenomena
Information Overload Phenomena
 
Cognitive computing
Cognitive computing Cognitive computing
Cognitive computing
 
The New Era of Cognitive Computing
The New Era of Cognitive ComputingThe New Era of Cognitive Computing
The New Era of Cognitive Computing
 
Impact of data overloading on productivity
Impact of data overloading on productivityImpact of data overloading on productivity
Impact of data overloading on productivity
 
Km cognitive computing overview by ken martin 19 jan2015
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015
 
What I learned about AI, ML and Blockchain from one Wired conference!
What I learned about AI, ML and Blockchain from one Wired conference!What I learned about AI, ML and Blockchain from one Wired conference!
What I learned about AI, ML and Blockchain from one Wired conference!
 
Big data and AI presentation slides
Big data and AI presentation slidesBig data and AI presentation slides
Big data and AI presentation slides
 
From Big Data to AI
From Big Data to AIFrom Big Data to AI
From Big Data to AI
 
Cognitive Computing and the future of Artificial Intelligence
Cognitive Computing and the future of Artificial IntelligenceCognitive Computing and the future of Artificial Intelligence
Cognitive Computing and the future of Artificial Intelligence
 
Cognitive Automation - Your AI Coworker
Cognitive Automation - Your AI CoworkerCognitive Automation - Your AI Coworker
Cognitive Automation - Your AI Coworker
 
centurylink-business-technology-2020-ebook-br141403
centurylink-business-technology-2020-ebook-br141403centurylink-business-technology-2020-ebook-br141403
centurylink-business-technology-2020-ebook-br141403
 
Big Data
Big DataBig Data
Big Data
 
Big data v4.0
Big data v4.0Big data v4.0
Big data v4.0
 
Machine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business LeadersMachine Learning Introduction for Digital Business Leaders
Machine Learning Introduction for Digital Business Leaders
 
Centurylink Business Technology in 2020 ebook
Centurylink Business Technology in 2020 ebookCenturylink Business Technology in 2020 ebook
Centurylink Business Technology in 2020 ebook
 
Wiki stage 20151128 - v001
Wiki stage   20151128 - v001Wiki stage   20151128 - v001
Wiki stage 20151128 - v001
 
Introduction To Machine Learning | Edureka
Introduction To Machine Learning | EdurekaIntroduction To Machine Learning | Edureka
Introduction To Machine Learning | Edureka
 

Destaque

Matematicas - cap 3
Matematicas - cap 3Matematicas - cap 3
Matematicas - cap 3Deiner10
 
Developing and deploying big data machine learning models
Developing and deploying big data machine learning modelsDeveloping and deploying big data machine learning models
Developing and deploying big data machine learning modelsNarayana Swamy
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsMachine Learning Valencia
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkInSemble
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Destaque (12)

Matematicas - cap 3
Matematicas - cap 3Matematicas - cap 3
Matematicas - cap 3
 
Developing and deploying big data machine learning models
Developing and deploying big data machine learning modelsDeveloping and deploying big data machine learning models
Developing and deploying big data machine learning models
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

Semelhante a Machine Learning on Big Data with HADOOP

Interview for saby upadhyay
Interview for  saby upadhyayInterview for  saby upadhyay
Interview for saby upadhyayAnthonyBennet
 
Interview for saby upadhyay
Interview for  saby upadhyayInterview for  saby upadhyay
Interview for saby upadhyayCameronDonovan
 
Digital Disruptions and Transformation
Digital Disruptions and TransformationDigital Disruptions and Transformation
Digital Disruptions and TransformationMohammad Faiz
 
In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018
In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018
In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018Yoh Staffing Solutions
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial IntelligenceEnes Bolfidan
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companiesRobert Smith
 
Popular Machine Learning Myths
Popular Machine Learning Myths Popular Machine Learning Myths
Popular Machine Learning Myths Rock Interview
 
The-Business-of-Artificial-Intelligence.pdf
The-Business-of-Artificial-Intelligence.pdfThe-Business-of-Artificial-Intelligence.pdf
The-Business-of-Artificial-Intelligence.pdfShaikhZarin
 
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSISUNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSISpijans
 
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSISUNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSISpijans
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningGovind Mudumbai
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminaryLabs1
 

Semelhante a Machine Learning on Big Data with HADOOP (20)

Interview for saby upadhyay
Interview for  saby upadhyayInterview for  saby upadhyay
Interview for saby upadhyay
 
Interview for saby upadhyay
Interview for  saby upadhyayInterview for  saby upadhyay
Interview for saby upadhyay
 
Digital Disruptions and Transformation
Digital Disruptions and TransformationDigital Disruptions and Transformation
Digital Disruptions and Transformation
 
In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018
In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018
In the Dark? Understanding Big Data & AI: Talent Acquisition Strategies for 2018
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
ANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEWANALYTICS OF DATA USING HADOOP-A REVIEW
ANALYTICS OF DATA USING HADOOP-A REVIEW
 
Intro to AI.pptx
Intro to AI.pptxIntro to AI.pptx
Intro to AI.pptx
 
Benefits of big data
Benefits of big dataBenefits of big data
Benefits of big data
 
Salesforce - AI for CRM
Salesforce - AI for CRMSalesforce - AI for CRM
Salesforce - AI for CRM
 
AI for CRM e-book
AI for CRM e-bookAI for CRM e-book
AI for CRM e-book
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
Big data
Big data Big data
Big data
 
Big data
Big dataBig data
Big data
 
Popular Machine Learning Myths
Popular Machine Learning Myths Popular Machine Learning Myths
Popular Machine Learning Myths
 
The-Business-of-Artificial-Intelligence.pdf
The-Business-of-Artificial-Intelligence.pdfThe-Business-of-Artificial-Intelligence.pdf
The-Business-of-Artificial-Intelligence.pdf
 
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSISUNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
 
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSISUNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
UNCOVERING FAKE NEWS BY MEANS OF SOCIAL NETWORK ANALYSIS
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Mais de EPAM Systems

Halloween Costumes Spendings
Halloween Costumes SpendingsHalloween Costumes Spendings
Halloween Costumes SpendingsEPAM Systems
 
How Americans are buying Halloween Costumes?
How Americans are buying Halloween Costumes?How Americans are buying Halloween Costumes?
How Americans are buying Halloween Costumes?EPAM Systems
 
Most popular Halloween costumes for Pets
Most popular Halloween costumes for PetsMost popular Halloween costumes for Pets
Most popular Halloween costumes for PetsEPAM Systems
 
Most Popular Halloween Costumes for Adults
Most Popular Halloween Costumes for AdultsMost Popular Halloween Costumes for Adults
Most Popular Halloween Costumes for AdultsEPAM Systems
 
Halloween infographic
Halloween infographicHalloween infographic
Halloween infographicEPAM Systems
 
EPAM Cloud Problem Resolution Consulting
EPAM Cloud Problem Resolution ConsultingEPAM Cloud Problem Resolution Consulting
EPAM Cloud Problem Resolution ConsultingEPAM Systems
 
Migration to the cloud
Migration to the cloudMigration to the cloud
Migration to the cloudEPAM Systems
 

Mais de EPAM Systems (8)

Halloween Costumes Spendings
Halloween Costumes SpendingsHalloween Costumes Spendings
Halloween Costumes Spendings
 
How Americans are buying Halloween Costumes?
How Americans are buying Halloween Costumes?How Americans are buying Halloween Costumes?
How Americans are buying Halloween Costumes?
 
Most popular Halloween costumes for Pets
Most popular Halloween costumes for PetsMost popular Halloween costumes for Pets
Most popular Halloween costumes for Pets
 
Most Popular Halloween Costumes for Adults
Most Popular Halloween Costumes for AdultsMost Popular Halloween Costumes for Adults
Most Popular Halloween Costumes for Adults
 
Halloween infographic
Halloween infographicHalloween infographic
Halloween infographic
 
CMS Integration
CMS IntegrationCMS Integration
CMS Integration
 
EPAM Cloud Problem Resolution Consulting
EPAM Cloud Problem Resolution ConsultingEPAM Cloud Problem Resolution Consulting
EPAM Cloud Problem Resolution Consulting
 
Migration to the cloud
Migration to the cloudMigration to the cloud
Migration to the cloud
 

Último

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Último (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Machine Learning on Big Data with HADOOP

  • 1. Machine Learning on Big Data with HADOOP Albert Hui, Associate Director, EPAM Systems April 2013
  • 2. Machine Learning on Big Data with HADOOP Introduction When I was doing my Master thesis in fuzzy logic, I struggled with writing my owned statistical algorithms as well as processing a large set of data to train my AI (Artificial intelligence) model. These tools I created are 1 │ EPAM SYSTEMS, INC. widely available. Thanks for the advancement in the software world, the forward-thinking computer scientists and engineers to make machine learning a reality
  • 3. Machine Learning on Big Data with HADOOP What is Machine Learning? Supervised Machine learning Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam, labeling Web pages according to their genre, and recognizing handwriting. Many algorithms are used to create supervised learners, the most common being neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Unsupervised Machine learning Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups. It also can be used to reduce the number of dimensions in a data set in order to focus on only the most useful attributes, or to detect trends. Common approaches to unsupervised learning include k-Means, hierarchical clustering, and self-organizing maps. List A Two main types of machine learning This paper is an introduction to Machine Learning, and focuses on three objectives. 1. Shows you the trends and latest technologies in machine learning. 2. Helps you to start thinking how to better use your data via the machine learning techniques. 3. Demonstrates how to get started with some sample codes and algorithms. 2 │ EPAM SYSTEMS, INC. This paper also touches on a spectrum of technologies, mostly from the Apache Software Foundation projects, e.g. Apache Hadoop, Mahout, Lucene and Solr. What is machine learning? First, machine learning is not a novel topic. In fact, it was the original promise from the computer scientist when they first invented computer software programming. The long history shows scientists have not given
  • 4. Machine Learning on Big Data with HADOOP up the idea of using machines to help improving our lives. Machine Learning is a field of Artificial Intelligence(AI), and AI textbooks define this field as "the study and design of intelligent agents“ where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success. John McCarthy, who coined the term in 1956, defines it as "the science and engineering of making intelligent machines, and is to program computers to optimize a performance criterion using example data or past experience”. It is a field of combining efforts of information technology, statistics, biology, linear algebra and psychology etc. The modern scientists divide machine learning into two major types, Supervised and Unsupervised. This paper focuses mainly on supervised machine learning. And machine learning subtly differs from data mining, which focuses on extracting patterns – the unknown properties on the data. But, machine learning focuses on prediction, based on known properties learned from the training data. Proudly, machine learning has finally become closer to reality. IBM invented Watson the computer to compete human brains in a series of Jeopardy TV shows. The project is called IBM Watson, it is the ultimate machine learning example using natural language processing method to demonstrate the power of machine learning, see link http://www-03.ibm.com/innovation/us/watson/w hat-is-watson/index.html. Structured data: Walmart logs 1M transactions per hour; Boeing logs 640 terabytes from a 4-engine jumbo jet on one Atlantic crossing. Unstructured data: Twitter products 90M tweets per day; Facebook creates 30Billions web links, news, blogs, photo; YouTube has 10M+ views, uploads and download per day. Social movement – as people getting richer, they become more literate that fuels the information growth List D Big Data Variety and Variability What are the use cases of machine learning? Believe or not, machine learning has been integrated into our lives for years. Facebook is able to generate tailored advertisement based on how we are using it to connect with friends 3 │ EPAM SYSTEMS, INC. and status/comments that you left there. Amazon uses machine learning to recommend you books and items that you mostly like. This accounts for 30% of sales out of its recommendation engine. Google’s new piracy policy needs
  • 5. Machine Learning on Big Data with HADOOP you to agree on allowing them to integrate your preference among its offered products, Gmail, Google+, Google Map aiming to learn more about you and to sell more to you. Yahoo’s spam mail detection and intelligent filtering is also based on the clustering mechanism from machine learning. An ordinary traditional grocery British retailer Tesco collects 1.5 billion piece of information every week in order to use machine learning to adjust prices and promotions in real time. Not to mention it is mandatory for all their executives to tweet about their company in a daily basis. There are no shortages of use cases of machine learning; List B provides a few more. The pioneers in the industry have adopted so well using machine learning, now it is our turn as corporate to respond this “big” trend to further grow our businesses. E-Tailing – product recommendation engines, cross channel analytics. Retailing/consumer – Merchandizing and market basket analysis, campaign management and optimization, market and consumer segmentation. Financial Services – Next generation of Fraud detection and risk management, credit risk scoring decision making. Government – Fraud detection, catching Bin laden, cybersecurity. Web Scale – clickstream, social graph analysis, ad targeting/forecasting. Telecommunication – customer churn management, Mobile behavior analysis. List B More examples of Use cases in machine learning 4 │ EPAM SYSTEMS, INC.
  • 6. Machine Learning on Big Data with HADOOP How Do We Use Machine Learning to Improve Our Business? Before answering this question, let’s take a look of fundamental question first. Machine learning needs a lot of data, and actually it is a process to turn a lot of data to make them smarter. Like in my master thesis, my research was to optimize machining processes using machine learning. The fuzzy logic model that I developed was designed to process and learn a lot of machining data. The goal was to help people to make better decisions. There are four steps in turning data into smart data. 1. Collect your data; 2. Define the problem you are trying to solve; 3. Build your model to train using the data; 4. Use the model to predict and analyze the results, and modify your model if necessary. Advancement of technology – Digital devices soar as prices plummet, 4.6 billion mobile phones, GPS Geospatial. Internet – 2 billion people has access to Internet, Social networks, Digital clickstream. Social movement – as people getting richer, they become more literate that fuels the information growth List C Reason of data explosion The following discusses more details on each step. 5 │ EPAM SYSTEMS, INC. Step 1: Data Collection. In this digital age, data collection is not easy.
  • 7. Machine Learning on Big Data with HADOOP We simply have too much data to collect and process. This phenomenon is easy to explain. The storage cost has fallen from $14M per TB in 1980 to $70 in 2010. CPU drops to $59 processor in 2009 for an AMD Radeon. Bandwidth skyrockets from 200KB/s in 1980s’ to 4GB/s in 2010. All these factors create the “best” environment for generating a lot of data. Digital universe estimates there will be 1.8 zetabytes (1 zetabyte = 1 billion terabyte) of data in 2015. And by 2020, the world will have 35.2 zetabytes growing by a factor of 44. List C contains a few reasons for data explosion. And data today is not just BIG. It is also very complex. There is 1.8 trillion gigabytes of data created in 2011 and more than 90% is unstructured. It is approximately 500 quadrillion of files stored outside the databases, and the quantity doubles every 2 years. The industry has defined it as “Era of Big Data”, which is defined as datasets that grow so large that they become awkward to work with using relational databases. Think big Think outside the box Think new revenue possibility Think how to cut cost and reduce risk List E Think your machine learning problem The Era of Big Data has posed some challenges to machine learning. Many old technologies are simply not scalable to read and process such big data volume. However, with the advent of Apache Hadoop, this problem has found a solution. Apache Hadoop, inspired by Google’s white paper on Map/Reduced and Google File System (GFS) and Big Table, is an Apache top level project. It is open source. And it is designed for large scale of data processing and to deal with structured and unstructured data. It is able to run on commodity hardware and since 6 │ EPAM SYSTEMS, INC. it is based on Java, it is portable across heterogeneous hardware and OS platform. Why is it best for machine learning? It is because Hadoop is known for embarrassingly scalable as your data grows. It is reportedly claimed that Hadoop Map/Reduce is able to read and aggregate petabyte of data 500X faster than other commercial software. When you have more data to learn from, you can simply add more hardware to the Hadoop cluster to scale. It is merely fraction of the cost comparing with other commercialized software. Corporations can simply
  • 8. Machine Learning on Big Data with HADOOP collect their data first, and have an extremely scalable way to analyze them later. With Apache Hadoop, it gives a new realm of possibility for corporations to solve their machine learning problems. The sexy job in the next ten years will be statisticians… The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it . Hal Varian, Google Chief Economist List F Mahout provides statistics algorithm capability for machine learning Step 2: Define your problem. With Apache Hadoop, you can contextualize your data by decomposing them in key-value pair. For example, you can quickly learn how many people are talking about apples, oranges regarding fruit or windows and doors regarding renovation. Or people are talking apple and windows regarding Apple’s Ipad. And you can analyze terabyte of clickstream data to form relationship between pages and user’s behaviours and buying decision. It can also marry both structured and unstructured data to gain broader insight of customers’ buying behaviours, which was before nearly impossible. (means elephant driver in Hindi) to help develop your machine learning models. Apache Mahout is an open source project by the Apache Software Foundation (ASF) with the primary goal of creating scalable machine-learning algorithms that are free to use under the Apache license. It uses the Apache Hadoop library to enable Mahout to scale effectively in a Hadoop cluster cloud. Mahout contains four major implementations, they are: 1. Clustering, 2. Classification, 3. Collaborative filtering (recommendations) and 4. Frequent-pattern mining. Step 3: Build your model to train using the data you collected. Now, it is time to introduce Apache Mahout The power of Mahout is that it contains a list of well established statistical models. If you type bin/mahout at the command line, you will see the following list: >canopy: : Canopy clustering >cleansvd: : Cleanup and verification of SVD output >clusterdump: : Dump cluster output to text 7 │ EPAM SYSTEMS, INC.
  • 9. Machine Learning on Big Data with HADOOP >dirichlet: : Dirichlet Clustering >fkmeans: : Fuzzy K-means clustering >fpg: : Frequent Pattern Growth >itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering >kmeans: : K-means clustering >lda: : Latent Dirchlet Allocation >ldatopics: : LDA Print Topics >lucene.vector: : Generate Vectors from a Lucene index >matrixmult: : Take the product of two matrices >meanshift: : Mean Shift clustering >recommenditembased: : Compute recommendations using item-based collaborative filtering 8 │ EPAM SYSTEMS, INC.
  • 10. Machine Learning on Big Data with HADOOP Clustering Figure 1 K-Mean Clustering Let’s go a little deeper into each of four implementations to help you get started. Given large data sets, whether they are text or numeric, it is often useful to group together, or cluster, similar items automatically. For instance, given all of the news for the day from all of the newspapers in the United States, you might want to group all of the articles about the same story together automatically; you can then choose to focus on specific clusters and stories without needing to wade through a lot of unrelated ones. Another example: Given the output from sensors on a machine over time, you could cluster the outputs to determine normal versus problematic operation, because normal opera- 9 │ EPAM SYSTEMS, INC. tions would all cluster together and abnormal operations would be in outlying clusters. Clustering calculates the similarity between items in the collection, but its only job is to group together similar items. In many implementations of clustering, items in the collection are represented as vectors in an n-dimensional space. Given the vectors in Figure 1, one can calculate the distance between three items using measures such as the Manhattan Distance, Euclidean distance, or cosine similarity. Then, the actual clusters can be calculated by grouping together the items that are close in distance. Popular approaches include k-Means and hierarchical clustering. Mahout comes with several different clustering approaches.
  • 11. Machine Learning on Big Data with HADOOP Step a: Convert dataset into a Hadoop Sequence File. $ bin/mahout seqdirectory –I mahout-work/reuters-out –o mahout-work/reuters-out-seqdir –c UTF-8 –chunk 5 In the sequence of files, each record is a <key, value> pair Key=record name or file name or unique identifier Value = content as UTF-8 encoded string. Step b: Writing to the sequence files Step c: Generate Vectors using the sequence files by running run: $ bin/mahout seq2sparse Step d: Start k-mean Clustering $ bin/mahout kmeans -I mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ mahout-work/reuters-kmeans-clusters -o mahout-work/reuters-kmeans -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 –x 10 –k 20 –ow -c Figure 2 Clustering: understanding data as vectors There are many clustering algorithm available in Mahout. They are Canopy, fuzzy k-Means, 10 │ EPAM SYSTEMS, INC. Mean shift, Dirichlet process clustering and spectral clustering.
  • 12. Machine Learning on Big Data with HADOOP Classification The goal of classification (often also called categorization) is to label unseen documents, thus grouping them together. Many classification approaches in machine learning calculate a variety of statistics that associate the features of a document with the specified label, thus creating a model that can be used later to classify unseen documents. For example, a simple approach to classification might keep track of the words associated with a label, as well as the number of times those words are seen for a given label. Then, when a new document is classified, the words in the document are looked up in the model, probabilities are calculated, and the best result is output, usually along with a score indicating the confidence t he result is correct. Features for classification might include words, weights for those words (based on frequency, for instance), parts of speech, and so on. Of course, features really can be anything that helps associate a document with a label Figure 3 Playing Tennis, example for classification 11 │ EPAM SYSTEMS, INC. and can be incorporated into the algorithm. Mahout has several implementations, Naïve Bayes, Complementary Naïve Bayes, Decision Forests and Logistic Regression. The most popular classification is Naïve Bayes which is based on condition probability of the class given an instance. And: where Evidence E is the instance and Event H value for the instance. For example: The following is to play tennis Event H and the weather conditions E are the evidences for the instance.
  • 13. Machine Learning on Big Data with HADOOP Based on the above dataset, we can first train the classifier $MAHOUT_HOME/bin/mahout tennnisclassifier –i 20news-input/bayes-train-input playtennismodel -type bayes -ng 3 –source hdfs After the classifer is built and trained, we can test them using Mahout by entering a new day with its evidences. -o Temp=Cool, Humidity=High, Windy = true. Hence, use the model to determine the probability of yes playing tennis vs probability of No playing tennis. Use a new day looks like: Outlook = Sunny, $MAHOUT_HOME/bin/mahout tennnisclassifier –m playtennismodel 3 –source hdfs -method mapreduce -d 20-input –type bayes –ng [java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: -------------[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: Testing: example/rd/prepared-test/ playingTennis.txt [java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: playingTennis 70.0 21/30.0 [java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: -------------[java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: Testing: example /rd/prepared-test/ NotPlayingTennis.txt [java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: notPlayingTennis 81.3953488372093 35/43.0 [java] 09/07/23 17:06:38 INFO bayes. tennnisclassifier: [java] Summary [java] ------------------------------------------------------[java] Correctly Classified Instances : 9 79.5% [java] Incorrectly Classified Instances : 5 20.5% [java] Total Classified Instances : 14 [java] [java] ======================================================= [java] Confusion Matrix [java] ------------------------------------------------------[java] a b <--Classified as [java] 21 9 | 30 a = playingTennis [java] 8 35 | 43 b = NotPlayingTennis [java] Default Category: unknown: 2 12 │ EPAM SYSTEMS, INC.
  • 14. Machine Learning on Big Data with HADOOP Figure 4 Summary Result of Playing Tennis after Naïve Bayes 13 │ EPAM SYSTEMS, INC.
  • 15. Machine Learning on Big Data with HADOOP Collaborative Filtering Collaborative filtering (aka Recommendation) is a technique, popularized by Amazon and others, that uses user information such as ratings, clicks, and purchases to provide recommendations to other users of their site. Collaborative filtering is often used to recommend consumer items such as books, music, and movies, but it is also used in other applications where multiple actors need to collaborate to narrow down data. Chances are you've seen Collaborative filtering in action the last time you were on Amazon. Figure 5 Amazon’s recommendation Given a set of users and items, Collaborative filtering applications provide recommendations to the current user of the system. Four ways of generating recommendations are typical: • User-based: Recommend items by finding similar users. This is often harder to scale because of the dynamic nature of users. • Item-based: Calculate similarity between items and make recommendations. Items usually don't change much, so this often can be computed offline. 14 │ EPAM SYSTEMS, INC. • Model-based: Provide recommendations based on developing a model of users and their ratings. All Collaborative filtering approaches end up calculating a notion of similarity between users and their rated items. There are many ways to compute similarity, and most Collaborative filtering systems allow you to plug in different measures so that you can determine which one works best for your data.
  • 16. Machine Learning on Big Data with HADOOP Mahout currently provides tools for building a recommendation engine through the Taste library; it is a very flexible and fast engine for recommendation. And it is a Java-based. The library comes with classes that you can inherit the interfaces and properties. Taste supports both user-based and item-based recommendations and comes with many choices for making recommendations, as well as interfaces for you to define your own. Taste consists of five primary components that work with Users, Items and Preferences: • DataModel: Storage for Users, Items, and Preferences. • UserSimilarity: Interface defining the similarity between two users. • ItemSimilarity: Interface defining the similarity between two items. • Recommender: Interface for providing recommendations. • UserNeighborhood: Interface for computing a neighborhood of similar users that can then be used by the Recommenders. For example, to use Taste for book recommendation using a Item-based recommendation, the first step is to load the data containing the recommendations and store them into the DataModel storing the users, Item book ID and their preferences. //create the data model for the user preferences FileDataModel dataModel = new FileDataModel(new File(bookFile)); UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); // Optional: userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel)); The second step is to construct a LogLikeligoodSimilarity and a Recommender. The UserNeighborhood identifies users similar to the users and is handed off to the Recommender to create a ranked list of recommended books. //create the data model FileDataModel dataModel = new FileDataModel(new File(recsFile)); //Create an ItemSimilarity ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel); //Create an Item Based Recommender ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, itemSimilarity); //Get the recommendations List<RecommendedItem> recommendations = recommender.recommend(userId, 5); TasteUtils.printRecs(recommendations, handler.map); 15 │ EPAM SYSTEMS, INC.
  • 17. Machine Learning on Big Data with HADOOP Figure 6 Frequent Pattern Mining: FP – Growth Algorithm Running the main.java class will produce the following output. [echo] Getting similar items for user: Machine Learning with Mahout with a ItemSimilarity of 5 [java] 09/08/20 08:13:51 INFO file.FileDataModel: Creating FileDataModel for file src/main/resources/recommendations.txt [java] 09/08/20 08:13:51 INFO file.FileDataModel: Reading file info... [java] 09/08/20 08:13:51 INFO file.FileDataModel: Processed 100000 lines [java] 09/08/20 08:13:51 INFO file.FileDataModel: Read lines: 111901 [java] Data Model: Machine Learning with Mahout [java] ----[java] Title: Mahout 21 Rating: 3.630000066757202 [java] Title: Action in Mahout Rating: 2.703000068664551 [java] Title: Big Data Rating: 4.230000019073486 [java] Title: Correction Rating: 5.0 [java] Title: Abraham Lincoln Rating: 4.739999771118164 [java] Title: Data intelligence in bigdata: 3.430000066757202 [java] Title: Boston consulting in bigdata Rating: 2.009999990463257 [java] Title: Atlanta, Georgia Rating: 4.429999828338623 [java] Recommendations list below: [java] Doc Id: 50575 Title: April 10 Score: 4.98 [java] Doc Id: 134101348 Title: April 26 Score: 4.860541 16 │ EPAM SYSTEMS, INC.
  • 18. Machine Learning on Big Data with HADOOP [java] Doc Id: 133445748 Title: Mahout In action Score: 4.4308662 [java] Doc Id: 1193764 Title: Sam Doe Score: 4.404066 [java] Doc Id: 2417937 Title: Mike Olson Score: 4.24178 After processing the item of ‘Machine Learning with Mahout”, the system recommended several more books with various levels of confidence. 17 │ EPAM SYSTEMS, INC.
  • 19. Machine Learning on Big Data with HADOOP Frequent Pattern Mining Frequent patterns are itemsets and subsequence itemsets that appear in a data set with frequency no less than a user-specified threshold. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set, is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. Finding frequent patterns plays an essential role in machine learning associations, correlations, and many other interesting relationships among data. Frequent pattern mining is very powerful to analyze customer buying habits by finding associa- tions between the different items that customers place in their “shopping baskets”. For instance, if customers are buying milk, how likely are they going to also buy cereal (and what kind of cereal) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers do selective marketing and arrange their shelf space. FP-Growth Tree Algorithm is a high speed frequent item-set mining algorithm. It builds a prefix tree from an input data set which represents the data-set in a compressed form. Frequent item-sets can be pulled from the tree by a process pattern fragment growth involving projection of the FP-tree $ hadoop jar fp-growth-bell_mobility -0.4-job.jar org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i downloads-input -o reco-patterns-output -k 50 - method mapreduce -g 500 -regex '[ ]' -s 5 Apache Mahout has been rapidly adapted by the industry in the last two years. And it has continued to move forward in a number of ways. The current release has included many algorithms that can help solve real world problems. The community's primary focus at the moment is on pushing toward a 1.0 release by doing performance testing, documentation, API improvement, and the addition of new algorithms. And the long term plan is to start looking at distributed, in-memory approaches to solving machine-learning problems. It will make Mahout powerful and attractive to the machine learning practitioners. Mahout is well positioned to help solve today's most pressing big-data problems by focusing in on scalability and making it easier to consume complicated machine-learning algorithms. 18 │ EPAM SYSTEMS, INC. Step 4: Put the model to use and fine-tune it. It is fair to say machine learning is a trial and error process.There is no cookie cutting solution to a problem. When solving the basket analysis problem for a Canadian’s biggest Drug store chain, three different algorithms, K-means, Latent Dirichlet Allocation and FP-Growth tree were put into a trial and error process for a few months to land on a right algorithm. It is not an easy problem and requires a lot of efforts and patience. Thankfully, technology has helped the data churning process and made the process less painful.
  • 20. Machine Learning on Big Data with HADOOP Summary of Apache Mahout Apache Mahout has gained popularity in the last two years. Like the name of mahout suggests, a real mahout (elephant driver) leverages the strength and capabilities of the elephant, so too can Apache Mahout help you leverage the strength and capabilities of the yellow elephant of Apache Hadoop. Just within 2011, many user groups, forums, blogs, webinars and even start-up companies have sprout up around this new technology. And its community is becoming larger and stronger everyday to incorporate new 19 │ EPAM SYSTEMS, INC. algorithms with increasing capabilities for solving clustering, categorization, and collaborative filtering, frequent item mining or other machine learning problems. Observing the industry involvements and responses, in my opinion, Mahout likely will become the standard technology of machine learning on Big Data. Big technology firms like IBM, Oracle will incorporate Mahout in their product offerings to just acquire the user base.
  • 21. Machine Learning on Big Data with HADOOP Remarks Machine learning is definitely an exciting application that helps you to tap on the power of big data. As for corporate data continues to grow bigger and more complex, machine learning will become even more attractive. The industry has come up elegant solutions to help corporations to solve this problem. Let’s get ready; it is just a matter time this problem arrives at your desk. 20 │ EPAM SYSTEMS, INC.
  • 22. Established in 1993, EPAM Systems (NYSE: EPAM) provides complex software engineering solutions through its award-winning Central and Eastern European service delivery platform. Headquartered in the United States, EPAM employs approximately 8,500 IT professionals and serves clients worldwide from its locations in the United States, Canada, UK, Switzerland, Germany, Sweden, Belarus, Hungary, Russia, Ukraine, Kazakhstan, and Poland. EPAM is recognized among the top companies in IAOP’s “The 2012 Global Outsourcing 100”, featuring EPAM in a variety of sub-lists, including “Leaders – Companies in Eastern Europe”. The company is also ranked among the best global service providers on “The 2012 Global Services 100” by Global Services Magazine and Neogroup, which names EPAM “Leaders - Global Product Development” category. For more information, please visit www.epam.com G l ob al EU CIS 41 University Drive Suite 202, Newtown (PA), 18940, USA Phone: +1-267-759-9000 Fax: +1-267-759-8989 Corvin Offices I. Futó street 47-53 Budapest, H-1082, Hungary Phone: +36-1-327-7400 Fax: +36-1-577-2384 9th Radialnaya Street, bldg. 2 Moscow, 115404, Russia Phone: +7-495-730-6360 Fax: +7-495-730-6361 © 1993-2013 EPAM Systems. All Rights Reserved.