SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
02/201340
The Tools
I
n this article I will describe three clustering algo-
rithms – K-Means, Canopy clustering and MinHash
clustering implemented by both sequential approach
and MapReduce approach.
MapReduce paradigm has recently gained popular-
ity due to the ability to solve many of the problems
associated with rising data volume by modeling al-
gorithms using MapReduce paradigm. These algo-
rithms essentially work with data in a partitioned (in-
to shards) form and have to consider the distributed
nature of partitioned data and model the algorithm
accordingly. Also, implementations such as Apache
Hadoop, Phoenix and others have paved way for al-
gorithm writers to model their approach in MapReduce
fashion without worrying about the underlying architec-
ture of the system. One such field of algorithms is the
field of Machine Learning, which has worked tremen-
dously towards implementing algorithms in MapRe-
duce paradigm which work on a MapReduce system.
Open source implementations such as Apache Ma-
hout (http://mahout.apache.org) is being developed by
team of active contributors and used in various orga-
nizations (https://cwiki.apache.org/confluence/display/
MAHOUT/Powered+By+Mahout). This article intro-
duces MapReduce in a nutshell, but assumes familiar-
ity of programming paradigms.
MapReduce 101
Google introduced and patented MapReduce – A pro-
gramming paradigm of processing data with partitioning,
and aggregation of intermediate results. An overloaded
term in Google’s context – its also a software framework
to support distributed computing on large data sets on
clusters of computers, implemented in C++ to run Jobs
written with MapReduce paradigm. The name “MapRe-
duce” and the inspiration came from map and reduce
functions in functional programming world.
In functional programming, function applies a
function on input data. In MapReduce paradigm,
function applies the function on an individual chunks
(shard) of partitioned data. This partitioning is done by
The size of the partition is a tunable parameter.
Let us take an example of a function, f, which converts
input rectangle into a sphere. Now the function in
the MapReduce paradigm get the following parameters
– map(f, input).
f – A processing function. In out case, its
input – Input is split into many shards, based on the
input split size (Figure 1).
Data Clustering
using MapReduce: A Look at Various Clustering
Algorithms Implemented with MapReduce Paradigm
Data Clustering has always been one of the most sought after
problems to solve in Machine Learning due to its high applicability
in industrial settings and the active community of researchers and
developers applying clustering techniques in various fields, such as
Recommendation Systems, Image Processing, Gene sequence analysis,
Text analytics, etc., and in competitions such as Netflix Prize and Kaggle.
Figure 1. Map(f, input) applied to input
en.sdjournal.org 41
DATA CLUSTERING USING MAPREDUCE
The programming paradigm for Hadoop MapReduce
requires map function to take inputs as key-value pairs
in the form of records. The takes in input key and
input values to produce intermediate value lists of the
processed keys.
functions run in parallel, creating different inter-
mediate values from different input data sets.
After the map phase is over, all the intermediate val-
ues for a given output key are combined together in-
to a list. combines those intermediate values
functions also run in parallel, each working
on a different output key. It is important to note that
reduce method cannot start until and unless the map
As seen above, the reducer function takes a list of
values associated with a key as an input, and performs
processing over the list. In the above representation,
we have only one reducer task, but it depends on the
configuration settings of the system. In our example, the
reducer function takes the intermediate map output
and produces the final output. For our example, reducer
function looks like: Figure 3.
Machine Learning 101
Machine Learning is a branch of Artificial Intelligence
which deals with building of systems that ‘learn’ from
data without the need of explicit programming for all the
use-cases. Its a study of ways using which a computing
system can ‘learn’ – understand the input, make predic-
tions based on the input, and learn from the feedback.
The applications of machine learning have been seen in
many varied fields, e.g. e-mail analytics, Genome proj-
ect, image processing, supply chain optimization, and
many more. These applications are also motivated by
the simplicity of some of the algorithms and their efficien-
cy, which is useful in real-world applications. Its differ-
entiates itself from the field of Data Mining by focussing
only on the known properties learned from the data. Data
Mining techniques may inherently use machine learning
techniques to mining for the unknown properties.
Machine Learning algorithm can be classified into the
following subcategories (source: Wikipedia; see Figure 4).
Supervised learning – Generates a model which
takes the input and produces the desired output. It gets
trained on a training set to understand the input and
later works on the incoming data to provide the output.
Unsupervised learning – It models the input into a set
of representations produced after the algorithm. It does
not follow the train-test model as used in supervised
learning methods, but works on finding a representation
of the input data. Clustering is one of the prominent ap-
proaches in this space.
Semi-supervised learning – Combines supervised
and unsupervised learning approaches to generate ap-
Figure 2. Reduce(r, intermediate_value_list) applied to
intermediate_value_list to get the final output of the mapreduce
phase
Figure 3. Reducer function r()
Figure 4. Machine Learning Taxonomy (Source: Wikipedia)
02/201342
The Tools
proximate classifiers. Predicting outputs from fixed set
of inputs (train) and applying the learnings to improve
the test set of input.
Reinforcement learning – Learns from the feedback to
the output. It improves as the interactions with the sys-
tem increase. One of the most important approaches
is to use the Artificial Neural Networks, which feedback
the response, with each interaction with the system.
Learning to learn – It uses experience as a part of
learning. The experience factor comes while learning a
problem together with other similar problems simulta-
neously. It allows to share the experience from various
problems and build a better model, applied to the all the
problems used for learning simultaneously.
Example of Supervised Learning
Google Prediction API (https://developers.google.com/
prediction/) takes a training set as an input (Listing 1).
After getting trained, it predicts the language of the in-
put by making a model based on the training input.
Unsupervised Learning
As stated above, unsupervised learning refers to the
problem of finding hidden structure or representation in
unlabeled data. Simply stated – finding a way to store
similar data points into a data structure (e.g. cluster
them together). Since the examples given to the learner
are unlabeled, there is no error or reward signal to eval-
uate a potential solution. This distinguishes unsuper-
vised learning from supervised learning and reinforce-
ment learning.
Clustering
Cluster analysis refers to the process of grouping simi-
lar objects without labels. Similarity is directed by the
business needs or by the requirements of the clustering
algorithm. Its eventual goal is to group the similar ob-
jects into same cluster and distinct objects into distinct
groups. The distinction is directed similarity measure. It
is mainly used in explorative mining, and in statistical
analysis techniques such as pattern recognition, infor-
mation retrieval, etc.
Purposes of using data clustering [1]:
-
erate hypotheses, detect anomalies, and identify
salient features.
-
larity among forms or organisms (phylogenetic rela-
tionship).
and summarizing it through cluster prototypes.
We’ll look into three ways to cluster data points: K-
Means, Canopy Clustering, and MinHash Clustering,
and look into the approaches to implement these algo-
rithms: Sequential and MapReduce paradigm. These
three algorithms have proved to be useful in various
applications such as recommendation engines [2], im-
age processing, and the use as a pre-clustering algo-
rithm [3].
K-Means Clustering
K-Means clustering is a method of cluster analysis
which aims to form k groups from the n data points tak-
en as an input. This partitioning happens due to the da-
ta point associating itself with the nearest mean.
Classical Approach
The main steps of k-means algorithm are as follows:
Listing 1. Input file to the model with two fields – <Language, Sample Phrase in Language>
(Source: Excerpts from file available at https://developers.google.com/prediction/docs/language_id.txt)
en.sdjournal.org 43
DATA CLUSTERING USING MAPREDUCE
seed centroids.
-
the user, or the dimensions of centroid does not
change.
point to its closest cluster center – assignment hap-
pens based on the closest mean.
centroids using the mean formulae for multidimen-
sional data-points (Figure 5).
Theoretically, Let X = {xi}, i = 1,...,n be the set of n di-
mensional points to be clustered into a set of K clus-
-
tition such that the sum of squared error (distances)
between the empirical mean of a cluster and the points
in the cluster is minimized. Let µk be the mean of clus-
ter ck. The sum of the squared error between µk and
The goal of K-Means is to minimize the sum of the
squared error over all K clusters,
As we can observe, we would require to get all the
data-points in memory to start their comparison with the
centroids of the cluster. This is plausible only in condi-
tions where the data sets are not very large (very large
depends on the size of your RAM, processing power
of your CPU and the time within which the answer is
expected back). This approach does not scale well for
two reasons – first, the complexity is pretty high – k *
n * O (distance metric) * num (iterations), and second,
we need a solution which can scale with very large da-
tasets (Data Files in GBs, TBs, ...). MapReduce paves
Figure 5. K-Means representation applied on a dataset of 65
data points (n=65) and number of clusters to be 3 (k=3). (Source
of the dataset – [4], and representation plot produced by Wolfram
Mathematica)
Figure 6. Single pass of K-Means on MapReduce
02/201344
The Tools
way for making such a solution feasible with distribut-
ing into tasks to work on smaller chunks of data, and
making use of a multi-node setup (distributed system).
There can be multiple approaches to do the sequential
approach better, or many improved ways to implement
MapReduce version of k-means.
Distance Metrics
The k-means is typically used with the Euclidean metric
for computing the distance between points and cluster
centers. Other distance metrics, with which k-means is
used are Mahalanobis distance metric, Manhattan dis-
tance, Inner Product Space, Itakura–Saito distance,
family of Bregman distances, or any metric you define
over the space.
MapReduce Approach
MapReduce works on keys and values, and is based
on data partitioning. Thus, the assumption of having all
data points in memory fails in this paradigm. We have
to design the algorithm in such a manner that the task
can be parallelized and doesn’t depend on other splits
for any computation (Figure 6).
The Mappers do the distance computation and spill
out a key-value pair – <centroid_id, datapoint>. This
steps finds the associativity of a data point with the
cluster.
The Reducers work with specific cluster_id and a list
of the data points associated with it. A reducer com-
putes new means and writes to the new centroid file.
Now, based on the user’s choice, algorithm termina-
tion method works – specific number of iterations, or
comparison with centroid in the previous iteration.
Figure 7. K-Means Algorithm. Algorithm termination method is
user-driven
Figure 8. (a) 1100 samples generated (b) Resulting Canopies superimposed upon the sample data. Two circles, with radius T1 and T2.
(Source: https://cwiki.apache.org/MAHOUT/canopy-clustering.html)
Figure 9. An example of four data clusters with the canopies
superimposed. Point A was selected at random and forms a canopy
consisting of all points within the outer (solid) threshold. (Source: [3])
en.sdjournal.org 45
DATA CLUSTERING USING MAPREDUCE
MapReduce based K-Means implementation is also
implemented part of Mahout [7][9] (Figure 7).
Canopy Clustering
Canopy clustering is an unsupervised clustering algo-
rithm, primarily applied as a pre-clustering algorithm
– its output is given as input to classical clustering al-
gorithms such as k-means. Pre-clustering with canopy
clustering helps in speeding up the clustering of actual
clustering algorithm, which is very useful for very large
datasets.
Cheaply partitioning the data into overlapping subsets
(called "canopies") Perform more expensive clustering,
but only within these canopies (Figure 8 and Figure 9).
Algorithm
The main steps of partitioning are as follows:
Let us assume that we have a list of data points
named X.
1. You decide two thresholds values – T1 and T2
where T1 > T2.
2. Randomly pick 1 data point, which would represent
the canopy centroid, from X. Let it be A.
3. Calculate distances, d, for all the other points with
point A (see 2). If d < T1, add the point in the cano-
py. If d < T2, remove the point from X.
4. Repeat steps 2 and 3 for all the data points, until X
is not empty.
canopy centroids [10].
-
clusters. In the mapper, each point which is found
to be within an existing canopy will be added to an
internal list of Canopies. The mapper updates all of
its Canopies and produces new canopy centroids,
which are output, using a constant key ("centroid")
to a single reducer <“centroid”, vector>. The reduc-
er receives all of the initial centroids and again ap-
plies the canopy measure and thresholds to pro-
(i.e. clustering the cluster centroids).
single pass K-Means, producing the canopies.
MinHash Clustering
It is a probabilistic clustering method which comes from
the family of Locality Sensitive Hashing (LSH) algo-
rithms. It is generally used in a setting where clustering
needs to be done based on the range of the dimensions
of the data points (especially only a single dimension,
such as in [2], where the clustering is done only on the
basis of the news article seen by the user). The prob-
ability of a pair of data points assigned to the same
cluster is proportional to the overlap between the set of
items that these users have voted for (clicked-on). So
if the overlap of the set of the items clicked is high, the
probability of the two data-points ending up in the same
cluster is high (‘high’ is domain specific).
Example of MinHash Clustering
Let us assume hat we have some e-commerce website
with a catalog of products.
-
fer it as R (Figure 10).
the users belong to.
the user gives you the cluster number. The clus-
ter ID is the product ID itself.
packets of users getting associated with the
same product number.
information to the User Record. It can act as a
-
suming the use of NoSQL database in this case)
step.
need to make sure that the number of prod-
Figure 10. Random Permutation of Catalog
02/201346
The Tools
Figure 13. MinHash Clustering MapReduce workflow. The reference range for finding cluster is shared across mapper using distributed
cache
Figure 12. MinHash Clustering. The example assumes a
datastorage in a NoSQL database, such as MongoDB, in which
ClusterID is appended as a column entry to the user
Figure 11. The mappers share a common file (on distributed cache)
containing the permutation of the catalog
References
[1] Jain, A. K. (2010) “Data clustering: 50 years beyond
K-means”, Pattern Recognition Letters 31, pp. 651–666.
[2] Das, A. S.; Datar, M.; Garg, A.; and Rajaram, S. (2007)
“Google news personalization: scalable online collabor-
ative filtering”, WWW ‘07, pp. 271-280.
[3] McCallum, A.; Nigam, K.; and Ungar L.H. (2000) “Efficient
Clustering of High Dimensional Data Sets with Applica-
tion to Reference Matching”, SIGKDD ’00, pp. 169-178.
[4] Lingras, P. and West, C. (2004) “Interval Set Clustering of
Web Users with Rough K-means”, Journal of Intelligent
Information System, Vol. 23, No. 1, pp. 5-16.
[5] Lingras, P. and West, C. (2004) “Interval Set Clustering of
Web Users with Rough K-means”, Journal of Intelligent
Information System, Vol. 23, No. 1, pp. 5-16.
[6] Jain, A. K.; and Dubes, R. C. (1988) “Algorithms for Cluster-
ing Data”, Prentice Hall.
[7] Introducing Apache Mahout – IBM developerWorks –
http://www.ibm.com/developerworks/java/library/j-ma-
hout
[8] Chu, C. T.; Kim, Sang K.; Lin, Y. A.; Yu, Y.; Bradski, G. R.;
Ng, A. Y.; Olukotun, K. (2006) “Map-Reduce for Machine
Learning on Multicore”, In NIPS, pp. 281-288 available at
http://www.cs.stanford.edu/people/ang/papers/nips06-
mapreducemulticore.pdf
[9] Apache Mahout – K-Means Clustering – http://cwiki.
apache.org/MAHOUT/k-means-clustering.html
[10] Apache Mahout – Canopy Clustering – https://cwi-
ki.apache.org/confluence/display/MAHOUT/Canopy
+Clustering
[11] Apache Mahout – http://mahout.apache.org
DATA CLUSTERING USING MAPREDUCE
ucts are known beforehand and the clusters
formed are not too big which reduce the ac-
curacy of the clusters.
into a separate collection (table) based on
the merged values.
Before: We can group the random permutations into ‘q’
based on these groups.
MapReduce Approach
As displayed in the example, for ease of understanding
the algorithm, we are taking datapoints to be a click log-
ger entry for a user. This would contain each click entry
for products made by the user. The representation in
Figure 14 describes the process of clustering the users
based on clicks.
In the map phase, mappers would take the data points
as an input and produce a 2-tuple entry contain cluster
id (the minimum entry in the click list, from the product
catalog, R). The mappers share a common file (on dis-
tributed cache) containing the permutation of the cata-
log (Figure 11).
Reduce phase would not perform any processing and
just list of the users associated with the cluster_id (Fig-
ure 12 and Figure 13).
a d v e r i s e m e n t
Conclusion
Clustering Algorithm are time-tested way to finding pat-
terns in data, and with rising volume of data, its evi-
dent that these algorithms must scale. Industry and
Academia is already doing a lot to scale clustering al-
gorithms, by scaling it both vertically and horizontally.
MapReduce helps in managing data partitions and par-
allel processing over these data shards. In this article,
we saw only a handful algorithms from the vast number
of machine learning techniques that can be molded into
MapReduce[8][9].
VARAD MERU
Varad Meru, Software Development Engineer at Orzota, Inc.
I completed my Bachelors in Computer Science from Shiva-
ji University, Kolhapur in 2011. After graduating, I joined Per-
sistent Systems as a Software Engineer and worked on Recom-
mendations Engines, Search Engines, E-mail Analytics, etc. At
Orzota, I work Algorithms, Distributed Computing and Data
Science. At Orzota, we are trying to Make 'Big Data' platform
easy for consumption. I am active with HUGs, gave talks on
Big Data and related technologies at various places and like
mentoring next generation of professionals in the domain on
Data Science.
LinkedIn: in.linkedin.com/in/vmeru/
Company Website: www.orzota.com

Mais conteúdo relacionado

Mais procurados

Programming Languages
Programming LanguagesProgramming Languages
Programming LanguagesMohamed Omar
 
Operating system and its types
Operating system and its types Operating system and its types
Operating system and its types vimal kumar arora
 
Message Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationMessage Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationHimanshi Kathuria
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
Operating system notes
Operating system notesOperating system notes
Operating system notesSANTOSH RATH
 
Operating System Lecture Notes
Operating System Lecture NotesOperating System Lecture Notes
Operating System Lecture NotesFellowBuddy.com
 
Bca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital componentBca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital componentRai University
 
Point-to-Point Communicationsin MPI
Point-to-Point Communicationsin MPIPoint-to-Point Communicationsin MPI
Point-to-Point Communicationsin MPIHanif Durad
 
Primary memory (main memory)
Primary memory (main memory)Primary memory (main memory)
Primary memory (main memory)shah baadshah
 
Operating System - Types Of Operating System Unit-1
Operating System - Types Of Operating System Unit-1Operating System - Types Of Operating System Unit-1
Operating System - Types Of Operating System Unit-1abhinav baba
 
4 evolution-of-programming-languages
4 evolution-of-programming-languages4 evolution-of-programming-languages
4 evolution-of-programming-languagesRohit Shrivastava
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016) ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016) Erica Beavers
 

Mais procurados (20)

IPC
IPCIPC
IPC
 
Programming Languages
Programming LanguagesProgramming Languages
Programming Languages
 
Unix ch03-03(2)
Unix ch03-03(2)Unix ch03-03(2)
Unix ch03-03(2)
 
Operating system and its types
Operating system and its types Operating system and its types
Operating system and its types
 
CS6401 OPERATING SYSTEMS Unit 2
CS6401 OPERATING SYSTEMS Unit 2CS6401 OPERATING SYSTEMS Unit 2
CS6401 OPERATING SYSTEMS Unit 2
 
Deadlock ppt
Deadlock ppt Deadlock ppt
Deadlock ppt
 
Message Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communicationMessage Passing Interface (MPI)-A means of machine communication
Message Passing Interface (MPI)-A means of machine communication
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Operating system notes
Operating system notesOperating system notes
Operating system notes
 
Operating System Lecture Notes
Operating System Lecture NotesOperating System Lecture Notes
Operating System Lecture Notes
 
computer languages
computer languagescomputer languages
computer languages
 
Bca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital componentBca 2nd sem-u-1.2 digital logic circuits, digital component
Bca 2nd sem-u-1.2 digital logic circuits, digital component
 
Critical section operating system
Critical section  operating systemCritical section  operating system
Critical section operating system
 
Point-to-Point Communicationsin MPI
Point-to-Point Communicationsin MPIPoint-to-Point Communicationsin MPI
Point-to-Point Communicationsin MPI
 
Generation of os
Generation of osGeneration of os
Generation of os
 
Primary memory (main memory)
Primary memory (main memory)Primary memory (main memory)
Primary memory (main memory)
 
Operating System - Types Of Operating System Unit-1
Operating System - Types Of Operating System Unit-1Operating System - Types Of Operating System Unit-1
Operating System - Types Of Operating System Unit-1
 
4 evolution-of-programming-languages
4 evolution-of-programming-languages4 evolution-of-programming-languages
4 evolution-of-programming-languages
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016) ABR Algorithms Explained (from Streaming Media East 2016)
ABR Algorithms Explained (from Streaming Media East 2016)
 

Destaque

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoopTianwei Liu
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...
TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...
TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...TCI Network
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 

Destaque (20)

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...
TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...
TCI 2015 Cluster Mapping: Pattern in Irish National and Regional Economic Act...
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Canopy k-means using Hadoop
Canopy k-means using HadoopCanopy k-means using Hadoop
Canopy k-means using Hadoop
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 

Semelhante a Data clustering using map reduce

Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelDr. Abdul Ahad Abro
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningVenkata Karthik Gullapalli
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsAM Publications
 
Classification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsClassification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsAM Publications
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine LearningJanani C
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006IJARTES
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...ijcsa
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning Pranya Prabhakar
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

Semelhante a Data clustering using map reduce (20)

Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning Algorithms
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Classification of Machine Learning Algorithms
Classification of Machine Learning AlgorithmsClassification of Machine Learning Algorithms
Classification of Machine Learning Algorithms
 
Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 

Mais de Varad Meru

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningVarad Meru
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemVarad Meru
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...Varad Meru
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemVarad Meru
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An OverviewVarad Meru
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Varad Meru
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionVarad Meru
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Varad Meru
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducationVarad Meru
 

Mais de Varad Meru (15)

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An Overview
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducation
 

Último

AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Último (20)

AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 

Data clustering using map reduce

  • 1. 02/201340 The Tools I n this article I will describe three clustering algo- rithms – K-Means, Canopy clustering and MinHash clustering implemented by both sequential approach and MapReduce approach. MapReduce paradigm has recently gained popular- ity due to the ability to solve many of the problems associated with rising data volume by modeling al- gorithms using MapReduce paradigm. These algo- rithms essentially work with data in a partitioned (in- to shards) form and have to consider the distributed nature of partitioned data and model the algorithm accordingly. Also, implementations such as Apache Hadoop, Phoenix and others have paved way for al- gorithm writers to model their approach in MapReduce fashion without worrying about the underlying architec- ture of the system. One such field of algorithms is the field of Machine Learning, which has worked tremen- dously towards implementing algorithms in MapRe- duce paradigm which work on a MapReduce system. Open source implementations such as Apache Ma- hout (http://mahout.apache.org) is being developed by team of active contributors and used in various orga- nizations (https://cwiki.apache.org/confluence/display/ MAHOUT/Powered+By+Mahout). This article intro- duces MapReduce in a nutshell, but assumes familiar- ity of programming paradigms. MapReduce 101 Google introduced and patented MapReduce – A pro- gramming paradigm of processing data with partitioning, and aggregation of intermediate results. An overloaded term in Google’s context – its also a software framework to support distributed computing on large data sets on clusters of computers, implemented in C++ to run Jobs written with MapReduce paradigm. The name “MapRe- duce” and the inspiration came from map and reduce functions in functional programming world. In functional programming, function applies a function on input data. In MapReduce paradigm, function applies the function on an individual chunks (shard) of partitioned data. This partitioning is done by The size of the partition is a tunable parameter. Let us take an example of a function, f, which converts input rectangle into a sphere. Now the function in the MapReduce paradigm get the following parameters – map(f, input). f – A processing function. In out case, its input – Input is split into many shards, based on the input split size (Figure 1). Data Clustering using MapReduce: A Look at Various Clustering Algorithms Implemented with MapReduce Paradigm Data Clustering has always been one of the most sought after problems to solve in Machine Learning due to its high applicability in industrial settings and the active community of researchers and developers applying clustering techniques in various fields, such as Recommendation Systems, Image Processing, Gene sequence analysis, Text analytics, etc., and in competitions such as Netflix Prize and Kaggle. Figure 1. Map(f, input) applied to input
  • 2. en.sdjournal.org 41 DATA CLUSTERING USING MAPREDUCE The programming paradigm for Hadoop MapReduce requires map function to take inputs as key-value pairs in the form of records. The takes in input key and input values to produce intermediate value lists of the processed keys. functions run in parallel, creating different inter- mediate values from different input data sets. After the map phase is over, all the intermediate val- ues for a given output key are combined together in- to a list. combines those intermediate values functions also run in parallel, each working on a different output key. It is important to note that reduce method cannot start until and unless the map As seen above, the reducer function takes a list of values associated with a key as an input, and performs processing over the list. In the above representation, we have only one reducer task, but it depends on the configuration settings of the system. In our example, the reducer function takes the intermediate map output and produces the final output. For our example, reducer function looks like: Figure 3. Machine Learning 101 Machine Learning is a branch of Artificial Intelligence which deals with building of systems that ‘learn’ from data without the need of explicit programming for all the use-cases. Its a study of ways using which a computing system can ‘learn’ – understand the input, make predic- tions based on the input, and learn from the feedback. The applications of machine learning have been seen in many varied fields, e.g. e-mail analytics, Genome proj- ect, image processing, supply chain optimization, and many more. These applications are also motivated by the simplicity of some of the algorithms and their efficien- cy, which is useful in real-world applications. Its differ- entiates itself from the field of Data Mining by focussing only on the known properties learned from the data. Data Mining techniques may inherently use machine learning techniques to mining for the unknown properties. Machine Learning algorithm can be classified into the following subcategories (source: Wikipedia; see Figure 4). Supervised learning – Generates a model which takes the input and produces the desired output. It gets trained on a training set to understand the input and later works on the incoming data to provide the output. Unsupervised learning – It models the input into a set of representations produced after the algorithm. It does not follow the train-test model as used in supervised learning methods, but works on finding a representation of the input data. Clustering is one of the prominent ap- proaches in this space. Semi-supervised learning – Combines supervised and unsupervised learning approaches to generate ap- Figure 2. Reduce(r, intermediate_value_list) applied to intermediate_value_list to get the final output of the mapreduce phase Figure 3. Reducer function r() Figure 4. Machine Learning Taxonomy (Source: Wikipedia)
  • 3. 02/201342 The Tools proximate classifiers. Predicting outputs from fixed set of inputs (train) and applying the learnings to improve the test set of input. Reinforcement learning – Learns from the feedback to the output. It improves as the interactions with the sys- tem increase. One of the most important approaches is to use the Artificial Neural Networks, which feedback the response, with each interaction with the system. Learning to learn – It uses experience as a part of learning. The experience factor comes while learning a problem together with other similar problems simulta- neously. It allows to share the experience from various problems and build a better model, applied to the all the problems used for learning simultaneously. Example of Supervised Learning Google Prediction API (https://developers.google.com/ prediction/) takes a training set as an input (Listing 1). After getting trained, it predicts the language of the in- put by making a model based on the training input. Unsupervised Learning As stated above, unsupervised learning refers to the problem of finding hidden structure or representation in unlabeled data. Simply stated – finding a way to store similar data points into a data structure (e.g. cluster them together). Since the examples given to the learner are unlabeled, there is no error or reward signal to eval- uate a potential solution. This distinguishes unsuper- vised learning from supervised learning and reinforce- ment learning. Clustering Cluster analysis refers to the process of grouping simi- lar objects without labels. Similarity is directed by the business needs or by the requirements of the clustering algorithm. Its eventual goal is to group the similar ob- jects into same cluster and distinct objects into distinct groups. The distinction is directed similarity measure. It is mainly used in explorative mining, and in statistical analysis techniques such as pattern recognition, infor- mation retrieval, etc. Purposes of using data clustering [1]: - erate hypotheses, detect anomalies, and identify salient features. - larity among forms or organisms (phylogenetic rela- tionship). and summarizing it through cluster prototypes. We’ll look into three ways to cluster data points: K- Means, Canopy Clustering, and MinHash Clustering, and look into the approaches to implement these algo- rithms: Sequential and MapReduce paradigm. These three algorithms have proved to be useful in various applications such as recommendation engines [2], im- age processing, and the use as a pre-clustering algo- rithm [3]. K-Means Clustering K-Means clustering is a method of cluster analysis which aims to form k groups from the n data points tak- en as an input. This partitioning happens due to the da- ta point associating itself with the nearest mean. Classical Approach The main steps of k-means algorithm are as follows: Listing 1. Input file to the model with two fields – <Language, Sample Phrase in Language> (Source: Excerpts from file available at https://developers.google.com/prediction/docs/language_id.txt)
  • 4. en.sdjournal.org 43 DATA CLUSTERING USING MAPREDUCE seed centroids. - the user, or the dimensions of centroid does not change. point to its closest cluster center – assignment hap- pens based on the closest mean. centroids using the mean formulae for multidimen- sional data-points (Figure 5). Theoretically, Let X = {xi}, i = 1,...,n be the set of n di- mensional points to be clustered into a set of K clus- - tition such that the sum of squared error (distances) between the empirical mean of a cluster and the points in the cluster is minimized. Let µk be the mean of clus- ter ck. The sum of the squared error between µk and The goal of K-Means is to minimize the sum of the squared error over all K clusters, As we can observe, we would require to get all the data-points in memory to start their comparison with the centroids of the cluster. This is plausible only in condi- tions where the data sets are not very large (very large depends on the size of your RAM, processing power of your CPU and the time within which the answer is expected back). This approach does not scale well for two reasons – first, the complexity is pretty high – k * n * O (distance metric) * num (iterations), and second, we need a solution which can scale with very large da- tasets (Data Files in GBs, TBs, ...). MapReduce paves Figure 5. K-Means representation applied on a dataset of 65 data points (n=65) and number of clusters to be 3 (k=3). (Source of the dataset – [4], and representation plot produced by Wolfram Mathematica) Figure 6. Single pass of K-Means on MapReduce
  • 5. 02/201344 The Tools way for making such a solution feasible with distribut- ing into tasks to work on smaller chunks of data, and making use of a multi-node setup (distributed system). There can be multiple approaches to do the sequential approach better, or many improved ways to implement MapReduce version of k-means. Distance Metrics The k-means is typically used with the Euclidean metric for computing the distance between points and cluster centers. Other distance metrics, with which k-means is used are Mahalanobis distance metric, Manhattan dis- tance, Inner Product Space, Itakura–Saito distance, family of Bregman distances, or any metric you define over the space. MapReduce Approach MapReduce works on keys and values, and is based on data partitioning. Thus, the assumption of having all data points in memory fails in this paradigm. We have to design the algorithm in such a manner that the task can be parallelized and doesn’t depend on other splits for any computation (Figure 6). The Mappers do the distance computation and spill out a key-value pair – <centroid_id, datapoint>. This steps finds the associativity of a data point with the cluster. The Reducers work with specific cluster_id and a list of the data points associated with it. A reducer com- putes new means and writes to the new centroid file. Now, based on the user’s choice, algorithm termina- tion method works – specific number of iterations, or comparison with centroid in the previous iteration. Figure 7. K-Means Algorithm. Algorithm termination method is user-driven Figure 8. (a) 1100 samples generated (b) Resulting Canopies superimposed upon the sample data. Two circles, with radius T1 and T2. (Source: https://cwiki.apache.org/MAHOUT/canopy-clustering.html) Figure 9. An example of four data clusters with the canopies superimposed. Point A was selected at random and forms a canopy consisting of all points within the outer (solid) threshold. (Source: [3])
  • 6. en.sdjournal.org 45 DATA CLUSTERING USING MAPREDUCE MapReduce based K-Means implementation is also implemented part of Mahout [7][9] (Figure 7). Canopy Clustering Canopy clustering is an unsupervised clustering algo- rithm, primarily applied as a pre-clustering algorithm – its output is given as input to classical clustering al- gorithms such as k-means. Pre-clustering with canopy clustering helps in speeding up the clustering of actual clustering algorithm, which is very useful for very large datasets. Cheaply partitioning the data into overlapping subsets (called "canopies") Perform more expensive clustering, but only within these canopies (Figure 8 and Figure 9). Algorithm The main steps of partitioning are as follows: Let us assume that we have a list of data points named X. 1. You decide two thresholds values – T1 and T2 where T1 > T2. 2. Randomly pick 1 data point, which would represent the canopy centroid, from X. Let it be A. 3. Calculate distances, d, for all the other points with point A (see 2). If d < T1, add the point in the cano- py. If d < T2, remove the point from X. 4. Repeat steps 2 and 3 for all the data points, until X is not empty. canopy centroids [10]. - clusters. In the mapper, each point which is found to be within an existing canopy will be added to an internal list of Canopies. The mapper updates all of its Canopies and produces new canopy centroids, which are output, using a constant key ("centroid") to a single reducer <“centroid”, vector>. The reduc- er receives all of the initial centroids and again ap- plies the canopy measure and thresholds to pro- (i.e. clustering the cluster centroids). single pass K-Means, producing the canopies. MinHash Clustering It is a probabilistic clustering method which comes from the family of Locality Sensitive Hashing (LSH) algo- rithms. It is generally used in a setting where clustering needs to be done based on the range of the dimensions of the data points (especially only a single dimension, such as in [2], where the clustering is done only on the basis of the news article seen by the user). The prob- ability of a pair of data points assigned to the same cluster is proportional to the overlap between the set of items that these users have voted for (clicked-on). So if the overlap of the set of the items clicked is high, the probability of the two data-points ending up in the same cluster is high (‘high’ is domain specific). Example of MinHash Clustering Let us assume hat we have some e-commerce website with a catalog of products. - fer it as R (Figure 10). the users belong to. the user gives you the cluster number. The clus- ter ID is the product ID itself. packets of users getting associated with the same product number. information to the User Record. It can act as a - suming the use of NoSQL database in this case) step. need to make sure that the number of prod- Figure 10. Random Permutation of Catalog
  • 7. 02/201346 The Tools Figure 13. MinHash Clustering MapReduce workflow. The reference range for finding cluster is shared across mapper using distributed cache Figure 12. MinHash Clustering. The example assumes a datastorage in a NoSQL database, such as MongoDB, in which ClusterID is appended as a column entry to the user Figure 11. The mappers share a common file (on distributed cache) containing the permutation of the catalog References [1] Jain, A. K. (2010) “Data clustering: 50 years beyond K-means”, Pattern Recognition Letters 31, pp. 651–666. [2] Das, A. S.; Datar, M.; Garg, A.; and Rajaram, S. (2007) “Google news personalization: scalable online collabor- ative filtering”, WWW ‘07, pp. 271-280. [3] McCallum, A.; Nigam, K.; and Ungar L.H. (2000) “Efficient Clustering of High Dimensional Data Sets with Applica- tion to Reference Matching”, SIGKDD ’00, pp. 169-178. [4] Lingras, P. and West, C. (2004) “Interval Set Clustering of Web Users with Rough K-means”, Journal of Intelligent Information System, Vol. 23, No. 1, pp. 5-16. [5] Lingras, P. and West, C. (2004) “Interval Set Clustering of Web Users with Rough K-means”, Journal of Intelligent Information System, Vol. 23, No. 1, pp. 5-16. [6] Jain, A. K.; and Dubes, R. C. (1988) “Algorithms for Cluster- ing Data”, Prentice Hall. [7] Introducing Apache Mahout – IBM developerWorks – http://www.ibm.com/developerworks/java/library/j-ma- hout [8] Chu, C. T.; Kim, Sang K.; Lin, Y. A.; Yu, Y.; Bradski, G. R.; Ng, A. Y.; Olukotun, K. (2006) “Map-Reduce for Machine Learning on Multicore”, In NIPS, pp. 281-288 available at http://www.cs.stanford.edu/people/ang/papers/nips06- mapreducemulticore.pdf [9] Apache Mahout – K-Means Clustering – http://cwiki. apache.org/MAHOUT/k-means-clustering.html [10] Apache Mahout – Canopy Clustering – https://cwi- ki.apache.org/confluence/display/MAHOUT/Canopy +Clustering [11] Apache Mahout – http://mahout.apache.org
  • 8. DATA CLUSTERING USING MAPREDUCE ucts are known beforehand and the clusters formed are not too big which reduce the ac- curacy of the clusters. into a separate collection (table) based on the merged values. Before: We can group the random permutations into ‘q’ based on these groups. MapReduce Approach As displayed in the example, for ease of understanding the algorithm, we are taking datapoints to be a click log- ger entry for a user. This would contain each click entry for products made by the user. The representation in Figure 14 describes the process of clustering the users based on clicks. In the map phase, mappers would take the data points as an input and produce a 2-tuple entry contain cluster id (the minimum entry in the click list, from the product catalog, R). The mappers share a common file (on dis- tributed cache) containing the permutation of the cata- log (Figure 11). Reduce phase would not perform any processing and just list of the users associated with the cluster_id (Fig- ure 12 and Figure 13). a d v e r i s e m e n t Conclusion Clustering Algorithm are time-tested way to finding pat- terns in data, and with rising volume of data, its evi- dent that these algorithms must scale. Industry and Academia is already doing a lot to scale clustering al- gorithms, by scaling it both vertically and horizontally. MapReduce helps in managing data partitions and par- allel processing over these data shards. In this article, we saw only a handful algorithms from the vast number of machine learning techniques that can be molded into MapReduce[8][9]. VARAD MERU Varad Meru, Software Development Engineer at Orzota, Inc. I completed my Bachelors in Computer Science from Shiva- ji University, Kolhapur in 2011. After graduating, I joined Per- sistent Systems as a Software Engineer and worked on Recom- mendations Engines, Search Engines, E-mail Analytics, etc. At Orzota, I work Algorithms, Distributed Computing and Data Science. At Orzota, we are trying to Make 'Big Data' platform easy for consumption. I am active with HUGs, gave talks on Big Data and related technologies at various places and like mentoring next generation of professionals in the domain on Data Science. LinkedIn: in.linkedin.com/in/vmeru/ Company Website: www.orzota.com