SlideShare uma empresa Scribd logo
1 de 35
Machines Learning in Big Data (
MapReduce, Knime, Spark)
Présenté par: Sahmoudi Yahia
Targhi Amal
24/12/2015
1
Proposé par : Bouchra Frikh
Outlines
 Introduction
 Big Data
 Machine Learning
 Applications of ML Techniques to Data mining Tasks
 Why Machine Learning in Big Data?
 Big Data frameworks processing
24/12/2015
2
Introduction
• Every day, 2.5 quintillion bytes of data are created and 90% of the data
in the world today were produced within the past two years (IBM 2012).
• On October 4, 2012, the first presidential debate between President
Barack Obama and Governor Mitt Romney triggered more than 10
million tweets within two hours (Twitter Blog 2012).
• Another example is Flickr, a public picture sharing site, which received
1.8 million photos per day, on average, from February to March 2012
(Michel F. 2012).
24/12/2015
3
• “ The most fundamental challenge for the Big
Data applications is to explore the large volumes
of data and extract useful information or
knowledge for future actions “ (Rajaraman and
Ullman, 2011).
24/12/2015
4
• What is big data?
• What is machine learning?
• Why do machine learning in big data?
• What are they the big data processing frameworks
which integrate machine learning algorithm?
24/12/2015
5
Big Data
• Big data is a term that describes the large volume of data (
structed, semi- structed, unstructed), which can exploited to
get informations.
• Big data can be characterized by 3Vs:
• the extreme volume of data.
• the wide variety of types of data
• velocity at which the data must be must processed.
24/12/2015
6
P.S and we can also add: Variability and Complexity.
24/12/2015
7
Big Data Analytics
Big data analytics is the process of examining large data
sets containing a variety of data types.
“ It’s important to remember that the primary value from big data
comes not from the data in its raw form, but from the processing and
analysis of it and the insights, products, and services that emerge from
analysis. The sweeping changes in big data technologies and
management approaches need to be accompanied by similarly dramatic
shifts in how data supports decisions and product/service innovation.”
Thomas H. Davenport in Big Data in Big Companies
24/12/2015
8
Machine Learning
Design off implementation methods allowing a machine to evolve by
a systematic process and completing tasks which are difficult
They are difficult for a classic algorithm.
Machine Learning it’s when de data replace the algortihm
Example
Facebook's News Feed changes according to the user's personal interactions with
other users. If a user frequently tags a friend in photos, writes on his wall or likes his
links, the News Feed will show more of that friend's activity in the user's News Feed
due to presumed closeness.
24/12/2015
9
Applications of ML Techniques to
Data mining Tasks
• The amount of the data seems to increase rapidly every single day for the
majority domains related to information processing, and the need to find a
way to mine and get knowledge from databases is still crucial.
• Data Mining (DM) defines the automated extraction procedures of hidden
predictive information from databases . Among the tasks of DM are:
• Diagnosis
• Pattern Recognition:
• Prediction
• Classification
• Clustering
• Optimization:
• Control:
24/12/2015
10
The impact of ML techniques used
for DM tasks
Artificial Neural Networks for DM
Genetic Algorithms for DM
Inductive Logic Programming for DM
Rule Induction for DM
Decision Trees for DM
Instance-based Learning Algorithms for DM
24/12/2015
11
Exemple
• Amazon is an interesting example of application
of "Machine Learning" in e-commerce. For
example, suppose we perform the search for a
product on Amazon today. When we come back
another day on the site, it is able to offer us
products related to our specific needs
(prediction). And thanks to algorithms of
"Machine Learning" which provide our changing
needs from our previous visits to the site.
24/12/2015
12
Why Machine Learning in big Data?
• It delivers on the promise of extracting value from big and
disparate data sources with far less reliance on human
direction. It is data driven and runs at machine scale. It is well
suited to the complexity of dealing with disparate data sources
and the huge variety of variables and amounts of data
involved. And unlike traditional analysis, machine learning
thrives on growing datasets. The more data fed into a machine
learning system, the more it can learn and apply the results to
higher quality insights.
24/12/2015
13
Big Data framework s processing
• The Machine Learning is far from being a recent
phenomenon. What is new, however, is the number of
treatments platforms parallelized data for the
management of Big Data.
• in this Presentation we will take a look on some of the most used
algorithms as MapReduce, knime, spark.
24/12/2015
14
MapReduce
• MapReduce invented by google, it's the heart of
hadoop and it's is a processing technique and a
program model for distributed computing based on
java.
• MapReduce allows for massive scalability across
hundreds or thousands of servers in a Hadoop
cluster., allows to write applications to process huge
amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.
24/12/2015
15
• The MapReduce algorithm refers two important tasks,Map
and Reduce:
• Map: Map takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).
• Reduce: takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is
always performed after the map job.
24/12/2015
16
Algorithm
• Map stage : The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of
data.
• Reduce stage : This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.
24/12/2015
17
24/12/2015
18
Exemple
• The Word Count program used in this section to
explain how a map will also be used to explain
the operation of a reducer.
• the object of this program is: counting the
number of occurrences of the different words
which make up a book.
24/12/2015
19
Book Content
book's content
24/12/2015
20
Step 1 : Map
• Line 1: the mapper reads in input a record that comes in the form of
a pair <key, value> with:
• the String value type (a line of the file).
• LongWritable the key type
• Line 2 : for each word in the current line.
• Line 3 : we write on a output file the couple <w,1> , corresponding
to an occurrence of the word in the variable w.
24/12/2015
21
the output file of the mapper
24/12/2015
22
Step 2 ( between map and Reduce:
shuffle) and sort
24/12/2015
23
Step3 : Reduce
Line 1: the reducer input a recording as a pair <key, value> with:
Text the key type (one word).
value is a list of values of intWritable type.
Line 2: the reducer resets the counter when the wordcount word is changed.
Line 3: v for each value in the list is added to inValue2 v Word-Count (in this case v is
always 1).
Line4: When we change the word, we write in a output file: couple <inKey2, wordCount>,
wordCount is the number of occurrences of the word.
24/12/2015
24
24/12/2015
25
Use case in real life
• http://hadoopilluminated.com/hadoop_book/Hadoop_Us
e_Cases.html#d1575e1290
24/12/2015
26
Spark
• Apache Spark is an open source cluster
computing framework originally developed in
the AMPLab at University of California,
• Spark extends the popular MapReduce model to
efficiently support more types of computations.
24/12/2015
27
A Unified Stack
• one of the largest advantages of tight integration is the ability to build
applications that seamlessly combine different processing models. For
example, in Spark you can write one application that uses machine learning
to classify data in real time as it is ingested from streaming sources.
Simultaneously, analysts can query the resulting data, also in real time, via
SQL (e.g., to join the data with unstructured logfiles).
• In addition, more sophisticated data engineers and data scientists can
access the same data via the Python shell for ad hoc analysis. Others might
access the data in standalone batch applications. All the while, the IT team
has to maintain only one system.
24/12/2015
28
Rdd
• the resilient distributed dataset (RDD). An RDD is simply
a distributed collection of elements. In Spark all work is
expressed as either creating new RDDs, transforming existing
RDDs, or calling operations on RDDs to compute a result. Under the
hood, Spark automatically distributes the data contained in RDDs
across your cluster and Parallelizes the operations you perform on
them.
• RDDs are the core concept in Spark.
• You can see a RDD as a table in a database.
24/12/2015
29
• Transformations: transformations do not return
a single value, they return a new RDD. Nothing is
evaluated when one uses a transformation function,
this function just takes and RDD and returns a new
RDD . The processing functions such as map, filter,
flatMap, groupByKey, reduceByKey,
aggregateByKey ...
• Actions: shares evaluate and return a new value.
When an action function is called on a RDD object,
all data processing requests are calculated and the
result is returned. The actions are reduce, collect,
count, first, take, countByKey and foreach ...
24/12/2015
30
Spark EcoSystem
24/12/2015
31
Spark Core
Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more. Spark Core is also home to the API that defines resilient distributed datasets
(RDDs), which are Spark’s main programming abstraction. RDDs represent a
collection of items distributed across many compute nodes that can be manipulated
in parallel. Spark Core provides many APIs for building and manipulating these
collections.
24/12/2015
32
Example in real Life
Another early Spark adopter is Conviva, one of the largest
streaming video companies on the Internet, with about 4
billion video feeds per month (second only to YouTube). As
you can imagine, such an operation requires pretty
sophisticated behind-the-scenes technology to ensure a
high quality of service. As it turns out, it’s using Spark to
help deliver that QoS by avoiding dreaded screen buffering.
24/12/2015
33
MapReuce Vs Spark
Spark MapReduce
Performance Spark processes data
in-memory
On the Disc
Compatibility Apache Spark can run
as standalone or on top
of Hadoop YARN or on
the cloud. ..
With hadoop
Data Processing Data Batch
Failure Tolerance + +
Security is still in its infancy Hadoop MapReduce
has more security
features and projects.n
it comes to security
24/12/2015
34
Conclusion
24/12/2015
35

Mais conteúdo relacionado

Mais procurados

Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
Kaniska Mandal
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 

Mais procurados (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Structuring Big Data
Structuring Big DataStructuring Big Data
Structuring Big Data
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 

Destaque

Health lease investor presentation july 2012
Health lease investor presentation july 2012Health lease investor presentation july 2012
Health lease investor presentation july 2012
HealthLease
 
Udzungwa_Vidunda_WWFTPO_Nov2006
Udzungwa_Vidunda_WWFTPO_Nov2006Udzungwa_Vidunda_WWFTPO_Nov2006
Udzungwa_Vidunda_WWFTPO_Nov2006
Ahmed Ndossa
 

Destaque (18)

Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
 
An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Newsworks Comet effectiveness study
Newsworks Comet effectiveness studyNewsworks Comet effectiveness study
Newsworks Comet effectiveness study
 
OOP... Object Whaaat?
OOP... Object Whaaat?OOP... Object Whaaat?
OOP... Object Whaaat?
 
How Can Businesses Benefit from Web 2.0?
How Can Businesses Benefit from Web 2.0? How Can Businesses Benefit from Web 2.0?
How Can Businesses Benefit from Web 2.0?
 
MEDIA UNDER INDY
MEDIA UNDER INDY MEDIA UNDER INDY
MEDIA UNDER INDY
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
 
BIG Cinemas enters Bengaluru
BIG Cinemas enters BengaluruBIG Cinemas enters Bengaluru
BIG Cinemas enters Bengaluru
 
R. VILLANO - CITYSCAPE (cd rom vol. 2 parte 1)
R. VILLANO - CITYSCAPE (cd rom  vol. 2  parte 1)R. VILLANO - CITYSCAPE (cd rom  vol. 2  parte 1)
R. VILLANO - CITYSCAPE (cd rom vol. 2 parte 1)
 
Health lease investor presentation july 2012
Health lease investor presentation july 2012Health lease investor presentation july 2012
Health lease investor presentation july 2012
 
Informe de gestión 100 días
Informe de  gestión 100 díasInforme de  gestión 100 días
Informe de gestión 100 días
 
Udzungwa_Vidunda_WWFTPO_Nov2006
Udzungwa_Vidunda_WWFTPO_Nov2006Udzungwa_Vidunda_WWFTPO_Nov2006
Udzungwa_Vidunda_WWFTPO_Nov2006
 
New look: ensuring that user needs are met in library space
New look: ensuring that user needs are met in library spaceNew look: ensuring that user needs are met in library space
New look: ensuring that user needs are met in library space
 
L'État plateforme & Identité numérique (DISIC 2014 11-12)
L'État plateforme & Identité numérique (DISIC 2014 11-12)L'État plateforme & Identité numérique (DISIC 2014 11-12)
L'État plateforme & Identité numérique (DISIC 2014 11-12)
 
Facebook for the workplace (1)
Facebook for the workplace (1)Facebook for the workplace (1)
Facebook for the workplace (1)
 
Certificates
CertificatesCertificates
Certificates
 

Semelhante a introduction to big data frameworks

Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3
mcacicio
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
asifahmed
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
dkuhn
 

Semelhante a introduction to big data frameworks (20)

Big Data
Big DataBig Data
Big Data
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big Data przt.pptx
Big Data przt.pptxBig Data przt.pptx
Big Data przt.pptx
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
 
Aginity Big Data Research Lab
Aginity Big Data Research LabAginity Big Data Research Lab
Aginity Big Data Research Lab
 
Big Data
Big DataBig Data
Big Data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
IJET-V2I6P25
IJET-V2I6P25IJET-V2I6P25
IJET-V2I6P25
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 

Último

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Último (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 

introduction to big data frameworks

  • 1. Machines Learning in Big Data ( MapReduce, Knime, Spark) Présenté par: Sahmoudi Yahia Targhi Amal 24/12/2015 1 Proposé par : Bouchra Frikh
  • 2. Outlines  Introduction  Big Data  Machine Learning  Applications of ML Techniques to Data mining Tasks  Why Machine Learning in Big Data?  Big Data frameworks processing 24/12/2015 2
  • 3. Introduction • Every day, 2.5 quintillion bytes of data are created and 90% of the data in the world today were produced within the past two years (IBM 2012). • On October 4, 2012, the first presidential debate between President Barack Obama and Governor Mitt Romney triggered more than 10 million tweets within two hours (Twitter Blog 2012). • Another example is Flickr, a public picture sharing site, which received 1.8 million photos per day, on average, from February to March 2012 (Michel F. 2012). 24/12/2015 3
  • 4. • “ The most fundamental challenge for the Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions “ (Rajaraman and Ullman, 2011). 24/12/2015 4
  • 5. • What is big data? • What is machine learning? • Why do machine learning in big data? • What are they the big data processing frameworks which integrate machine learning algorithm? 24/12/2015 5
  • 6. Big Data • Big data is a term that describes the large volume of data ( structed, semi- structed, unstructed), which can exploited to get informations. • Big data can be characterized by 3Vs: • the extreme volume of data. • the wide variety of types of data • velocity at which the data must be must processed. 24/12/2015 6
  • 7. P.S and we can also add: Variability and Complexity. 24/12/2015 7
  • 8. Big Data Analytics Big data analytics is the process of examining large data sets containing a variety of data types. “ It’s important to remember that the primary value from big data comes not from the data in its raw form, but from the processing and analysis of it and the insights, products, and services that emerge from analysis. The sweeping changes in big data technologies and management approaches need to be accompanied by similarly dramatic shifts in how data supports decisions and product/service innovation.” Thomas H. Davenport in Big Data in Big Companies 24/12/2015 8
  • 9. Machine Learning Design off implementation methods allowing a machine to evolve by a systematic process and completing tasks which are difficult They are difficult for a classic algorithm. Machine Learning it’s when de data replace the algortihm Example Facebook's News Feed changes according to the user's personal interactions with other users. If a user frequently tags a friend in photos, writes on his wall or likes his links, the News Feed will show more of that friend's activity in the user's News Feed due to presumed closeness. 24/12/2015 9
  • 10. Applications of ML Techniques to Data mining Tasks • The amount of the data seems to increase rapidly every single day for the majority domains related to information processing, and the need to find a way to mine and get knowledge from databases is still crucial. • Data Mining (DM) defines the automated extraction procedures of hidden predictive information from databases . Among the tasks of DM are: • Diagnosis • Pattern Recognition: • Prediction • Classification • Clustering • Optimization: • Control: 24/12/2015 10
  • 11. The impact of ML techniques used for DM tasks Artificial Neural Networks for DM Genetic Algorithms for DM Inductive Logic Programming for DM Rule Induction for DM Decision Trees for DM Instance-based Learning Algorithms for DM 24/12/2015 11
  • 12. Exemple • Amazon is an interesting example of application of "Machine Learning" in e-commerce. For example, suppose we perform the search for a product on Amazon today. When we come back another day on the site, it is able to offer us products related to our specific needs (prediction). And thanks to algorithms of "Machine Learning" which provide our changing needs from our previous visits to the site. 24/12/2015 12
  • 13. Why Machine Learning in big Data? • It delivers on the promise of extracting value from big and disparate data sources with far less reliance on human direction. It is data driven and runs at machine scale. It is well suited to the complexity of dealing with disparate data sources and the huge variety of variables and amounts of data involved. And unlike traditional analysis, machine learning thrives on growing datasets. The more data fed into a machine learning system, the more it can learn and apply the results to higher quality insights. 24/12/2015 13
  • 14. Big Data framework s processing • The Machine Learning is far from being a recent phenomenon. What is new, however, is the number of treatments platforms parallelized data for the management of Big Data. • in this Presentation we will take a look on some of the most used algorithms as MapReduce, knime, spark. 24/12/2015 14
  • 15. MapReduce • MapReduce invented by google, it's the heart of hadoop and it's is a processing technique and a program model for distributed computing based on java. • MapReduce allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster., allows to write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. 24/12/2015 15
  • 16. • The MapReduce algorithm refers two important tasks,Map and Reduce: • Map: Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). • Reduce: takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. 24/12/2015 16
  • 17. Algorithm • Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. 24/12/2015 17
  • 19. Exemple • The Word Count program used in this section to explain how a map will also be used to explain the operation of a reducer. • the object of this program is: counting the number of occurrences of the different words which make up a book. 24/12/2015 19
  • 21. Step 1 : Map • Line 1: the mapper reads in input a record that comes in the form of a pair <key, value> with: • the String value type (a line of the file). • LongWritable the key type • Line 2 : for each word in the current line. • Line 3 : we write on a output file the couple <w,1> , corresponding to an occurrence of the word in the variable w. 24/12/2015 21
  • 22. the output file of the mapper 24/12/2015 22
  • 23. Step 2 ( between map and Reduce: shuffle) and sort 24/12/2015 23
  • 24. Step3 : Reduce Line 1: the reducer input a recording as a pair <key, value> with: Text the key type (one word). value is a list of values of intWritable type. Line 2: the reducer resets the counter when the wordcount word is changed. Line 3: v for each value in the list is added to inValue2 v Word-Count (in this case v is always 1). Line4: When we change the word, we write in a output file: couple <inKey2, wordCount>, wordCount is the number of occurrences of the word. 24/12/2015 24
  • 26. Use case in real life • http://hadoopilluminated.com/hadoop_book/Hadoop_Us e_Cases.html#d1575e1290 24/12/2015 26
  • 27. Spark • Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, • Spark extends the popular MapReduce model to efficiently support more types of computations. 24/12/2015 27
  • 28. A Unified Stack • one of the largest advantages of tight integration is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously, analysts can query the resulting data, also in real time, via SQL (e.g., to join the data with unstructured logfiles). • In addition, more sophisticated data engineers and data scientists can access the same data via the Python shell for ad hoc analysis. Others might access the data in standalone batch applications. All the while, the IT team has to maintain only one system. 24/12/2015 28
  • 29. Rdd • the resilient distributed dataset (RDD). An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and Parallelizes the operations you perform on them. • RDDs are the core concept in Spark. • You can see a RDD as a table in a database. 24/12/2015 29
  • 30. • Transformations: transformations do not return a single value, they return a new RDD. Nothing is evaluated when one uses a transformation function, this function just takes and RDD and returns a new RDD . The processing functions such as map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey ... • Actions: shares evaluate and return a new value. When an action function is called on a RDD object, all data processing requests are calculated and the result is returned. The actions are reduce, collect, count, first, take, countByKey and foreach ... 24/12/2015 30
  • 32. Spark Core Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections. 24/12/2015 32
  • 33. Example in real Life Another early Spark adopter is Conviva, one of the largest streaming video companies on the Internet, with about 4 billion video feeds per month (second only to YouTube). As you can imagine, such an operation requires pretty sophisticated behind-the-scenes technology to ensure a high quality of service. As it turns out, it’s using Spark to help deliver that QoS by avoiding dreaded screen buffering. 24/12/2015 33
  • 34. MapReuce Vs Spark Spark MapReduce Performance Spark processes data in-memory On the Disc Compatibility Apache Spark can run as standalone or on top of Hadoop YARN or on the cloud. .. With hadoop Data Processing Data Batch Failure Tolerance + + Security is still in its infancy Hadoop MapReduce has more security features and projects.n it comes to security 24/12/2015 34

Notas do Editor

  1. Exa- Parmiis touuuuus ces tweet les analyses ont pu savoir les inetret de public on peut citer les assurances de maladis Suppposant que la taille d une photo et 2 mega bytes (mb) ceci resulte 3.6 terabytes dans chak jour et chak photo vaut prek 1000 words
  2. Les exemples que je viens de citer montent l application des big dataa où la collection des donnée au-delà de la capacité des outils logiciels couramment utilisés pour capturer, gérer, et le processus au sein d'un "temps écoulé". C pk machine learning et big data data sont indispensable en effet permet de developper et etudier des alogotihtmes qui permettent aux machines de lire et traiter automatquement les donné d une manière efficace
  3. Structured data refers to any data that resides in a fixed field within a record or file. The phrase unstructured data usually refers to information that doesn't reside in a traditional row-column database mail video photo …
  4. Because big data takes too much time and costs too much money to load into a traditional relational database for analysis, Puisque les big data necessite bcp d temps et sont couteux en terme d argent si on veut les charger dans un traditionnelle bd pour l analyse. De nouvelle approchent ont emergé qui se base sur les données brutes avec métadonnées. et les algo de l apprentissage automatique ia utilise des algo complextes pour chercher des modeles reproductibles ( de stockage et traitement) Big data analytics est associté au cloud computing pske l analyse d un larache ensemble de donnée dans le temps réel necessite des plateforme coomme hadoop
  5. c est un type d ia qu Machine learning is a type of artificial intelligence (AI) q ui foutni des calculs avec la capacité d apprendre sans etre explicitement programmé Machine learnning se focalise sur le developpent des programme et des calculent qui peuvent informatiques qui peuvent eux-mêmes apprendre et changer qu on ajoute de nouvelle donnée
  6. Avec l augmentation de la quantité de donnée l analyse intelligence des donné est devenu une exigence pour le progrés technologique Click streams venant de source different donnée en temps rééls Dans reporting tools Dans une societé si on veut effectueé des moyenne des montantes calculs … on doit executer de requtes sql qui sont simples ces traiement de de base qui reposent toujours sur un être humain de diriger les activités précisent ce qui doit être calculée. Enregistrement de l activité d un utilisateur sur internet les pages visité le temps ecoulé les mails ordre des pages …..
  7. Le Machine Learning est loin d'être un phénomène récent. Ce qui est nouveau, cependant, est le nombre de plates-formes de traitements parallélisés données pour la gestion des Big Data.
  8. permet une extensibilité massive à travers des centaines ou des milliers de serveurs dans un cluster Hadoop., permet d'écrire des applications pour traiter d'énormes quantités de données, en parallèle, sur de grandes grappes de matériel de base d'une manière fiable. Permet la scalabilité a tracaers des milleurs de serveurs dans un cluster hadoop permet d ecrire des app pour traiter d enormes quantités de donnée en parallelles . Capacité de s adapter a de nouvellles données
  9. For people new to this topic, it can be somewhat difficult to understand, because we didn't typically something people have been exposed to previously. If you’re new to Hadoop’s MapReduce jobs, don’t worry: we'll try to describe it in a way that gets you up to speed quickly. Prends la sortie du map comme entrée et combine ces tulpes de données dans un ensemble de tulpes
  10. Input of reducer
  11. We just wrote a program Hadoop, for counting the number of occurrences of the different words which make up a book. But with one map and one reducer the program performance will not be better than that of a conventional program running on a single machine. that why we should by the way more than map and reducer, the purpose if this example is just understanding the principle of MapReduce.
  12. Parmis la carastiques majeur que spark offre est la possibilité d executer la computation in memery ce ce qui lui rend plus efficace que ma recuce et rapiide 100 fois que map reduce qui execute les application dans le disque Une base de données dite « en mémoire » (in-memory), ou IMDB (In Memory DataBase), ou encore MMDB (Main Memory DB), désigne une base de données dont les informations sont stockées dans main memory « afin d'accélérer les temps de réponse. Spark est conçu pour être très accessible, offrant API simples en Python, Java, Scala, et SQL, et riche intégré dans les bibliothèques.
  13. les transformation renvoient une nouvelle rdd le prinice et juste utilisé notre rddd et renvoyer une nouvelle rdd Faire des calcul et renovyer une nouvelle valeurs
  14. : Spark SQL permet d’exposer les jeux de données Spark via API JDBC et d’exécuter des requêtes de type SQL en utilisant les outils BI et de visualisation traditionnels. Spark SQL permet d’extraire, transformer et charger des données sous différents formats (JSON, Parquet, base de données) et les exposer pour des requêtes ad-hoc. Spark MLlib : MLlib est une librarie de machine learning qui contient tous les algorithmes et utilitaires d’apprentissage classiques, comme la classification, la régression, le clustering, le filtrage collaboratif, la réduction de dimensions, en plus des primitives d’optimisation sous-jacentes. Spark Streaming : Spark Streaming peut être utilisé pour traitement temps-réel des données en flux. Il s’appuie sur un mode de traitement en "micro batch" et utilise pour les données temps-réel DStream, c’est-à-dire une série de RDD (Resilient Distributed Dataset).
  15. Aider à fournir une qualité d service tt en envitant les screen byffer
  16. According to the Spark website, it also works with BI tools via JDBC and ODBC. Hive and Pig integration are on the way. Spark is a bit bare at the moment whe Spark security is still in its infancy; Hadoop MapReduce has more security features and projects.n it comes to security. Authentication is supported via a shared secret,