SlideShare a Scribd company logo
1 of 23
Big Data &
Data Science
20 mars 2017
Big Data & Data Science : Agenda – 18h30 / 20h15
1/ L’écosystème Apache Spark
Johan Picard, Expert Big Data
2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS
Victor Hatinguais, Architecte Big Data
3/ Social Data : Machine Learning pour un projet à caractère social
Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris
4/ Data Science Experience
Zied Abidi, Data Scientist
5/ Comment faire parler les données pour détecter des anomalies ?
Pauline Clavelloux, Data Scientist
Questions & Réponses - Clôture
IBM | Spark 3
Power of data. Simplicity of design. Speed of innovation.
Apache Spark in 15 minutes
IBM | Spark 4
Apache Spark
Apache Spark is a fast and general engine for large scale data processing.
https://spark.apache.org/
IBM | Spark 5
Spark History: one of the most active open-source projects
2002 – MapReduce @ Google
2004 – MapReduce paper
2006 – Hadoop @ Yahoo
2008 – Hadoop Summit
2010 – Spark paper
2013 – Spark 0.7 Apache Incubator
2014 – Apache Spark top-level
2014 – 1.2.0 released in December
2015 – 1.3.0 released in March
2015 – 1.4.0 released in June
2015 – 1.5.0 released in September
2016 – 1.6.0 released in January
2016 – 2.0.0 released in July
2016 – 2.1.0 released in December
Spark is HOT!!!
Most active project in Hadoop ecosystem
One of top 3 most active Apache projects
Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
IBM | Spark 6
Spark is the most active open source project in Big Data
Source: Syncort – Hadoop Perspectives for 2016
2015
2014
2016
900
Now 1039 contributors…
IBM | Spark 7
Why Spark? In-memory performances and code compactness
IBM | Spark 8
Spark RDD
In-memory distribution
HDFS
On-disk distribution
Why Spark? A distributed framework
IBM | Spark 9
Resilient Distributed Dataset
Create RDDs:
 parallelize
 textFile
 Transformations
Get results:
 Actions
IBM | Spark 10
Why Spark? A bunch of comfortables APIs
IBM | Spark 11
Spark Programming Languages
IBM | Spark 12
 Distributed File System
 Data Preparation
 SQL Engine
 Stream Processing
 Graph Engine
 Machine Learning
 Distributed R
Spark SQL
Spark
Streaming
GraphX MLlib Spark R
Why Spark? An unified framework
IBM | Spark 13
• Reliability
• Resiliency
• Security
• Multiple data sources
• Multiple applications
• Multiple users
• Files
• Semi-structured
• Databases
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
Spark complements Hadoop (1/3): Hadoop Strengths
IBM | Spark 14
• Need deep Java skills
• Few abstractions available for
analysts
• No in-memory framework
• Application tasks write to disk with
each cycle
• Only suitable for batch workloads
• Rigid processing model
In-Memory Performance
Ease of Development
Combine Workflows
Spark complements Hadoop (2/3): MapReduce Weaknesses
IBM | Spark 15
In-Memory Performance
Ease of Development
• Easier APIs
• Python, Scala, Java
• Resilient Distributed Datasets
• Unify processing
• Batch
• Interactive
• Iterative algorithms
• Micro-batch
Combine Workflows
Spark complements Hadoop (3/3): Spark Advantages
IBM | Spark 16
In-Memory Performance
Ease of Development
Combine Workflows
Unlimited Scale
Enterprise Platform
Wide Range of
Data Formats
The Flexibility of Spark on a Stable Hadoop Platform
IBM | Spark 17
 Spark Shell: interactive Scala
 PySpark: interactive Python
 Spark Submit: compiled
 Notebooks: Jupyter, Zeppelin
How to develop and run a Spark job?
IBM | Spark 18
What Spark Is Not!
 Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a
standalone system
 Not a data store – Spark attaches to other data stores but does not provide its own
 Not only for machine learning – Spark includes machine learning and does it very well,
but it can handle much broader tasks equally well
 Not a replacement for Streams – Spark Streaming is micro-batching, not true
streaming, and cannot handle the real-time complex event processing
 Not a language!!!
IBM | Spark 19
Spark et IBM
IBM | Spark 20
IBM has the largest investment in Spark of any company in the world
visit www.spark.tc for more informationIBM | Spark
IBM Spark Technology Center
https://ibm.biz/hadoop-jira
https://ibm.biz/spark-jira
 On of the top commiter/contributor
 300+ inventors
 Commitment to educate 1 million data scientists
 Contributed SystemML
 Founding member of AMPLab
 Partnerships in the ecosystem
IBM | Spark 21
Leadership in Spark
 Spark Technology Center has contributed 829 code changes to Spark components since we started
around middle of 2015
 STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib.
 For more details, use this dash board https://www.ibm.biz/spark-jira
IBM | Spark 22
Data Science Experience (DSX)
IBM | Spark
ALL YOUR TOOLS IN ONE PLACE
IBM Data Science Experience is an environment that brings
together everything that a Data Scientist needs. It includes the
most popular Open Source tools and IBM unique value-add
functionalities with community and social features, integrated
as a first class citizen to make Data Scientists more successful.
datascience.ibm.com
IBM | Spark 23
Power of data. Simplicity of design. Speed of innovation.
PoT IBM sur Google
9 Mai : Manipulation de données massives avec Spark
10 Mai : Formation machine learning utilisant DSX

More Related Content

What's hot

How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
Nicolas Poggi
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Databricks
 

What's hot (20)

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the Business
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 

Similar to A short introduction to Spark and its benefits

Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
IBMInfoSphereUGFR
 

Similar to A short introduction to Spark and its benefits (20)

20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Spark 101
Spark 101Spark 101
Spark 101
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 
Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15Ibm leads way with hadoop and spark 2015 may 15
Ibm leads way with hadoop and spark 2015 may 15
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark Kerzner
 

Recently uploaded

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

A short introduction to Spark and its benefits

  • 1. Big Data & Data Science 20 mars 2017
  • 2. Big Data & Data Science : Agenda – 18h30 / 20h15 1/ L’écosystème Apache Spark Johan Picard, Expert Big Data 2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS Victor Hatinguais, Architecte Big Data 3/ Social Data : Machine Learning pour un projet à caractère social Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris 4/ Data Science Experience Zied Abidi, Data Scientist 5/ Comment faire parler les données pour détecter des anomalies ? Pauline Clavelloux, Data Scientist Questions & Réponses - Clôture
  • 3. IBM | Spark 3 Power of data. Simplicity of design. Speed of innovation. Apache Spark in 15 minutes
  • 4. IBM | Spark 4 Apache Spark Apache Spark is a fast and general engine for large scale data processing. https://spark.apache.org/
  • 5. IBM | Spark 5 Spark History: one of the most active open-source projects 2002 – MapReduce @ Google 2004 – MapReduce paper 2006 – Hadoop @ Yahoo 2008 – Hadoop Summit 2010 – Spark paper 2013 – Spark 0.7 Apache Incubator 2014 – Apache Spark top-level 2014 – 1.2.0 released in December 2015 – 1.3.0 released in March 2015 – 1.4.0 released in June 2015 – 1.5.0 released in September 2016 – 1.6.0 released in January 2016 – 2.0.0 released in July 2016 – 2.1.0 released in December Spark is HOT!!! Most active project in Hadoop ecosystem One of top 3 most active Apache projects Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
  • 6. IBM | Spark 6 Spark is the most active open source project in Big Data Source: Syncort – Hadoop Perspectives for 2016 2015 2014 2016 900 Now 1039 contributors…
  • 7. IBM | Spark 7 Why Spark? In-memory performances and code compactness
  • 8. IBM | Spark 8 Spark RDD In-memory distribution HDFS On-disk distribution Why Spark? A distributed framework
  • 9. IBM | Spark 9 Resilient Distributed Dataset Create RDDs:  parallelize  textFile  Transformations Get results:  Actions
  • 10. IBM | Spark 10 Why Spark? A bunch of comfortables APIs
  • 11. IBM | Spark 11 Spark Programming Languages
  • 12. IBM | Spark 12  Distributed File System  Data Preparation  SQL Engine  Stream Processing  Graph Engine  Machine Learning  Distributed R Spark SQL Spark Streaming GraphX MLlib Spark R Why Spark? An unified framework
  • 13. IBM | Spark 13 • Reliability • Resiliency • Security • Multiple data sources • Multiple applications • Multiple users • Files • Semi-structured • Databases Unlimited Scale Enterprise Platform Wide Range of Data Formats Spark complements Hadoop (1/3): Hadoop Strengths
  • 14. IBM | Spark 14 • Need deep Java skills • Few abstractions available for analysts • No in-memory framework • Application tasks write to disk with each cycle • Only suitable for batch workloads • Rigid processing model In-Memory Performance Ease of Development Combine Workflows Spark complements Hadoop (2/3): MapReduce Weaknesses
  • 15. IBM | Spark 15 In-Memory Performance Ease of Development • Easier APIs • Python, Scala, Java • Resilient Distributed Datasets • Unify processing • Batch • Interactive • Iterative algorithms • Micro-batch Combine Workflows Spark complements Hadoop (3/3): Spark Advantages
  • 16. IBM | Spark 16 In-Memory Performance Ease of Development Combine Workflows Unlimited Scale Enterprise Platform Wide Range of Data Formats The Flexibility of Spark on a Stable Hadoop Platform
  • 17. IBM | Spark 17  Spark Shell: interactive Scala  PySpark: interactive Python  Spark Submit: compiled  Notebooks: Jupyter, Zeppelin How to develop and run a Spark job?
  • 18. IBM | Spark 18 What Spark Is Not!  Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a standalone system  Not a data store – Spark attaches to other data stores but does not provide its own  Not only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally well  Not a replacement for Streams – Spark Streaming is micro-batching, not true streaming, and cannot handle the real-time complex event processing  Not a language!!!
  • 19. IBM | Spark 19 Spark et IBM
  • 20. IBM | Spark 20 IBM has the largest investment in Spark of any company in the world visit www.spark.tc for more informationIBM | Spark IBM Spark Technology Center https://ibm.biz/hadoop-jira https://ibm.biz/spark-jira  On of the top commiter/contributor  300+ inventors  Commitment to educate 1 million data scientists  Contributed SystemML  Founding member of AMPLab  Partnerships in the ecosystem
  • 21. IBM | Spark 21 Leadership in Spark  Spark Technology Center has contributed 829 code changes to Spark components since we started around middle of 2015  STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib.  For more details, use this dash board https://www.ibm.biz/spark-jira
  • 22. IBM | Spark 22 Data Science Experience (DSX) IBM | Spark ALL YOUR TOOLS IN ONE PLACE IBM Data Science Experience is an environment that brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to make Data Scientists more successful. datascience.ibm.com
  • 23. IBM | Spark 23 Power of data. Simplicity of design. Speed of innovation. PoT IBM sur Google 9 Mai : Manipulation de données massives avec Spark 10 Mai : Formation machine learning utilisant DSX

Editor's Notes

  1. Open source : commiters & contributors Databricks : compagnie derrière Spark, politique, conserver la majorité des commiters pour orienter les decisions des features et leur business model Project Management Committees (PMC)  Nearly 20% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project by most accounts. In Machine Learning, the Spark Technology Center contributed no less than 45% of the new features, and up to 25% of the enhancements.  The STC has contributed 60-75% of all lines of code (LOC) worldwide to the PySpark project.  Significant code contributions were also made in SparkR, WebUI and many others.  In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving up to 25% of all bug fixes for the new release.
  2. Spark is the most active open source project in Big Data with over 600 contributors in 2015, up from 315 in the previous 12-24 months. Today (5/26/2016) that number is up to 900! Look here to get the latest count: https://github.com/apache/spark Considering that Spark was only founded in 2009 and open-sourced in 2010, this is considerable growth. An interesting survey done by Syncsort - Nearly 70 percent of respondents when asked which compute framework they were most interested – answered Spark, surpassing interest in all other compute frameworks, including the recognized incumbent, MapReduce. MapReduce is an original component of the Hadoop ecosystem, being rapidly subsumed by Spark, which boasts better compute performance and a facility for interactive, streaming and other advanced Big Data analytics. We’ll talk about the advantages of Spark in a later slide. Notice many of the market leaders leverage Spark. The list above is not inclusive, these are some of the market leaders that presented at the 2015 Spark Summit in San Francisco and many of their presentations can be found online. The point is, Spark is gaining speed rapidly in the market… and for good reason as you’ll learn from this presentation. Read more about Sparks rapid growth: http://www.techrepublic.com/article/apache-spark-rises-to-become-most-active-open-source-project-in-big-data/
  3. Add another graph? Hortonworks ne backait pas Spark au début, projet Tez assez similaire mais abandonné avec l’avènement de Spark
  4. Immutable Two types of operations Transformations ~ DDL (Create View V2 as…) val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10 val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11 The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded It’s a Directed Acyclic Graph (DAG) No actual data processing does take place  Lazy evaluations Actions ~ DML (Select * From V2…) rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] Performs transformations and action Returns a value (or write to a file) Fault tolerance If data in memory is lost it will be recreated from lineage Caching, persistence (memory, spilling, disk) and check-pointing
  5. Day in an Hadoop developer life
  6. Open source innovation is the first leg we’ve just talked about. When it comes to Big Data, Apache Hadoop has been the dominant open source technology (and collection of projects, really) up until very recently, and it continues to be very important. The reasons are captured here on this slide, which extend the point we talked about a few slides ago, when we mentioned the low cost of storage that Hadoop is able to take advantage of. First, Hadoop has virtually unlimited scale. If it’s big enough for Yahoo!, Facebook, and LinkedIn, who deal with enormous data volumes, it should be good enough for any customer. And the scale also applies to the heterogeneous nature of the data, the applications running on the data, and the users running Hadoop applications. Hadoop can store virtually any kind of data, and if the hardware is there, it can support many concurrent applications or users. Second, Hadoop has become an enterprise-class platform. Much of the recent work in the open source community around Hadoop has been hardening its security capabilities. Applications using Hadoop are in place today that are PCI-DSS compliant. Hadoop has always been known for its resiliency with its failover capabilities for both data storage and processing. More recently, the services administering the storage and processing systems in Hadoop have themselves also gained failover capability. Finally, Hadoop is now seen as a reliable data engine – reports of issues like data corruption are exceedingly rare in Apache Hadoop. Third, Hadoop supports a wide range of the kinds of data you need to store: at the lowest level, it can store any kind of file data – part of Hadoop is, after all, a file system. Hadoop can also host databases for structured data, and you can also use Hadoop to work with what many term “semi-structured” data, such as log files.
  7. Apache Hadoop was once synonymous with MapReduce. As recently as early 2014, there was still considerable hype around MapReduce and its applications. However, as Hadoop has been entering the mainstream, its challenges have become increasingly apparent. First, from a developer perspective, programming Hadoop-MapReduce applications is quite difficult, and requires specialized skills around parallel programming and a deep understanding of Java. Also, there are very few abstractions available to enable analysts to easily and flexibly work with data. And ones that do exist do not typically perform very quickly. Second, Hadoop-MapReduce has no in-memory framework. Applications have their individual tasks load data sets, but once the tasks complete, the data sets are no longer in memory – and when they are in memory, they aren’t shared with other applications. Also, during the execution of a MapReduce application, each map task writes its interim results sets to disk – this is highly inefficient, as the reduce tasks then need to read them from disk, instead of from memory. Third, Hadoop-MapReduce is only suitable for batch workloads. There is no shame in this, as that’s what it was designed for, but for users who want to take advantage of Hadoop’s benefits, they need support for interactive or real-time workloads as well. And coming back to the execution of applications, only one pattern is supported in Hadoop-MapReduce: that is, map, and then reduce. There are many use cases, where different patterns are needed, for example, map, reduce, reduce. You can make these different patterns work in Hadoop-MapReduce, but it comes at a great cost in terms of complexity and performance.
  8. Apache Spark has been an active open source project since 2010, but it has become hugely popular starting around the middle of 2014. It is, in fact, the single most active project in the Apache Software Foundation, with over 500 code updates made per month by a community of over 400 contributors. The major reason for its popularity is that it addresses the weak points of Hadoop-MapReduce. While MapReduce has proven to be highly difficult, Spark is much simpler. Raw Spark applications (which can be coded in Java, like MapReduce, but also Python and Scala) are still not for novice programmers, but are far more accessible and require less coding than Hadoop-MapReduce. Spark is actually written in Scala, which is a relatively new language. One of the major features of Spark is its in-memory capabilities, which are based on the Spark concept of a Resilient Distributed Dataset (RDD). This greatly speeds up workloads, because you can keep data loaded in memory for multiple applications, thus saving them the overhead of loading data from disk. Early benchmarking results have shown speedups between 10x to 100x for the same applications as compared to MapReduce. Another reason for Spark’s massive appeal is its ability to support different classes of workloads. You can use Spark to build batch applications, just as you would have with Hadoop-MapReduce, but with its in-memory capabilities, interactive workloads (like running SQL queries) and iterative algorithms (running machine learning models against the same data set) are also possible. Finally, Spark-Streaming enables the running of micro-batch workloads (this would be near-realtime workloads, where a micro-batch could, for example, ensure latency as small as half a second for streaming data).
  9. There are some analyst reports that have provocative titles, like “Hadoop vs. Spark,” or “Does Spark Mean the End of Hadoop?”. Many of these articles are heavily sensationalized, and ignore the reality that Spark actually integrates deeply with Hadoop. Yes, Spark can run in a standalone mode, or on other distributed environments like Mesos, AWS, or Cassandra. But the majority of Spark adoption and activity we see is in concert with Hadoop. After all, Spark is just a processing framework – it needs data, resource management, and other enterprise services. Hadoop has all those things, which makes it an ideal complement to Spark. And as we can see on this slide, Spark fills holes that Hadoop itself has. Spark brings ease of use for developers, high performance from its in-memory capabilities, and much more flexible support for different kinds of workloads to Hadoop. The key point here is that it’s not “Spark or Hadoop,” but “Spark AND Hadoop.”
  10. To run the application, you will need to first define the dependencies. In Scala, it is defined in the simple.sbt file. In Java, it is defined in the pom.xml file. In Python, you don’t need to define any dependencies for this simple application, but if you used third party libraries, then you can use the –py-files argument to handle that. Next, you place your files in the typical directory structure as shown for Scala and Java. Python does not need to do this. Finally, you have to create the JAR package using the appropriate tool and then run the spark-submit to execute the application.
  11. Let’s talk about some of the misconceptions about Spark. Many people get confused on the difference between Hadoop and Spark, for that reason as we talk these points we’ll also discuss how they relate to Hadoop. Spark does not require Hadoop to run. You can run Spark using its standalone mode or on Hadoop clusters through YARN, or on Apache Mesos. Spark does not include a storage layer. You must provide a data store for Spark to access. Spark can access data in HDFS, Cassandra, Hbase, Hive, Tachyon, and any Hadoop data source. You do not need to have a machine learning project to use Spark. Spark can manage complex analytics such as streaming or graphing data. Spark does have a library for streaming, which can be useful for many use cases, however it is not true streaming. Spark Streaming process data streams in batches, where each batch contains a collection of events that arrived over the batch period (regardless of when the data were actually created). This is fine for some applications such as simple counts into Hadoop, but be aware that the lack of true record-by-record processes makes stream processing and time-series analytics impossible.
  12.  Nearly 20% of all JIRAs were contributed by the Spark Technology Center, placing IBM as the number two contributor to the Apache Spark Project by most accounts. In Machine Learning, the Spark Technology Center contributed no less than 45% of the new features, and up to 25% of the enhancements.  The STC has contributed 60-75% of all lines of code (LOC) worldwide to the PySpark project.  Significant code contributions were also made in SparkR, WebUI and many others.  In Spark SQL, Spark’s most active component, IBM leveraged its long-standing SQL experience by resolving up to 25% of all bug fixes for the new release.
  13. Hadoop: https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12327116 Spark: https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326761