SlideShare uma empresa Scribd logo
1 de 26
The Spark Ecosystem

       Michael Malak


   technicaltidbit.com
Agenda
•    What Hadoop gives us
•    What everyone is complaining about in 2013
•    Spark
       – Berkeley Team
       – BDAS (Berkeley Data Analytics Stack)
       – RDDs (Resilient Distributed Datasets)
       – Shark
       – Spark Streaming
       – Other Spark subsystems
Global Big Data Apr 23, 2013   technicaltidbit.com   2
What Hadoop Gives Us
• HDFS
• Map/Reduce




Global Big Data Apr 23, 2013   technicaltidbit.com   3
Hadoop: HDFS




                                 Image from mark.chmarny.com




Global Big Data Apr 23, 2013      technicaltidbit.com          4
Hadoop: Map/Reduce




Image from blog.octo.com




                                                        Image from people.apache.org/~rdonkin




   Global Big Data Apr 23, 2013   technicaltidbit.com                                    5
Map/Reduce Tools


          Pig Script                     HiveQL          Hbase App

              Pig                         Hive

                                        Hadoop

                                          Linux




Global Big Data Apr 23, 2013       technicaltidbit.com               6
Hadoop Distribution Dogs in the
                  Race
                Hadoop Distribution             Query Tool

                                                     Apache Drill




                                                Stinger



Global Big Data Apr 23, 2013   technicaltidbit.com                  7
Other Open Source Solutions
• Druid
• Spark




Global Big Data Apr 23, 2013   technicaltidbit.com   8
Not just caching, but streaming
•    1st generation: HDFS
•    2nd generation: Caching & “Push” Map/Reduce
•    3rd generation: Streaming




Global Big Data Apr 23, 2013   technicaltidbit.com   9
Berkeley Team
• 40 students
• 8 faculty
• 3 staff software
  engineers
• Silicon Valley style
  skunkworks office                      Image from Ian Stoica’s slides from Strata 2013 presentation
  space
• 2 years into 6 year
  program
 Global Big Data Apr 23, 2013      technicaltidbit.com                                            10
BDAS
        (Berkeley Data Analytics Stack)
                                                 Spark Streaming
      Bagel App                Shark App
                                                       App

         Bagel                   Shark           Spark Streaming   Spark App



                                            Spark
  Hadoop/HDFS

                                           Mesos

                                            Linux


Global Big Data Apr 23, 2013         technicaltidbit.com                       11
RDDs
         (Resilient Distributed Dataset)




                               Image from Matei Zaharia’s paper




Global Big Data Apr 23, 2013      technicaltidbit.com             12
RDDs: Laziness
                                                              x => x.startsWith(“ERROR”)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
                               .map(_.split(‘t’)(2))                   All Lazy
                               .filter(_.contains(“foo”))
cnt = errors.count

                                    Action!




Global Big Data Apr 23, 2013            technicaltidbit.com                          13
RDDs: Transformations vs. Actions
  Transformations                                             Actions
  map(func)                                                   reduce(func)
  filter(func)                                                collect()
  flatMap(func)                                               count()
  sample(withReplacement,                                     take(n)
     frac, seed)                                              first()
  union(otherDataset)                                         saveAsTextFile(path)
  groupByKey[K,V](func)                                       saveAsSequenceFile(path)
  reduceByKey[K,V](func)                                      foreach(func)
  join[K,V,W](otherDataset)
  cogroup[K,V,W1,W2](other1,
     other2)
  cartesian[U](otherDataset)
  sortByKey[K,V]
                               [K,V] in Scala same as <K,V>
                               templates in C++, Java

Global Big Data Apr 23, 2013                technicaltidbit.com                          14
Hive vs. Shark

                                                       Shark
            HiveQL
            HiveQL




                                                         HiveQL
                                                         HiveQL
 HDFS files                          HDFS files
                                                         +        RDDs




Global Big Data Apr 23, 2013     technicaltidbit.com                     15
Shark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES
  ("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;


Creates a table that is stored in a cluster’s
  memory using RDD.cache().


Global Big Data Apr 23, 2013   technicaltidbit.com   16
Shark: Just a Shim

                                                                     Shark




                                 Images from Reynold Xin’s presentation




Global Big Data Apr 23, 2013           technicaltidbit.com                   17
What about “Big Data”?


                                                     PB

                                                     TB




                                                          Shark Effectiveness
                                                          Shark Effectiveness
                                                     GB

                                                     MB

                                                     KB
Global Big Data Apr 23, 2013   technicaltidbit.com                              18
Median Hadoop job input size




                               Image from Reynold Xin’s presentation


Global Big Data Apr 23, 2013        technicaltidbit.com                19
Spark Streaming: Motivation




x1,000,000 clients
                                              HDFS




 Global Big Data Apr 23, 2013   technicaltidbit.com   20
Spark Streaming: DStream
• “A series of small batches”
  {{“id”: “hercman”},          {{“id”: “hercman”},
                                                          {{“id”: “shewolf”},
  “eventType”:                 “eventType”:
                                                          “eventType”: “error”}}   RDD   2 sec
  “buyGoods”}}                 “buyGoods”}}



  {{“id”: “shewolf”},
  “eventType”: “error”}}                                                           RDD   2 sec
                                                 ...

  {{“id”: “catlover”},
                               {{“id”: “hercman”},
  “eventType”:
                               “eventType”: “logOff”}}                             RDD   2 sec
  “buyGoods”}}


                                                     DStream
                                                      DStream

Global Big Data Apr 23, 2013                 technicaltidbit.com                           21
Spark Streaming: DAG
                                                                               DStream
                                                                                                Dstream
                                                                               .filter(
                                                                                                .foreach(
                                                                               _.eventType==
                                                                                                println)
                                                                        bj]    “error”)
                                                                    [EvO
                                                              am
                                                           tre
         DStream[String]             Dstream            Ds
Kafka                              .transform
             (JSON)                                   Ds
                                                         tr   eam
                                                                  [Ev
                                                                      Ob
                                                                        j]
                                                                              Dstream
                                                                                               Dstream
                                                                              .filter(
                                                                                               .foreach(
                                                                              _.eventType==
                                                                                               println)
                                                                              “buyGoods”)




                        The DAG                                               Dstream
                                                                              .map((_.id,1))
                                                                                               Dstream
                                                                                               .groupByKey


    Global Big Data Apr 23, 2013                technicaltidbit.com                                    22
Spark Streaming: Example Code
// Initialize
val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)
val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)
errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1))
                        .groupByKey
usersBuying.foreach(rdd => println(rdd.count))

// Go
ssc.start




Global Big Data Apr 23, 2013   technicaltidbit.com                         23
Stateful Spark Streaming
Class ErrorsPerUser(var numErrors:Int=0) extends Serializable
val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => {
    if (values.find(_.eventType == “logOff”) == None)
        None
    else {
        values.foreach(e => {
             e.eventType match { “error” => state.numErrors += 1 }
        })
        Option(state)
    }
}

// DAG
val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)
val states = errorCounts.map((_.id,1))
                        .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAG
states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))



Global Big Data Apr 23, 2013   technicaltidbit.com                        24
Other Spark Subsystems
• Bagel (similar to Google Pregel)
• Sparkler (Matrix decomposition)
•              (Machine Learning)




Global Big Data Apr 23, 2013   technicaltidbit.com   25
Teaser
                                  • Future Meetup: Machine
                                    learning from real-time
                                    data streams




Global Big Data Apr 23, 2013   technicaltidbit.com        26

Mais conteúdo relacionado

Mais procurados

Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREJazz Yao-Tsung Wang
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data scienceDeepak Singh
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work WebinarNGDATA
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 

Mais procurados (20)

Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop
HadoopHadoop
Hadoop
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Platforms for data science
Platforms for data sciencePlatforms for data science
Platforms for data science
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 

Semelhante a Spark 2013-04-17

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data with Modern R & Spark
Big Data with Modern R & SparkBig Data with Modern R & Spark
Big Data with Modern R & SparkXavier de Pedro
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidYousun Jeong
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 

Semelhante a Spark 2013-04-17 (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data with Modern R & Spark
Big Data with Modern R & SparkBig Data with Modern R & Spark
Big Data with Modern R & Spark
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Tese phd
Tese phdTese phd
Tese phd
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druid
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 

Último

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Spark 2013-04-17

  • 1. The Spark Ecosystem Michael Malak technicaltidbit.com
  • 2. Agenda • What Hadoop gives us • What everyone is complaining about in 2013 • Spark – Berkeley Team – BDAS (Berkeley Data Analytics Stack) – RDDs (Resilient Distributed Datasets) – Shark – Spark Streaming – Other Spark subsystems Global Big Data Apr 23, 2013 technicaltidbit.com 2
  • 3. What Hadoop Gives Us • HDFS • Map/Reduce Global Big Data Apr 23, 2013 technicaltidbit.com 3
  • 4. Hadoop: HDFS Image from mark.chmarny.com Global Big Data Apr 23, 2013 technicaltidbit.com 4
  • 5. Hadoop: Map/Reduce Image from blog.octo.com Image from people.apache.org/~rdonkin Global Big Data Apr 23, 2013 technicaltidbit.com 5
  • 6. Map/Reduce Tools Pig Script HiveQL Hbase App Pig Hive Hadoop Linux Global Big Data Apr 23, 2013 technicaltidbit.com 6
  • 7. Hadoop Distribution Dogs in the Race Hadoop Distribution Query Tool Apache Drill Stinger Global Big Data Apr 23, 2013 technicaltidbit.com 7
  • 8. Other Open Source Solutions • Druid • Spark Global Big Data Apr 23, 2013 technicaltidbit.com 8
  • 9. Not just caching, but streaming • 1st generation: HDFS • 2nd generation: Caching & “Push” Map/Reduce • 3rd generation: Streaming Global Big Data Apr 23, 2013 technicaltidbit.com 9
  • 10. Berkeley Team • 40 students • 8 faculty • 3 staff software engineers • Silicon Valley style skunkworks office Image from Ian Stoica’s slides from Strata 2013 presentation space • 2 years into 6 year program Global Big Data Apr 23, 2013 technicaltidbit.com 10
  • 11. BDAS (Berkeley Data Analytics Stack) Spark Streaming Bagel App Shark App App Bagel Shark Spark Streaming Spark App Spark Hadoop/HDFS Mesos Linux Global Big Data Apr 23, 2013 technicaltidbit.com 11
  • 12. RDDs (Resilient Distributed Dataset) Image from Matei Zaharia’s paper Global Big Data Apr 23, 2013 technicaltidbit.com 12
  • 13. RDDs: Laziness x => x.startsWith(“ERROR”) lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) All Lazy .filter(_.contains(“foo”)) cnt = errors.count Action! Global Big Data Apr 23, 2013 technicaltidbit.com 13
  • 14. RDDs: Transformations vs. Actions Transformations Actions map(func) reduce(func) filter(func) collect() flatMap(func) count() sample(withReplacement, take(n) frac, seed) first() union(otherDataset) saveAsTextFile(path) groupByKey[K,V](func) saveAsSequenceFile(path) reduceByKey[K,V](func) foreach(func) join[K,V,W](otherDataset) cogroup[K,V,W1,W2](other1, other2) cartesian[U](otherDataset) sortByKey[K,V] [K,V] in Scala same as <K,V> templates in C++, Java Global Big Data Apr 23, 2013 technicaltidbit.com 14
  • 15. Hive vs. Shark Shark HiveQL HiveQL HiveQL HiveQL HDFS files HDFS files + RDDs Global Big Data Apr 23, 2013 technicaltidbit.com 15
  • 16. Shark: Copy from HDFS to RDD CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache(). Global Big Data Apr 23, 2013 technicaltidbit.com 16
  • 17. Shark: Just a Shim Shark Images from Reynold Xin’s presentation Global Big Data Apr 23, 2013 technicaltidbit.com 17
  • 18. What about “Big Data”? PB TB Shark Effectiveness Shark Effectiveness GB MB KB Global Big Data Apr 23, 2013 technicaltidbit.com 18
  • 19. Median Hadoop job input size Image from Reynold Xin’s presentation Global Big Data Apr 23, 2013 technicaltidbit.com 19
  • 20. Spark Streaming: Motivation x1,000,000 clients HDFS Global Big Data Apr 23, 2013 technicaltidbit.com 20
  • 21. Spark Streaming: DStream • “A series of small batches” {{“id”: “hercman”}, {{“id”: “hercman”}, {{“id”: “shewolf”}, “eventType”: “eventType”: “eventType”: “error”}} RDD 2 sec “buyGoods”}} “buyGoods”}} {{“id”: “shewolf”}, “eventType”: “error”}} RDD 2 sec ... {{“id”: “catlover”}, {{“id”: “hercman”}, “eventType”: “eventType”: “logOff”}} RDD 2 sec “buyGoods”}} DStream DStream Global Big Data Apr 23, 2013 technicaltidbit.com 21
  • 22. Spark Streaming: DAG DStream Dstream .filter( .foreach( _.eventType== println) bj] “error”) [EvO am tre DStream[String] Dstream Ds Kafka .transform (JSON) Ds tr eam [Ev Ob j] Dstream Dstream .filter( .foreach( _.eventType== println) “buyGoods”) The DAG Dstream .map((_.id,1)) Dstream .groupByKey Global Big Data Apr 23, 2013 technicaltidbit.com 22
  • 23. Spark Streaming: Example Code // Initialize val ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …) val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK) // DAG val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_)) val errorCounts = events.filter(_.eventType == “error”) errorCounts.foreach(rdd => println(rdd.count)) val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKey usersBuying.foreach(rdd => println(rdd.count)) // Go ssc.start Global Big Data Apr 23, 2013 technicaltidbit.com 23
  • 24. Stateful Spark Streaming Class ErrorsPerUser(var numErrors:Int=0) extends Serializable val updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) } } // DAG val events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_)) val errorCounts = events.filter(_.eventType == “error”) val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc) // Off-DAG states.foreach(rdd => println(“Num users experiencing errors:” + rdd.count)) Global Big Data Apr 23, 2013 technicaltidbit.com 24
  • 25. Other Spark Subsystems • Bagel (similar to Google Pregel) • Sparkler (Matrix decomposition) • (Machine Learning) Global Big Data Apr 23, 2013 technicaltidbit.com 25
  • 26. Teaser • Future Meetup: Machine learning from real-time data streams Global Big Data Apr 23, 2013 technicaltidbit.com 26