SlideShare uma empresa Scribd logo
1 de 38
UC BERKELEY
It’s All Happening On-line          User Generated
                                  (Web, Social & Mobile)
          Every:
          Click
          Ad impression
          Billing event
                                                           …..
          Fast Forward, pause,…
          Friend Request
          Transaction
          Network message
          Fault
          …


Internet of Things / M2M          Scientific Computing
Volume     Petabytes+



                               Variety    Unstructured




                               Velocity   Real-Time



Our view: More data should mean better answers


    • Must balance Cost, Time, and Answer Quality
3
4
UC BERKELEY



                    Algorithms: Machine
                       Learning and
                          Analytics




                         Massive
                        and Diverse
                           Data


         People:
                                             Machines:
     CrowdSourcing &
                                          Cloud Computing
    Human Computation

5
throughout the entire analytics lifecycle
6
Alex Bayen (Mobile Sensing)       Anthony Joseph (Sec./ Privacy)
   Ken Goldberg (Crowdsourcing)      Randy Katz (Systems)
   *Michael Franklin (Databases)     Dave Patterson (Systems)
   Armando Fox (Systems)             *Ion Stoica (Systems)
   *Mike Jordan (Machine Learning)   Scott Shenker (Networking)



Organized for Collaboration:




   7
8
> 450,000
    downloads




9
• Sequencing costs                    (150X)               Big Data                $100,000.0
                                                                                                 $K per genome

                                                                                     $10,000.0

 • UCSF cancer researchers + UCSC cancer genetic                                      $1,000.0
                                                                                       $100.0

   database + AMP Lab + Intel Cluster                                                    $10.0
                                                                                          $1.0
    @TCGA: 5 PB = 20 cancers x 1000 genomes                                               $0.1
                                                                                                   2001 - 2014


• See Dave Patterson’s Talk: Thursday 3-4, BDT205
        David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times,
   10   12/5/2011
MLBase (Declarative Machine Learning)
     Hadoop MR
        MPI                         BlinkDB (approx QP)
      Graphlab                        Shark (SQL) + Streaming
        etc.                  Spark                       Streaming
                    Shared RDDs (distributed memory)
                     Mesos (cluster resource manager)
                                HDFS

        3rd party      AMPLab (released)          AMPLab (in progress)


11
12
13
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Lightning-Fast Cluster Computing
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Base RDD                                              Cache 1
lines = spark.textFile(“hdfs://...”)              Transformed RDD
                                                                                            Worker
                                                                         results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))                                            tasks    Block 1
                                                                    Driver
cachedMsgs = messages.cache()

                                                    Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                                       Cache 2
                                                                                           Worker
                                                                     Cache 3
                                                               Worker                      Block 2
 Result: full-text search TBWikipedia in sec sec
    Result: scaled to 1 of data in 5-7 <1
         (vs 170sec for on-disk data)
          (vs 20 sec for on-disk data)                         Block 3
messages = textFile(...).filter(_.contains(“error”))
                        .map(_.split(‘t’)(2))




HadoopRDD                FilteredRDD              MappedRDD
 path = hdfs://…        func = _.contains(...)    func = _.split(…)
random initial line




target
map readPoint      cache

                                                       Load data in memory once
                               Initial parameter vector

                  map p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
reduce _ + _
                                    Repeated MapReduce steps
                                      to do gradient descent
60

                     50
Running Time (min)



                                                            110 s / iteration

                     40
                                                             Hadoop
                     30
                                                             Spark
                     20

                     10
                                                            first iteration 80 s
                                                          further iterations 1 s
                     0
                          1     10            20     30
                              Number of Iterations
Java API        JavaRDD<String> lines = sc.textFile(...);
(out now)
                lines.filter(new Function<String, Boolean>() {
                  Boolean call(String s) {
                    return s.contains(“error”);
                  }
                }).count();




PySpark         lines = sc.textFile(...)
(coming soon)
                lines.filter(lambda x: x.contains('error')) 
                     .count()
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Hive                            20

Spark       0.5
                                     Time (hours)
        0         5   10   15   20
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Client
                                 CLI          JDBC

                               Driver

Meta store      SQL       Query         Physical Plan
               Parser    Optimizer       Execution


                            MapReduce

                        HDFS
Client
                                 CLI          JDBC

                               Driver     Cache Mgr.

Meta store      SQL       Query         Physical Plan
               Parser    Optimizer       Execution


                               Spark

                        HDFS
Row Storage       Column Storage
1   john    4.1    1      2      3

2   mike    3.5   john   mike   sally

3   sally   6.4   4.1    3.5    6.4
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Shark   Shark (disk)   Hive

                                 100
                                  90
                                  80
                                  70
                                  60
                                  50
                                  40
                                  30

100 m2.4xlarge nodes              20

2.1 TB benchmark (Pavlo et al)    10




                                           1.1
                                   0
                                             Selection
Shark   Shark (disk)   Hive
                                 600


                                 500


                                 400


                                 300


                                 200

100 m2.4xlarge nodes             100




                                           32
2.1 TB benchmark (Pavlo et al)
                                   0
                                            Group By
1800
                                        Shark (copartitioned)
                                        Shark
                                 1500
                                        Shark (disk)
                                        Hive
                                 1200


                                  900


                                  600


                                  300




                                          105
100 m2.4xlarge nodes
2.1 TB benchmark (Pavlo et al)      0
                                                Join
Shark   Shark (disk)   Hive
70                             70               100
                                                90
60                             60
                                                80
50                             50               70

40                             40               60
                                                50
30                             30               40
20                             20               30
                                                20                100 m2.4xlarge
10                             10                                 nodes, 1.7 TB
                                                10
         0.8




                                    0.7




                                                      1.0
                                                                  Conviva dataset
0                              0                 0
           Query 1                    Query 2           Query 3
spark-project.org
amplab.cs.berkeley.edu

                         UC BERKELEY
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

Mais conteúdo relacionado

Mais procurados

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
 
Ado.net session08
Ado.net session08Ado.net session08
Ado.net session08Niit Care
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution ISSGC Summer School
 
Become a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day DevopsBecome a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day DevopsTier1app
 
Getting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaGetting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaDave Snowdon
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
ModuLab DLC-Medical3
ModuLab DLC-Medical3ModuLab DLC-Medical3
ModuLab DLC-Medical3Dongheon Lee
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Yandex
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614Sri Ambati
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithmsDuyhai Doan
 
Easy Scaling with Open Source Data Structures, by Talip Ozturk
Easy Scaling with Open Source Data Structures, by Talip OzturkEasy Scaling with Open Source Data Structures, by Talip Ozturk
Easy Scaling with Open Source Data Structures, by Talip OzturkZeroTurnaround
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleYvonne K. Matos
 

Mais procurados (20)

Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
 
Ado.net session08
Ado.net session08Ado.net session08
Ado.net session08
 
Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution Session 46 - Principles of workflow management and execution
Session 46 - Principles of workflow management and execution
 
Become a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day DevopsBecome a Java GC Hero - All Day Devops
Become a Java GC Hero - All Day Devops
 
Getting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in javaGetting your hands dirty with deep learning in java
Getting your hands dirty with deep learning in java
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
ModuLab DLC-Medical3
ModuLab DLC-Medical3ModuLab DLC-Medical3
ModuLab DLC-Medical3
 
User biglm
User biglmUser biglm
User biglm
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
 
Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"Marco Cattaneo "Event data processing in LHCb"
Marco Cattaneo "Event data processing in LHCb"
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
 
Cassandra data structures and algorithms
Cassandra data structures and algorithmsCassandra data structures and algorithms
Cassandra data structures and algorithms
 
Easy Scaling with Open Source Data Structures, by Talip Ozturk
Easy Scaling with Open Source Data Structures, by Talip OzturkEasy Scaling with Open Source Data Structures, by Talip Ozturk
Easy Scaling with Open Source Data Structures, by Talip Ozturk
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
 

Destaque

AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...Amazon Web Services
 
AWS Summit 2011: Customer Presentation - NYTimes
AWS Summit 2011: Customer Presentation - NYTimesAWS Summit 2011: Customer Presentation - NYTimes
AWS Summit 2011: Customer Presentation - NYTimesAmazon Web Services
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAmazon Web Services
 
Staying Lean with Amazon Web Services
Staying Lean with Amazon Web ServicesStaying Lean with Amazon Web Services
Staying Lean with Amazon Web ServicesAmazon Web Services
 
AWS Customer Success Story - DotAndMedia
AWS Customer Success Story - DotAndMediaAWS Customer Success Story - DotAndMedia
AWS Customer Success Story - DotAndMediaAmazon Web Services
 
AWS Customer Presentation - Newsweek
AWS Customer Presentation - Newsweek AWS Customer Presentation - Newsweek
AWS Customer Presentation - Newsweek Amazon Web Services
 
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014Amazon Web Services
 
Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...
Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...
Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...Amazon Web Services
 
Security in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileySecurity in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileyAmazon Web Services
 
Enterprise Empowerment & Innovation
Enterprise Empowerment & InnovationEnterprise Empowerment & Innovation
Enterprise Empowerment & InnovationAmazon Web Services
 
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
 AWS Government, Education, and Nonprofits Symposium London, United Kingdom L... AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...Amazon Web Services
 
Next Generation of Storage Sydney Customer Appreciation Day
Next Generation of Storage Sydney Customer Appreciation DayNext Generation of Storage Sydney Customer Appreciation Day
Next Generation of Storage Sydney Customer Appreciation DayAmazon Web Services
 
AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...
AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...
AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...Amazon Web Services
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the CloudAmazon Web Services
 
Best Practices in Architecting for the Cloud Webinar - Jinesh Varia
Best Practices in Architecting for the Cloud Webinar - Jinesh VariaBest Practices in Architecting for the Cloud Webinar - Jinesh Varia
Best Practices in Architecting for the Cloud Webinar - Jinesh VariaAmazon Web Services
 
DAT201 Migrating Databases to AWS - AWS re: Invent 2012
DAT201 Migrating Databases to AWS - AWS re: Invent 2012DAT201 Migrating Databases to AWS - AWS re: Invent 2012
DAT201 Migrating Databases to AWS - AWS re: Invent 2012Amazon Web Services
 
AWSome Day Jakarta - Opening Keynote
AWSome Day Jakarta - Opening KeynoteAWSome Day Jakarta - Opening Keynote
AWSome Day Jakarta - Opening KeynoteAmazon Web Services
 

Destaque (20)

AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
AWS Webcast - Webinar Series for State and Local Government #3: Discover the ...
 
Getting to MVP
Getting to MVPGetting to MVP
Getting to MVP
 
AWS Summit 2011: Customer Presentation - NYTimes
AWS Summit 2011: Customer Presentation - NYTimesAWS Summit 2011: Customer Presentation - NYTimes
AWS Summit 2011: Customer Presentation - NYTimes
 
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWSAWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
AWS Summit Stockholm 2014 – B2 – Migrating enterprise applications to AWS
 
Staying Lean with Amazon Web Services
Staying Lean with Amazon Web ServicesStaying Lean with Amazon Web Services
Staying Lean with Amazon Web Services
 
AWS Customer Success Story - DotAndMedia
AWS Customer Success Story - DotAndMediaAWS Customer Success Story - DotAndMedia
AWS Customer Success Story - DotAndMedia
 
Into the Cloud
Into the CloudInto the Cloud
Into the Cloud
 
AWS Customer Presentation - Newsweek
AWS Customer Presentation - Newsweek AWS Customer Presentation - Newsweek
AWS Customer Presentation - Newsweek
 
Application Portfolio Migration
Application Portfolio MigrationApplication Portfolio Migration
Application Portfolio Migration
 
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014
 
Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...
Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...
Time to Science, Time to Results: Accelerating Research with AWS - AWS Sympos...
 
Security in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve RileySecurity in the AWS Cloud - Steve Riley
Security in the AWS Cloud - Steve Riley
 
Enterprise Empowerment & Innovation
Enterprise Empowerment & InnovationEnterprise Empowerment & Innovation
Enterprise Empowerment & Innovation
 
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
 AWS Government, Education, and Nonprofits Symposium London, United Kingdom L... AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
AWS Government, Education, and Nonprofits Symposium London, United Kingdom L...
 
Next Generation of Storage Sydney Customer Appreciation Day
Next Generation of Storage Sydney Customer Appreciation DayNext Generation of Storage Sydney Customer Appreciation Day
Next Generation of Storage Sydney Customer Appreciation Day
 
AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...
AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...
AWS Cloud Kata 2013 | Singapore - Opening Keynote: Running Lean & Scaling Fas...
 
“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud“Spikey Workloads” Emergency Management in the Cloud
“Spikey Workloads” Emergency Management in the Cloud
 
Best Practices in Architecting for the Cloud Webinar - Jinesh Varia
Best Practices in Architecting for the Cloud Webinar - Jinesh VariaBest Practices in Architecting for the Cloud Webinar - Jinesh Varia
Best Practices in Architecting for the Cloud Webinar - Jinesh Varia
 
DAT201 Migrating Databases to AWS - AWS re: Invent 2012
DAT201 Migrating Databases to AWS - AWS re: Invent 2012DAT201 Migrating Databases to AWS - AWS re: Invent 2012
DAT201 Migrating Databases to AWS - AWS re: Invent 2012
 
AWSome Day Jakarta - Opening Keynote
AWSome Day Jakarta - Opening KeynoteAWSome Day Jakarta - Opening Keynote
AWSome Day Jakarta - Opening Keynote
 

Semelhante a BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopsrisatish ambati
 
Sedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsSedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsIvan Shcheklein
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 

Semelhante a BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012 (20)

Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013Big data analytics_beyond_hadoop_public_18_july_2013
Big data analytics_beyond_hadoop_public_18_july_2013
 
Hadoop
HadoopHadoop
Hadoop
 
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoopJava one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Sedna XML Database: Executor Internals
Sedna XML Database: Executor InternalsSedna XML Database: Executor Internals
Sedna XML Database: Executor Internals
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012

  • 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault … Internet of Things / M2M Scientific Computing
  • 3. Volume Petabytes+ Variety Unstructured Velocity Real-Time Our view: More data should mean better answers • Must balance Cost, Time, and Answer Quality 3
  • 4. 4
  • 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation 5
  • 6. throughout the entire analytics lifecycle 6
  • 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking) Organized for Collaboration: 7
  • 8. 8
  • 9. > 450,000 downloads 9
  • 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014 • See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
  • 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress) 11
  • 12. 12
  • 13. 13
  • 19. Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) Transformed RDD Worker results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
  • 20. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  • 22. map readPoint cache Load data in memory once Initial parameter vector map p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x reduce _ + _ Repeated MapReduce steps to do gradient descent
  • 23. 60 50 Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
  • 24. Java API JavaRDD<String> lines = sc.textFile(...); (out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); PySpark lines = sc.textFile(...) (coming soon) lines.filter(lambda x: x.contains('error')) .count()
  • 26. Hive 20 Spark 0.5 Time (hours) 0 5 10 15 20
  • 28. Client CLI JDBC Driver Meta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
  • 29. Client CLI JDBC Driver Cache Mgr. Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
  • 30. Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
  • 33. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30 100 m2.4xlarge nodes 20 2.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
  • 34. Shark Shark (disk) Hive 600 500 400 300 200 100 m2.4xlarge nodes 100 32 2.1 TB benchmark (Pavlo et al) 0 Group By
  • 35. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 0 Join
  • 36. Shark Shark (disk) Hive 70 70 100 90 60 60 80 50 50 70 40 40 60 50 30 30 40 20 20 30 20 100 m2.4xlarge 10 10 nodes, 1.7 TB 10 0.8 0.7 1.0 Conviva dataset 0 0 0 Query 1 Query 2 Query 3
  • 38. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.