SlideShare uma empresa Scribd logo
1 de 37
Putting Analytics in
 Big Data Analytics
Matt Casters, Chief of Data Integration
         Pentaho Corporation

        PLUG – Feb 17th, 2011




        © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Big Data




              Terabytes and petabytes of data
                    Sometimes per day



010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Example Use Cases Today
             Transactional
             •Fraud detection
             •Financial services/stock markets
             Sub-Transactional
             •Weblogs
             •Social/online media
             •Telecoms events


010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Example Use Cases Today
             Non-Transactional
             •Web pages, blogs etc
             •Documents
             •Physical events
             •Application events
             •Machine events

             In most cases structured or semi-structured

010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake
             • Single source
             • Large volume
             • Not distilled




010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Data Lakes
             • 0-2 lakes per company
             • Known and unknown questions
             • Multiple user communities
             • $1-10k questions, not $1m ones
             • Don’t fit in traditional RDBMS with a
               reasonable cost



010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake Requirements
             • Store all the data
             • Satisfy routine reporting and analysis
             • Satisfy ad-hoc query / analysis / reporting
             • Balance performance and cost




010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Traditional BI
                                 Data Mart(s)




                                                      Tape/Trash

            Data                             ? ? ?
           Source                             ?
                                             ? ??


010, Pentaho. All Rights Reserved. www.pentaho.com.                US and Worldwide: +1 (866) 660-7555 | Slide
What if...
                                 Data Mart(s)           Ad-Hoc       Data Warehouse




                                                      Data Lake(s)

            Data
           Source



010, Pentaho. All Rights Reserved. www.pentaho.com.                    US and Worldwide: +1 (866) 660-7555 | Slide
Big Data Does Not Replace Data Marts
             • It’s not a database
             • High latency
             • Optimized for massive data-crunching
             • Databases are immature
             • Databases are no-SQL




2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Big Data



                                                   Map/Reduce
                                                       And
                                                 Sometimes per day
                                                     Hadoop



010, Pentaho. All Rights Reserved. www.pentaho.com.            US and Worldwide: +1 (866) 660-7555 | Slide
What is Map/Reduce
• Obligatory Wikipedia quote: “... is a patented software

 framework introduced by Google to support
 distributed computing on large data sets on clusters
 of computers”
• Invented by Google to index “The Internet”

• Apache Hadoop is an Open Source implementation of the

 Map/Reduce algorithm
• Scalable & fault-tolerant, not efficient!
What Hadoop Really Is
• Core components
   • HDFS – a distributed file system allowing massive storage across a cluster of
     commodity servers
   • Map-Reduce
      • Framework for distributed computation, common use cases include
        aggregating, sorting, and filtering BIG data sets
      • Problem is broken up into small fragments of work that can be computed or
        recomputed in isolation on any node of the cluster
• Related Projects
   • Hive – a data warehouse infrastructure on top of Hadoop
      • Implements a SQL like Query language, including a JDBC driver
      • Allows MapReduce developers to plugin custom mappers and reducers
   • Hbase – the Hadoop database – AH HA!
      • A variant of NoSQL databases, problematic for traditional BI
      • Best at storing large amounts of unstructured data
No seriously, what’s is Hadoop?
          Java software framework that supports data-
            intensive distributed applications
          • Apache project
          • Created by Yahoo, Google’s idea
          • Distributed filesystem + MapReduce engine
          • Commodity hardware
          • Scales out beyond technology and/or
            economy of RDBMS

2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?
              • Distributed processing
              • Distributed file system
              • Commodity hardware
              • Platform independent (in theory)
              • Scales out beyond technology and/or
                     economy of a RDBMS

              In many cases it’s the only viable solution

2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?


              90% of new Hadoop use cases
                  are transformation of
                  semi/structured data*


              * of those companies we’ve talked to...

2010, Pentaho. All Rights Reserved. www.pentaho.com.    US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?




                                        “The working conditions
                                        within Hadoop are shocking”


                                     ETL Developer




2010, Pentaho. All Rights Reserved. www.pentaho.com.          US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?
              Instead of this...




2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Hadoop and BI?
              You have to do this in Java...
              •public void map(
              •                Text key,
              •                Text value,
              •                OutputCollector output,
              •                Reporter reporter)

              •public void reduce(
              •                Text key,
              •                Iterator values,
              •                OutputCollector output,
              •                Reporter reporter)


2010, Pentaho. All Rights Reserved. www.pentaho.com.     US and Worldwide: +1 (866) 660-7555 | Slide
People don’t use
                                      Hadoop for BI because
                                          they want to...


010, Pentaho. All Rights Reserved. www.pentaho.com.    US and Worldwide: +1 (866) 660-7555 | Slide
...they do it because
                                                   they have to...




010, Pentaho. All Rights Reserved. www.pentaho.com.            US and Worldwide: +1 (866) 660-7555 | Slide
... and unfortunately it
                                         wasn’t designed
                                    for most BI requirements



2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Why not add to Hadoop
                                    the things it’s missing...



010, Pentaho. All Rights Reserved. www.pentaho.com.     US and Worldwide: +1 (866) 660-7555 | Slide
... until it can do
                                                  what we need it to?



010, Pentaho. All Rights Reserved. www.pentaho.com.                US and Worldwide: +1 (866) 660-7555 | Slide
If only we had a
                            Java, embeddable,
                        data transformation engine...


010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho Data Integration
                                                      Data Marts, Data Warehouse,
                                                         Analytical Applications


                                                          Pentaho Data
                                                           Integration
                                                                               Design
                                                          Pentaho Data         Deploy
                             Hadoop                        Integration
                                                                            Orchestrate
                                                          Pentaho Data
                                                           Integration
010, Pentaho. All Rights Reserved. www.pentaho.com.                          US and Worldwide: +1 (866) 660-7555 | Slide
Visualize Reporting / Dashboards /
                                                 Analysis

                                                                                            Web Tier

                                                            DM & DW                          RDBMS
  Optimize
                                                              Hive
                                                                                              Hadoop
                                                           Files / HDFS


        Load                                          Applications & Systems

010, Pentaho. All Rights Reserved. www.pentaho.com.                       US and Worldwide: +1 (866) 660-7555 | Slide
Reporting / Dashboards /
                                                 Analysis

                                                                                         Web Tier

                                                             DM                             RDBMS

                                                        Hive
                                                                                           Hadoop
                                                      HDFS




010, Pentaho. All Rights Reserved. www.pentaho.com.                    US and Worldwide: +1 (866) 660-7555 | Slide
30000ft View

                                                              Host Machine

                                      pentaho-hadoop-vm


                                                  Hadoop


                                                                                               PDI Client
                                  HDFS                     Hive



                                          Tasks and Jobs




2010, Pentaho. All Rights Reserved. www.pentaho.com.                     US and Worldwide: +1 (866) 660-7555 | Slide 29
Inside the VM

                                                                pentaho-hadoop-vm


                                                                    Hadoop


                                                        HDFS                        Hive


                                                                      Job


                                                       Mapper                        Reducer




2010, Pentaho. All Rights Reserved. www.pentaho.com.                         US and Worldwide: +1 (866) 660-7555 | Slide 30
Inside a job
                                                        Job


                                             Mapper                          Reducer


                                                          *
                                  Java Application                     Java Application

                                          Scripting                         Scripting




          * Combiner can be used to pre-reduce in memory on the mappers before data is transmitted.


2010, Pentaho. All Rights Reserved. www.pentaho.com.              US and Worldwide: +1 (866) 660-7555 | Slide 31
Inside a job with PDI
                                                        Job


                                             Mapper                      Reducer


                            PDI Execution Engine              PDI Execution Engine

                                     Transformation                  Transformation

                                               Step
                                                Step                        Step
                                                                             Step
                                                 Step                         Step




2010, Pentaho. All Rights Reserved. www.pentaho.com.          US and Worldwide: +1 (866) 660-7555 | Slide 32
Demo




010, Pentaho. All Rights Reserved. www.pentaho.com.          US and Worldwide: +1 (866) 660-7555 | Slide
The Single Threaded Transformation Engine

          • Designed to use a single thread
          • Processes rows per batch because Hadoop
            delivers rows in batches
          • Knows when the batch of rows is processed
          • Is only initialized once and disposed of once
          • Has reduced overhead for data passing
            between steps


2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
The Single Threaded Transformation Engine

          • Is no longer used inside of Hadoop thanks
            to new developments. “The multi-threaded
            engine is still faster” they said.
          • Is being introduced into PDI 4.2.0 (CE)
          • You will be able to specify a mapping to run
            single threaded
          • Allows you to reduce context switching in
            large to huge transformations (lots of steps)

2010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho for Hadoop Resources


              Download www.pentaho.com/download/hadoop
              Pentaho for Hadoop webpage - resources, press,
             events, partnerships and more:
             www.pentaho.com/hadoop
              Big Data Analytics: 5 part video series with James
             Dixon, Pentaho CTO




                  Or contact me : mcasters at pentaho dot org

010, Pentaho. All Rights Reserved. www.pentaho.com.   US and Worldwide: +1 (866) 660-7555 | Slide
Thank You.

           Join the conversation. You can find us on:

                       http://blog.pentaho.com

                       @Pentaho

                        Pentaho Facebook Group

                         Pentaho - Open Source Business Intelligence Group




010, Pentaho. All Rights Reserved. www.pentaho.com.                     US and Worldwide: +1 (866) 660-7555 | Slide

Mais conteúdo relacionado

Mais procurados

Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOADemed L'Her
 
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HPMITEF México
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIBIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIMark Rittman
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Amr Awadallah
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
SAS Modernization architectures - Big Data Analytics
SAS Modernization architectures - Big Data AnalyticsSAS Modernization architectures - Big Data Analytics
SAS Modernization architectures - Big Data AnalyticsDeepak Ramanathan
 

Mais procurados (20)

Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
4. Big data & analytics HP
4. Big data & analytics HP4. Big data & analytics HP
4. Big data & analytics HP
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIBIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
 
Dw concepts
Dw conceptsDw concepts
Dw concepts
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
SAS Modernization architectures - Big Data Analytics
SAS Modernization architectures - Big Data AnalyticsSAS Modernization architectures - Big Data Analytics
SAS Modernization architectures - Big Data Analytics
 

Destaque

Destaque (6)

Feature Injection River Glide
Feature Injection River GlideFeature Injection River Glide
Feature Injection River Glide
 
Michael Adobe Flex Java 1 London
Michael Adobe Flex Java 1 LondonMichael Adobe Flex Java 1 London
Michael Adobe Flex Java 1 London
 
Notagile2010 leadership
Notagile2010 leadershipNotagile2010 leadership
Notagile2010 leadership
 
Adobe® Flex™
Adobe® Flex™Adobe® Flex™
Adobe® Flex™
 
Open Source BI
Open Source BIOpen Source BI
Open Source BI
 
Adobe Flex Resources
Adobe Flex ResourcesAdobe Flex Resources
Adobe Flex Resources
 

Semelhante a Plug 20110217

Pentaho big data camp - 5 min
Pentaho   big data camp - 5 minPentaho   big data camp - 5 min
Pentaho big data camp - 5 minianfyfe
 
Pentaho - Jake Cornelius - Hadoop World 2010
Pentaho - Jake Cornelius - Hadoop World 2010Pentaho - Jake Cornelius - Hadoop World 2010
Pentaho - Jake Cornelius - Hadoop World 2010Cloudera, Inc.
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting finalSkills Matter
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User GroupPentaho
 
How advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorHow advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorMichael Haddad
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopMark Kromer
 
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBICC Thomas More
 
Pentaho Roadmap 2011
Pentaho Roadmap 2011Pentaho Roadmap 2011
Pentaho Roadmap 2011Datalytics
 
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...MongoDB
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseJeffrey T. Pollock
 
Pentaho Analytics at Tampa Analytics September Meetup
Pentaho Analytics at Tampa Analytics September MeetupPentaho Analytics at Tampa Analytics September Meetup
Pentaho Analytics at Tampa Analytics September MeetupMark Kromer
 
What's on Your Wish List?
What's on Your Wish List?What's on Your Wish List?
What's on Your Wish List?MongoDB
 
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
30 for 30: Quick Start Your Pentaho Evaluation
30 for 30: Quick Start Your Pentaho Evaluation30 for 30: Quick Start Your Pentaho Evaluation
30 for 30: Quick Start Your Pentaho EvaluationPentaho
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDBMark Kromer
 
Pentaho roadmap 061314
Pentaho roadmap 061314Pentaho roadmap 061314
Pentaho roadmap 061314Stratebi
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 

Semelhante a Plug 20110217 (20)

Pentaho big data camp - 5 min
Pentaho   big data camp - 5 minPentaho   big data camp - 5 min
Pentaho big data camp - 5 min
 
Pentaho - Jake Cornelius - Hadoop World 2010
Pentaho - Jake Cornelius - Hadoop World 2010Pentaho - Jake Cornelius - Hadoop World 2010
Pentaho - Jake Cornelius - Hadoop World 2010
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting final
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User Group
 
How advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sectorHow advanced analytics is impacting the banking sector
How advanced analytics is impacting the banking sector
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
 
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho
 
Pentaho Roadmap 2011
Pentaho Roadmap 2011Pentaho Roadmap 2011
Pentaho Roadmap 2011
 
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
Pentaho Analytics at Tampa Analytics September Meetup
Pentaho Analytics at Tampa Analytics September MeetupPentaho Analytics at Tampa Analytics September Meetup
Pentaho Analytics at Tampa Analytics September Meetup
 
What's on Your Wish List?
What's on Your Wish List?What's on Your Wish List?
What's on Your Wish List?
 
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
30 for 30: Quick Start Your Pentaho Evaluation
30 for 30: Quick Start Your Pentaho Evaluation30 for 30: Quick Start Your Pentaho Evaluation
30 for 30: Quick Start Your Pentaho Evaluation
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Pentaho roadmap 061314
Pentaho roadmap 061314Pentaho roadmap 061314
Pentaho roadmap 061314
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 

Mais de Skills Matter

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard LawrenceSkills Matter
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applicationsSkills Matter
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimSkills Matter
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Skills Matter
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingSkills Matter
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveSkills Matter
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tSkills Matter
 

Mais de Skills Matter (20)

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberl
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.js
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source world
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testing
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
 
Serendipity-neo4j
Serendipity-neo4jSerendipity-neo4j
Serendipity-neo4j
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Lug presentation
Lug presentationLug presentation
Lug presentation
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
 
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
 
Huguk lily
Huguk lilyHuguk lily
Huguk lily
 

Plug 20110217

  • 1. Putting Analytics in Big Data Analytics Matt Casters, Chief of Data Integration Pentaho Corporation PLUG – Feb 17th, 2011 © 2010, Pentaho. All Rights Reserved. www.pentaho.com.
  • 2. Big Data Terabytes and petabytes of data Sometimes per day 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 3. Example Use Cases Today Transactional •Fraud detection •Financial services/stock markets Sub-Transactional •Weblogs •Social/online media •Telecoms events 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 4. Example Use Cases Today Non-Transactional •Web pages, blogs etc •Documents •Physical events •Application events •Machine events In most cases structured or semi-structured 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 5. Data Lake • Single source • Large volume • Not distilled 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 6. Data Lakes • 0-2 lakes per company • Known and unknown questions • Multiple user communities • $1-10k questions, not $1m ones • Don’t fit in traditional RDBMS with a reasonable cost 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 7. Data Lake Requirements • Store all the data • Satisfy routine reporting and analysis • Satisfy ad-hoc query / analysis / reporting • Balance performance and cost 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 8. Traditional BI Data Mart(s) Tape/Trash Data ? ? ? Source ? ? ?? 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 9. What if... Data Mart(s) Ad-Hoc Data Warehouse Data Lake(s) Data Source 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 10. Big Data Does Not Replace Data Marts • It’s not a database • High latency • Optimized for massive data-crunching • Databases are immature • Databases are no-SQL 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 11. Big Data Map/Reduce And Sometimes per day Hadoop 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 12. What is Map/Reduce • Obligatory Wikipedia quote: “... is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers” • Invented by Google to index “The Internet” • Apache Hadoop is an Open Source implementation of the Map/Reduce algorithm • Scalable & fault-tolerant, not efficient!
  • 13. What Hadoop Really Is • Core components • HDFS – a distributed file system allowing massive storage across a cluster of commodity servers • Map-Reduce • Framework for distributed computation, common use cases include aggregating, sorting, and filtering BIG data sets • Problem is broken up into small fragments of work that can be computed or recomputed in isolation on any node of the cluster • Related Projects • Hive – a data warehouse infrastructure on top of Hadoop • Implements a SQL like Query language, including a JDBC driver • Allows MapReduce developers to plugin custom mappers and reducers • Hbase – the Hadoop database – AH HA! • A variant of NoSQL databases, problematic for traditional BI • Best at storing large amounts of unstructured data
  • 14. No seriously, what’s is Hadoop? Java software framework that supports data- intensive distributed applications • Apache project • Created by Yahoo, Google’s idea • Distributed filesystem + MapReduce engine • Commodity hardware • Scales out beyond technology and/or economy of RDBMS 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 15. Hadoop and BI? • Distributed processing • Distributed file system • Commodity hardware • Platform independent (in theory) • Scales out beyond technology and/or economy of a RDBMS In many cases it’s the only viable solution 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 16. Hadoop and BI? 90% of new Hadoop use cases are transformation of semi/structured data* * of those companies we’ve talked to... 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 17. Hadoop and BI? “The working conditions within Hadoop are shocking” ETL Developer 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 18. Hadoop and BI? Instead of this... 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 19. Hadoop and BI? You have to do this in Java... •public void map( • Text key, • Text value, • OutputCollector output, • Reporter reporter) •public void reduce( • Text key, • Iterator values, • OutputCollector output, • Reporter reporter) 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 20. People don’t use Hadoop for BI because they want to... 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 21. ...they do it because they have to... 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 22. ... and unfortunately it wasn’t designed for most BI requirements 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 23. Why not add to Hadoop the things it’s missing... 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 24. ... until it can do what we need it to? 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 25. If only we had a Java, embeddable, data transformation engine... 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 26. Pentaho Data Integration Data Marts, Data Warehouse, Analytical Applications Pentaho Data Integration Design Pentaho Data Deploy Hadoop Integration Orchestrate Pentaho Data Integration 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 27. Visualize Reporting / Dashboards / Analysis Web Tier DM & DW RDBMS Optimize Hive Hadoop Files / HDFS Load Applications & Systems 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 28. Reporting / Dashboards / Analysis Web Tier DM RDBMS Hive Hadoop HDFS 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 29. 30000ft View Host Machine pentaho-hadoop-vm Hadoop PDI Client HDFS Hive Tasks and Jobs 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 29
  • 30. Inside the VM pentaho-hadoop-vm Hadoop HDFS Hive Job Mapper Reducer 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 30
  • 31. Inside a job Job Mapper Reducer * Java Application Java Application Scripting Scripting * Combiner can be used to pre-reduce in memory on the mappers before data is transmitted. 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 31
  • 32. Inside a job with PDI Job Mapper Reducer PDI Execution Engine PDI Execution Engine Transformation Transformation Step Step Step Step Step Step 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 32
  • 33. Demo 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 34. The Single Threaded Transformation Engine • Designed to use a single thread • Processes rows per batch because Hadoop delivers rows in batches • Knows when the batch of rows is processed • Is only initialized once and disposed of once • Has reduced overhead for data passing between steps 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 35. The Single Threaded Transformation Engine • Is no longer used inside of Hadoop thanks to new developments. “The multi-threaded engine is still faster” they said. • Is being introduced into PDI 4.2.0 (CE) • You will be able to specify a mapping to run single threaded • Allows you to reduce context switching in large to huge transformations (lots of steps) 2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 36. Pentaho for Hadoop Resources Download www.pentaho.com/download/hadoop Pentaho for Hadoop webpage - resources, press, events, partnerships and more: www.pentaho.com/hadoop Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO Or contact me : mcasters at pentaho dot org 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
  • 37. Thank You. Join the conversation. You can find us on: http://blog.pentaho.com @Pentaho Pentaho Facebook Group Pentaho - Open Source Business Intelligence Group 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide