SlideShare uma empresa Scribd logo
1 de 36
Krishnan Parasuraman       Greg Rokita
Netezza                    Edmunds.com




  Building Scalable Data Platforms
 Hadoop and Netezza Deployment Models
Talking Points
• Building scalable data platforms
  – Architectural considerations

• Hadoop and Massively Parallel Databases
  – Similarities and differences
  – Usage patterns


• Practitioner’s View Point
  – Edmunds.com data warehouse platform


   2                      Hadoop World 2011
Building scalable data platforms
Typical Digital Media Information Processing Pipeline


        Clicks

        Visits

    Page Views                                                 • Scoring
                  Real Time                                    • Yield optimization
        Likes                                   Data           • Audience Analytics
                  Decision
        Tweets                               Processing
   Impressions
                   Engine
    Locations

                 • Display Ads                 • Correlate      Reporting
                 • Recommendation              • Structure
                 • Personalized Content        • Consolidate
                                                               • Aggregate
                                                               • Summarize
                                                               • Ad-hoc analysis



    3                                     Hadoop World 2011
Building scalable data platforms
     Clicks

     Visits

  Page Views
                Real Time
        Likes                    Data
                Decision
    Tweets                    Processing
  Impressions
                 Engine
   Locations                                    Reporting




                       DATA PLATFORM


    4                       Hadoop World 2011
Building scalable data platforms

                     Real Time
                                          Data
                      Decision
                                       Processing
                       Engine
                                                                              Reporting


             • Real Time
                                • High Velocity     • Compute intensive • Cached Queries
             • High Concurrency
Workloads    • Transactional
                                • Linearly Scalable • Full table scans  • Low Latency
                                • Disk bound        • Disk bound        • H. Concurrency
             • High Thruput

             • Structured        • Structured
                                          • Mostly Structured
  Data       • Un-Structured      DATA PLATFORM
                                 • Un-Structured
                                          • Some unstructured
                                                                         • Structured
                                                                         • Relational
             • Key-Value pairs   • Machine Gen.

             • Stream Processing • Low Disk I/O      • In-DB computation • OLAP
Capability   • Memory resident • Fast Processing     • SQL and MR         • Columnar
             • Key based         • Low Cost/TB       • Analytic Libraries
               lookups
         5                              Hadoop World 2011
Building scalable data platforms

                     Real Time
                                           Data
                      Decision
                                        Processing
                       Engine
                                                                               Reporting


             • Real Time
                                • High Velocity     • Compute intensive • Cached Queries
             • High Concurrency
Workloads    • Transactional
                                • Linearly Scalable • Full table scans     • Low Latency
                                • Disk bound        • Disk bound
                                                                  Massively
             • High Thruput                                                • H. Concurrency
                                             Hadoop               Parallel DB
                       NoSQL
             • Structured        • Structured         • Mostly Structured • Structured
  Data               Databases
             • Un-Structured     • Un-Structured      • Some unstructured • Relational
                                                                             In-Memory
             • Key-Value pairs   • Machine Gen.
                                                                                 DB
                                      Graph
             • Stream Processing • Low Disk I/O             Plain Ole’ DB
                                                      • In-DB computation • OLAP
                                       DB
Capability   • Memory resident   • Fast Processing           on steroids • Columnar
                                                      • SQL and MR
             • Key based         • Low Cost/TB        • Analytic Libraries
               lookups
         6                               Hadoop World 2011
Building scalable data platforms

                     Real Time
                                           Data
                      Decision
                                        Processing
                       Engine
                                                                               Reporting


             • Real Time
                                • High Velocity     • Compute intensive • Cached Queries
             • High Concurrency
Workloads    • Transactional
                                • Linearly Scalable • Full table scans     • Low Latency
                                • Disk bound        • Disk bound
                                                                  Massively
             • High Thruput                                                • H. Concurrency
                                             Hadoop               Parallel DB
                       NoSQL
             • Structured        • Structured         • Mostly Structured • Structured
  Data               Databases
             • Un-Structured     • Un-Structured      • Some unstructured • Relational
                                                                             In-Memory
             • Key-Value pairs   • Machine Gen.
                                                                                 DB
                                      Graph
             • Stream Processing • Low Disk I/O             Plain Ole’ DB
                                                      • In-DB computation • OLAP
                                       DB
Capability   • Memory resident   • Fast Processing           on steroids • Columnar
                                                      • SQL and MR
             • Key based         • Low Cost/TB        • Analytic Libraries
               lookups
         7                               Hadoop World 2011
Myt A single technology will meet all the considerations for
  h our scalable data platform needs
               Best Practices


Workloads scale differently – Monolithic architectures don’t work

Minimize components – Data movement is painful

Understand tradeoffs – Performance  Price  Effort

Start with the core architecture and work in the edge cases



  8                        Hadoop World 2011
Massively parallel data warehouses
                   SQL And MR


                                                           Host controllers
                    Hosts

                                                           Network fabric


      FPGA   CPU    FPGA    CPU             FPGA     CPU   Massively
                                                           parallel
        Memory         Memory                   Memory
                                                           compute nodes


                                                           Distributed
                                                           Storage


  9                             Hadoop World 2011
Hadoop
                       Map Reduce

                         Job
                       Tracke
                                Name                         Master Node
                                Node
                          r



                                                             Network fabric

        Task            Task                  Task
       Tracke
                Data
                Node
                       Tracke
                                Data
                                Node
                                             Tracke
                                                      Data
                                                      Node
                                                             Parallel
          r               r                     r
                                                             compute nodes


                                                             Distributed
                                                             Storage


  10                             Hadoop World 2011
There are striking similarities….
                 Map Reduce

                   Job
                 Tracke
                          Name
                          Node
                                                      Massive
                    r
                                                      parallelism

                                                      Execute code &
                                                      algorithms next to
  Task            Task                Task            data
          Data            Data                 Data
 Tracke          Tracke              Tracke
          Node            Node                 Node
    r               r                   r
                                                      Scalable


                                                      Highly Available


                                                      Map Reduce

     11                          Hadoop World 2011
But also key differences
                          Map
                         Reduce
                                                                    Schema on Read – Data loading is fast




                                                          Hadoop
                     Job
                   Tracker
                                  Name
                                  Node                              Batch Mode data access
                                                                    Lower cost of data storage
                                                                    Process unstructured data
  Task     Data     Task          Data    Task     Data
 Tracker   Node    Tracker        Node   Tracker   Node




                                                                    Optimized for Performance

                                                          Netezza   Real time access, random
                                                                    reads, query optimizer, co-located
                                                                    joins
                                                                    Hardware Accelerated queries

                  Data Loading = File copy                          SQL and Map Reduce
                     Look Ma, No ETL


                                                                                                         12
These differences lead to opportunities for co-
existence for Hadoop in a Netezza environment
1. Scalable ETL engine
  – Complex data
  – Relationships not defined
  – Evolving schema
2. Queryable Archive
  – Moving computation is cheaper than moving data
3. Analytics sandbox
  – Exploratory analysis

   13                      Hadoop World 2011
Netezza-Hadoop: Deployment Patterns

                              Create context
                                                             Analyze
unstructured data      (classification, text mining)




                              Parse, aggregate            Analyze, report
semi-structured data




                                                            Active archival
                               Analyze, report           Long running queries
   structured data




        14                           Hadoop World 2011
Pattern 1: Data Processing Engine (ETL)

                            Hadoop Cluster
                                                              Netezza Environment



                                           NameNode
                                           JobTracker




Raw Weblogs

               DataNode       DataNode           DataNode
              TaskTracker    TaskTracker        TaskTracker




     15                       Hadoop World 2011
Pattern 2: Low cost storage and dynamic
provisioning
               Amazon Cloud
                                                      Netezza
                                                    Environment
                                     2
                                                3




                                      Elastic
                                    MapReduce


           1
                Amazon S3




   16                       Hadoop World 2011
Pattern 3: Queryable Archive



                       1




                                                             3
        Data Sources       2




                                                     Netezza
                                                   Environment




   17                          Hadoop World 2011
About Greg Rokita
 o       Director, Software Architecture at Edmunds, Inc
 o       M.S. in Computer Science, Stanford University
 o       Research interests
            o      Large scale programing paradigms
            o      Domain specific Data Stores
            o      Semi-structured data representation and search
 o       Designs & Implementations of Core Frameworks
            o      Publishing & Messaging infrastructure
            o      Content & Digital Asset Management systems
            o      Reviews & Ratings system
            o      Search APIs
            o      Big Data Analytics


No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Edmunds.com and Scale
 o       Premier online resource for automotive information
         launched in 1995 as the first automotive information
         Web site
 o       15 million unique visitors
 o       210 million page views
 o       1 million+ new inventory items per day
 o       2 TB of new data every month
 o       40 node Hadoop cluster aggregating
         logs, advertising, vehicle, pricing, inventory and other
         data sets

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Edmunds Proposition

             We have developed an iterative
               approach to data warehouse
        development that has dropped the time
         it takes for us to deliver reports to our
               users from months to weeks.


 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

20   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
How did we do it?


   o           Process
   o           Technology
   o           Understanding of Value



No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Process: agile approach
   o       Continuous and fast delivery of new features
   o       Collaboration between users and developers
   o       Make new data available quickly and
           inexpensively
   o       Quick problem resolution
   o       No wasting of entire development cycle if data is
           not useful
   o       Encouragement of exploration and creation of
           new applications
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Process                                                                                                           Pre-process:
                                                                                                                    • Complete
                                                                                                                    • Raw
                                                                                                                    • Modeled as source data
                                                                                                                    • Generically loaded
                                                                                                                    • Quick turn-around
                                                                                                                    • Low retention
                                                                                                                    • Slower performance

                                                                                                                    Post-process:
                                                                                                                    • Filtered
                                                                                                                    • Transformed
                                                                                                                    • Modeled as star schema
                                                                                                                    • Optimized
                                                                                                                    • Slow turn-around
                                                                                                                    • High retention
                                                                                                                    • Fast performance
 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

23   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Post-Process Sandbox
                                        Use Pre-                                Load data
                                        process                                in ad-hock
                                         data                                    manner

                                                                                                                                                  Discard:
                                                                                                                                                   prevents shadow
                                                                                                                           No                        production
                    Change                                                                                                                         little effort lost
                  schema (by
                    users or                        Prototype                                   Data has value?
                  developers)

                                                                                                                                                  Develop Optimized
                                                                                                               Yes                                Pipeline:
                                                                                                                                                   data is confirmed to
                                        Enhance
                                                                                Schema is                                                            be useful
                                                                                 stable?                                                           effort is warranted




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

24   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Technology

                         Publishing                                                        Hadoop
                                                                                                                                                           Netezza
                           System                                                             Stack

  • All Data                                                     • HBase raw data                                                • All data loaded from
  • Generic                                                      • Oozie job coordinator                                           Hadoop in batch
  • Thrift IDL with                                              • HDFS storage of pre                                           • Analysis and data
    Versioning                                                     and optimized data                                              exploration - use the
                                                                   replica of RDBMS in                                             speed and power
                                                                   files                                                         • Report generation




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

25   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Edmunds Publishing System




No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
26 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Generic flow for pre-process

                                                  Producers: Inventory, Pricing, Vehicle,
                                                              Dealer, Leads
                                                                                           Broker

                                                                                      Consumer

                                                                                           HBase
                                                                                          Map-                                                                  G
                                                                                                                                                                e
                                                                                         Reduce
                                                                                                                                                                n
                                                                                         Netezza                                                                e
                                                                                         Action                                                                 r
                                                                                                                                                                i
                                                                                                                                                                c
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


                                                                                                                                                                ,
   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
27 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
What architecture enables generic
  consumer?
                                                                              Thrift


                                             Camel


                                  ActiveMQ


   o            Message                                                                              o           Retries
            o           Delivery                                                                     o           Throttling
            o           Routing
            o           Persistence                                                                  o           Versioning
            o           Durability                                                                   o           Monitoring

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Flexibility for Producers and Consumers:
 Support for Topologies

          Field                                         Example Values                                                        Purpose
          Environment                                   PROD, TEST, DEV                                                       Promotion cycle of
                                                                                                                              deployment units
          Index                                         Blue, Green, Stage                                                    Environment Index
          Data Center                                   LAX1, EC2                                                             The data center where
                                                                                                                              deployment unit is located
          Site                                          Edmunds, Insideline                                                   Company’s Product
          Application                                   HBase, Digital Asset Manager                                          Deployment Unit




No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Producer-Consumer matching
                                                                                      Match!
                 Producer                                              Virtual                            Queue
                                                                                                                                                     Consumer
                                                                       Topic                              Name
                                                                       Name
                                     Publish                                                                                Publish
                                     Inventory                                                                              Inventory
           I am                                                                                                                                                 I am
                                     Prod                                                                                   Test
                                     Lax                                         Broker
                                                                                                                            EC2
                                     Edmunds                                     Destination
                                                                                                                            Edmunds
                                     Inventory                                   Interceptor
                                                                                                                            Dealer

                                     Prod, Test                                                                             Prod
          Send To                    Lax, EC2                                                                               Lax, EC2                          Receive From
                                     Edmunds                                                                                Edmunds
                                     Dealer                                                                                 Inventory



No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
HBase: how to handle data generically
      Colum                      Binary                                                        Discrete                                                    Type 2
      Family
      Columns                    Serialized                Hashcode of                         Thrift Thrift                           Thrift              Start           End             List of
                                 Thrift                    the Thrift                          Object Object                           Object              Date            Date            fields
                                 Object                    Object                              Field 1 Field 2                         Field 3




      Role                       System of Check if       Versioning at the most                                                                           Versioning for
                                 record    updates are    granular level for lookups                                                                       optimized
                                           necessary                                                                                                       dimension tables
                                           (optimization)




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

31   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Generic Thrift Persistence in HBase
     Column Name                                                                                                                                               Value
     [ModelYear]|F:id|T:long|I:0                                                                                                                               1368
     [ModelYear]|F:midYear|T:boolean|I:1                                                                                                                       false
     [ModelYear]|F:year|T:int|I:2                                                                                                                              1993
     [ModelYear]|F:name|T:java.lang.String|I:4                                                                                                                 Celica
     [ModelYear]#[attributss][0]|F:_key|T:java.lang.Long                                                                                                       64
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F:                                                                                       Standard Sport
     value|T:java.lang.String|I:1                                                                                                                              V:GT-S 2dr
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:                                                                                       Hatchback
     value|T:java.lang.String|I:1
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i                                                                                      441
     d|T:long|I:2
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F:                                                                                       V:GT-S
     value|T:java.lang.String|I:1



 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

32   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Netezza: Time is Money
          Compared to Oracle                                                 Business Value

          Up to 12x faster load times                                         Can reload data more frequently
                                                                              Failed workflows are no longer a big problem
                                                                              Helps in transition to real time system:
                                                                               We can now create intraday reports for Leads!

          Up to 400x faster query                                             More productive Business Intelligence
          times                                                               Queries that could ‘never’ finish in Oracle are
                                                                               now providing business value




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

33   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Generic and reusable Oozie actions for
  Netezza

                                  Oozie Load and Remove Action



                                             Apache CLI


                                                       Nzload and Nzsql (provisioned
                                                       on worker nodes using Chef)


 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

34   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Value
     o      Data warehouse proves product value both
            internally and to our customers
     o      Failing fast and quick turn around allow us to
            know when we are building the right reporting
            and analytical products without a large up front
            investment
     o      By combining all data in a single system we are
            enabling new products to be developed that we
            previously could not


 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

35   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Krishnan Parasuraman       Greg Rokita
@kparasuraman              Edmunds.com




  Building Scalable Data Platforms
 Hadoop and Netezza Deployment Models

Mais conteúdo relacionado

Mais procurados

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 

Mais procurados (20)

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
HDFS
HDFSHDFS
HDFS
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 

Semelhante a Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Krishnan Parasuraman, Netezza & Greg Rokita, Edmunds

Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-timeMarc Sturlese
 
Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Gavin Heavyside
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-timeDani Solà Lagares
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLzenyk
 
Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBIntro to NoSQL and MongoDB
Intro to NoSQL and MongoDBDATAVERSITY
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 

Semelhante a Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Krishnan Parasuraman, Netezza & Greg Rokita, Edmunds (20)

Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
 
Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011
 
From legacy, to batch, to near real-time
From legacy, to batch, to near real-timeFrom legacy, to batch, to near real-time
From legacy, to batch, to near real-time
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Bigdata
BigdataBigdata
Bigdata
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBIntro to NoSQL and MongoDB
Intro to NoSQL and MongoDB
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Krishnan Parasuraman, Netezza & Greg Rokita, Edmunds

  • 1. Krishnan Parasuraman Greg Rokita Netezza Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models
  • 2. Talking Points • Building scalable data platforms – Architectural considerations • Hadoop and Massively Parallel Databases – Similarities and differences – Usage patterns • Practitioner’s View Point – Edmunds.com data warehouse platform 2 Hadoop World 2011
  • 3. Building scalable data platforms Typical Digital Media Information Processing Pipeline Clicks Visits Page Views • Scoring Real Time • Yield optimization Likes Data • Audience Analytics Decision Tweets Processing Impressions Engine Locations • Display Ads • Correlate Reporting • Recommendation • Structure • Personalized Content • Consolidate • Aggregate • Summarize • Ad-hoc analysis 3 Hadoop World 2011
  • 4. Building scalable data platforms Clicks Visits Page Views Real Time Likes Data Decision Tweets Processing Impressions Engine Locations Reporting DATA PLATFORM 4 Hadoop World 2011
  • 5. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound • H. Concurrency • High Thruput • Structured • Structured • Mostly Structured Data • Un-Structured DATA PLATFORM • Un-Structured • Some unstructured • Structured • Relational • Key-Value pairs • Machine Gen. • Stream Processing • Low Disk I/O • In-DB computation • OLAP Capability • Memory resident • Fast Processing • SQL and MR • Columnar • Key based • Low Cost/TB • Analytic Libraries lookups 5 Hadoop World 2011
  • 6. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound Massively • High Thruput • H. Concurrency Hadoop Parallel DB NoSQL • Structured • Structured • Mostly Structured • Structured Data Databases • Un-Structured • Un-Structured • Some unstructured • Relational In-Memory • Key-Value pairs • Machine Gen. DB Graph • Stream Processing • Low Disk I/O Plain Ole’ DB • In-DB computation • OLAP DB Capability • Memory resident • Fast Processing on steroids • Columnar • SQL and MR • Key based • Low Cost/TB • Analytic Libraries lookups 6 Hadoop World 2011
  • 7. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound Massively • High Thruput • H. Concurrency Hadoop Parallel DB NoSQL • Structured • Structured • Mostly Structured • Structured Data Databases • Un-Structured • Un-Structured • Some unstructured • Relational In-Memory • Key-Value pairs • Machine Gen. DB Graph • Stream Processing • Low Disk I/O Plain Ole’ DB • In-DB computation • OLAP DB Capability • Memory resident • Fast Processing on steroids • Columnar • SQL and MR • Key based • Low Cost/TB • Analytic Libraries lookups 7 Hadoop World 2011
  • 8. Myt A single technology will meet all the considerations for h our scalable data platform needs Best Practices Workloads scale differently – Monolithic architectures don’t work Minimize components – Data movement is painful Understand tradeoffs – Performance  Price  Effort Start with the core architecture and work in the edge cases 8 Hadoop World 2011
  • 9. Massively parallel data warehouses SQL And MR Host controllers Hosts Network fabric FPGA CPU FPGA CPU FPGA CPU Massively parallel Memory Memory Memory compute nodes Distributed Storage 9 Hadoop World 2011
  • 10. Hadoop Map Reduce Job Tracke Name Master Node Node r Network fabric Task Task Task Tracke Data Node Tracke Data Node Tracke Data Node Parallel r r r compute nodes Distributed Storage 10 Hadoop World 2011
  • 11. There are striking similarities…. Map Reduce Job Tracke Name Node Massive r parallelism Execute code & algorithms next to Task Task Task data Data Data Data Tracke Tracke Tracke Node Node Node r r r Scalable Highly Available Map Reduce 11 Hadoop World 2011
  • 12. But also key differences Map Reduce Schema on Read – Data loading is fast Hadoop Job Tracker Name Node Batch Mode data access Lower cost of data storage Process unstructured data Task Data Task Data Task Data Tracker Node Tracker Node Tracker Node Optimized for Performance Netezza Real time access, random reads, query optimizer, co-located joins Hardware Accelerated queries Data Loading = File copy SQL and Map Reduce Look Ma, No ETL 12
  • 13. These differences lead to opportunities for co- existence for Hadoop in a Netezza environment 1. Scalable ETL engine – Complex data – Relationships not defined – Evolving schema 2. Queryable Archive – Moving computation is cheaper than moving data 3. Analytics sandbox – Exploratory analysis 13 Hadoop World 2011
  • 14. Netezza-Hadoop: Deployment Patterns Create context Analyze unstructured data (classification, text mining) Parse, aggregate Analyze, report semi-structured data Active archival Analyze, report Long running queries structured data 14 Hadoop World 2011
  • 15. Pattern 1: Data Processing Engine (ETL) Hadoop Cluster Netezza Environment NameNode JobTracker Raw Weblogs DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker 15 Hadoop World 2011
  • 16. Pattern 2: Low cost storage and dynamic provisioning Amazon Cloud Netezza Environment 2 3 Elastic MapReduce 1 Amazon S3 16 Hadoop World 2011
  • 17. Pattern 3: Queryable Archive 1 3 Data Sources 2 Netezza Environment 17 Hadoop World 2011
  • 18. About Greg Rokita o Director, Software Architecture at Edmunds, Inc o M.S. in Computer Science, Stanford University o Research interests o Large scale programing paradigms o Domain specific Data Stores o Semi-structured data representation and search o Designs & Implementations of Core Frameworks o Publishing & Messaging infrastructure o Content & Digital Asset Management systems o Reviews & Ratings system o Search APIs o Big Data Analytics No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 19. Edmunds.com and Scale o Premier online resource for automotive information launched in 1995 as the first automotive information Web site o 15 million unique visitors o 210 million page views o 1 million+ new inventory items per day o 2 TB of new data every month o 40 node Hadoop cluster aggregating logs, advertising, vehicle, pricing, inventory and other data sets No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 20. Edmunds Proposition We have developed an iterative approach to data warehouse development that has dropped the time it takes for us to deliver reports to our users from months to weeks. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 20 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 21. How did we do it? o Process o Technology o Understanding of Value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 22. Process: agile approach o Continuous and fast delivery of new features o Collaboration between users and developers o Make new data available quickly and inexpensively o Quick problem resolution o No wasting of entire development cycle if data is not useful o Encouragement of exploration and creation of new applications No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 23. Process Pre-process: • Complete • Raw • Modeled as source data • Generically loaded • Quick turn-around • Low retention • Slower performance Post-process: • Filtered • Transformed • Modeled as star schema • Optimized • Slow turn-around • High retention • Fast performance No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 23 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 24. Post-Process Sandbox Use Pre- Load data process in ad-hock data manner Discard:  prevents shadow No production Change  little effort lost schema (by users or Prototype Data has value? developers) Develop Optimized Yes Pipeline:  data is confirmed to Enhance Schema is be useful stable?  effort is warranted No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 25. Technology Publishing Hadoop Netezza System Stack • All Data • HBase raw data • All data loaded from • Generic • Oozie job coordinator Hadoop in batch • Thrift IDL with • HDFS storage of pre • Analysis and data Versioning and optimized data exploration - use the replica of RDBMS in speed and power files • Report generation No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 26. Edmunds Publishing System No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 26 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 27. Generic flow for pre-process Producers: Inventory, Pricing, Vehicle, Dealer, Leads Broker Consumer HBase Map- G e Reduce n Netezza e Action r i c No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. , No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 27 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 28. What architecture enables generic consumer? Thrift Camel ActiveMQ o Message o Retries o Delivery o Throttling o Routing o Persistence o Versioning o Durability o Monitoring No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 29. Flexibility for Producers and Consumers: Support for Topologies Field Example Values Purpose Environment PROD, TEST, DEV Promotion cycle of deployment units Index Blue, Green, Stage Environment Index Data Center LAX1, EC2 The data center where deployment unit is located Site Edmunds, Insideline Company’s Product Application HBase, Digital Asset Manager Deployment Unit No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 30. Producer-Consumer matching Match! Producer Virtual Queue Consumer Topic Name Name Publish Publish Inventory Inventory I am I am Prod Test Lax Broker EC2 Edmunds Destination Edmunds Inventory Interceptor Dealer Prod, Test Prod Send To Lax, EC2 Lax, EC2 Receive From Edmunds Edmunds Dealer Inventory No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 31. HBase: how to handle data generically Colum Binary Discrete Type 2 Family Columns Serialized Hashcode of Thrift Thrift Thrift Start End List of Thrift the Thrift Object Object Object Date Date fields Object Object Field 1 Field 2 Field 3 Role System of Check if Versioning at the most Versioning for record updates are granular level for lookups optimized necessary dimension tables (optimization) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 31 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 32. Generic Thrift Persistence in HBase Column Name Value [ModelYear]|F:id|T:long|I:0 1368 [ModelYear]|F:midYear|T:boolean|I:1 false [ModelYear]|F:year|T:int|I:2 1993 [ModelYear]|F:name|T:java.lang.String|I:4 Celica [ModelYear]#[attributss][0]|F:_key|T:java.lang.Long 64 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F: Standard Sport value|T:java.lang.String|I:1 V:GT-S 2dr [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F: Hatchback value|T:java.lang.String|I:1 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i 441 d|T:long|I:2 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F: V:GT-S value|T:java.lang.String|I:1 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 32 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 33. Netezza: Time is Money Compared to Oracle Business Value Up to 12x faster load times  Can reload data more frequently  Failed workflows are no longer a big problem  Helps in transition to real time system: We can now create intraday reports for Leads! Up to 400x faster query  More productive Business Intelligence times  Queries that could ‘never’ finish in Oracle are now providing business value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 33 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 34. Generic and reusable Oozie actions for Netezza Oozie Load and Remove Action Apache CLI Nzload and Nzsql (provisioned on worker nodes using Chef) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 34 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 35. Value o Data warehouse proves product value both internally and to our customers o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment o By combining all data in a single system we are enabling new products to be developed that we previously could not No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 35 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 36. Krishnan Parasuraman Greg Rokita @kparasuraman Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models