SlideShare uma empresa Scribd logo
1 de 33
Research on Big Data
                                                     - FlexDB: A cloud-scale database engine
                                                     based on Hadoop
                                                         Jidong Chen (jidong.chen@emc.com)
                                                         Manager, Research Scientist, Big Data Lab

                                                         EMC Labs China
                                                         Sept. 2011




© Copyright 2011 EMC Corporation. All rights reserved.                                               1
Grand Opening Announcement




                      EMC Labs China is formed from EMC Research China and the
                      Advanced Technology Venture group, which were established in
                      2007 by the office of CTO.



© Copyright 2011 EMC Corporation. All rights reserved.                               2
EMC Labs China - Vision and Mission
       Advanced Technology
     Research and Development                               University
                                                           Collaboration
                                                                                 Vision
                     Big Data Lab                                                 Become an elite
                                                                              research and advanced
                                                                                technology institute
                                                         Industry Standards           in China
            Cloud Infrastructure                               Office                     -
              and System Lab                                                   Become the model for
                                                                                  future EMC Labs
             Cloud Platform and                                                      worldwide
                                                            IP Portfolio
              Applications Lab
                                                           Development




© Copyright 2011 EMC Corporation. All rights reserved.                                                 3
Outline

• Big Data projects overview at EMC Labs China
• Introduction to Cloud Databases
• Data analytics in the cloud
           – Parallel DBMS
           – MapReduce
• FlexDB - A cloud-scale database engine based on
  Hadoop
• Summary



© Copyright 2011 EMC Corporation. All rights reserved.   4
The Digital Universe 2009-2020



                                                 Growing
                                                 by a
                                                 Factor of 44
2009:
0.8 Zb




                                                                 2020: 35.2 Zettabytes
Source: IDC Digital Universe Study, sponsored by EMC, May 2010




 © Copyright 2011 EMC Corporation. All rights reserved.                                  5
Big Data is Changing the World
             Expanding Data Sources                              Bigger Challenges
• Science and research                                   • Scale out automatically
          – Gene sequences                                   – Vs. scale up manually
          – LHC accelerator
          – Earth and space exploration                  • More capacity and bigger pool
                                                             – E.g., 10 PB in a single file system
• Enterprise applications
          – Email, documents, files                      • New process capability
          – Applications log                                 – Loading, Analyzing, Moving data
          – Transaction records                              – Intelligence

• Web 2.0 data                                           • Better performance
          – Search log / click stream                        – Linear vs. exponent
          – Twitter/ Blog / SNS                              – Faster
          – Wiki                                         • Autonomous
• Other unstructured data                                    – Fewer human interference
          – Video/Movie                                      – Lower cost
          – Graphics
          – Digital widgets



© Copyright 2011 EMC Corporation. All rights reserved.                                               6
Research Scopes and Topics in Big Data
• Search and Analytics
          – Search: Entity Search, Faceted Search, Associative Search
          – Analytics: Text Analysis, Activity Modeling and Sequence Analysis,
            Real-time Data Analysis for Streaming, Parallel Data Mining
            Algorithms
• MPP Databases and Data Services
          – Parallel Database: Parallel Query Optimization, Data Partitioning
            and Replication, Distributed Transaction
          – In-memory Database: Cache, Recovery, Consistence
          – Database as a Service: Multi-tenant Data Management, Auto-
            Administration
• Hadoop/NoSQL
          – Hadoop: Single-node Failure, Performance, Real-time MapReduce
            Scheduler and Fault Tolerance
          – NoSQL: Key-Value Store, Documents Store, Graph Data Store

© Copyright 2011 EMC Corporation. All rights reserved.                           7
Project Overview
• Hadoop/NoSQL
          – vHadoop - joint project with VMWare
                    • Parallel SAN file system for DISC on virtualized platform
          – Online MapReduce for Real-time Data Analytics
                    • Pipelined task execution, Group task scheduling, Enhanced fault tolerance
                    • Parallel Data Mining
          – FlexDB: Cloud-scale Parallel Database for OLAP
                    • MapReduce integration into DBMS, Parallel query execution, Cost-based query
                      optimization
          – Cloud-scale Parallel Database for OLTP
                    • Intelligent database sharding and resharding
                    • Active-active (eager) replication with group communication service
                    • Multiple masters with elastic distributed coordination




© Copyright 2011 EMC Corporation. All rights reserved.                                              8
Cloud Databases
  • Two largest components of data management market
            – Transactional Data Management
                      • Banks, airline reservation, online e-commerce
                      • ACID, write-intensive
            – Analytical Data Management
                      • Business planning, decision support
                      • Query-intensive

  • Challenges of data management in the Cloud
            –     Scalability
            –     Fault Tolerance
            –     Availability & Consistence
            –     Transaction Management
            –     Flexible Schemes




© Copyright 2011 EMC Corporation. All rights reserved.                  9
Cloud Databases
  • Data analytics in the cloud
            – Parallel DBMS
            – MapReduce
  • Transactional data management in the cloud
            – NoSQL Store
            – SQL Database
  • Cloud data services (Database as a Service)
            – Multi-tenant data management
            – Auto-administration




© Copyright 2011 EMC Corporation. All rights reserved.   10
Commercial Landscape Major Players

  • Amazon EC2
            – IaaS abstraction
            – Data management using S3 and SimpleDB
  • Microsoft Azure
            – PaaS abstraction
            – Relational engine (SQL Azure)
  • Google AppEngine
            – PaaS abstraction
            – Data management using Google MegaStore



© Copyright 2011 EMC Corporation. All rights reserved.   11
Data Analytics in the Cloud

• Scalability to large data volumes:
           – Scan 100 TB on 1 node @ 50 MB/sec = 23 days
           – Scan on 1000-node cluster = 33 minutes
 Divide-And-Conquer (i.e., data partitioning)

• Cost-efficiency:
           –     Commodity nodes (cheap, but unreliable)
           –     Commodity network
           –     Automatic fault-tolerance (fewer admins)
           –     Easy to use (fewer programmers)

© Copyright 2011 EMC Corporation. All rights reserved.      12
Solutions for Large-scale Data Analysis

  • Parallel DBMS technologies
            – Proposed in late eighties
            – Matured over the last two decades
            – Multi-billion dollar industry: Proprietary DBMS Engines
              intended as Data Warehousing solutions for very large
              enterprises
  • Map Reduce
            – pioneered by Google
            – popularized by Yahoo! (Hadoop)



© Copyright 2011 EMC Corporation. All rights reserved.                  13
Parallel DBMS technologies
  • Popularly used for more than two decades
            – Research Projects: Gamma, Grace, …
            – Commercial: Teradata, Greenplum (acquired by EMC), Netezza
              (acquired by IBM), DATAllegro (acquired by Microsoft), Vertica
              (acquired by HP), Aster Data (acquired by Teradata)
  •    Share-nothing nodes clusters
  •    Relational Data Model
  •    Indexing
  •    Familiar SQL interface
  •    Parallel query execution
            – Horizontal partitioning of relational tables with partitioned execution of
              SQL queries
  • Advanced query optimization
  • Well understood and studied


© Copyright 2011 EMC Corporation. All rights reserved.                                     14
Greenplum: A Share-nothing Parallel DBMS
                                                          Greenplum’s MPP Database has extreme scalability
                                                             – Optimized for BI and analytics
                                                             – Fault-tolerant reliability and optimized performance
                                                                using commodity CPUs, disks and networking
                  Interconnect                            Provides automatic parallelization
                                                              – No need for manual partitioning or tuning
                                                              – Just load and query like any database
                                                              – Tables are automatically distributed across nodes
                                                          Extremely scalable and I/O optimized
                                                              – All nodes can scan and process in parallel
                  Loading                                     – No I/O contention between segments
                                                          Linear scalability by adding nodes
                                                              – Each adds storage, query performance and loading
                                                                performance




© Copyright 2011 EMC Corporation. All rights reserved.                                                                15
Greenplum Database Architecture
   MPP (Massively Parallel Processing)                         SQL
                                                               MapReduce
            Shared-Nothing Architecture

                      Master
                      Servers                            ...               ...
                 Query planning &
                     dispatch


                   Network
                 Interconnect


                     Segment
                     Servers                      ...                            ...
                 Query processing
                  & data storage




                     External
                     Sources
                       Loading,
                   streaming, etc.




© Copyright 2011 EMC Corporation. All rights reserved.                                 16
Example of Parallel Query Optimization
                                                                        Gather Motion 4:1
                                                                            (slice 3)
select
    c_custkey, c_name,
    sum(l_extendedprice * (1 - l_discount)) as                                 Sort
revenue,
    c_acctbal, n_name, c_address, c_phone,
c_comment                                                                 HashAggregate

from
       customer, orders, lineitem, nation                                   HashJoin

where
    c_custkey = o_custkey                                Redistribute Motion 4:4
                                                                                                    Hash
                                                                 (slice 1)
    and l_orderkey = o_orderkey
    and o_orderdate >= date '1994-08-01'
                                                                HashJoin                           HashJoin
    and o_orderdate < date '1994-08-01'
                      + interval '3 month'
                                                         Seq Scan on                        Seq Scan on
    and l_returnflag = 'R'                                                   Hash                               Hash
                                                           lineitem                          customer
    and c_nationkey = n_nationkey
                                                                                                      Broadcast Motion 4:4
group by                                                               Seq Scan on orders
                                                                                                            (slice 2)
    c_custkey, c_name, c_acctbal,
    c_phone, n_name, c_address, c_comment
                                                                                                          Seq Scan on nation
order by
    revenue desc




© Copyright 2011 EMC Corporation. All rights reserved.                                                                         17
MapReduce

  • Overview
            – large-scale, massively parallel data access platform
            – Simple data-parallel programming model to express relatively
              sophisticated distributed programs
            – An associated parallel and distributed implementation for commodity
              clusters
  • Pioneered by Google
            – Processes 20 PB of data per day
  • Popularized by open-source Hadoop project
            – Used by Yahoo!, Facebook, Amazon, and the list is growing …




© Copyright 2011 EMC Corporation. All rights reserved.                              18
Programming Framework

                                               Raw Input: <key, value>


                                                         MAP



                              <K1, V1>                    <K2,V2>        <K3,V3>


                                                         REDUCE


© Copyright 2011 EMC Corporation. All rights reserved.                             19
MapReduce Example: WordCount                                                               Reduce(K, V[ ]) {
                                                                                             Int count = 0;
                                                                                             For each v in V
                                               Map(K, V) {
                                                                                              count += v;
                                                 For each word w in V
                                                                                             Collect(K, count);
                                                  Collect(w, 1);
                                                                                           }
                                               }


                                                                        combine                               part0
                                                              map                           reduce
  Cat                                 split
   .                                                                                                                  Cat 3
   .
                                                                                            reduce            part1 Bat 4
   .                                  split                   map       combine

  Bat                                                                                                                 Dog 3
                                                                                                                      …
    .
    .                                                         map                                            part2
                                      split                             combine             reduce
  Dog
    .
                                                                         Combine(K, V[ ]) {
    .                                                         map          Int count = 0;
Other                                 split                                For each v in V
Words                                                                       count += v;
                                                                           Collect(K, count);
 (size:                                                                  }
TByte)
© Copyright 2011 EMC Corporation. All rights reserved.                                                                        20
MapReduce Implementation in Hadoop
                                                                      client

                                                                               job
                                                                      master

                                                         assign                      assign
                                                         map                         reduce

                                            mapper
            split0
                                                                                                        write
                                                                                              reducer            file0
            split1
                          read                             local               remote
            split2                          mapper         write               read
            split3
                                                                                              reducer            file1
            split4

                                            mapper

                 input                        map                 intermediate files          reduce            output
                 files                        phase               (local disk)                phase             files

© Copyright 2011 EMC Corporation. All rights reserved.                                                                   21
MapReduce Advantages
     • Automatic Parallelization:
                – Depending on the size of RAW INPUT DATA  instantiate
                  multiple MAP tasks
                – Similarly, depending upon the number of intermediate <key,
                  value> partitions 
                  instantiate multiple REDUCE tasks
     • Run-time:
                –     Data partitioning
                –     Task scheduling
                –     Handling machine failures
                –     Managing inter-machine communication
     • Completely transparent to the programmer/analyst/user


© Copyright 2011 EMC Corporation. All rights reserved.                         22
Possible Applications
  • Special-purpose programs to process large amounts
    of data: crawled documents, Web query logs, etc.
            – ETL and “read once” data sets
            – Complex analytics
            – Semi-structured data, key-value pairs
  • At Google and others (Yahoo!, Facebook):
            –      Inverted index
            –      Graph structure of the WEB documents
            –      Summaries of #pages/host, set of frequent queries, etc.
            –      Ad Optimization
            –      Spam filtering

© Copyright 2011 EMC Corporation. All rights reserved.                       23
Map Reduce vs Parallel DBMS
                                                          Parallel DBMS          MapReduce

         Schema Support                                                       Not out of the box

                  Indexing                                                    Not out of the box
                                                                                   Imperative
                                                            Declarative         (C/C++, Java, …)
    Programming Model
                                                              (SQL)           Extensions through
                                                                                  Pig and Hive
      Optimizations
   (Compression, Query                                                        Not out of the box
      Optimization)
                 Flexibility                             Not out of the box            
                                                          Coarse grained
          Fault Tolerance                                                              
                                                            techniques


© Copyright 2011 EMC Corporation. All rights reserved.                                              24
Further Analysis and Comparison
• Limitations of some current parallel database / data warehouse
           – Often use expensive/specialized hardware
           – Difficult to scale to more than 100 nodes
           – Difficult to parallelize data mining applications
                     • MPI …
           – Difficult to deal with unstructured data
           – Fault tolerance
                     • One node fails, restart whole query
           – Expensive
• Disadvantages of some MapReduce based solution (Hive)
           – A sub-optimal brute force implementation: No indexing, No JOINs
                     • Find those guys whose salary is $10,000
           –     Row based storage, Updates?
           –     Not SQL/BI tool compatible
           –     No support for schema
           –     Non-declarative programming model


© Copyright 2011 EMC Corporation. All rights reserved.                         25
MapReduce Integration in DBMS Context

  • FlexDB - A Cloud-scale Parallel Database Engine based on
    Hadoop MapReduce (A Research Project)
      – An architectural hybrid of MapReduce and DBMS
        technologies
      – Use Fault-tolerance and Scalability of Map Reduce
        framework
      – Leverage advanced data processing techniques (e.g.,
        Query Optimization) of an RDBMS for high performance
      – Expose a declarative interface to the user
  • Goal: Leverage from the best of both worlds



© Copyright 2011 EMC Corporation. All rights reserved.         26
FlexDB Architecture




© Copyright 2011 EMC Corporation. All rights reserved.   27
FlexDB Master
                                                                            Query Parser

                                           SELECT *
                                         FROM Account                     Query Optimizer
                                       WHERE balance > 30
                                                                            Job Generator                 Catalog manager

                                                                             Job Executor

                                                                                                    Job
                                                                                        Job                  Job
                                                                                                    Job

 MapReduce                                                                                                                           Mapper
 Framework
Account                                                                                                                              Reducer
r0   n0      m0
                             SELECT *                                  SELECT *                               SELECT *
r1   n1      m1            FROM Account                              FROM Account                           FROM Account
r2   n2      m2          WHERE balance > 30                        WHERE balance > 30                     WHERE balance > 30

r3   n3      m3
                                     subquery                              subquery                                   subquery
r4   n4      m4
r5   n5      m5
r6   n6      m6
r7   n7      m7              Database            Database   Database     Database        Database          Database       Database



                r0 n0 m0                           r2 n2 m2            r4 n4 m4                     r6 n6 m6                   r8 n8 m8
                r1 n1 m1                           r3 n3 m3            r5 n5 m5                     r7 n7 m7                   r9 n9 m9


 © Copyright 2011 EMC Corporation. All rights reserved.                                                                                        28
Comparison with other systems

                                                         FlexDB   Hive     HadoopDB Traditional parallel
                                                                                        database
     Query Language                                       SQL     HQL       SQL (not            SQL
                                                                           support join
                                                                            currently)
              Storage                      Postgres/Greenplum   HDFS          JDBC         Native OS files
                                                                           compatible
            Optimizer                      Cost based (DB/MR Simple rule   Simple rule       Cost based
                                                  paths)        based        based
      Physical storage                     Column/Row based Row based Currently Row       Column/Row based
        organization                                                         based
      Implementation                        FlexDB Master + Hive + Hadoop Hive (rev) +         Native
                                              Hadoop + DB                 Hadoop + DB
            Efficiency                             High          Low         Middle           Very High

                Scale                                     Large   Large       Large            Middle

                 Cost                                     Low     Low          Low              High




© Copyright 2011 EMC Corporation. All rights reserved.                                                       29
Summary
  • New in cloud computing
            – Elasticity/Scalability
            – Resource sharing (multi-tenancy)
            – Focus on failure
  • Data analytics in the cloud: Different solutions suitable for
    different workloads
            – Parallel DBMSs excel at efficient querying of large data sets
            – MR-style systems excel at complex analytics and ETL tasks
  • Combine MapReduce with shared-nothing DBMS to produce a
    system that better fit the cloud computing market




© Copyright 2011 EMC Corporation. All rights reserved.                        30
Acknowledgements

  • Some slides are adapted from the following references:
            – Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud
              Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial
            – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik
              Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel
              DBMS’s: Friends or Foes?”, Communications of the ACM 2010




© Copyright 2011 EMC Corporation. All rights reserved.                               31
易安信中国研究院
                                  陶波 博士
                                  易安信中国研究院 院长


                                                         博客 http://blog.sina.com.cn/emclabschina
                                                         微博 http://weibo.com/emclabschina




© Copyright 2011 EMC Corporation. All rights reserved.                                             32
THANK YOU



© Copyright 2011 EMC Corporation. All rights reserved.           33

Mais conteúdo relacionado

Mais procurados

The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012DATAVERSITY
 
Summit 2011 infra_dbms
Summit 2011 infra_dbmsSummit 2011 infra_dbms
Summit 2011 infra_dbmsPini Cohen
 
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...EMC
 
Prepare Your Data For The Cloud
Prepare Your Data For The CloudPrepare Your Data For The Cloud
Prepare Your Data For The CloudIndicThreads
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardwareinside-BigData.com
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPCNetApp
 
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, ContinuentFuture Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, ContinuentEero Teerikorpi
 
Big Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & VirtualizationBig Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & Virtualizationtervela
 
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Antonio Alba
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridScaleOut Software
 
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 

Mais procurados (19)

The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
 
Summit 2011 infra_dbms
Summit 2011 infra_dbmsSummit 2011 infra_dbms
Summit 2011 infra_dbms
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
 
Prepare Your Data For The Cloud
Prepare Your Data For The CloudPrepare Your Data For The Cloud
Prepare Your Data For The Cloud
 
Greenplum hadoop
Greenplum hadoopGreenplum hadoop
Greenplum hadoop
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPC
 
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, ContinuentFuture Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, Continuent
 
Big Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & VirtualizationBig Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & Virtualization
 
Alfa bank
Alfa bankAlfa bank
Alfa bank
 
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered Storage
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 

Semelhante a Research ON Big Data

Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based ExtensibilityExtending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based ExtensibilityJerome Leonard
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Cloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstackCloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstackOpenCity Community
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28korusamol
 
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters Emulex Corporation
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackJoe Arnold
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
The Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand ITThe Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand ITInnoTech
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven businessOpenDataSoft
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT SimpleBob Rhubart
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Cloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleCloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleBob Rhubart
 

Semelhante a Research ON Big Data (20)

Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based ExtensibilityExtending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Cloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstackCloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstack
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
The Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand ITThe Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand IT
 
EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras PelenisEMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT Simple
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Cloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleCloud Computing: Making IT Simple
Cloud Computing: Making IT Simple
 

Mais de mysqlops

The simplethebeautiful
The simplethebeautifulThe simplethebeautiful
The simplethebeautifulmysqlops
 
Oracle数据库分析函数详解
Oracle数据库分析函数详解Oracle数据库分析函数详解
Oracle数据库分析函数详解mysqlops
 
Percona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT:mysql-security-privileges-and-user-managementPercona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT:mysql-security-privileges-and-user-managementmysqlops
 
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationPercona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationmysqlops
 
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Cluster And NDB ClusterPercona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Cluster And NDB Clustermysqlops
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationPercona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationmysqlops
 
Pldc2012 innodb architecture and internals
Pldc2012 innodb architecture and internalsPldc2012 innodb architecture and internals
Pldc2012 innodb architecture and internalsmysqlops
 
DBA新人的述职报告
DBA新人的述职报告DBA新人的述职报告
DBA新人的述职报告mysqlops
 
分布式爬虫
分布式爬虫分布式爬虫
分布式爬虫mysqlops
 
MySQL应用优化实践
MySQL应用优化实践MySQL应用优化实践
MySQL应用优化实践mysqlops
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
基于协程的网络开发框架的设计与实现
基于协程的网络开发框架的设计与实现基于协程的网络开发框架的设计与实现
基于协程的网络开发框架的设计与实现mysqlops
 
eBay基于Hadoop平台的用户邮件数据分析
eBay基于Hadoop平台的用户邮件数据分析eBay基于Hadoop平台的用户邮件数据分析
eBay基于Hadoop平台的用户邮件数据分析mysqlops
 
对MySQL DBA的一些思考
对MySQL DBA的一些思考对MySQL DBA的一些思考
对MySQL DBA的一些思考mysqlops
 
QQ聊天系统后台架构的演化与启示
QQ聊天系统后台架构的演化与启示QQ聊天系统后台架构的演化与启示
QQ聊天系统后台架构的演化与启示mysqlops
 
腾讯即时聊天IM1.4亿在线背后的故事
腾讯即时聊天IM1.4亿在线背后的故事腾讯即时聊天IM1.4亿在线背后的故事
腾讯即时聊天IM1.4亿在线背后的故事mysqlops
 
分布式存储与TDDL
分布式存储与TDDL分布式存储与TDDL
分布式存储与TDDLmysqlops
 
MySQL数据库生产环境维护
MySQL数据库生产环境维护MySQL数据库生产环境维护
MySQL数据库生产环境维护mysqlops
 

Mais de mysqlops (20)

The simplethebeautiful
The simplethebeautifulThe simplethebeautiful
The simplethebeautiful
 
Oracle数据库分析函数详解
Oracle数据库分析函数详解Oracle数据库分析函数详解
Oracle数据库分析函数详解
 
Percona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT:mysql-security-privileges-and-user-managementPercona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT:mysql-security-privileges-and-user-management
 
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationPercona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replication
 
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Cluster And NDB ClusterPercona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimizationPercona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
 
Pldc2012 innodb architecture and internals
Pldc2012 innodb architecture and internalsPldc2012 innodb architecture and internals
Pldc2012 innodb architecture and internals
 
DBA新人的述职报告
DBA新人的述职报告DBA新人的述职报告
DBA新人的述职报告
 
分布式爬虫
分布式爬虫分布式爬虫
分布式爬虫
 
MySQL应用优化实践
MySQL应用优化实践MySQL应用优化实践
MySQL应用优化实践
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
基于协程的网络开发框架的设计与实现
基于协程的网络开发框架的设计与实现基于协程的网络开发框架的设计与实现
基于协程的网络开发框架的设计与实现
 
eBay基于Hadoop平台的用户邮件数据分析
eBay基于Hadoop平台的用户邮件数据分析eBay基于Hadoop平台的用户邮件数据分析
eBay基于Hadoop平台的用户邮件数据分析
 
对MySQL DBA的一些思考
对MySQL DBA的一些思考对MySQL DBA的一些思考
对MySQL DBA的一些思考
 
QQ聊天系统后台架构的演化与启示
QQ聊天系统后台架构的演化与启示QQ聊天系统后台架构的演化与启示
QQ聊天系统后台架构的演化与启示
 
腾讯即时聊天IM1.4亿在线背后的故事
腾讯即时聊天IM1.4亿在线背后的故事腾讯即时聊天IM1.4亿在线背后的故事
腾讯即时聊天IM1.4亿在线背后的故事
 
分布式存储与TDDL
分布式存储与TDDL分布式存储与TDDL
分布式存储与TDDL
 
MySQL数据库生产环境维护
MySQL数据库生产环境维护MySQL数据库生产环境维护
MySQL数据库生产环境维护
 
Memcached
MemcachedMemcached
Memcached
 
DevOPS
DevOPSDevOPS
DevOPS
 

Último

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Research ON Big Data

  • 1. Research on Big Data - FlexDB: A cloud-scale database engine based on Hadoop Jidong Chen (jidong.chen@emc.com) Manager, Research Scientist, Big Data Lab EMC Labs China Sept. 2011 © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Grand Opening Announcement EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO. © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. EMC Labs China - Vision and Mission Advanced Technology Research and Development University Collaboration Vision Big Data Lab Become an elite research and advanced technology institute Industry Standards in China Cloud Infrastructure Office - and System Lab Become the model for future EMC Labs Cloud Platform and worldwide IP Portfolio Applications Lab Development © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Outline • Big Data projects overview at EMC Labs China • Introduction to Cloud Databases • Data analytics in the cloud – Parallel DBMS – MapReduce • FlexDB - A cloud-scale database engine based on Hadoop • Summary © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. The Digital Universe 2009-2020 Growing by a Factor of 44 2009: 0.8 Zb 2020: 35.2 Zettabytes Source: IDC Digital Universe Study, sponsored by EMC, May 2010 © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. Big Data is Changing the World Expanding Data Sources Bigger Challenges • Science and research • Scale out automatically – Gene sequences – Vs. scale up manually – LHC accelerator – Earth and space exploration • More capacity and bigger pool – E.g., 10 PB in a single file system • Enterprise applications – Email, documents, files • New process capability – Applications log – Loading, Analyzing, Moving data – Transaction records – Intelligence • Web 2.0 data • Better performance – Search log / click stream – Linear vs. exponent – Twitter/ Blog / SNS – Faster – Wiki • Autonomous • Other unstructured data – Fewer human interference – Video/Movie – Lower cost – Graphics – Digital widgets © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. Research Scopes and Topics in Big Data • Search and Analytics – Search: Entity Search, Faceted Search, Associative Search – Analytics: Text Analysis, Activity Modeling and Sequence Analysis, Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms • MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning and Replication, Distributed Transaction – In-memory Database: Cache, Recovery, Consistence – Database as a Service: Multi-tenant Data Management, Auto- Administration • Hadoop/NoSQL – Hadoop: Single-node Failure, Performance, Real-time MapReduce Scheduler and Fault Tolerance – NoSQL: Key-Value Store, Documents Store, Graph Data Store © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Project Overview • Hadoop/NoSQL – vHadoop - joint project with VMWare • Parallel SAN file system for DISC on virtualized platform – Online MapReduce for Real-time Data Analytics • Pipelined task execution, Group task scheduling, Enhanced fault tolerance • Parallel Data Mining – FlexDB: Cloud-scale Parallel Database for OLAP • MapReduce integration into DBMS, Parallel query execution, Cost-based query optimization – Cloud-scale Parallel Database for OLTP • Intelligent database sharding and resharding • Active-active (eager) replication with group communication service • Multiple masters with elastic distributed coordination © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. Cloud Databases • Two largest components of data management market – Transactional Data Management • Banks, airline reservation, online e-commerce • ACID, write-intensive – Analytical Data Management • Business planning, decision support • Query-intensive • Challenges of data management in the Cloud – Scalability – Fault Tolerance – Availability & Consistence – Transaction Management – Flexible Schemes © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. Cloud Databases • Data analytics in the cloud – Parallel DBMS – MapReduce • Transactional data management in the cloud – NoSQL Store – SQL Database • Cloud data services (Database as a Service) – Multi-tenant data management – Auto-administration © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 11. Commercial Landscape Major Players • Amazon EC2 – IaaS abstraction – Data management using S3 and SimpleDB • Microsoft Azure – PaaS abstraction – Relational engine (SQL Azure) • Google AppEngine – PaaS abstraction – Data management using Google MegaStore © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Data Analytics in the Cloud • Scalability to large data volumes: – Scan 100 TB on 1 node @ 50 MB/sec = 23 days – Scan on 1000-node cluster = 33 minutes  Divide-And-Conquer (i.e., data partitioning) • Cost-efficiency: – Commodity nodes (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers) © Copyright 2011 EMC Corporation. All rights reserved. 12
  • 13. Solutions for Large-scale Data Analysis • Parallel DBMS technologies – Proposed in late eighties – Matured over the last two decades – Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises • Map Reduce – pioneered by Google – popularized by Yahoo! (Hadoop) © Copyright 2011 EMC Corporation. All rights reserved. 13
  • 14. Parallel DBMS technologies • Popularly used for more than two decades – Research Projects: Gamma, Grace, … – Commercial: Teradata, Greenplum (acquired by EMC), Netezza (acquired by IBM), DATAllegro (acquired by Microsoft), Vertica (acquired by HP), Aster Data (acquired by Teradata) • Share-nothing nodes clusters • Relational Data Model • Indexing • Familiar SQL interface • Parallel query execution – Horizontal partitioning of relational tables with partitioned execution of SQL queries • Advanced query optimization • Well understood and studied © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Greenplum: A Share-nothing Parallel DBMS  Greenplum’s MPP Database has extreme scalability – Optimized for BI and analytics – Fault-tolerant reliability and optimized performance using commodity CPUs, disks and networking Interconnect  Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database – Tables are automatically distributed across nodes  Extremely scalable and I/O optimized – All nodes can scan and process in parallel Loading – No I/O contention between segments  Linear scalability by adding nodes – Each adds storage, query performance and loading performance © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. Greenplum Database Architecture MPP (Massively Parallel Processing) SQL MapReduce Shared-Nothing Architecture Master Servers ... ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Example of Parallel Query Optimization Gather Motion 4:1 (slice 3) select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as Sort revenue, c_acctbal, n_name, c_address, c_phone, c_comment HashAggregate from customer, orders, lineitem, nation HashJoin where c_custkey = o_custkey Redistribute Motion 4:4 Hash (slice 1) and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' HashJoin HashJoin and o_orderdate < date '1994-08-01' + interval '3 month' Seq Scan on Seq Scan on and l_returnflag = 'R' Hash Hash lineitem customer and c_nationkey = n_nationkey Broadcast Motion 4:4 group by Seq Scan on orders (slice 2) c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment Seq Scan on nation order by revenue desc © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. MapReduce • Overview – large-scale, massively parallel data access platform – Simple data-parallel programming model to express relatively sophisticated distributed programs – An associated parallel and distributed implementation for commodity clusters • Pioneered by Google – Processes 20 PB of data per day • Popularized by open-source Hadoop project – Used by Yahoo!, Facebook, Amazon, and the list is growing … © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. Programming Framework Raw Input: <key, value> MAP <K1, V1> <K2,V2> <K3,V3> REDUCE © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. MapReduce Example: WordCount Reduce(K, V[ ]) { Int count = 0; For each v in V Map(K, V) { count += v; For each word w in V Collect(K, count); Collect(w, 1); } } combine part0 map reduce Cat split . Cat 3 . reduce part1 Bat 4 . split map combine Bat Dog 3 … . . map part2 split combine reduce Dog . Combine(K, V[ ]) { . map Int count = 0; Other split For each v in V Words count += v; Collect(K, count); (size: } TByte) © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. MapReduce Implementation in Hadoop client job master assign assign map reduce mapper split0 write reducer file0 split1 read local remote split2 mapper write read split3 reducer file1 split4 mapper input map intermediate files reduce output files phase (local disk) phase files © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MapReduce Advantages • Automatic Parallelization: – Depending on the size of RAW INPUT DATA  instantiate multiple MAP tasks – Similarly, depending upon the number of intermediate <key, value> partitions  instantiate multiple REDUCE tasks • Run-time: – Data partitioning – Task scheduling – Handling machine failures – Managing inter-machine communication • Completely transparent to the programmer/analyst/user © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Possible Applications • Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc. – ETL and “read once” data sets – Complex analytics – Semi-structured data, key-value pairs • At Google and others (Yahoo!, Facebook): – Inverted index – Graph structure of the WEB documents – Summaries of #pages/host, set of frequent queries, etc. – Ad Optimization – Spam filtering © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support  Not out of the box Indexing  Not out of the box Imperative Declarative (C/C++, Java, …) Programming Model (SQL) Extensions through Pig and Hive Optimizations (Compression, Query  Not out of the box Optimization) Flexibility Not out of the box  Coarse grained Fault Tolerance  techniques © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. Further Analysis and Comparison • Limitations of some current parallel database / data warehouse – Often use expensive/specialized hardware – Difficult to scale to more than 100 nodes – Difficult to parallelize data mining applications • MPI … – Difficult to deal with unstructured data – Fault tolerance • One node fails, restart whole query – Expensive • Disadvantages of some MapReduce based solution (Hive) – A sub-optimal brute force implementation: No indexing, No JOINs • Find those guys whose salary is $10,000 – Row based storage, Updates? – Not SQL/BI tool compatible – No support for schema – Non-declarative programming model © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. MapReduce Integration in DBMS Context • FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project) – An architectural hybrid of MapReduce and DBMS technologies – Use Fault-tolerance and Scalability of Map Reduce framework – Leverage advanced data processing techniques (e.g., Query Optimization) of an RDBMS for high performance – Expose a declarative interface to the user • Goal: Leverage from the best of both worlds © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. FlexDB Architecture © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. FlexDB Master Query Parser SELECT * FROM Account Query Optimizer WHERE balance > 30 Job Generator Catalog manager Job Executor Job Job Job Job MapReduce Mapper Framework Account Reducer r0 n0 m0 SELECT * SELECT * SELECT * r1 n1 m1 FROM Account FROM Account FROM Account r2 n2 m2 WHERE balance > 30 WHERE balance > 30 WHERE balance > 30 r3 n3 m3 subquery subquery subquery r4 n4 m4 r5 n5 m5 r6 n6 m6 r7 n7 m7 Database Database Database Database Database Database Database r0 n0 m0 r2 n2 m2 r4 n4 m4 r6 n6 m6 r8 n8 m8 r1 n1 m1 r3 n3 m3 r5 n5 m5 r7 n7 m7 r9 n9 m9 © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. Comparison with other systems FlexDB Hive HadoopDB Traditional parallel database Query Language SQL HQL SQL (not SQL support join currently) Storage Postgres/Greenplum HDFS JDBC Native OS files compatible Optimizer Cost based (DB/MR Simple rule Simple rule Cost based paths) based based Physical storage Column/Row based Row based Currently Row Column/Row based organization based Implementation FlexDB Master + Hive + Hadoop Hive (rev) + Native Hadoop + DB Hadoop + DB Efficiency High Low Middle Very High Scale Large Large Large Middle Cost Low Low Low High © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Summary • New in cloud computing – Elasticity/Scalability – Resource sharing (multi-tenancy) – Focus on failure • Data analytics in the cloud: Different solutions suitable for different workloads – Parallel DBMSs excel at efficient querying of large data sets – MR-style systems excel at complex analytics and ETL tasks • Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Acknowledgements • Some slides are adapted from the following references: – Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010 © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. 易安信中国研究院 陶波 博士 易安信中国研究院 院长 博客 http://blog.sina.com.cn/emclabschina 微博 http://weibo.com/emclabschina © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. THANK YOU © Copyright 2011 EMC Corporation. All rights reserved. 33