Enviar pesquisa
Carregar
Research ON Big Data
•
8 gostaram
•
860 visualizações
M
mysqlops
Seguir
bigdata,greenplum,flexdb,hadoop,mapreduce
Leia menos
Leia mais
Tecnologia
Negócios
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 33
Recomendados
Research on big data
Research on big data
Roby Chen
Dell - Storage 12sept2012
Dell - Storage 12sept2012
Agora Group
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
Richard McDougall
Building Big Data Applications
Building Big Data Applications
Richard McDougall
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
keirdo1
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Alan Sill
Presentatie Cisco NetApp Proact over FlexPod
Presentatie Cisco NetApp Proact over FlexPod
Proact Netherlands B.V.
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
ScaleOut Software
Recomendados
Research on big data
Research on big data
Roby Chen
Dell - Storage 12sept2012
Dell - Storage 12sept2012
Agora Group
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
Richard McDougall
Building Big Data Applications
Building Big Data Applications
Richard McDougall
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
keirdo1
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Cloud Standards in the Real World: Cloud Standards Testing for Developers
Alan Sill
Presentatie Cisco NetApp Proact over FlexPod
Presentatie Cisco NetApp Proact over FlexPod
Proact Netherlands B.V.
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
ScaleOut Software
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
DATAVERSITY
Summit 2011 infra_dbms
Summit 2011 infra_dbms
Pini Cohen
Cosbench apac
Cosbench apac
OpenCity Community
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
EMC
Prepare Your Data For The Cloud
Prepare Your Data For The Cloud
IndicThreads
Greenplum hadoop
Greenplum hadoop
Chiou-Nan Chen
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
Richard McDougall
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
inside-BigData.com
Big Data and HPC
Big Data and HPC
NetApp
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, Continuent
Eero Teerikorpi
Big Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & Virtualization
tervela
Alfa bank
Alfa bank
Cisco Case Studies
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Antonio Alba
EMC config Hadoop
EMC config Hadoop
solarisyougood
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
ScaleOut Software
Hadoop on VMware
Hadoop on VMware
Richard McDougall
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered Storage
NetApp
gfs-sosp2003
gfs-sosp2003
Hiroshi Ono
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
parker01
EMC Isilon Database Converged deck
EMC Isilon Database Converged deck
KeithETD_CTO
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Jerome Leonard
Mais conteúdo relacionado
Mais procurados
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
DATAVERSITY
Summit 2011 infra_dbms
Summit 2011 infra_dbms
Pini Cohen
Cosbench apac
Cosbench apac
OpenCity Community
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
EMC
Prepare Your Data For The Cloud
Prepare Your Data For The Cloud
IndicThreads
Greenplum hadoop
Greenplum hadoop
Chiou-Nan Chen
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
Richard McDougall
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
inside-BigData.com
Big Data and HPC
Big Data and HPC
NetApp
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, Continuent
Eero Teerikorpi
Big Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & Virtualization
tervela
Alfa bank
Alfa bank
Cisco Case Studies
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Antonio Alba
EMC config Hadoop
EMC config Hadoop
solarisyougood
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
ScaleOut Software
Hadoop on VMware
Hadoop on VMware
Richard McDougall
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered Storage
NetApp
gfs-sosp2003
gfs-sosp2003
Hiroshi Ono
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
Mais procurados
(19)
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
Summit 2011 infra_dbms
Summit 2011 infra_dbms
Cosbench apac
Cosbench apac
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Prepare Your Data For The Cloud
Prepare Your Data For The Cloud
Greenplum hadoop
Greenplum hadoop
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
Big Data and HPC
Big Data and HPC
Future Proofing MySQL by Robert Hodges, Continuent
Future Proofing MySQL by Robert Hodges, Continuent
Big Data: Movement, Warehousing, & Virtualization
Big Data: Movement, Warehousing, & Virtualization
Alfa bank
Alfa bank
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
EMC config Hadoop
EMC config Hadoop
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
Hadoop on VMware
Hadoop on VMware
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered Storage
gfs-sosp2003
gfs-sosp2003
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Semelhante a Research ON Big Data
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
parker01
EMC Isilon Database Converged deck
EMC Isilon Database Converged deck
KeithETD_CTO
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Jerome Leonard
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
Cloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstack
OpenCity Community
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
Cloud computingjun28
Cloud computingjun28
korusamol
Cloud computingjun28
Cloud computingjun28
Abhishek Thakur
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Emulex Corporation
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
Joe Arnold
Big Data = Big Decisions
Big Data = Big Decisions
InnoTech
The Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand IT
InnoTech
EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
Lietuvos kompiuterininkų sąjunga
From open data to API-driven business
From open data to API-driven business
OpenDataSoft
Cloud Computing - Making IT Simple
Cloud Computing - Making IT Simple
Bob Rhubart
Hadoop as data refinery
Hadoop as data refinery
Steve Loughran
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
JAX London
Cloud Computing: Making IT Simple
Cloud Computing: Making IT Simple
Bob Rhubart
Semelhante a Research ON Big Data
(20)
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
EMC Isilon Database Converged deck
EMC Isilon Database Converged deck
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstack
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloud computingjun28
Cloud computingjun28
Cloud computingjun28
Cloud computingjun28
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
Big Data = Big Decisions
Big Data = Big Decisions
The Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand IT
EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
From open data to API-driven business
From open data to API-driven business
Cloud Computing - Making IT Simple
Cloud Computing - Making IT Simple
Hadoop as data refinery
Hadoop as data refinery
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
Cloud Computing: Making IT Simple
Cloud Computing: Making IT Simple
Mais de mysqlops
The simplethebeautiful
The simplethebeautiful
mysqlops
Oracle数据库分析函数详解
Oracle数据库分析函数详解
mysqlops
Percona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT:mysql-security-privileges-and-user-management
mysqlops
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replication
mysqlops
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
mysqlops
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
mysqlops
Pldc2012 innodb architecture and internals
Pldc2012 innodb architecture and internals
mysqlops
DBA新人的述职报告
DBA新人的述职报告
mysqlops
分布式爬虫
分布式爬虫
mysqlops
MySQL应用优化实践
MySQL应用优化实践
mysqlops
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
mysqlops
基于协程的网络开发框架的设计与实现
基于协程的网络开发框架的设计与实现
mysqlops
eBay基于Hadoop平台的用户邮件数据分析
eBay基于Hadoop平台的用户邮件数据分析
mysqlops
对MySQL DBA的一些思考
对MySQL DBA的一些思考
mysqlops
QQ聊天系统后台架构的演化与启示
QQ聊天系统后台架构的演化与启示
mysqlops
腾讯即时聊天IM1.4亿在线背后的故事
腾讯即时聊天IM1.4亿在线背后的故事
mysqlops
分布式存储与TDDL
分布式存储与TDDL
mysqlops
MySQL数据库生产环境维护
MySQL数据库生产环境维护
mysqlops
Memcached
Memcached
mysqlops
DevOPS
DevOPS
mysqlops
Mais de mysqlops
(20)
The simplethebeautiful
The simplethebeautiful
Oracle数据库分析函数详解
Oracle数据库分析函数详解
Percona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT:mysql-security-privileges-and-user-management
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Cluster And NDB Cluster
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
Pldc2012 innodb architecture and internals
Pldc2012 innodb architecture and internals
DBA新人的述职报告
DBA新人的述职报告
分布式爬虫
分布式爬虫
MySQL应用优化实践
MySQL应用优化实践
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
基于协程的网络开发框架的设计与实现
基于协程的网络开发框架的设计与实现
eBay基于Hadoop平台的用户邮件数据分析
eBay基于Hadoop平台的用户邮件数据分析
对MySQL DBA的一些思考
对MySQL DBA的一些思考
QQ聊天系统后台架构的演化与启示
QQ聊天系统后台架构的演化与启示
腾讯即时聊天IM1.4亿在线背后的故事
腾讯即时聊天IM1.4亿在线背后的故事
分布式存储与TDDL
分布式存储与TDDL
MySQL数据库生产环境维护
MySQL数据库生产环境维护
Memcached
Memcached
DevOPS
DevOPS
Último
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
jfdjdjcjdnsjd
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
AnitaRaj43
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
Zilliz
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
danishmna97
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard37
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Jago de Vreede
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
Remote DBA Services
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Christopher Logan Kennedy
Último
(20)
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
presentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Research ON Big Data
1.
Research on Big
Data - FlexDB: A cloud-scale database engine based on Hadoop Jidong Chen (jidong.chen@emc.com) Manager, Research Scientist, Big Data Lab EMC Labs China Sept. 2011 © Copyright 2011 EMC Corporation. All rights reserved. 1
2.
Grand Opening Announcement
EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO. © Copyright 2011 EMC Corporation. All rights reserved. 2
3.
EMC Labs China
- Vision and Mission Advanced Technology Research and Development University Collaboration Vision Big Data Lab Become an elite research and advanced technology institute Industry Standards in China Cloud Infrastructure Office - and System Lab Become the model for future EMC Labs Cloud Platform and worldwide IP Portfolio Applications Lab Development © Copyright 2011 EMC Corporation. All rights reserved. 3
4.
Outline • Big Data
projects overview at EMC Labs China • Introduction to Cloud Databases • Data analytics in the cloud – Parallel DBMS – MapReduce • FlexDB - A cloud-scale database engine based on Hadoop • Summary © Copyright 2011 EMC Corporation. All rights reserved. 4
5.
The Digital Universe
2009-2020 Growing by a Factor of 44 2009: 0.8 Zb 2020: 35.2 Zettabytes Source: IDC Digital Universe Study, sponsored by EMC, May 2010 © Copyright 2011 EMC Corporation. All rights reserved. 5
6.
Big Data is
Changing the World Expanding Data Sources Bigger Challenges • Science and research • Scale out automatically – Gene sequences – Vs. scale up manually – LHC accelerator – Earth and space exploration • More capacity and bigger pool – E.g., 10 PB in a single file system • Enterprise applications – Email, documents, files • New process capability – Applications log – Loading, Analyzing, Moving data – Transaction records – Intelligence • Web 2.0 data • Better performance – Search log / click stream – Linear vs. exponent – Twitter/ Blog / SNS – Faster – Wiki • Autonomous • Other unstructured data – Fewer human interference – Video/Movie – Lower cost – Graphics – Digital widgets © Copyright 2011 EMC Corporation. All rights reserved. 6
7.
Research Scopes and
Topics in Big Data • Search and Analytics – Search: Entity Search, Faceted Search, Associative Search – Analytics: Text Analysis, Activity Modeling and Sequence Analysis, Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms • MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning and Replication, Distributed Transaction – In-memory Database: Cache, Recovery, Consistence – Database as a Service: Multi-tenant Data Management, Auto- Administration • Hadoop/NoSQL – Hadoop: Single-node Failure, Performance, Real-time MapReduce Scheduler and Fault Tolerance – NoSQL: Key-Value Store, Documents Store, Graph Data Store © Copyright 2011 EMC Corporation. All rights reserved. 7
8.
Project Overview • Hadoop/NoSQL
– vHadoop - joint project with VMWare • Parallel SAN file system for DISC on virtualized platform – Online MapReduce for Real-time Data Analytics • Pipelined task execution, Group task scheduling, Enhanced fault tolerance • Parallel Data Mining – FlexDB: Cloud-scale Parallel Database for OLAP • MapReduce integration into DBMS, Parallel query execution, Cost-based query optimization – Cloud-scale Parallel Database for OLTP • Intelligent database sharding and resharding • Active-active (eager) replication with group communication service • Multiple masters with elastic distributed coordination © Copyright 2011 EMC Corporation. All rights reserved. 8
9.
Cloud Databases
• Two largest components of data management market – Transactional Data Management • Banks, airline reservation, online e-commerce • ACID, write-intensive – Analytical Data Management • Business planning, decision support • Query-intensive • Challenges of data management in the Cloud – Scalability – Fault Tolerance – Availability & Consistence – Transaction Management – Flexible Schemes © Copyright 2011 EMC Corporation. All rights reserved. 9
10.
Cloud Databases
• Data analytics in the cloud – Parallel DBMS – MapReduce • Transactional data management in the cloud – NoSQL Store – SQL Database • Cloud data services (Database as a Service) – Multi-tenant data management – Auto-administration © Copyright 2011 EMC Corporation. All rights reserved. 10
11.
Commercial Landscape Major
Players • Amazon EC2 – IaaS abstraction – Data management using S3 and SimpleDB • Microsoft Azure – PaaS abstraction – Relational engine (SQL Azure) • Google AppEngine – PaaS abstraction – Data management using Google MegaStore © Copyright 2011 EMC Corporation. All rights reserved. 11
12.
Data Analytics in
the Cloud • Scalability to large data volumes: – Scan 100 TB on 1 node @ 50 MB/sec = 23 days – Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) • Cost-efficiency: – Commodity nodes (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers) © Copyright 2011 EMC Corporation. All rights reserved. 12
13.
Solutions for Large-scale
Data Analysis • Parallel DBMS technologies – Proposed in late eighties – Matured over the last two decades – Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises • Map Reduce – pioneered by Google – popularized by Yahoo! (Hadoop) © Copyright 2011 EMC Corporation. All rights reserved. 13
14.
Parallel DBMS technologies
• Popularly used for more than two decades – Research Projects: Gamma, Grace, … – Commercial: Teradata, Greenplum (acquired by EMC), Netezza (acquired by IBM), DATAllegro (acquired by Microsoft), Vertica (acquired by HP), Aster Data (acquired by Teradata) • Share-nothing nodes clusters • Relational Data Model • Indexing • Familiar SQL interface • Parallel query execution – Horizontal partitioning of relational tables with partitioned execution of SQL queries • Advanced query optimization • Well understood and studied © Copyright 2011 EMC Corporation. All rights reserved. 14
15.
Greenplum: A Share-nothing
Parallel DBMS Greenplum’s MPP Database has extreme scalability – Optimized for BI and analytics – Fault-tolerant reliability and optimized performance using commodity CPUs, disks and networking Interconnect Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database – Tables are automatically distributed across nodes Extremely scalable and I/O optimized – All nodes can scan and process in parallel Loading – No I/O contention between segments Linear scalability by adding nodes – Each adds storage, query performance and loading performance © Copyright 2011 EMC Corporation. All rights reserved. 15
16.
Greenplum Database Architecture
MPP (Massively Parallel Processing) SQL MapReduce Shared-Nothing Architecture Master Servers ... ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 16
17.
Example of Parallel
Query Optimization Gather Motion 4:1 (slice 3) select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as Sort revenue, c_acctbal, n_name, c_address, c_phone, c_comment HashAggregate from customer, orders, lineitem, nation HashJoin where c_custkey = o_custkey Redistribute Motion 4:4 Hash (slice 1) and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' HashJoin HashJoin and o_orderdate < date '1994-08-01' + interval '3 month' Seq Scan on Seq Scan on and l_returnflag = 'R' Hash Hash lineitem customer and c_nationkey = n_nationkey Broadcast Motion 4:4 group by Seq Scan on orders (slice 2) c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment Seq Scan on nation order by revenue desc © Copyright 2011 EMC Corporation. All rights reserved. 17
18.
MapReduce •
Overview – large-scale, massively parallel data access platform – Simple data-parallel programming model to express relatively sophisticated distributed programs – An associated parallel and distributed implementation for commodity clusters • Pioneered by Google – Processes 20 PB of data per day • Popularized by open-source Hadoop project – Used by Yahoo!, Facebook, Amazon, and the list is growing … © Copyright 2011 EMC Corporation. All rights reserved. 18
19.
Programming Framework
Raw Input: <key, value> MAP <K1, V1> <K2,V2> <K3,V3> REDUCE © Copyright 2011 EMC Corporation. All rights reserved. 19
20.
MapReduce Example: WordCount
Reduce(K, V[ ]) { Int count = 0; For each v in V Map(K, V) { count += v; For each word w in V Collect(K, count); Collect(w, 1); } } combine part0 map reduce Cat split . Cat 3 . reduce part1 Bat 4 . split map combine Bat Dog 3 … . . map part2 split combine reduce Dog . Combine(K, V[ ]) { . map Int count = 0; Other split For each v in V Words count += v; Collect(K, count); (size: } TByte) © Copyright 2011 EMC Corporation. All rights reserved. 20
21.
MapReduce Implementation in
Hadoop client job master assign assign map reduce mapper split0 write reducer file0 split1 read local remote split2 mapper write read split3 reducer file1 split4 mapper input map intermediate files reduce output files phase (local disk) phase files © Copyright 2011 EMC Corporation. All rights reserved. 21
22.
MapReduce Advantages
• Automatic Parallelization: – Depending on the size of RAW INPUT DATA instantiate multiple MAP tasks – Similarly, depending upon the number of intermediate <key, value> partitions instantiate multiple REDUCE tasks • Run-time: – Data partitioning – Task scheduling – Handling machine failures – Managing inter-machine communication • Completely transparent to the programmer/analyst/user © Copyright 2011 EMC Corporation. All rights reserved. 22
23.
Possible Applications
• Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc. – ETL and “read once” data sets – Complex analytics – Semi-structured data, key-value pairs • At Google and others (Yahoo!, Facebook): – Inverted index – Graph structure of the WEB documents – Summaries of #pages/host, set of frequent queries, etc. – Ad Optimization – Spam filtering © Copyright 2011 EMC Corporation. All rights reserved. 23
24.
Map Reduce vs
Parallel DBMS Parallel DBMS MapReduce Schema Support Not out of the box Indexing Not out of the box Imperative Declarative (C/C++, Java, …) Programming Model (SQL) Extensions through Pig and Hive Optimizations (Compression, Query Not out of the box Optimization) Flexibility Not out of the box Coarse grained Fault Tolerance techniques © Copyright 2011 EMC Corporation. All rights reserved. 24
25.
Further Analysis and
Comparison • Limitations of some current parallel database / data warehouse – Often use expensive/specialized hardware – Difficult to scale to more than 100 nodes – Difficult to parallelize data mining applications • MPI … – Difficult to deal with unstructured data – Fault tolerance • One node fails, restart whole query – Expensive • Disadvantages of some MapReduce based solution (Hive) – A sub-optimal brute force implementation: No indexing, No JOINs • Find those guys whose salary is $10,000 – Row based storage, Updates? – Not SQL/BI tool compatible – No support for schema – Non-declarative programming model © Copyright 2011 EMC Corporation. All rights reserved. 25
26.
MapReduce Integration in
DBMS Context • FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project) – An architectural hybrid of MapReduce and DBMS technologies – Use Fault-tolerance and Scalability of Map Reduce framework – Leverage advanced data processing techniques (e.g., Query Optimization) of an RDBMS for high performance – Expose a declarative interface to the user • Goal: Leverage from the best of both worlds © Copyright 2011 EMC Corporation. All rights reserved. 26
27.
FlexDB Architecture © Copyright
2011 EMC Corporation. All rights reserved. 27
28.
FlexDB Master
Query Parser SELECT * FROM Account Query Optimizer WHERE balance > 30 Job Generator Catalog manager Job Executor Job Job Job Job MapReduce Mapper Framework Account Reducer r0 n0 m0 SELECT * SELECT * SELECT * r1 n1 m1 FROM Account FROM Account FROM Account r2 n2 m2 WHERE balance > 30 WHERE balance > 30 WHERE balance > 30 r3 n3 m3 subquery subquery subquery r4 n4 m4 r5 n5 m5 r6 n6 m6 r7 n7 m7 Database Database Database Database Database Database Database r0 n0 m0 r2 n2 m2 r4 n4 m4 r6 n6 m6 r8 n8 m8 r1 n1 m1 r3 n3 m3 r5 n5 m5 r7 n7 m7 r9 n9 m9 © Copyright 2011 EMC Corporation. All rights reserved. 28
29.
Comparison with other
systems FlexDB Hive HadoopDB Traditional parallel database Query Language SQL HQL SQL (not SQL support join currently) Storage Postgres/Greenplum HDFS JDBC Native OS files compatible Optimizer Cost based (DB/MR Simple rule Simple rule Cost based paths) based based Physical storage Column/Row based Row based Currently Row Column/Row based organization based Implementation FlexDB Master + Hive + Hadoop Hive (rev) + Native Hadoop + DB Hadoop + DB Efficiency High Low Middle Very High Scale Large Large Large Middle Cost Low Low Low High © Copyright 2011 EMC Corporation. All rights reserved. 29
30.
Summary •
New in cloud computing – Elasticity/Scalability – Resource sharing (multi-tenancy) – Focus on failure • Data analytics in the cloud: Different solutions suitable for different workloads – Parallel DBMSs excel at efficient querying of large data sets – MR-style systems excel at complex analytics and ETL tasks • Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market © Copyright 2011 EMC Corporation. All rights reserved. 30
31.
Acknowledgements •
Some slides are adapted from the following references: – Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010 © Copyright 2011 EMC Corporation. All rights reserved. 31
32.
易安信中国研究院
陶波 博士 易安信中国研究院 院长 博客 http://blog.sina.com.cn/emclabschina 微博 http://weibo.com/emclabschina © Copyright 2011 EMC Corporation. All rights reserved. 32
33.
THANK YOU © Copyright
2011 EMC Corporation. All rights reserved. 33