SlideShare uma empresa Scribd logo
1 de 12
1© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS and Hadoop
A Powerful Combination
Jacque Istok
2© Copyright 2010 EMC Corporation. All rights reserved.
You Know Hadoop, But What Is Greenplum?
EMC/Greenplum is an MPP data warehouse
system, based off PostgreSQL, with the full
capabilities of a traditional RDBMS system. In
conjunction with SQL-99 compliance for
structured analysis, Greenplum also offers a
MapReduce implementation for non structured
analysis. In short:
Greenplum ~ Hadoop/Hive
3© Copyright 2010 EMC Corporation. All rights reserved.
Data in a Typical Enterprise
• Data is everywhere –
corporate EDW, 100s
of data marts,
‘shadow’ databases,
spreadsheets, logs,
etc
• The goal of
centralizing all data
in a single EDW has
proven untenable
EDW
~10% of data
Data Marts and
‘Personal Databases’
~90% of data
4© Copyright 2010 EMC Corporation. All rights reserved.
Today’s Big Data Challenges
• Sources of data and the amount of data to analyze
is growing exponentially
• Stale data exists because DW solutions cannot
ingest the vast amounts of data fast enough
• Lack of performance for advanced analytics and
complex queries
• The number of users and the concurrency of users
is increasing rapidly
• Security and privacy around the data is both
preferred and often mandated
5© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of HDFS/Hadoop/Hive
Hive Server accepts SQL and dynamically
generates and executes MapReduce code
Flexible framework for processing large datasets
Materialize data subsets to
reduce impact of node failure
DataNode servers process
analytics close to the data in
parallel
NameNode
DataNodeDataNode DataNode DataNode DataNode
…
NameNode
SQL (subset)
Hive
Process large datasets with support for
both SQL and MapReduce
MapReduce
6© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of Greenplum
Master servers optimize queries
for the most efficient query execution
MPP Scatter/Gather streaming for
fast loading of data
Flexible framework for processing large datasets
Interconnect for continuous
pipelining of data processing
Segment servers process queries
close to the data in parallel
Master
SegmentSegment Segment Segment Segment
…
Master
SQL
MapReduce
Process large datasets with support for
both SQL and MapReduce
7© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS Advantages
8© Copyright 2010 EMC Corporation. All rights reserved.
Common Real World Implementation
Lots ‘O Data
9© Copyright 2010 EMC Corporation. All rights reserved.
A Cyber-Analytics Data Mart Use Case
• Commercial SIEM products struggle
with the volumes of data generated in
a large enterprise. Non-parallel
event processing systems can’t keep
up with ingest, user load, etc
• Greenplum provides the ability to
cost-effectively ingest and store large
volumes of sensor data.
• Greenplum provides the parallel
analytics that support data mining,
event correlation, etc, over datasets
from TB’s to PB’s in size.
Access and
Events
Greenplum
Analytics
Data Mart
GPLoad
SQL MapReduce
(Perl)
(Python Math Lib)
(R)
SoR
ETL
ODS
BI
10© Copyright 2010 EMC Corporation. All rights reserved.
Coexistence Approach – Use Case
Compute
Storage
Analytics
General Purpose X86 Cluster of
Systems
Network
• Provides true, complete SQL compliant analytics
• Data can be read and written from Hadoop via
Greenplum
• Store your data structured, unstructured, column or row
oriented, compressed, leveraging Index support where
appropriate
• SQL can be executed, through Greenplum, on data
residing within Greenplum as well as data residing
within HDFS
• MapReduce can be executed through Greenplum in
Java, C, Perl, Python or through Java in Hadoop
• Designed for rapid analysis of data volumes from less
than a terabyte scaling into the petabytes
11© Copyright 2010 EMC Corporation. All rights reserved.
Big Data is Complementary to EDW
Commodity
Hardware
Virtual Machines Public Cloud
Greenplum
Enterprise Data Warehouse
• Single Source of Truth
• 1 Logical Model
• Heavy data governance and quality
• Operational Reporting
• Financial Consolidation
MapReduce Analytics Cloud
• Source of all raw data (often 10X size of
EDW)
• Self-service infrastructure to support multiple
marts and sandboxes
• Rapid analytic iteration, and business owned
solutions
12© Copyright 2010 EMC Corporation. All rights reserved.

Mais conteúdo relacionado

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Greenplum - Jacque Istok - Hadoop World 2010

  • 1. 1© Copyright 2010 EMC Corporation. All rights reserved. RDBMS and Hadoop A Powerful Combination Jacque Istok
  • 2. 2© Copyright 2010 EMC Corporation. All rights reserved. You Know Hadoop, But What Is Greenplum? EMC/Greenplum is an MPP data warehouse system, based off PostgreSQL, with the full capabilities of a traditional RDBMS system. In conjunction with SQL-99 compliance for structured analysis, Greenplum also offers a MapReduce implementation for non structured analysis. In short: Greenplum ~ Hadoop/Hive
  • 3. 3© Copyright 2010 EMC Corporation. All rights reserved. Data in a Typical Enterprise • Data is everywhere – corporate EDW, 100s of data marts, ‘shadow’ databases, spreadsheets, logs, etc • The goal of centralizing all data in a single EDW has proven untenable EDW ~10% of data Data Marts and ‘Personal Databases’ ~90% of data
  • 4. 4© Copyright 2010 EMC Corporation. All rights reserved. Today’s Big Data Challenges • Sources of data and the amount of data to analyze is growing exponentially • Stale data exists because DW solutions cannot ingest the vast amounts of data fast enough • Lack of performance for advanced analytics and complex queries • The number of users and the concurrency of users is increasing rapidly • Security and privacy around the data is both preferred and often mandated
  • 5. 5© Copyright 2010 EMC Corporation. All rights reserved. Architecture of HDFS/Hadoop/Hive Hive Server accepts SQL and dynamically generates and executes MapReduce code Flexible framework for processing large datasets Materialize data subsets to reduce impact of node failure DataNode servers process analytics close to the data in parallel NameNode DataNodeDataNode DataNode DataNode DataNode … NameNode SQL (subset) Hive Process large datasets with support for both SQL and MapReduce MapReduce
  • 6. 6© Copyright 2010 EMC Corporation. All rights reserved. Architecture of Greenplum Master servers optimize queries for the most efficient query execution MPP Scatter/Gather streaming for fast loading of data Flexible framework for processing large datasets Interconnect for continuous pipelining of data processing Segment servers process queries close to the data in parallel Master SegmentSegment Segment Segment Segment … Master SQL MapReduce Process large datasets with support for both SQL and MapReduce
  • 7. 7© Copyright 2010 EMC Corporation. All rights reserved. RDBMS Advantages
  • 8. 8© Copyright 2010 EMC Corporation. All rights reserved. Common Real World Implementation Lots ‘O Data
  • 9. 9© Copyright 2010 EMC Corporation. All rights reserved. A Cyber-Analytics Data Mart Use Case • Commercial SIEM products struggle with the volumes of data generated in a large enterprise. Non-parallel event processing systems can’t keep up with ingest, user load, etc • Greenplum provides the ability to cost-effectively ingest and store large volumes of sensor data. • Greenplum provides the parallel analytics that support data mining, event correlation, etc, over datasets from TB’s to PB’s in size. Access and Events Greenplum Analytics Data Mart GPLoad SQL MapReduce (Perl) (Python Math Lib) (R) SoR ETL ODS BI
  • 10. 10© Copyright 2010 EMC Corporation. All rights reserved. Coexistence Approach – Use Case Compute Storage Analytics General Purpose X86 Cluster of Systems Network • Provides true, complete SQL compliant analytics • Data can be read and written from Hadoop via Greenplum • Store your data structured, unstructured, column or row oriented, compressed, leveraging Index support where appropriate • SQL can be executed, through Greenplum, on data residing within Greenplum as well as data residing within HDFS • MapReduce can be executed through Greenplum in Java, C, Perl, Python or through Java in Hadoop • Designed for rapid analysis of data volumes from less than a terabyte scaling into the petabytes
  • 11. 11© Copyright 2010 EMC Corporation. All rights reserved. Big Data is Complementary to EDW Commodity Hardware Virtual Machines Public Cloud Greenplum Enterprise Data Warehouse • Single Source of Truth • 1 Logical Model • Heavy data governance and quality • Operational Reporting • Financial Consolidation MapReduce Analytics Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple marts and sandboxes • Rapid analytic iteration, and business owned solutions
  • 12. 12© Copyright 2010 EMC Corporation. All rights reserved.