Mais conteúdo relacionado
Mais de Cloudera, Inc. (20)
Greenplum - Jacque Istok - Hadoop World 2010
- 1. 1© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS and Hadoop
A Powerful Combination
Jacque Istok
- 2. 2© Copyright 2010 EMC Corporation. All rights reserved.
You Know Hadoop, But What Is Greenplum?
EMC/Greenplum is an MPP data warehouse
system, based off PostgreSQL, with the full
capabilities of a traditional RDBMS system. In
conjunction with SQL-99 compliance for
structured analysis, Greenplum also offers a
MapReduce implementation for non structured
analysis. In short:
Greenplum ~ Hadoop/Hive
- 3. 3© Copyright 2010 EMC Corporation. All rights reserved.
Data in a Typical Enterprise
• Data is everywhere –
corporate EDW, 100s
of data marts,
‘shadow’ databases,
spreadsheets, logs,
etc
• The goal of
centralizing all data
in a single EDW has
proven untenable
EDW
~10% of data
Data Marts and
‘Personal Databases’
~90% of data
- 4. 4© Copyright 2010 EMC Corporation. All rights reserved.
Today’s Big Data Challenges
• Sources of data and the amount of data to analyze
is growing exponentially
• Stale data exists because DW solutions cannot
ingest the vast amounts of data fast enough
• Lack of performance for advanced analytics and
complex queries
• The number of users and the concurrency of users
is increasing rapidly
• Security and privacy around the data is both
preferred and often mandated
- 5. 5© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of HDFS/Hadoop/Hive
Hive Server accepts SQL and dynamically
generates and executes MapReduce code
Flexible framework for processing large datasets
Materialize data subsets to
reduce impact of node failure
DataNode servers process
analytics close to the data in
parallel
NameNode
DataNodeDataNode DataNode DataNode DataNode
…
NameNode
SQL (subset)
Hive
Process large datasets with support for
both SQL and MapReduce
MapReduce
- 6. 6© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of Greenplum
Master servers optimize queries
for the most efficient query execution
MPP Scatter/Gather streaming for
fast loading of data
Flexible framework for processing large datasets
Interconnect for continuous
pipelining of data processing
Segment servers process queries
close to the data in parallel
Master
SegmentSegment Segment Segment Segment
…
Master
SQL
MapReduce
Process large datasets with support for
both SQL and MapReduce
- 8. 8© Copyright 2010 EMC Corporation. All rights reserved.
Common Real World Implementation
Lots ‘O Data
- 9. 9© Copyright 2010 EMC Corporation. All rights reserved.
A Cyber-Analytics Data Mart Use Case
• Commercial SIEM products struggle
with the volumes of data generated in
a large enterprise. Non-parallel
event processing systems can’t keep
up with ingest, user load, etc
• Greenplum provides the ability to
cost-effectively ingest and store large
volumes of sensor data.
• Greenplum provides the parallel
analytics that support data mining,
event correlation, etc, over datasets
from TB’s to PB’s in size.
Access and
Events
Greenplum
Analytics
Data Mart
GPLoad
SQL MapReduce
(Perl)
(Python Math Lib)
(R)
SoR
ETL
ODS
BI
- 10. 10© Copyright 2010 EMC Corporation. All rights reserved.
Coexistence Approach – Use Case
Compute
Storage
Analytics
General Purpose X86 Cluster of
Systems
Network
• Provides true, complete SQL compliant analytics
• Data can be read and written from Hadoop via
Greenplum
• Store your data structured, unstructured, column or row
oriented, compressed, leveraging Index support where
appropriate
• SQL can be executed, through Greenplum, on data
residing within Greenplum as well as data residing
within HDFS
• MapReduce can be executed through Greenplum in
Java, C, Perl, Python or through Java in Hadoop
• Designed for rapid analysis of data volumes from less
than a terabyte scaling into the petabytes
- 11. 11© Copyright 2010 EMC Corporation. All rights reserved.
Big Data is Complementary to EDW
Commodity
Hardware
Virtual Machines Public Cloud
Greenplum
Enterprise Data Warehouse
• Single Source of Truth
• 1 Logical Model
• Heavy data governance and quality
• Operational Reporting
• Financial Consolidation
MapReduce Analytics Cloud
• Source of all raw data (often 10X size of
EDW)
• Self-service infrastructure to support multiple
marts and sandboxes
• Rapid analytic iteration, and business owned
solutions