This document discusses building big data analytics platforms and infrastructure using Supermicro, Greenplum, and SAS. It provides an agenda that covers big data analytics platforms and infrastructure as well as a 1,000 node Hadoop cluster built using EMC and Supermicro. The document then discusses Greenplum's data computing appliances and how Greenplum has become the foundation of EMC's data computing division. It also provides an overview of SAS and discusses building the big data analytics "stack" using analytic toolsets, Greenplum Chorus, Greenplum data computing appliances, Greenplum Database, Greenplum HD, and SAS.
3. !!!
“Big Data Is Less !!!
About Size, And
More About
Freedom”
―Techcrunch
!!!
THE ERA OF
!!!
BIG DATA
“Findings: „Big Data‟
!!! Is More Extreme
Than Volume”
“Big Data! It‟s Real,
IS HERE… ― Gartner It‟s Real-time, and
It‟s Already
“Total data: „bigger‟ Changing Your
than big data” World”
!!! ― 451 Group
!!!
!!! ―IDB
4. Data Sources Are Expanding
THE DIGITAL UNIVERSE WILL
GROW 44X
IN THE NEXT 10 YEARS
Source : 2011 IDC Digital Universe Study
5. BIG Data is Just a Bunch of Data to Store…? OR
90
80
70
60
50
Big 40
Data 30
Sources 20
10
0
2009 2010 2011 2012 2013 2014
File Based: 60.7% CAGR Block Based: 21.8% CAGR
By 2012, 80% of all storage capacity sold will be for file-based data
Source: IDC
7. Make BIG Data
Accessible
Identify the data source
Store the data
Connect applications and users
Utilize the data in different views
8. EMC UAP Solutions – Analytics Platform
This is what my
analytics
environment looks
like…
9. Building The Big Data Analytics
“Stack”
Analytic Toolsets
(Business Analytics, BI, Statistics, etc.)
Greenplum Chorus
Enterprise Collaboration Platform for Data
Greenplum Data Computing Appliances
Purpose-built for Big Data Analytics
Greenplum Database Greenplum HD
Enterprise & Community Editions Hadoop Enterprise & Community Editions
World’s Most Scalable MPP Database Platform Enterprise Analytics Platform for Unstructured Data
10. Greenplum Becomes the Foundation
of EMC’s Data Computing Division
E M C A C Q U I R E S G R E E N P L U M O N J U LY 2 0 1 0
“For three years, Gartner has identified Greenplum as
the most advanced vendor in the visionary
quadrant of its data warehouse DBMS Magic Quadrant….”
– Gartner
11.
12. SAS at a Glance
Company Highlight:
• Founded 1976: 11,000+ employees in 400+
offices
• 2010 worldwide revenue $2.43 B
• IDC: SAS is leader in Analytics with a 34.5%
market share : Analytics and Reporting
• 4.5 million users worldwide
• 50,000+sites in 114 countries
• From Tools to Vertical Solutions
Services
Retail
11%
Other 4% Financial Services
2% 42%
Manufacturing
6%
Healthcare
Communications
& Life Sciences
8%
8%
Government Education
14% Energy & Utilities 3%
2%
13. Overview
SMC Inc., HQ SMC BV,
San Jose, CA The Netherlands
SMC TW,
Taiwan
Founded in 1993, HQ– San Jose, CA, 2007 NASDAQ: SMCI
Revenues: FY09 $500M, FY10 $721M , FY11 ~$1B
Global Footprint: >100 Countries
Production: US, EU and Asia Production facilities
Engineering: 70% of workforce in engineering (30% growth through recession)
Market Share: #1 Server Channel (SMCI enables ~10% of global server market)
Brand Equity: Growing public profile since 2007 IPO
Corporate Focus: Energy Efficiency, Earth-friendly, Green Technology Innovation
14. Product Family
Resource Optimized (WIO/UIO) Twin Architecture GPU SuperComputing
Data Center Optimized Embedded
Application Optimized: Multi I/O SuperBlade
Workstation
Mainstream Business Solutions Storage Server
15. In-House Design and Server Building Block Solutions®
Technology Partners Server Building Block Solutions® Customer Requirements
Application Optimized
OEM
Specs
Tri-Lab
Optimized
Data Center
In-House Design
Server Building Block Solutions®
> 350 Operating
>550 >1300 > 140 Power Open
Cooling Systems /
Motherboards Chassis Supplies CPU/ Memory
Modules Applications
(1) As of Q2, 2009
16. Big Data Analytics on Hadoop
Internet companies are not built on SQL but are building Analytics on Hadoop/NoSQL
Existing Hadoop Users (Internet)
This is what I think BI &
ETL Tools Web Apps
my analytics Reporting
environment looks
like…
Management & Coordination
Pig Hive HBase
Hadoop System MapReduce Layer
Hadoop Storage
Web Portal,
Social Networks
17. Hadoop Components (hadoop.apache.org)
HDFS • Hadoop Distributed File System
MapReduce • Framework for writing scalable data applications
Pig • Procedural language that abstracts lower level MapReduce
Zookeeper • Highly reliable distributed coordination
Hive • Data warehouse infrastructure built on top of Hadoop
HBase • Database for random, real time read/write access
Oozie • workflow/coordination to manage jobs
Mahout • Scalable machine learning libraries
18. What can Hadoop do for you?
Financial Services Web & e-Tailing
Better knowing customers Web usage, click stream behavior
Risk analysis and management. Market & customer segmentation
Fraud detection and security Ad customer targeting
analytics. On-line fraud detection
Telecommunications Government
Customer churn prevention. Fraud detection
Price optimization and marketing Compliance and regulatory analytics
Network analysis and optimization
Customer experience management Retail
Market and consumer segmentation
Healthcare Merchandizing and cross-selling
Patient care quality Promotion and campaign analysis
Drug development
Data Source: Cloudera
19. Hadoop Use Cases
Linkedin – “People You May Know” and other facts
Yahoo! – Hadoop to support AdSystems and web search
Visa – Credit card fraud detection and analysis
T-Mobile – Churn analysis, user experience
Amazon, Baidu, AOL, eBay, Facebook, Twitter, …
Data Source: Cloudera
20. Hadoop Cluster HW selection
What’s the HW configuration for Hadoop clusters?...
It depends, workloads matter.
CPU Intensive I/O Intensive
Machine learning Data importing and exporting
Natural language processing Indexing
Complex data mining Searching
Feature extraction Grouping
Decoding/decompressing
Data Storage
Capacity
General Configuration
# of data mirroring
2 Quad Core CPUs
16-96GB Memory
TCO 2 x GE
Rack space 1TB-2TB Disk x n
Power consumption 1U/2U Rack mount
Different workloads
21. Proven at Scale with Worldwide Support
Production-scale testing of Apache Trunk & hosted environment for customer POC‟s
Industry’s largest Hadoop
support team
Industry‟s most accomplished
Hadoop talents (from Yahoo!,
LinkedIn, Talend, etc.)
Tested at scale on the
Greenplum Analytics
Workbench
1,000-node, 24-petabyte cluster
Multi-million dollar investment
by EMC and partners
Reduced risk for EMC
Bringing Rapid Innovation customers
to Hadoop
Certification of partner products
22. Supermicro Server Functions in the Cluster
Supermicro
Data Nodes
2U Storage Server
Supermicro Infrastructure
Nodes
• 1,000+ Physical Supermicro Server Nodes
(10k virtual nodes)
• 12,000 Processor Cores
• 24 Petabytes of Storage Capacity (6Gbps SATA)
• 48 Terabytes RAM
2U Twin2 Server • 56 Gbps Infiniband Connectivity
26. Supermicro Advantages
Why Supermicro…
Building Blocks for different High Efficiency, High Quality
Workloads & Requirement
-Green IT
-Meet any Hadoop workloads by models -High Efficiency Power
-I/O, CPU, Disks, Density -High Quality for highest system availability and
- Customize by specific workload requirement best utilization
Proven solutions TCO
-EMC Greenplum proven solutions Solutions to Cost-Effective Hadoop Clusters
-100% Apache Hadoop Compatible Best choice of Hadoop Hardware platforms
-Benchmark and testing programs with partners
27. Turnkey Hadoop:
Supermicro Complete Rack Solutions
One Stop Shop for Hardware, End to End Total
Solutions
Speedup Deployment With Ready to Run Rack
Systems
Single Source, Consistent Build Quality and
Delivery Time
Multi-Vendor Compatibility Test, Zero
Compatibility Issue
Premium Service With Competitive Pricing
Shipped Directly From US, NL, TW