Expect More from Hadoop

1©MapR Technologies
Expect More from Hadoop
Jack Norris, MapR Technologies

Hadoop Growth

Important Drivers for Hadoop
 Data on compute
 You don’t need to know what
questions to ask beforehand
 Simple algorithms on Big Data
 Analysis of unstructured data

The Cost of Enterprise Storage
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5Petabytes
200,000 IOPS
1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:
1 Petabyte
400,000 IOPS
2Gbyte/sec
Local Storage
$0.02/Gigabyte
$1M gets:
50 Petabytes
10,000,000 IOPS
800 Gbytes/sec
1/100 to 1/20 the cost

MapReduce: A Paradigm Shift
 Distributed, scalable computing platform
– Data/Compute framework
– Commodity hardware
 Pioneered at Google
 Commercially available as Hadoop

MapR Distribution for Apache Hadoop
 Complete Hadoop
distribution
 Comprehensive
management suite
 Industry-standard
interfaces
 Enterprise-grade
dependability
 Higher performance

How do you Benefit?

Expanding data
for existing applications

Use Case #1
 Major telecom vendor
 Key step in billing pipeline handled by data warehouse (EDW)
 EDW at maximum capacity
 Multiple rounds of software optimization already done
 Revenue limiting (= career limiting) bottleneck

Transformation
Extract and Load
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow

Problem Analysis
 70% of EDW load is related to call detail record (CDR)
normalization
–< 10% of total lines of code
–CDR normalization difficult within the EDW
–Binary extraction and conversion
 Data rates are too high for upstream transform
–Requires high volume joins

ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
Hadoop Cluster

Simplified Analysis
 70% of EDW consumed by ETL processing – Offload
frees capacity
 EDW direct hardware cost is approximately $30 million
vs. Hadoop cluster at 1/50 the cost
 Additional EDW only increases capacity by 50% due to
poor division of labor

The Results
 EDW strategy
–1.5 x performance
–$30 million
 MapR Strategy
–3 x faster
–20x cost/performance advantage for MapR strategy
–With High Availability and data protection

Use Case #2
Combine Many Different Data Sources

Use Case #2 – Customer Example
 Global Credit Card Issuer
 Launching a New Location Based Service
 Benefits both Merchants and Consumers

Combining different feeds on one platform
Hadoop and HBase
Storage and Processing
…
Real-time data feed
from social network
Stored in
Hadoop
Historical
Purchase
Information
Predictive Analytics from
Historical data combined with
NoSQL querying on real-time
social networking data
Billing
Data

Results
 New Service Rolled out in 1 quarter
 Processing time cut from 20 hours per day to 3
 Recommendation engine load time decreased from 8
hours to 3 minutes
 Includes data versioning support for easier
development and updating of models

Use Case #3
New Application from New Data Source

Ancestry.com – Family Tree

Overview and Requirements
 Collect and Collate information from disparate sources
(Text files, Images, etc.)
 Leverage new data source: Spit
 Machine learning techniques and DNA Matching
Algorithms

The Results
 Storage Infrastructure for billions of small and large files
 Blob Store for large images through NoSQL solutions
 Multi-tenant capability for data-mining and machine-learning
algorithm development
 One highly available, efficient platform

MapR M7: Making HBase Enterprise Grade
Disks
ext3
JVM
DFS
JVM
HBase
Other Distributions
Disks
Unified
Easy Dependable Fast
No RegionServers No compactions Consistent low latency
Seamless splits Instant recovery from node
failure
Real-time in-memory
configuration
Automatic merges Snapshots Disk and network compression
In-memory column families Mirroring Reduced I/O to disk

Use Case
New Analytics on Existing Data

Analytic Flexibility
 MapReduce enabled Machine learning algorithms
 Enhanced Search
 Real-time event processing
 No need to sample the data
Fraud Detection Target Marketing
Consumer
Behavior Analysis …

Hadoop Expands Analytics
“Simple algorithms and lots of data
trump complex models ”
Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems

Use Case #4
Combine All Three

Where do you Start?

One Platform for Big Data
…
Batch
99.999%
HA
Data
Protection
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integration
Multi-
tenancy
Batch
Processing
File-Based
Applications
SQL Database Search Stream
Processing
Interactive Realtime

World Record Performance
Why is MapR faster and more efficient?
– C/C++ vs. Java
– Distributed metadata
– Optimized shuffle
New Minute Sort World
Record
1.5 TB in 1 minute
2103 nodes

Thank You

Expect More from Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a Expect More from Hadoop

Semelhante a Expect More from Hadoop (20)

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

Expect More from Hadoop

Notas do Editor