SQL + Hadoop: The High Performance Advantage�

Confidential © 2014 Actian Corporation1
SQL + Hadoop: The High Performance
Advantage
Turn Hadoop into a High Performance Analytics Platform
Emma McGrattan, Actian
Jim Hare, Actian
8 July 2014

1. Introduction
2. Hadoop Challenges
3. Actian Analytics Platform –
Hadoop SQL Edition
4. Industrialized, High
Performance SQL in Hadoop
5. Questions
Agenda
All lines are muted
To ask a question, use
Chat or Q&A panel
Recording will be made
available
We‘ll be running a few
polling questions

$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide offices, 7x 24 multinational support model
3
“Actian is now very powerfully
positioned in the big data and
analytics markets.” Robin Bloor
Actian is Delivering Transformational Value
“Actian has assembled all of the next generation
IPs into a single analytics platform, allowing
users a level of flexibility in data interaction that
competitors have not been able to match.”
siliconANGLE

Big Data Offers Significant Opportunities
Personalized Experience
New Products/Services
Reduce Risk
Predictive Analytics
Many Data Sources
Low Cost Storage
…But only for those who embrace it
Improve Decision-Making

Enter Hadoop as the Big Data Enabler
for Low Cost Storage
DW
Offload
Landing
Zone
Data Reservoir
?

But It isn’t Easy with Hadoop
Batch performance
Time to Value
Expensive Skills
Silo’d Data
Access
Data preparation

Hadoop Complexity Forcing Organizations
to Move Data in order to Analyze it
DW
Offload
Landing
Zone
Hadoop Data Reservoir
Data
Management
Analytics
Processing
Visualization
& Data
Science
Workbench
Result: duplicate storage & infrastructure costs, more IT
resources, network bandwidth usage, and complexity
Data
Transfer

CIOs Challenged by Big Data Costs
One in three CIOs pay
between 21 cents to 30 cents per
gigabyte a month.
Translation: it costs a company $3.12
million per year to store 500,000
gigabytes at an average cost of 26
cents per gigabyte per month.
Source: http://www.cioinsight.com/it-strategy/storage/slideshows/cios-challenged-by-big-data-costs.html
-- CIO Insight

CIOs Challenged by Types of Big Data
73% of CIOs day up
to 50% of their data
will be unstructured
within two years.
Source: http://www.cioinsight.com/it-strategy/storage/slideshows/cios-challenged-by-big-data-costs.html
-- CIO Insight

Instead, what if you could move the
analytic processing to the Hadoop data?
Data Science
Workbench
Analytic
Processing
Data
Management
… And transform Hadoop from a data lake into a high
performance, fully functional analytics platform
SQL User
Access

What is it?
Introducing the Actian Analytics Platform –
Hadoop SQL Edition
Patented X100 vector processing engine plus visual data and analytics work
flow, all running natively in Hadoop via YARN
Turns Hadoop into a High-Performance, Fully-Functional Analytics Database
How is this unique?
Highest performing, most industrialized SQL access to Hadoop data
Only end-to-end analytic processing natively in Hadoop
Most consumable, accessible, manageable Hadoop analytics
What does this mean to you?
Removes all barriers for business access to big data analytics
Enables SQL users with no constraints on Hadoop data
Accelerates time to value

The Industry’s Abuzz – about Actian!
“Deploying on Hadoop enables the Actian Analytics Platform to scale to massively
parallel scale without having to modify the underlying engine. For Actian, Hadoop
is a means to an end; it provides an opening for Actian to introduce a fast SQL
engine that operates at scale.”
Tony Baer, Principal Analyst, Software, Ovum
“Actian’s platform now makes Hadoop data repositories accessible to the entire
enterprise by empowering millions of business-savvy SQL users and business
analysts to conduct advanced analytics directly on data in the Hadoop
Distributed File System (HDFS). Companies investing in Hadoop now can
broaden the scope of data discovery, increase the accuracy of decisions, and
speed time to value.”
Daniel Gutierrez, Inside Big Data
“The latest version of the Actian Analytics Platform provides end-to-end analytic
processing natively in Hadoop. This will make the Hadoop Big Data framework
more accessible by offering high-performance ELT (extract, load and transform)
and SQL analytics on Hadoop with no need for MapReduce skills. This is a big
deal because data scientists with Hadoop skills are in short supply, while SQL
skills are relatively abundant.”

Libraries of Analytics
Hadoop
Connections to Access Any Data
Actian Analytics Platform – Hadoop SQL Edition
Visual Data and Analytic Workbench
High Performance
Data Flow Engine
Industrialized SQL
Analytics Database
Natively in Hadoop
Removes all barriers for business access to big data
analytics
Business
Processes
Users
Machines
Applications
Expansive Connectivity  Data Blending & Enrichment  Discovery  Data Science  Analytics  Operational BI
Enterprise Data
Machine Data
Social Data
Data Warehouse
SaaS Data
Amazon
Redshift

Lightning fast and industrial strength
SQL in Hadoop – Up to 30X faster than
Impala
Full end-to-end analytic processing
platform - all native in Hadoop
Packaged with “real world” solution
blueprints

Visual Data Science & Analytics Workbench
• Drag/drop interface with 100’s of data prep and analytic functions
• Connect, blend, & enrich data and perform discovery & data science
• Build and test predictive models
• Running on top of a high performance data flow engine
• All natively within Hadoop via YARN

Ubiquitous Skills
■ 1 Million+ SQL Users
■ $ Lower cost
■ Easy to find, in most
companies
■ Embedded in the business
Specialty Skills
■ 150K MapReduce
Programmers
■ $$$ Expensive
■ 170K Shortage, hard to find
■ Separate from the business
Unleash millions of business-savvy, SQL users
with no constraints on Hadoop data
Actian Analytics PlatformTM
Analyze ActConnect
+

Actian Analytics Platform = 25 Minutes
Log Reader Filter Rows Group Load Vectork-Means
Coding MapReduce = 4 Weeks
Avro Writer
MapReduce Code
k-Means
MapReduce Code
Log Reader Filter Rows Group Load Vector
MapReduce Code MapReduce Code MapReduce Code MapReduce Code
Accelerate time to value and turn Hadoop data
into transformational value

Vendor Approaches to “SQL on Hadoop”
“marketing jobs”
“wrapped legacy”
“from scratch”
SQL Outside Hadoop
• Connector approach
• MPP DB  need 2 clusters
• Expensive, hard to manage
Mature but non-Integrated
• Legacy engine (e.g. Postgres) + top layer
• Store data outside HDFS (local files)
• Separate Failover Management (tools)
Integrated but Immature
• No trickle updates
• Immature/poor optimizers+engines
• I18N, security, workload mgmt,
access control?

“wrapped
legacy”
“from
scratch”
Maturity
(SQL support,
ACID, reliability,
security, connectivity,
performance)
Hadoop IntegrationLow Native
High
“marketing jobs” Mature &
Integrated
+
+
“SQL on Hadoop” Vendor Landscape

Confidential © 2014 Actian Corporation20 Confidential © 2014 Actian Corporation 20
Actian Vector Hadoop Edition
Actian Analytics Platform
Hadoop SQL Edition
Actian Analytics Platform
NameNode
DataNode DataNode
DataNode DataNode
DataNode DataNode
DataNode DataNode
Prepare
Standard SQL Interfaces
Orchestrate
Connect
Connect to any data
via Actian
DataConnect
Manage dataflow
across the entire
analytic process
6 POINTS OF
INNOVATION:
Vector Processing
On Chip Cache
Fast Real-time
Updates
Smart Compression
Storage Indexes
Multi-Core Parallelism
Running natively in
Hadoop via YARN
Prepare, enrich, and
analyze any data with
Actian DataFlow
NEXT GENERATION
DATABASE
TECHNOLOGY::
Columnar
Compressed
Storage Indexes

Actian Vector – Unmatched InnovationTime/CyclestoProcess
Data Processed
DISK
RAM
CHIP
10GB2-3GB40-400MB
2-20150-250Millions
Vector Processing
Single
Instruction
Multiple
Data
2nd Gen Column Store
Limit I/O
Efficient real time updates
Smarter Compression
Maximize throughput
Vectorized decompression
Exploiting Chip Cache
Process data on chip – not in RAM
1
2
3
4
Multi-core Parallelism
Maximize system resource
utilization…
Storage Indexes
Quickly identify candidate data
blocks
Minimize IO
5
6

TPC-H 1TB – Faster, Less Hardware
0 100,000 200,000 300,000 400,000
Actian Vector 445,529
Actian Vector 436,788
SQL Server 219,888
Oracle 209,534
Oracle 201,487
SQL Server 173,962
Sybase IQ 164,747
Oracle 140,181
SQL Server 134,117
June ‘12
May ‘11
Aug ‘11
June ‘11
Sept ‘11
Apr ‘11
Dec ‘10
Apr ‘10
Dec ‘11
$57,146
$1,229,968
$460,869
$2,402,706
$753,392
$278,527
$85,621
$1,249,967
$258,880
Hardware Cost
(excluding discounts)QphH
Fastest TPC-H QphH@1TB Benchmark (non-clustered)
Source: www.tpc.org /

HADOOP
YARN
HDFS
Standard
SQL
Interfaces
DataNode
HDFS
Visual Data
& Analytics
Workflow
Transform Hadoop into a High Performance Analytics Platform
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
X100X100X100
Read
Load
Actian Vector
Blend &
Enrich
Data Science
& Analytics
DataNode
HDFS
X100
HDFS
Vector
• Original file format
• Standard block
replication
NameNode
High Performance,
Industrialized SQL
Database
High Performance,
Parallelized Data Flow
Engine
• Column-based
blocks
• Compressed
• Partitioned
Replicated
Vector
• >=3 Replicated
Copies of Vector
Blocks
• Leveraged to co-
locate data with
various join keys

History of the TPC-DS Comparison

Confidential © 2014 Actian Corporation25 Confidential © 2014 Actian Corporation 25
TPC-DS Benchmark Components
Operational
Systems
Refresh Process Ad-hoc Reporting
Queries
User Queries
DSS Database
TPC-DS
Reports
Store
Web
Catalog
Inventory
Promotions
Set of Files
ETL

Actian Hadoop SQL Performance
0
5
10
15
20
25
30
35
Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98
“Impala Subset” of TPC-DS Queries at Scale Factor 3000 (3TB)
Speedup vs Impala
Impala Actian
16x avg. speedup
Background to “Impala Subset “of TPC-DS benchmark can be found here:
http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
Both Executed on the Same Hardware and Software Environment:
5 Node Cluster with 64GB of RAM per node and 12x2TB Hard Disks.
SpeedupFactor

Comprehensive – covers full analytic process: data blending & enrichment, discovery &
data science, analytics & operational BI
Accessible – standard ANSI SQL to support standard BI tools; plus key advanced
analytics including cube, grouping sets and windowing functions
Optimized – mature, proven planner and optimizer; optimal use of every node, CPU,
memory, and cache
Secure – native DBMS security including authentication, user and role-based security,
data protection, and encryption
Reliable - fully ACID-compliant with multi-version read consistency, plus system-wide
failover protection
Manageable – resources managed automatically in Hadoop via YARN
Consumable – now usable by millions of users with every SQL tool and application on
the planet
Scalable – unlimited expansion to handle extreme #s of users, nodes, data
Most Industrialized SQL in Hadoop

Actian Director for Management

Industrialized, High-Performance SQL in Hadoop
Only end-to-end analytic processing natively in Hadoop
Highest performing, most industrialized SQL in Hadoop
Removes all barriers for business access to big data analytics
Unleashes millions of business-savvy SQL users on Hadoop data
Outperforms Cloudera’s Impala by up to 30x
Actian transforms Hadoop from a data lake into a high-
performance analytics platform.

Transform Hadoop – Transform your Business

3
Get started today! www.actian.com/hadoop
Pre-register for an
evaluation copy of
Actian’s SQL in
Hadoop
bigdata.actian.com/
sql-in-hadoop
Register for a Sand
Hill Hadoop Survey
Results webinar on
July 24, 2014
bigdata.actian.com/
SandHill- Hadoop-
Results
2
1

SQL + Hadoop: The High Performance Advantage�

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a SQL + Hadoop: The High Performance Advantage�

Semelhante a SQL + Hadoop: The High Performance Advantage� (20)

Último

Último (20)

SQL + Hadoop: The High Performance Advantage�

Notas do Editor