From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

From HadoopDB to Hadapt: A Case
Study of Transitioning a VLDB
paper into Real World Deployments
Daniel Abadi
Yale University
August 28th, 2013
Twitter: @daniel_abadi

Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Overview of the commercialization process
Technical features missing from HadoopDB that
Hadapt needed to implement
What does this mean for tenure?

Situation in 2008
Hadoop starting to take off as a “Big Data”
processing platform
Parallel database startups such as
Netezza, Vertica, and Greenplum gaining
traction for “Big Data” analysis
2 Schools of Thought
– School 1: They are on a collision course
– School 2: They are complementary
technologies

From 10,000 feet Hadoop and Parallel
Database Systems are Quite Similar
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)

SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at

Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring

Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Time(seconds)
Vertica
DBMS-X
Hadoop

UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF

Scalability
Except for UDFs all systems scale near
linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale

Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node
failure, and do not adapt if a
node is running slowly

Benchmark Conclusions
Hadoop had scalability advantages
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop was consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needed better support for compression and direct
operation on compressed data
– Needed better support for indexing
– Needed better support for co-partitioning of datasets

Best of Both Worlds Possible?
Connector

Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)

Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact

Adding DBMS Technology to
Hadoop
Option 1: Keep Hadoop’s storage and build parallel
executor on top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
Option 2: Use relational storage on each node
Accelerates “time to complete system”
We chose this option for HadoopDB

UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Time(seconds)
DBMS
Hadoop
HadoopDB

Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
HadoopDB

HadoopDB Commercialization
Wanted to build a real system
Released initial prototype open source
Blog post about HadoopDB got slashdotted, led
to VC interest
– Initially reluctant to take VC money
Posted a job for an engineer to help build out
open source codebase
– Low quality of applicants
– Not enough government funding for more than 1
engineer

HadoopDB Commercialization
VC money only route to building a
complete system
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012

Commercializing HadoopDB:
Where does development time go?
Work we expected to transition from
research prototype to commercial product
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Error codes / messages for every situation
– Installer
– Documentation
But what about unexpected work?

Infrastructure Tools
Distributed systems are unwieldy
– For a cluster of size n, many things need to be done n times
Automated tools are critical
Just to try some new code, the following needs to
happen:
– Build product
– Provision a cluster
– Deploy build to cluster
– Install dependencies (Hadoop distro, libraries, etc)
– Install Hadapt with correct configuration parameters for that
cluster
– Generate data or copy data files to cluster for load

Upgrader
Start-ups need to move fast
Hadapt delivers a new release every
couple of months
Upgrade process must be easy
Downgrade (!) process must be easy
Changes in storage layout or APIs add
complexity to the process

UDF Support
HadoopDB supported both MapReduce
and SQL as interfaces
MapReduce was not a sufficient
replacement for database UDFs
Hadapt provides an “HDK” that enables
analysts to create functions that are
invokable from SQL
– Integrates with 3rd party tools

Search
Hadoop is increasingly used as a data
landfill
– Granular data
– Messy data
– Unprocessed data
Database for Hadoop cannot assume all
data fits in rows and columns
Search support was the first thing we built
after our A round of financing

Is doing a start-up pre-tenure a
good idea?
Spinning off a company takes a ton of time
– At first, you are the ONLY person who can give a
complete description of the technical vision, so
You’re talking to all the VCs to fundraise
You’re talking to all the prospective customers
You’re talking to all the prospective employees
– Lots of travel
– Eventually, others can help with the above, but a
good CEO will not let you escape
Ups and downs can be mentally draining

If you do a start-up you will:
Publish less
Advise fewer students
Pursue fewer grants
Avoid university committees as much as
possible
Skip faculty meetings (usually because of
travel)
Attend fewer academic conferences

At the end of the day
Unless there are changes (see SIGMOD panel
from June):
– Publishing a lot is the best way to get tenure
– Spinning off a company necessarily detracts from
university measurable objectives
Doing a start-up is putting all your eggs in one
basket
– If successful, you have a lot of impact you can point to
– If not successful, you have nothing
– A lot of market forces that you have no control over
determine success

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Similar to From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments (20)

Recently uploaded

Recently uploaded (20)

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments