VLDB 2013 Early Career Research Contribution Award Presentation
Abstract: Four years ago at VLDB 2009, a paper was published about a research prototype, called HadoopDB, that attempted to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications. In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe how the project transitioned from a research prototype written by PhD students at Yale University into enterprise-ready software written by a team of experienced engineers. We will examine particular technical features that are required in enterprise Hadoop deployments, and technical challenges that we ran into while making HadoopDB robust enough to be deployed in the real world. The talk will conclude with an analysis of how starting a company impacts the tenure process, and some thoughts for graduate students and junior faculty considering a similar path.
Strategies for Landing an Oracle DBA Job as a Fresher
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments
1. From HadoopDB to Hadapt: A Case
Study of Transitioning a VLDB
paper into Real World Deployments
Daniel Abadi
Yale University
August 28th, 2013
Twitter: @daniel_abadi
2. Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Overview of the commercialization process
Technical features missing from HadoopDB that
Hadapt needed to implement
What does this mean for tenure?
3. Situation in 2008
Hadoop starting to take off as a “Big Data”
processing platform
Parallel database startups such as
Netezza, Vertica, and Greenplum gaining
traction for “Big Data” analysis
2 Schools of Thought
– School 1: They are on a collision course
– School 2: They are complementary
technologies
4. From 10,000 feet Hadoop and Parallel
Database Systems are Quite Similar
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
5. SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at
6. Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
8. UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF
9. Scalability
Except for UDFs all systems scale near
linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale
10. Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node
failure, and do not adapt if a
node is running slowly
11. Benchmark Conclusions
Hadoop had scalability advantages
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop was consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needed better support for compression and direct
operation on compressed data
– Needed better support for indexing
– Needed better support for co-partitioning of datasets
13. Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)
14. Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact
15. Adding DBMS Technology to
Hadoop
Option 1: Keep Hadoop’s storage and build parallel
executor on top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
Option 2: Use relational storage on each node
Accelerates “time to complete system”
We chose this option for HadoopDB
21. HadoopDB Commercialization
Wanted to build a real system
Released initial prototype open source
Blog post about HadoopDB got slashdotted, led
to VC interest
– Initially reluctant to take VC money
Posted a job for an engineer to help build out
open source codebase
– Low quality of applicants
– Not enough government funding for more than 1
engineer
22. HadoopDB Commercialization
VC money only route to building a
complete system
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012
23. Commercializing HadoopDB:
Where does development time go?
Work we expected to transition from
research prototype to commercial product
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Error codes / messages for every situation
– Installer
– Documentation
But what about unexpected work?
24. Infrastructure Tools
Distributed systems are unwieldy
– For a cluster of size n, many things need to be done n times
Automated tools are critical
Just to try some new code, the following needs to
happen:
– Build product
– Provision a cluster
– Deploy build to cluster
– Install dependencies (Hadoop distro, libraries, etc)
– Install Hadapt with correct configuration parameters for that
cluster
– Generate data or copy data files to cluster for load
25. Upgrader
Start-ups need to move fast
Hadapt delivers a new release every
couple of months
Upgrade process must be easy
Downgrade (!) process must be easy
Changes in storage layout or APIs add
complexity to the process
26. UDF Support
HadoopDB supported both MapReduce
and SQL as interfaces
MapReduce was not a sufficient
replacement for database UDFs
Hadapt provides an “HDK” that enables
analysts to create functions that are
invokable from SQL
– Integrates with 3rd party tools
27. Search
Hadoop is increasingly used as a data
landfill
– Granular data
– Messy data
– Unprocessed data
Database for Hadoop cannot assume all
data fits in rows and columns
Search support was the first thing we built
after our A round of financing
28. Is doing a start-up pre-tenure a
good idea?
Spinning off a company takes a ton of time
– At first, you are the ONLY person who can give a
complete description of the technical vision, so
You’re talking to all the VCs to fundraise
You’re talking to all the prospective customers
You’re talking to all the prospective employees
– Lots of travel
– Eventually, others can help with the above, but a
good CEO will not let you escape
Ups and downs can be mentally draining
29. If you do a start-up you will:
Publish less
Advise fewer students
Pursue fewer grants
Avoid university committees as much as
possible
Skip faculty meetings (usually because of
travel)
Attend fewer academic conferences
30. At the end of the day
Unless there are changes (see SIGMOD panel
from June):
– Publishing a lot is the best way to get tenure
– Spinning off a company necessarily detracts from
university measurable objectives
Doing a start-up is putting all your eggs in one
basket
– If successful, you have a lot of impact you can point to
– If not successful, you have nothing
– A lot of market forces that you have no control over
determine success