SlideShare a Scribd company logo
1 of 45
Download to read offline
Data Loading Challenges when
Transforming Hadoop into a
Parallel Database System
Daniel Abadi
Yale University
March 19th, 2013
Twitter: @daniel_abadi
Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Lessons learned from commercializing
HadoopDB into Hadapt
Ideas for overcoming the loading challenge
(invisible loading)
EDBT 2013 paper on invisible loading that is
published alongside this talk can be found at:
http://cs-
www.cs.yale.edu/homes/dna/papers/invisibleloadi
ng.pdf
Big Data
Data is growing (currently) faster than Moore’s Law
– Scale becoming a bigger concern
– Cost becoming a bigger concern
– Unstructured data “variety” becoming a bigger concern
Increasing desire for incremental scale out on
commodity hardware
– Fault tolerance a bigger concern
– Dealing with unpredictable performance a bigger concern
Parallel database systems are the traditional solution
A lot of excitement around Hadoop as a tool to help with
these problems (initiated by Yahoo and Facebook)
Hadoop/MapReduce
Data is partitioned across N machines
– Typically stored in a distributed file system (GFS/HDFS)
On each machine n, apply a function, Map, to each data item d
– Map(d)  {(key1,value1)} “map job”
– Sort output of all map jobs on n by key
– Send (key1,value1) pairs with same key value to same machine
(using e.g., hashing)
On each machine m, apply reduce function to (key1, {value1})
pairs mapped to it
– Reduce(key1,{value1})  (key2,value2) “reduce job”
Optionally, collect output and present to user
Map Workers
Example
Count occurrences of the word “Italy” and “DBMS” in all documents
map(d):
words = split(d,’ ‘)
foreach w in words:
if w == ‘Italy’
emit (‘Italy’, 1)
if w == ‘DBMS’
emit (‘DBMS’, 1)
reduce(key, valueSet):
count = 0
for each v in valueSet:
count += v
emit (key,count)
Data Partitions
Reduce Workers
Italy reducer DBMS reducer
(DBMS,1)(Italy, 1)
Relational Operators In MR
Straightforward to implement relational
operators in MapReduce
– Select: simple filter in Map Phase
– Project: project function in Map Phase
– Join: Map produces tuples with join key as
key; Reduce performs the join
Query plans can be implemented as a
sequence of MapReduce jobs (e.g Hive)
Similarities Between Hadoop
and Parallel Database Systems
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
Differences
Hadoop can operate on in-situ data, without
requiring transformation or loading
Schemas:
– Hadoop doesn’t require them, DBMSs do
– Easy to write simple MR programs over raw data
Indexes
– Hadoop’s built in support is primitive
Declarative vs imperative programming
Hadoop uses a run-time scheduler for fine-grained
load balancing
Hadoop checkpoints intermediate results for fault
tolerance
SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at
Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
Loading
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF
Scalability
Except for DBMS-X load time and UDFs
all systems scale near linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node failure,
and do not adapt if a node is
running slowly
Benchmark Conclusions
Hadoop is consistently more scalable
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop is consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needs better support for compression and direct
operation on compressed data
– Needs better support for indexing
– Needs better support for co-partitioning of datasets
Best of Both Worlds Possible?
Connector
Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)
Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact
Adding DBMS Technology to
Hadoop
Option 1:
– Keep Hadoop’s storage and build parallel executor on
top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
– Use relational storage on each node
Better support for traditional database management
(including updates and deletes)
Accelerates “time to complete system”
We chose this option for HadoopDB
HadoopDB Architecture
SMS Planner
HadoopDB Experiments
VLDB 2009 paper ran same Web analytics
benchmark as SIGMOD 2009 paper
Used PostgreSQL as the DBMS storage
layer
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes
Vertica
DBMS-X
Hadoop
HadoopDB
•HadoopDB much faster than Hadoop
•In 2009, didn’t quite match the database
systems in performance (but see next slide)
•Hadoop start-up costs
•PostgreSQL join performance
•Fault tolerance/run-time scheduling
TPC-H Benchmark Results
•More advanced storage and better
join algorithms can lead to better
performance (even beating database
systems for some queries)
UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Time(seconds)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
HadoopDB
Hadapt
Goal: Turn HadoopDB into enterprise-
grade software so it can be used for real
analytical applications
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012
Speaker has a significant financial interest in Hadapt
Hadapt: Lessons Learned
Location matters
– Impossible to scale quickly in New Haven,
Connecticut
– Company was forced to move to Cambridge
(near other successful DB companies)
Personally painful for me
Hadapt: Lessons Learned
Business side matters
– Need intelligence and competence across the
whole company (not just in engineering)
Decisions have to made all the time that have
enormous effects on the company
– Technical founders often find themselves
doing non-technical tasks more than half the
time
Hadapt: Lessons Learned
Good enterprise software: more time on
minutia than innovative new features
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Disaster Recovery
– Installer, upgrader, downgrader (!)
– Documentation
Hadapt: Lessons Learned
Customer input to the design process
critical
– Hadapt had no plans to integrate text search
into the product until request from a customer
– Today ~75% of Hadapt customers use the
text search feature
Hadapt: Lessons Learned
Installation and load processes are critical
– They are the customer’s first two interactions
with your product
– Lead to the invisible loading idea
Discussed in the remaining slides of this
presentation
If only we could make loading
invisible …
Review: Analytic Workflows
Assume data already in file system (HDFS)
Two workflows
– Workflow #1: In-situ querying
e.g. Map Reduce, HIVE / Pig, Impala, etc
Latency optimization (time to first query answer)
File
System
Query
EngineData Retrieval
Query Result
Submit Query
Review: Analytic Workflows
Assume data already in file system (HDFS)
Two workflows
– Workflow #2: Querying after bulk load
E.g. RDBMS, Hadapt
Storage and retrieval optimization for throughput
(repeated queries)
Other benefits: data validation and quality control
File
System
Query
Engine
1. Bulk Load
Query Result
2. Submit Query
Storage
Engine
Immediate Versus Deferred
Gratification
Use Case Example
Context-Dependent Schema Awareness
Different analysts know the schema of
what they are looking for and don’t care
about other log messages
Message field
varies depending
on application
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
File
System
Query
Engine
Partial Load 1
Query 1 Result
Submit Query 1
Storage
Engine
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
File
System
Query
Engine
Partial Load 2
Query 2 Result
Submit Query 2
Storage
Engine
Workload-Based Data Loading
Invisible loading makes workload-directed
decisions on:
– When to bulk load which data
– By piggy-backing on queries
Benefits
– Eliminates manual workflows on
data modeling and bulk load scripting
– Optimizes query performance over time
– Boosts productivity and performance
Interface and User Experience
Design
Structured parser interface for mapper
– getAttribute(int index)
– Similar to PigLatin
Minimize load’s impact on query performance
– Load a subset of rows (based on HDFS file splits)
– Load size tunable based on query performance
impact
Optional: load a subset of columns
– Useful for columnar file formats
Physical Tuning: Incremental
Reorganization
Performance Study
Conclusion
Invisible loading: Immediate gratification
+ long term performance benefits
Applicability beyond HDFS
Future work
– Automatic data schema / format recognition
– Schema evolution support
– Decouple load from query
Workload-sensitive loading strategies

More Related Content

What's hot

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersXiao Qin
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoopdarugar
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 

What's hot (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

Viewers also liked

Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Daniel Abadi
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 

Viewers also liked (9)

Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 

Similar to Shared slides-edbt-keynote-03-19-13

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 

Similar to Shared slides-edbt-keynote-03-19-13 (20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
hadoop
hadoophadoop
hadoop
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Shared slides-edbt-keynote-03-19-13

  • 1. Data Loading Challenges when Transforming Hadoop into a Parallel Database System Daniel Abadi Yale University March 19th, 2013 Twitter: @daniel_abadi
  • 2. Overview of Talk Motivation for HadoopDB Overview of HadoopDB Lessons learned from commercializing HadoopDB into Hadapt Ideas for overcoming the loading challenge (invisible loading) EDBT 2013 paper on invisible loading that is published alongside this talk can be found at: http://cs- www.cs.yale.edu/homes/dna/papers/invisibleloadi ng.pdf
  • 3. Big Data Data is growing (currently) faster than Moore’s Law – Scale becoming a bigger concern – Cost becoming a bigger concern – Unstructured data “variety” becoming a bigger concern Increasing desire for incremental scale out on commodity hardware – Fault tolerance a bigger concern – Dealing with unpredictable performance a bigger concern Parallel database systems are the traditional solution A lot of excitement around Hadoop as a tool to help with these problems (initiated by Yahoo and Facebook)
  • 4. Hadoop/MapReduce Data is partitioned across N machines – Typically stored in a distributed file system (GFS/HDFS) On each machine n, apply a function, Map, to each data item d – Map(d)  {(key1,value1)} “map job” – Sort output of all map jobs on n by key – Send (key1,value1) pairs with same key value to same machine (using e.g., hashing) On each machine m, apply reduce function to (key1, {value1}) pairs mapped to it – Reduce(key1,{value1})  (key2,value2) “reduce job” Optionally, collect output and present to user
  • 5. Map Workers Example Count occurrences of the word “Italy” and “DBMS” in all documents map(d): words = split(d,’ ‘) foreach w in words: if w == ‘Italy’ emit (‘Italy’, 1) if w == ‘DBMS’ emit (‘DBMS’, 1) reduce(key, valueSet): count = 0 for each v in valueSet: count += v emit (key,count) Data Partitions Reduce Workers Italy reducer DBMS reducer (DBMS,1)(Italy, 1)
  • 6. Relational Operators In MR Straightforward to implement relational operators in MapReduce – Select: simple filter in Map Phase – Project: project function in Map Phase – Join: Map produces tuples with join key as key; Reduce performs the join Query plans can be implemented as a sequence of MapReduce jobs (e.g Hive)
  • 7. Similarities Between Hadoop and Parallel Database Systems Both are suitable for large-scale data processing – I.e. analytical processing workloads – Bulk loads – Not optimized for transactional workloads – Queries over large amounts of data – Both can handle both relational and nonrelational queries (DBMS via UDFs)
  • 8. Differences Hadoop can operate on in-situ data, without requiring transformation or loading Schemas: – Hadoop doesn’t require them, DBMSs do – Easy to write simple MR programs over raw data Indexes – Hadoop’s built in support is primitive Declarative vs imperative programming Hadoop uses a run-time scheduler for fine-grained load balancing Hadoop checkpoints intermediate results for fault tolerance
  • 9. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems – Mostly focused on performance differences – Measured differences in load and query time for some common data processing tasks – Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 10. Hardware Setup 100 node cluster Each node – 2.4 GHz Code 2 Duo Processors – 4 GB RAM – 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes – 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 11. Loading 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 12. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 13. UDF Task 0 200 400 600 800 1000 1200 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) DBMS Hadoop DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents Performed via a UDF
  • 14. Scalability Except for DBMS-X load time and UDFs all systems scale near linearly BUT: only ran on 100 nodes As nodes approach 1000, other effects come into play – Faults go from being rare, to not so rare – It is nearly impossible to maintain homogeneity at scale
  • 15. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 16. Benchmark Conclusions Hadoop is consistently more scalable – Checkpointing allows for better fault tolerance – Runtime scheduling allows for better tolerance of unexpectedly slow nodes – Better parallelization of UDFs Hadoop is consistently less efficient for structured, relational data – Reasons mostly non-fundamental – Needs better support for compression and direct operation on compressed data – Needs better support for indexing – Needs better support for co-partitioning of datasets
  • 17. Best of Both Worlds Possible? Connector
  • 18. Problems With the Connector Approach Network delays and bandwidth limitations Data silos Multiple vendors Fundamentally wasteful – Very similar architectures Both partition data across a cluster Both parallelize processing across the cluster Both optimize for local data processing (to minimize network costs)
  • 19. Unified System Two options: – Bring Hadoop technology to a parallel database system Problem: Hadoop is more than just technology – Bring parallel database system technology to Hadoop Far more likely to have impact
  • 20. Adding DBMS Technology to Hadoop Option 1: – Keep Hadoop’s storage and build parallel executor on top of it Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects) Need better Storage Formats (Trevni and Parquet are promising) Updates and Deletes are hard (Impala doesn’t support them) – Use relational storage on each node Better support for traditional database management (including updates and deletes) Accelerates “time to complete system” We chose this option for HadoopDB
  • 23. HadoopDB Experiments VLDB 2009 paper ran same Web analytics benchmark as SIGMOD 2009 paper Used PostgreSQL as the DBMS storage layer
  • 24. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes Vertica DBMS-X Hadoop HadoopDB •HadoopDB much faster than Hadoop •In 2009, didn’t quite match the database systems in performance (but see next slide) •Hadoop start-up costs •PostgreSQL join performance •Fault tolerance/run-time scheduling
  • 25. TPC-H Benchmark Results •More advanced storage and better join algorithms can lead to better performance (even beating database systems for some queries)
  • 26. UDF Task 0 100 200 300 400 500 600 700 800 10 nodes 25 nodes 50 nodes Time(seconds) DBMS Hadoop HadoopDB
  • 27. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop HadoopDB
  • 28. Hadapt Goal: Turn HadoopDB into enterprise- grade software so it can be used for real analytical applications – Launched with $1.5 million in seed money in 2010 – Raised an additional $8 million in 2011 – Raised an additional $6.75 million in 2012 Speaker has a significant financial interest in Hadapt
  • 29. Hadapt: Lessons Learned Location matters – Impossible to scale quickly in New Haven, Connecticut – Company was forced to move to Cambridge (near other successful DB companies) Personally painful for me
  • 30. Hadapt: Lessons Learned Business side matters – Need intelligence and competence across the whole company (not just in engineering) Decisions have to made all the time that have enormous effects on the company – Technical founders often find themselves doing non-technical tasks more than half the time
  • 31. Hadapt: Lessons Learned Good enterprise software: more time on minutia than innovative new features – SQL coverage – Failover for high availability – Authorization / authentication – Disaster Recovery – Installer, upgrader, downgrader (!) – Documentation
  • 32. Hadapt: Lessons Learned Customer input to the design process critical – Hadapt had no plans to integrate text search into the product until request from a customer – Today ~75% of Hadapt customers use the text search feature
  • 33. Hadapt: Lessons Learned Installation and load processes are critical – They are the customer’s first two interactions with your product – Lead to the invisible loading idea Discussed in the remaining slides of this presentation
  • 34. If only we could make loading invisible …
  • 35. Review: Analytic Workflows Assume data already in file system (HDFS) Two workflows – Workflow #1: In-situ querying e.g. Map Reduce, HIVE / Pig, Impala, etc Latency optimization (time to first query answer) File System Query EngineData Retrieval Query Result Submit Query
  • 36. Review: Analytic Workflows Assume data already in file system (HDFS) Two workflows – Workflow #2: Querying after bulk load E.g. RDBMS, Hadapt Storage and retrieval optimization for throughput (repeated queries) Other benefits: data validation and quality control File System Query Engine 1. Bulk Load Query Result 2. Submit Query Storage Engine
  • 38. Use Case Example Context-Dependent Schema Awareness Different analysts know the schema of what they are looking for and don’t care about other log messages Message field varies depending on application
  • 39. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries File System Query Engine Partial Load 1 Query 1 Result Submit Query 1 Storage Engine
  • 40. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries File System Query Engine Partial Load 2 Query 2 Result Submit Query 2 Storage Engine
  • 41. Workload-Based Data Loading Invisible loading makes workload-directed decisions on: – When to bulk load which data – By piggy-backing on queries Benefits – Eliminates manual workflows on data modeling and bulk load scripting – Optimizes query performance over time – Boosts productivity and performance
  • 42. Interface and User Experience Design Structured parser interface for mapper – getAttribute(int index) – Similar to PigLatin Minimize load’s impact on query performance – Load a subset of rows (based on HDFS file splits) – Load size tunable based on query performance impact Optional: load a subset of columns – Useful for columnar file formats
  • 45. Conclusion Invisible loading: Immediate gratification + long term performance benefits Applicability beyond HDFS Future work – Automatic data schema / format recognition – Schema evolution support – Decouple load from query Workload-sensitive loading strategies