SlideShare uma empresa Scribd logo
1 de 40
Copyright © 2013 Cloudera Inc. All rights reserved.
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop Beyond Batch: 

Real-time Workloads, SQL-on-
Hadoop, and the Virtual EDW
Marcel Kornacker | marcel@cloudera.com 
April 2014
Copyright © 2013 Cloudera Inc. All rights reserved.
Analytic Workloads on Hadoop: Where Do
We Stand?
!2
“DeWitt Clause” prohibits
using DBMS vendor name
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Hadoop has traditional been utilized for offline batch processing:
ETL and ELT
•Next step: Hadoop for traditional business intelligence (BI)/data
warehouse (EDW) workloads:
•interactive
•concurrent users
•Topic of this talk: a Hadoop-based open-source stack for EDW
workloads:
•HDFS: a high-performance storage system
•Parquet: a state-of-the-art columnar storage format
•Impala: a modern, open-source SQL engine for Hadoop
!3
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Hadoop stack is effective solution for certain EDW workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!4
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: A Storage System for Analytic
Workloads
•Available in Hdfs today:
•high-efficiency data scans at or near hardware speed, both
from disk and memory
•On the immediate roadmap:
•co-partitioned tables for even faster distributed joins
•temp-FS: write temp table data straight to memory,
bypassing disk

!5
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•High efficiency data transfers
•short-circuit reads: bypass DataNode protocol when reading
from local disk

-> read at 100+MB/s per disk
•HDFS caching: access explicitly cached data w/o copy or
checksumming

-> access memory-resident data at memory bus speed

-> enable in-memory processing
!6
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•Coming attractions:
•affinity groups: collocate blocks from different files

-> create co-partitioned tables for improved join
performance
•temp-fs: write temp table data straight to memory,
bypassing disk

-> ideal for iterative interactive data analysis
!7
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Columnar Storage for Hadoop
•What it is:
•state-of-the-art, open-source columnar file format that’s
available for (most) Hadoop processing frameworks:

Impala, Hive, Pig, MapReduce, Cascading, …
•offers both high compression and high scan efficiency
•co-developed by Twitter and Cloudera; hosted on github and
soon to be an Apache incubator project
•with contributors from Criteo, Stripe, Berkeley AMPlab,
LinkedIn
•used in production at Twitter and Criteo
!8
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: The Details
•columnar storage: column-major instead of the traditional
row-major layout; used by all high-end analytic DBMSs
•optimized storage of nested data structures: patterned
after Dremel’s ColumnIO format
•extensible set of column encodings:
•run-length and dictionary encodings in current version (1.2)
•delta and optimized string encodings in 2.0
•embedded statistics: version 2.0 stores inlined column
statistics for further optimization of scan efficiency
!9
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Storage Efficiency
!10
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Scan Efficiency
!11
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•implementation of an MPP SQL query engine for the Hadoop
environment
•highest-performance SQL engine for the Hadoop ecosystem;

already outperforms some of its commercial competitors
•effective for EDW-style workloads
•maintains Hadoop flexibility by utilizing standard Hadoop
components (HDFS, Hbase, Metastore, Yarn)
•plays well with traditional BI tools:

exposes/interacts with industry-standard interfaces (odbc/
jdbc, Kerberos and LDAP, ANSI SQL)
!12
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•history:
•developed by Cloudera and fully open-source; hosted on
github
•released as beta in 10/2012
•1.0 version available in 05/2013
•current version is 1.2.3, available for CDH4 and CDH5 beta
!13
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•create tables as virtual views over data stored in HDFS
or Hbase;

schema metadata is stored in Metastore (shared with
Hive, Pig, etc.; basis of HCatalog)
•connect via odbc/jdbc; authenticate via Kerberos or
LDAP
•run standard SQL:
•current version: ANSI SQL-92 (limited to SELECT and bulk
insert) minus correlated subqueries, has UDFs and UDAs
!14
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•2014 roadmap:
•1.3: admission control, Order By without Limit,
Decimal(<precision>, <scale>)
•1.4: analytic window functions
•2.0: support for nested types (structs, arrays, maps), UDTFs,
disk-based joins and aggregation
!15
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture
•distributed service:
•daemon process (impalad) runs on every node with data
•easily deployed with Cloudera Manager
•each node can handle user requests; load balancer
configuration for multi-user environments recommended
•query execution phases:
•client request arrives via odbc/jdbc
•planner turns request into collection of plan fragments
•coordinator initiates execution on remote impala’s
!16
Copyright © 2013 Cloudera Inc. All rights reserved.
• Request arrives via odbc/jdbc
Impala Query Execution
!17
Copyright © 2013 Cloudera Inc. All rights reserved.
• Planner turns request into collection of plan fragments
• Coordinator initiates execution on remote impalad nodes
Impala Query Execution
!18
Copyright © 2013 Cloudera Inc. All rights reserved.
• Intermediate results are streamed between impala’s
• Query results are streamed back to client
Impala Query Execution
!19
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Planning
•2-phase process:
•single-node plan: left-deep tree of query operators
•partitioning into plan fragments for distributed parallel
execution:

maximize scan locality/minimize data movement, parallelize
all query operators
•cost-based join order optimization
•cost-based join distribution optimization
!20
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•execution engine designed for efficiency, written from scratch
in C++; no reuse of decades-old open-source code
•circumvents MapReduce completely
•in-memory execution:
•aggregation results and right-hand side inputs of joins are
cached in memory
•example: join with 1TB table, reference 2 of 200 cols, 10% of
rows 

-> need to cache 1GB across all nodes in cluster

-> not a limitation for most workloads
!21
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•runtime code generation:
•uses llvm to jit-compile the runtime-intensive parts of a
query
•effect the same as custom-coding a query:
•remove branches
•propagate constants, offsets, pointers, etc.
•inline function calls
•optimized execution for modern CPUs (instruction pipelines)
!22
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
!23
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
•Impala vs. SQL-on-MR
•Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”)
•file formats: Parquet/ORCfile
•TPC-DS, 3TB data set running on 5-node cluster
!24
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
• Impala speedup:
• interactive: 8-69x
• report: 6-68x
• deep analytics:
10-58x
!25
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
•Impala 1.2.3/Presto 0.6/Shark
•file formats: RCfile (+ Parquet)
•TPC-DS, 15TB data set running on 21-node cluster
!26
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!27
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!28
• Multi-user benchmark:
• 10 users concurrently
• same dataset, same
hardware
• workload: queries from
“interactive” group
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!29
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Hadoop’s promise of linear scalability: add more
nodes to cluster, gain a proportional increase in
capabilities

-> adapt to any kind of workload changes simply by
adding more nodes to cluster
•scaling dimensions for EDW workloads:
•response time
•concurrency/query throughput
•data size
!30
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Scalability results for Impala:
•tests show linear scaling along all 3 dimensions
•setup:
•2 clusters: 18 and 36 nodes
•15TB TPC-DS data set
•6 “interactive” TPC-DS queries
!31
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Latency
!32
Copyright © 2013 Cloudera Inc. All rights reserved.
• Comparison: 10 vs 20 concurrent users
Impala Scalability: Concurrency
!33
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Data Size
• Comparison: 15TB vs. 30TB data set
!34
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Impala/Parquet/Hdfs is effective solution for certain EDW
workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!35
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•latest technological innovations add capabilities that
originated in high-end proprietary systems:
•high-performance disk scans and memory caching in HDFS
•Parquet: columnar storage for analytic workloads
•Impala: high-performance parallel SQL execution
!36
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Impala/Parquet/Hdfs for EDW workloads:
•integrates into BI environment via standard connectivity and
security
•comparable or better performance than commercial
competitors
•currently still SQL limitations
•but those are rapidly diminishing
!37
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Impala/Parquet/Hdfs maintains traditional Hadoop
strengths:
•flexibility: Parquet is understood across the platform, natively
processed by most popular frameworks
•demonstrated scalability and cost effectiveness
!38
The End
!39
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•what the future holds:
•further performance gains
•more complete SQL capabilities
•improved resource mgmt and ability to handle multiple
concurrent workloads in a single cluster
!40

Mais conteúdo relacionado

Mais procurados

Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 

Mais procurados (20)

Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 

Destaque

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...Kai Wähner
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten IchibaRakuten Group, Inc.
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsCloudera, Inc.
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017John Maeda
 

Destaque (14)

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
 

Semelhante a Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 

Semelhante a Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads (20)

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 

Mais de Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

Mais de Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads

  • 1. Copyright © 2013 Cloudera Inc. All rights reserved. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop Beyond Batch: 
 Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Marcel Kornacker | marcel@cloudera.com April 2014
  • 2. Copyright © 2013 Cloudera Inc. All rights reserved. Analytic Workloads on Hadoop: Where Do We Stand? !2 “DeWitt Clause” prohibits using DBMS vendor name
  • 3. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Hadoop has traditional been utilized for offline batch processing: ETL and ELT •Next step: Hadoop for traditional business intelligence (BI)/data warehouse (EDW) workloads: •interactive •concurrent users •Topic of this talk: a Hadoop-based open-source stack for EDW workloads: •HDFS: a high-performance storage system •Parquet: a state-of-the-art columnar storage format •Impala: a modern, open-source SQL engine for Hadoop !3
  • 4. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Hadoop stack is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !4
  • 5. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: A Storage System for Analytic Workloads •Available in Hdfs today: •high-efficiency data scans at or near hardware speed, both from disk and memory •On the immediate roadmap: •co-partitioned tables for even faster distributed joins •temp-FS: write temp table data straight to memory, bypassing disk
 !5
  • 6. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •High efficiency data transfers •short-circuit reads: bypass DataNode protocol when reading from local disk
 -> read at 100+MB/s per disk •HDFS caching: access explicitly cached data w/o copy or checksumming
 -> access memory-resident data at memory bus speed
 -> enable in-memory processing !6
  • 7. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •Coming attractions: •affinity groups: collocate blocks from different files
 -> create co-partitioned tables for improved join performance •temp-fs: write temp table data straight to memory, bypassing disk
 -> ideal for iterative interactive data analysis !7
  • 8. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Columnar Storage for Hadoop •What it is: •state-of-the-art, open-source columnar file format that’s available for (most) Hadoop processing frameworks:
 Impala, Hive, Pig, MapReduce, Cascading, … •offers both high compression and high scan efficiency •co-developed by Twitter and Cloudera; hosted on github and soon to be an Apache incubator project •with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn •used in production at Twitter and Criteo !8
  • 9. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: The Details •columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs •optimized storage of nested data structures: patterned after Dremel’s ColumnIO format •extensible set of column encodings: •run-length and dictionary encodings in current version (1.2) •delta and optimized string encodings in 2.0 •embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency !9
  • 10. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Storage Efficiency !10
  • 11. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Scan Efficiency !11
  • 12. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •implementation of an MPP SQL query engine for the Hadoop environment •highest-performance SQL engine for the Hadoop ecosystem;
 already outperforms some of its commercial competitors •effective for EDW-style workloads •maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Hbase, Metastore, Yarn) •plays well with traditional BI tools:
 exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL) !12
  • 13. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •history: •developed by Cloudera and fully open-source; hosted on github •released as beta in 10/2012 •1.0 version available in 05/2013 •current version is 1.2.3, available for CDH4 and CDH5 beta !13
  • 14. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •create tables as virtual views over data stored in HDFS or Hbase;
 schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) •connect via odbc/jdbc; authenticate via Kerberos or LDAP •run standard SQL: •current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, has UDFs and UDAs !14
  • 15. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •2014 roadmap: •1.3: admission control, Order By without Limit, Decimal(<precision>, <scale>) •1.4: analytic window functions •2.0: support for nested types (structs, arrays, maps), UDTFs, disk-based joins and aggregation !15
  • 16. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture •distributed service: •daemon process (impalad) runs on every node with data •easily deployed with Cloudera Manager •each node can handle user requests; load balancer configuration for multi-user environments recommended •query execution phases: •client request arrives via odbc/jdbc •planner turns request into collection of plan fragments •coordinator initiates execution on remote impala’s !16
  • 17. Copyright © 2013 Cloudera Inc. All rights reserved. • Request arrives via odbc/jdbc Impala Query Execution !17
  • 18. Copyright © 2013 Cloudera Inc. All rights reserved. • Planner turns request into collection of plan fragments • Coordinator initiates execution on remote impalad nodes Impala Query Execution !18
  • 19. Copyright © 2013 Cloudera Inc. All rights reserved. • Intermediate results are streamed between impala’s • Query results are streamed back to client Impala Query Execution !19
  • 20. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Planning •2-phase process: •single-node plan: left-deep tree of query operators •partitioning into plan fragments for distributed parallel execution:
 maximize scan locality/minimize data movement, parallelize all query operators •cost-based join order optimization •cost-based join distribution optimization !20
  • 21. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •execution engine designed for efficiency, written from scratch in C++; no reuse of decades-old open-source code •circumvents MapReduce completely •in-memory execution: •aggregation results and right-hand side inputs of joins are cached in memory •example: join with 1TB table, reference 2 of 200 cols, 10% of rows 
 -> need to cache 1GB across all nodes in cluster
 -> not a limitation for most workloads !21
  • 22. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •runtime code generation: •uses llvm to jit-compile the runtime-intensive parts of a query •effect the same as custom-coding a query: •remove branches •propagate constants, offsets, pointers, etc. •inline function calls •optimized execution for modern CPUs (instruction pipelines) !22
  • 23. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution !23
  • 24. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads •Impala vs. SQL-on-MR •Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”) •file formats: Parquet/ORCfile •TPC-DS, 3TB data set running on 5-node cluster !24
  • 25. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads • Impala speedup: • interactive: 8-69x • report: 6-68x • deep analytics: 10-58x !25
  • 26. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads •Impala 1.2.3/Presto 0.6/Shark •file formats: RCfile (+ Parquet) •TPC-DS, 15TB data set running on 21-node cluster !26
  • 27. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !27
  • 28. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !28 • Multi-user benchmark: • 10 users concurrently • same dataset, same hardware • workload: queries from “interactive” group
  • 29. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !29
  • 30. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Hadoop’s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities
 -> adapt to any kind of workload changes simply by adding more nodes to cluster •scaling dimensions for EDW workloads: •response time •concurrency/query throughput •data size !30
  • 31. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Scalability results for Impala: •tests show linear scaling along all 3 dimensions •setup: •2 clusters: 18 and 36 nodes •15TB TPC-DS data set •6 “interactive” TPC-DS queries !31
  • 32. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Latency !32
  • 33. Copyright © 2013 Cloudera Inc. All rights reserved. • Comparison: 10 vs 20 concurrent users Impala Scalability: Concurrency !33
  • 34. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Data Size • Comparison: 15TB vs. 30TB data set !34
  • 35. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Impala/Parquet/Hdfs is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !35
  • 36. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •latest technological innovations add capabilities that originated in high-end proprietary systems: •high-performance disk scans and memory caching in HDFS •Parquet: columnar storage for analytic workloads •Impala: high-performance parallel SQL execution !36
  • 37. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Impala/Parquet/Hdfs for EDW workloads: •integrates into BI environment via standard connectivity and security •comparable or better performance than commercial competitors •currently still SQL limitations •but those are rapidly diminishing !37
  • 38. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Impala/Parquet/Hdfs maintains traditional Hadoop strengths: •flexibility: Parquet is understood across the platform, natively processed by most popular frameworks •demonstrated scalability and cost effectiveness !38
  • 40. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •what the future holds: •further performance gains •more complete SQL capabilities •improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster !40