SlideShare a Scribd company logo
1 of 40
Data cubes in Apache Hive
Amareshwari Sriramadasu
Jaideep Dhok
Amareshwari
Sriramadasu
• Engineer at Inmobi
• Working in Hadoop and eco
systems since 2007
• Apache Hadoop – PMC
• Apache Hive– Committer
Jaideep
Dhok
• Engineer at Inmobi
• Working in distributed
systems and Hadoop eco
system since 2007
• Apache Hive and Hadoop
Contributor
Background
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Global Mobile technology company enabling
- Developers & Publishers to monetize
- Advertisers to engage and acquire users
@ Scale
About InMobi
Digital advertising – Intro
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells
Real estate
on digital
inventory
Has reach to
users
Wants to
target Users
Brings money
Market place
Consumer
Hadoop @ InMobi – Factual Reporting & Analytics
 130 TB Hadoop warehouse
 5 TB SQL warehouse
Users
• Advertisers and publishers
• Regional officers
• Account managers
• Executive team
• Business analysts
• Product analysts
• Data scientists
• Engineering systems
• Developers
Use cases
Sharing reports/data with customer (account level)
Understand trends through data & exploration (analysis)
Debugging / Postmortem of issues (troubleshooting)
Sizing & Estimation (Ex: inventory, reach)
Summary of Product lines, Geographies, Network (Ex: Rev by Geo)
Sales/Revenue Targets vs Actuals
Tracking campaign performance
Tracking any metrics on REAL- TIME basis
Use cases
Categorize the use cases
• Batch queries
• Adhoc queries
• Interactive queries
• Scheduled reports
• Infer insights through ML algorithms
Use cases
Adhoc querying system Dashboard system
Customer facing system
Analytics systems at Inmobi
• Disparate user experience
• Disparate data storage systems causing inability to scale
• Schema management
• Not leveraging community around
Problems
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Associates structure to data
Provides Metastore and catalog
service – Hcatalog
Provides pluggable storage
interface
Accepts SQL like batch queries
HQL is widely adopted
language by systems like
Shark, Impala
Has strong apache community
Data warehouse features like
facts, dimensions
Logical table associated with
multiple physical storages
Interactive queries
Scheduling queries
UI
WhatdoesHiveprovide
WhatismissinginHive
Apache Hive to the rescue
Background
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
Data Layout – Fact data
Aggrk : measures (mak
<= ma(k-1)), dimensions
(dak < da(k-1))
…..
Aggr2 : measures (ma2 <= ma1),
dimensions (da2 < da1)
Aggr1 : measures (ma1 <= mr), dimensions (da1
< dr)
Raw data : measures (mr), dimensions(dr)
Other side of pyramid is aggregated at timed dimension
Data Layout – snowflake
Aggrk : measures
(mak <= ma(k-1)),
dimensions (dak <
da(k-1))
…..
Aggr2 : measures (ma2 <= ma1),
dimensions (da2 < da1)
Aggr1 : measures (ma1 <= mr), dimensions
(da1 < dr)
Raw data : measures (mr), dimensions(dr)
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
Data Model
Cube Storage Fact Table
Physical
Fact tables
Dimension
Table
Physical
Dimension
tables
Data Model - Cube
Dimension
• Simple Dimension
• Referenced Dimension
• Hierarchical Dimension
• Expression Dimension
• Timed dimension
Measure
• Column Measure
• Expression Measure
Cube
Measures Dimensions
Note : Some of the concepts are borrowed from
http://community.pentaho.com/projects/mondrian/
Data Model – Storage
Storage
Name
End point
Properties
Ex : ProdCluster,
StagingCluster, Postgres1,
HBase1, HBase2
Data Model – Fact Table
Fact
table
Cube
Fact
table
Storage
Fact Table
Columns
Cube that it belongs
Storages on which it is
present and the associated
update periods
Data Model – Dimension table
Dimension Table
Columns
Dimension references
Storages on which it is present
and associated snapshot dump
period, if any.
Cube
Dimension
table
Dimension
table
Dimension
table
Storage
Data Model – Storage tables and partitions
Storage table
Belongs to fact/dimension
Associated storage
descriptor
Partitioned by columns
Naming convention –
storage name followed by
fact/dimension name
Partition can override its
storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
CUBE SELECT [DISTINCT] select_expr,
select_expr, ...
FROM cube_table_reference
WHERE [where_condition AND]
TIME_RANGE_IN(colName , from, to)
[GROUP BY col_list]
[HAVING having_expr]
[ORDER BY colList]
[LIMIT number]
cube_table_reference:
cube_table_factor
| join_table
join_table:
cube_table_reference JOIN cube_table_factor
[join_condition]
| cube_table_reference {LEFT|RIGHT|FULL} [OUTER]
JOIN cube_table_reference [join_condition]
cube_table_factor:
cube_name [alias]
| ( cube_table_reference )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
colOrder: ( ASC | DESC )
colList : colName colOrder? (',' colName colOrder?)*
Queries on Data cubes
Resolve candidate tables and storages
Automatically resolve joins
Resolve aggregates and groupby expressions
Allows SQL over Cube QL
Queries can span multiple storages
Accepts multi time range queries
All Hive QL features
Query features
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable
citytable LIMIT 100
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable
citytable WHERE (citytable.dt = 'latest') LIMIT 100
cube select name, stateid from citytable limit 100
Example query
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER
JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE ((
testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-
10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR
(testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-
10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR
(testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-
10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR
(testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-
10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR
(testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-
00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt
= 'latest')
GROUP BY(citytable.name)
cube select citytable.name, msr2 from testcube where
timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
Available in Hive
• Data warehouse features
like facts, dimensions
• Logical table associated
with multiple physical
storages
Available in Grill
• Pluggable execution engine for
HQL
• Query history, caching
• Scheduling queries
Where is it available
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Unified analytics platform at Inmobi
• Supports multiple execution engines
• Supports multiple storages
Provides analytics on the system
Provides query history
Grill
Grill Architecture
Implements an interface
• execute
• explain
• executeAsynchronously
• fetchResults
• Specify all storages it can support
Pluggable execution engine
Cube QL query
Rewrite query for available execution engine’s
supported storages
Get cost of the rewritten query from each
execution engine
Pick up execution engine with least cost and
fire the query
Cube query with multiple execution engines
Grill Server
Grill server – code status
• Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 24
• Number cubes – 15
• Size of the data
• Total size – 136 TB
• Dimension data – 400 MB compressed per hour
• Raw data - 1.2 TB per day
• Aggregated facts- 53GB per day
Data ware house statistics
Jira reference
• https://issues.apache.org/jira/browse/HIVE-4115
Github source for Hive
• https://github.com/InMobi/hive
Github source for grill
• https://github.com/InMobi/grill
Documentation
• http://inmobi.github.io/grill
References
Thank you!
• amareshwari@apache.org
• jaideep.dhok@inmobi.com

More Related Content

What's hot

What's hot (20)

Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting started
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 

Viewers also liked

User-Defined Table Generating Functions
User-Defined Table Generating FunctionsUser-Defined Table Generating Functions
User-Defined Table Generating Functions
pauly1
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
JunHo Cho
 

Viewers also liked (12)

Hive introduction 介绍
Hive  introduction 介绍Hive  introduction 介绍
Hive introduction 介绍
 
User-Defined Table Generating Functions
User-Defined Table Generating FunctionsUser-Defined Table Generating Functions
User-Defined Table Generating Functions
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Replacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and HiveReplacing Telco DB/DW to Hadoop and Hive
Replacing Telco DB/DW to Hadoop and Hive
 
Integrate Hive and R
Integrate Hive and RIntegrate Hive and R
Integrate Hive and R
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similar to Datacubes in Apache Hive at ApacheCon

HQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseHQL over Tiered Data Warehouse
HQL over Tiered Data Warehouse
DataWorks Summit
 
Fifth elephant-grill
Fifth elephant-grillFifth elephant-grill
Fifth elephant-grill
amarsri
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
amarsri
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
DataWorks Summit
 

Similar to Datacubes in Apache Hive at ApacheCon (20)

HQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseHQL over Tiered Data Warehouse
HQL over Tiered Data Warehouse
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
 
Fifth elephant-grill
Fifth elephant-grillFifth elephant-grill
Fifth elephant-grill
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 

Recently uploaded

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Recently uploaded (20)

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Datacubes in Apache Hive at ApacheCon

  • 1. Data cubes in Apache Hive Amareshwari Sriramadasu Jaideep Dhok
  • 2. Amareshwari Sriramadasu • Engineer at Inmobi • Working in Hadoop and eco systems since 2007 • Apache Hadoop – PMC • Apache Hive– Committer
  • 3. Jaideep Dhok • Engineer at Inmobi • Working in distributed systems and Hadoop eco system since 2007 • Apache Hive and Hadoop Contributor
  • 4. Background Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
  • 5. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 6. Global Mobile technology company enabling - Developers & Publishers to monetize - Advertisers to engage and acquire users @ Scale About InMobi
  • 7. Digital advertising – Intro Courtesy: http://www.liesdamnedlies.com/ Owns & Sells Real estate on digital inventory Has reach to users Wants to target Users Brings money Market place Consumer
  • 8. Hadoop @ InMobi – Factual Reporting & Analytics  130 TB Hadoop warehouse  5 TB SQL warehouse
  • 9. Users • Advertisers and publishers • Regional officers • Account managers • Executive team • Business analysts • Product analysts • Data scientists • Engineering systems • Developers Use cases
  • 10. Sharing reports/data with customer (account level) Understand trends through data & exploration (analysis) Debugging / Postmortem of issues (troubleshooting) Sizing & Estimation (Ex: inventory, reach) Summary of Product lines, Geographies, Network (Ex: Rev by Geo) Sales/Revenue Targets vs Actuals Tracking campaign performance Tracking any metrics on REAL- TIME basis Use cases
  • 11. Categorize the use cases • Batch queries • Adhoc queries • Interactive queries • Scheduled reports • Infer insights through ML algorithms Use cases
  • 12. Adhoc querying system Dashboard system Customer facing system Analytics systems at Inmobi
  • 13. • Disparate user experience • Disparate data storage systems causing inability to scale • Schema management • Not leveraging community around Problems
  • 14. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 15. Associates structure to data Provides Metastore and catalog service – Hcatalog Provides pluggable storage interface Accepts SQL like batch queries HQL is widely adopted language by systems like Shark, Impala Has strong apache community Data warehouse features like facts, dimensions Logical table associated with multiple physical storages Interactive queries Scheduling queries UI WhatdoesHiveprovide WhatismissinginHive Apache Hive to the rescue
  • 16. Background Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
  • 17. Data Layout – Fact data Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Other side of pyramid is aggregated at timed dimension
  • 18. Data Layout – snowflake Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1
  • 19. Data Model Cube Storage Fact Table Physical Fact tables Dimension Table Physical Dimension tables
  • 20. Data Model - Cube Dimension • Simple Dimension • Referenced Dimension • Hierarchical Dimension • Expression Dimension • Timed dimension Measure • Column Measure • Expression Measure Cube Measures Dimensions Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
  • 21. Data Model – Storage Storage Name End point Properties Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
  • 22. Data Model – Fact Table Fact table Cube Fact table Storage Fact Table Columns Cube that it belongs Storages on which it is present and the associated update periods
  • 23. Data Model – Dimension table Dimension Table Columns Dimension references Storages on which it is present and associated snapshot dump period, if any. Cube Dimension table Dimension table Dimension table Storage
  • 24. Data Model – Storage tables and partitions Storage table Belongs to fact/dimension Associated storage descriptor Partitioned by columns Naming convention – storage name followed by fact/dimension name Partition can override its storage descriptor • Fact storage table Fact table • Dimension storage table Dimension table
  • 25. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 26. CUBE SELECT [DISTINCT] select_expr, select_expr, ... FROM cube_table_reference WHERE [where_condition AND] TIME_RANGE_IN(colName , from, to) [GROUP BY col_list] [HAVING having_expr] [ORDER BY colList] [LIMIT number] cube_table_reference: cube_table_factor | join_table join_table: cube_table_reference JOIN cube_table_factor [join_condition] | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition] cube_table_factor: cube_name [alias] | ( cube_table_reference ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)* Queries on Data cubes
  • 27. Resolve candidate tables and storages Automatically resolve joins Resolve aggregates and groupby expressions Allows SQL over Cube QL Queries can span multiple storages Accepts multi time range queries All Hive QL features Query features
  • 28. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100 cube select name, stateid from citytable limit 100 Example query
  • 29. Example query • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03- 10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03- 10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03- 10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03- 10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12- 00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name) cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
  • 30. Available in Hive • Data warehouse features like facts, dimensions • Logical table associated with multiple physical storages Available in Grill • Pluggable execution engine for HQL • Query history, caching • Scheduling queries Where is it available
  • 31. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 32. Unified analytics platform at Inmobi • Supports multiple execution engines • Supports multiple storages Provides analytics on the system Provides query history Grill
  • 34. Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support Pluggable execution engine
  • 35. Cube QL query Rewrite query for available execution engine’s supported storages Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query Cube query with multiple execution engines
  • 37. Grill server – code status
  • 38. • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 24 • Number cubes – 15 • Size of the data • Total size – 136 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day Data ware house statistics
  • 39. Jira reference • https://issues.apache.org/jira/browse/HIVE-4115 Github source for Hive • https://github.com/InMobi/hive Github source for grill • https://github.com/InMobi/grill Documentation • http://inmobi.github.io/grill References
  • 40. Thank you! • amareshwari@apache.org • jaideep.dhok@inmobi.com