SlideShare uma empresa Scribd logo
1 de 25
Page1 © Hortonworks Inc. 2014
Discardable, In-Memory
Materialized Query for Hadoop
Julian Hyde Julian Hyde
June 3rd, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
Before we get started…
The bad news
• This software is not available to download
• I am a database bigot
• I am a BI bigot
The good news
• I believe in Hadoop
• Now is a great time to discuss where Hadoop is going
Page4 © Hortonworks Inc. 2014
Hadoop today
Brute force
Hadoop brings a lot of CPU, disk, IO
Yarn, Tez, Vectorization are making Hadoop faster
How to use that brute force is left to the application
Business Intelligence
Best practice is to pull data out of Hadoop
• Populate enterprise data warehouse
• In-memory analytics
• Custom analytics, e.g. Lambda architecture
Ineffective use of memory
Opportunity to make Hadoop smarter
Page5 © Hortonworks Inc. 2014
Brute force + diplomacy
Page6 © Hortonworks Inc. 2014
Hardware trends
Typical Hadoop server configuration
Lots of memory - but not enough for all data
More heterogeneous mix (disk + SSD + memory)
What to do about memory?
Year 2009 2014
Cores 4 – 8 24
Memory 8 GB 128 GB
SSD None 1 TB
Disk 4 x 1 TB 12 x 4 TB
Disk : memory 512 : 1 384 : 1
Page7 © Hortonworks Inc. 2014
Dumb use of memory - Buffer cache
Example:
50 TB data
1 TB memory
Full-table scan
Analogous to virtual
memory
• Great while it works, but…
• Table scan nukes the cache!
Table (disk)
Buffer cache
(memory)
Table scan
Query
operators
(Tez, or
whatever)
Pre-fetch
Page8 © Hortonworks Inc. 2014
“Document-oriented” analysis
Operate on working sets small enough to fit in memory
Analogous to working on a document (e.g. a spreadsheet)
• Works well for problems that fit into memory (e.g. some machine-learning algorithms)
• If your problem grows, you’re out of luck
Working set
(memory)
Table (disk)
Interactive user
Page9 © Hortonworks Inc. 2014
Smarter use of memory - Materialized queries
In-memory
materialized
queries
Tables
on disk
Page10 © Hortonworks Inc. 2014
Census data
Census table – 300M records
Which state has the most
males?
Brute force – read 150M
records
Smarter – read 50 records
CREATE TABLE Census (id, gender,
zipcode, state, age);
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
CREATE TABLE CensusSummary AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state, c
FROM CensusSummary
WHERE gender = ‘M’
ORDER BY c DESC LIMIT 1;
Page11 © Hortonworks Inc. 2014
Materialized view – Automatic smartness
A materialized view is a
table that is declared to be
identical to a given query
Optimizer rewrites query
on “Census” to use
“CensusSummary” instead
Even smarter – read 50
records
CREATE MATERIALIZED VIEW CensusSummary
STORAGE (MEMORY, DISCARDABLE) AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
Page12 © Hortonworks Inc. 2014
Indistinguishable from magic
Run query #1
System creates
materialized view in
background
Related query #2 runs
faster
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
CREATE MATERIALIZED VIEW CensusSummary
STORAGE (MEMORY, DISCARDABLE) AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state,
COUNT(NULLIF(gender, ‘F’)) AS males,
COUNT(NULLIF(gender, ‘M’)) AS females
FROM Census
GROUP BY state
HAVING females > males;
Page13 © Hortonworks Inc. 2014
Materialized views - Classic
Classic materialized view
(Oracle, DB2, Teradata, MSSql)
1. A table defined using a SQL query
2. Designed by DBA
3. Storage same as a regular table
1. On disk
2. Can define indexes
4. DB populates the table
5. Queries are rewritten to use the
table**
6. DB updates the table to reflect
changes to source data (usually
deferred)*
*Magic required
SELECT t.year, AVG(s.units)
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
GROUP BY t.year;
CREATE MATERIALIZED VIEW SalesMonthZip AS
SELECT t.year, t.month,
c.state, c.zipcode,
COUNT(*), SUM(s.units), SUM(s.price)
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
JOIN CustomerDim AS c USING (customerId)
GROUP BY t.year, t.month,
c.state, c.zipcode;
Page14 © Hortonworks Inc. 2014
Materialized views - DIMMQ
DIMMQs - Discardable, In-memory Materialized Queries
Differences with classic materialized views
1. May be in-memory
2. HDFS may discard – based on DDM (Distributed Discardable Memory)
3. Lifecycle support:
1. Assume table is populated
2. Don’t populate & maintain
3. User can flag as valid, invalid, or change definition (e.g. date range)
4. HDFS may discard
4. More design options:
1. DBA specifies
2. Retain query results (or partial results)
3. An agent builds MVs based on query traffic
Page15 © Hortonworks Inc. 2014
DIMMQ compared to Spark RDDs
It’s not “either / or”
• Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon
• Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible
Spark RDD DIMMQ
Access By reference By algebraic expression
Sharing Within session Across sessions
Execution model Built-in External
Recovery Yes No
Discard Failure or GC Failure or cost-based
Native organization Memory Disk
Language Scala (or other JVM
language)
Language-independent
Page16 © Hortonworks Inc. 2014
Data independence
This is not just about SQL standards compliance!
Materialized views are supposed to be transparent in creation, maintenance and use.
If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded
Data independence
Ability to move data around and not tell your application
Replicas
Redundant copies
Moving between disk and memory
Sort order, projections (à la Vertica), aggregates (à la Microstrategy)
Indexes, and other weird data structures
Page17 © Hortonworks Inc. 2014
Implementing DIMMQs
Relational algebra
Apache Optiq (just entered incubator – yay!)
Algebra, rewrite rules, cost model
Metadata
Hive: “CREATE MATERIALIZED VIEW”
Definitions of materialized views in HCatalog
HDFS - Discardable Distributed Memory (DDM)
Off-heap data in memory-mapped files
Discard policy
Build in-memory, replicate to disk; or vice versa
Central namespace
Evolution of existing components
Page18 © Hortonworks Inc. 2014
Tiled queries in distributed memory
Query: SELECT x, SUM(y) FROM t GROUP BY x
In-memory
materialized
queries
Tables
on disk
Page19 © Hortonworks Inc. 2014
An adaptive system
Ongoing activities:
• Agent suggests new MVs
• MVs are built in background
• Ongoing query activity uses MVs
• User marks MVs as invalid due to source data
changes
• HDFS throws out MVs that are not pulling their
weight
Dynamic equilibrium
DIMMQs continually created & destroyed
System moves data around to adapt to
changing usage patterns
Page20 © Hortonworks Inc. 2014
Lambda architecture
From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
Page21 © Hortonworks Inc. 2014
Lambda architecture in Hadoop via DIMMQs
Use DIMMQs for materialized historic & streaming data
MV on disk
MV in key-
value store
Hadoop
Query via
SQL
Page22 © Hortonworks Inc. 2014
Variations on a theme
Materialized queries don’t have to be in memory
Materialized queries don’t need to be discardable
Materialized queries don’t need to be accessed via SQL
Materialized queries allow novel data structures to be
described
Maintaining materialized queries - build on Hive ACID
Fine-grained invalidation
Streaming into DIMMQs
In-memory tables don’t have to be materialized queries
Data aging – Older data to cheaper storage
Discardable
In-memory
Algebraic
Page23 © Hortonworks Inc. 2014
Lattice
Space of possible materialized views
A star schema, with mandatory many-to-one
relationships
Each view is a projected, filtered
aggregation
• Sales by zipcode and quarter in 2013
• Sales by state in Q1, 2012
Lattice gathers stats
• “I used MV m to answer query q and avoided
fetching r rows”
• Cost of MV = construction effort + memory * time
• Utility of MV = query processing effort saved
Recommends & builds optimal MVs
CREATE LATTICE SalesStar AS
SELECT *
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
JOIN CustomerDim AS c USING (customerId);
Page24 © Hortonworks Inc. 2014
Conclusion
Broadening the tent – batch, interactive BI, streaming, iterative
Declarative, algebraic, transparent
Seamless movement between memory and disk
Adding brains to complement Hadoop’s brawn
Page25 © Hortonworks Inc. 2014
Thank you!
My next talk:
“Cost-based query optimization in Hive”
4:35pm Wednesday
@julianhyde
http://hortonworks.com/blog/dmmq/
http://hortonworks.com/blog/ddm/
http://incubator.apache.org/projects/optiq.html

Mais conteúdo relacionado

Mais procurados

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMark Kromer
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 

Mais procurados (19)

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 

Destaque

hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Move Beyond ETL: Tapping the True Business Value of Hadoop
Move Beyond ETL: Tapping the True Business Value of HadoopMove Beyond ETL: Tapping the True Business Value of Hadoop
Move Beyond ETL: Tapping the True Business Value of HadoopDataWorks Summit
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasDataWorks Summit
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitDataWorks Summit
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant ScaleHadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant ScaleDataWorks Summit
 
Gaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate EnvironmentGaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate EnvironmentDataWorks Summit
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterDataWorks Summit
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemDataWorks Summit
 
Railroad Modeling at Hadoop Scale
Railroad Modeling at Hadoop ScaleRailroad Modeling at Hadoop Scale
Railroad Modeling at Hadoop ScaleDataWorks Summit
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 
Visualising your Big Data: Eye Vegetables and Eye Candy
Visualising your Big Data: Eye Vegetables and Eye CandyVisualising your Big Data: Eye Vegetables and Eye Candy
Visualising your Big Data: Eye Vegetables and Eye CandyDataWorks Summit
 
Applying Big Data to Flow Theory
Applying Big Data to Flow TheoryApplying Big Data to Flow Theory
Applying Big Data to Flow TheoryDataWorks Summit
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNDataWorks Summit
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven ProductsDataWorks Summit
 

Destaque (20)

hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Move Beyond ETL: Tapping the True Business Value of Hadoop
Move Beyond ETL: Tapping the True Business Value of HadoopMove Beyond ETL: Tapping the True Business Value of Hadoop
Move Beyond ETL: Tapping the True Business Value of Hadoop
 
Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant ScaleHadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale
 
Gaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate EnvironmentGaining Support for Hadoop in a Large Corporate Environment
Gaining Support for Hadoop in a Large Corporate Environment
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
 
Hado "OPS" or Had "oops"
Hado "OPS" or Had "oops" Hado "OPS" or Had "oops"
Hado "OPS" or Had "oops"
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop Ecosystem
 
Hadoop YARN Services
Hadoop YARN ServicesHadoop YARN Services
Hadoop YARN Services
 
Algorithms of the heart
Algorithms of the heartAlgorithms of the heart
Algorithms of the heart
 
Railroad Modeling at Hadoop Scale
Railroad Modeling at Hadoop ScaleRailroad Modeling at Hadoop Scale
Railroad Modeling at Hadoop Scale
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Visualising your Big Data: Eye Vegetables and Eye Candy
Visualising your Big Data: Eye Vegetables and Eye CandyVisualising your Big Data: Eye Vegetables and Eye Candy
Visualising your Big Data: Eye Vegetables and Eye Candy
 
Applying Big Data to Flow Theory
Applying Big Data to Flow TheoryApplying Big Data to Flow Theory
Applying Big Data to Flow Theory
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
 

Semelhante a Discardable, In-Memory Materialized Query for Hadoop

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...Lucas Jellema
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 

Semelhante a Discardable, In-Memory Materialized Query for Hadoop (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Spark
SparkSpark
Spark
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
NoSQL
NoSQLNoSQL
NoSQL
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Discardable, In-Memory Materialized Query for Hadoop

  • 1. Page1 © Hortonworks Inc. 2014 Discardable, In-Memory Materialized Query for Hadoop Julian Hyde Julian Hyde June 3rd, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 Before we get started… The bad news • This software is not available to download • I am a database bigot • I am a BI bigot The good news • I believe in Hadoop • Now is a great time to discuss where Hadoop is going
  • 4. Page4 © Hortonworks Inc. 2014 Hadoop today Brute force Hadoop brings a lot of CPU, disk, IO Yarn, Tez, Vectorization are making Hadoop faster How to use that brute force is left to the application Business Intelligence Best practice is to pull data out of Hadoop • Populate enterprise data warehouse • In-memory analytics • Custom analytics, e.g. Lambda architecture Ineffective use of memory Opportunity to make Hadoop smarter
  • 5. Page5 © Hortonworks Inc. 2014 Brute force + diplomacy
  • 6. Page6 © Hortonworks Inc. 2014 Hardware trends Typical Hadoop server configuration Lots of memory - but not enough for all data More heterogeneous mix (disk + SSD + memory) What to do about memory? Year 2009 2014 Cores 4 – 8 24 Memory 8 GB 128 GB SSD None 1 TB Disk 4 x 1 TB 12 x 4 TB Disk : memory 512 : 1 384 : 1
  • 7. Page7 © Hortonworks Inc. 2014 Dumb use of memory - Buffer cache Example: 50 TB data 1 TB memory Full-table scan Analogous to virtual memory • Great while it works, but… • Table scan nukes the cache! Table (disk) Buffer cache (memory) Table scan Query operators (Tez, or whatever) Pre-fetch
  • 8. Page8 © Hortonworks Inc. 2014 “Document-oriented” analysis Operate on working sets small enough to fit in memory Analogous to working on a document (e.g. a spreadsheet) • Works well for problems that fit into memory (e.g. some machine-learning algorithms) • If your problem grows, you’re out of luck Working set (memory) Table (disk) Interactive user
  • 9. Page9 © Hortonworks Inc. 2014 Smarter use of memory - Materialized queries In-memory materialized queries Tables on disk
  • 10. Page10 © Hortonworks Inc. 2014 Census data Census table – 300M records Which state has the most males? Brute force – read 150M records Smarter – read 50 records CREATE TABLE Census (id, gender, zipcode, state, age); SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE TABLE CensusSummary AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, c FROM CensusSummary WHERE gender = ‘M’ ORDER BY c DESC LIMIT 1;
  • 11. Page11 © Hortonworks Inc. 2014 Materialized view – Automatic smartness A materialized view is a table that is declared to be identical to a given query Optimizer rewrites query on “Census” to use “CensusSummary” instead Even smarter – read 50 records CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1;
  • 12. Page12 © Hortonworks Inc. 2014 Indistinguishable from magic Run query #1 System creates materialized view in background Related query #2 runs faster SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(NULLIF(gender, ‘F’)) AS males, COUNT(NULLIF(gender, ‘M’)) AS females FROM Census GROUP BY state HAVING females > males;
  • 13. Page13 © Hortonworks Inc. 2014 Materialized views - Classic Classic materialized view (Oracle, DB2, Teradata, MSSql) 1. A table defined using a SQL query 2. Designed by DBA 3. Storage same as a regular table 1. On disk 2. Can define indexes 4. DB populates the table 5. Queries are rewritten to use the table** 6. DB updates the table to reflect changes to source data (usually deferred)* *Magic required SELECT t.year, AVG(s.units) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) GROUP BY t.year; CREATE MATERIALIZED VIEW SalesMonthZip AS SELECT t.year, t.month, c.state, c.zipcode, COUNT(*), SUM(s.units), SUM(s.price) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId) GROUP BY t.year, t.month, c.state, c.zipcode;
  • 14. Page14 © Hortonworks Inc. 2014 Materialized views - DIMMQ DIMMQs - Discardable, In-memory Materialized Queries Differences with classic materialized views 1. May be in-memory 2. HDFS may discard – based on DDM (Distributed Discardable Memory) 3. Lifecycle support: 1. Assume table is populated 2. Don’t populate & maintain 3. User can flag as valid, invalid, or change definition (e.g. date range) 4. HDFS may discard 4. More design options: 1. DBA specifies 2. Retain query results (or partial results) 3. An agent builds MVs based on query traffic
  • 15. Page15 © Hortonworks Inc. 2014 DIMMQ compared to Spark RDDs It’s not “either / or” • Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon • Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible Spark RDD DIMMQ Access By reference By algebraic expression Sharing Within session Across sessions Execution model Built-in External Recovery Yes No Discard Failure or GC Failure or cost-based Native organization Memory Disk Language Scala (or other JVM language) Language-independent
  • 16. Page16 © Hortonworks Inc. 2014 Data independence This is not just about SQL standards compliance! Materialized views are supposed to be transparent in creation, maintenance and use. If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded Data independence Ability to move data around and not tell your application Replicas Redundant copies Moving between disk and memory Sort order, projections (à la Vertica), aggregates (à la Microstrategy) Indexes, and other weird data structures
  • 17. Page17 © Hortonworks Inc. 2014 Implementing DIMMQs Relational algebra Apache Optiq (just entered incubator – yay!) Algebra, rewrite rules, cost model Metadata Hive: “CREATE MATERIALIZED VIEW” Definitions of materialized views in HCatalog HDFS - Discardable Distributed Memory (DDM) Off-heap data in memory-mapped files Discard policy Build in-memory, replicate to disk; or vice versa Central namespace Evolution of existing components
  • 18. Page18 © Hortonworks Inc. 2014 Tiled queries in distributed memory Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk
  • 19. Page19 © Hortonworks Inc. 2014 An adaptive system Ongoing activities: • Agent suggests new MVs • MVs are built in background • Ongoing query activity uses MVs • User marks MVs as invalid due to source data changes • HDFS throws out MVs that are not pulling their weight Dynamic equilibrium DIMMQs continually created & destroyed System moves data around to adapt to changing usage patterns
  • 20. Page20 © Hortonworks Inc. 2014 Lambda architecture From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
  • 21. Page21 © Hortonworks Inc. 2014 Lambda architecture in Hadoop via DIMMQs Use DIMMQs for materialized historic & streaming data MV on disk MV in key- value store Hadoop Query via SQL
  • 22. Page22 © Hortonworks Inc. 2014 Variations on a theme Materialized queries don’t have to be in memory Materialized queries don’t need to be discardable Materialized queries don’t need to be accessed via SQL Materialized queries allow novel data structures to be described Maintaining materialized queries - build on Hive ACID Fine-grained invalidation Streaming into DIMMQs In-memory tables don’t have to be materialized queries Data aging – Older data to cheaper storage Discardable In-memory Algebraic
  • 23. Page23 © Hortonworks Inc. 2014 Lattice Space of possible materialized views A star schema, with mandatory many-to-one relationships Each view is a projected, filtered aggregation • Sales by zipcode and quarter in 2013 • Sales by state in Q1, 2012 Lattice gathers stats • “I used MV m to answer query q and avoided fetching r rows” • Cost of MV = construction effort + memory * time • Utility of MV = query processing effort saved Recommends & builds optimal MVs CREATE LATTICE SalesStar AS SELECT * FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId);
  • 24. Page24 © Hortonworks Inc. 2014 Conclusion Broadening the tent – batch, interactive BI, streaming, iterative Declarative, algebraic, transparent Seamless movement between memory and disk Adding brains to complement Hadoop’s brawn
  • 25. Page25 © Hortonworks Inc. 2014 Thank you! My next talk: “Cost-based query optimization in Hive” 4:35pm Wednesday @julianhyde http://hortonworks.com/blog/dmmq/ http://hortonworks.com/blog/ddm/ http://incubator.apache.org/projects/optiq.html

Notas do Editor

  1. This software does not yet exist. I hope and believe that it will. I don’t believe in promoting vaporware! I want to have a discussion about where Hadoop is going so as many of us can shape it. This seems like a great time and place to have it. My background is in database. I think like a database guy. I’ve been thinking really hard about how to bring the “good stuff” from databases over to Hadoop I have a deep knowledge of BI, and what BI tools need from their underlying data system, having written Mondrian
  2. Hadoop today brings a lot of brute force. It can be more effective if it were a bit smarter. We cannot make effective use of memory unless we are a bit smarter.
  3. Lots of memory now. But still not enough
  4. Bringing all kinds of data into the Hadoop tent - Batch, interactive, streaming, iterative Declarative, algebraic, transparent - Applications use different copies of data without knowing it Making Hadoop smarter – Adding brains to compliment Hadoop’s formidable brawn