SQL on Big Data using Optiq

•Transferir como PPT, PDF•

0 gostou•3,059 visualizações

SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics. This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.

Tecnologia

SQL on Big Data
using Optiq
@julianhyde

Real-time Big Data Meetup
at RichRelevance

April 2013

What is “SQL on Big Data?”
□ “Open-source Teradata”
□ SQL generator for Map-Reduce
□ ETL (Extract-Transform Load)
□ Scalable transaction processing
□ Querying nested data sets
□ Querying documents & populating databases
□ Continuous query/streaming

(Check one or more.)

Revolution & counter-revolution
“Big Data” was a revolution in data management.

Lots of broken things got fixed (unlimited scale,
data anywhere & any format, late schema,
flexible queries).

Some useful things got broken (standard
interface, data independence, central control).

“In 5 years everyone will be using Hadoop and
they won't even know it.” – me, a few years
ago

Conventional DBMS architecture

JDBC client

JDBC server
SQL parser /
validator Metadata
Query
optimizer
Data-flow
operators

Data Data

Optiq architecture

JDBC client

JDBC server
Optional SQL parser / Metadata
validator SPI
Core Query Pluggable
optimizer rules
3rd 3rd
Pluggable party party
ops ops
3rd party 3rd party
data data

SELECT p.product_name, COUNT(*) AS c
Expression FROM splunk.splunk AS s
JOIN mysql.products AS p
tree ON s.product_id = p.product_id
WHERE s.action = 'purchase'
GROUP BY p.product_name
ORDER BY c DESC
Splunk
Table: splunk
Key: product_name
Key: product_id Agg: count
Condition: Key: c DESC
action =
'purchase'
scan
join
MySQL filter group sort
scan
Table: products

SELECT p.product_name, COUNT(*) AS c
Expression FROM splunk.splunk AS s
JOIN mysql.products AS p
tree (optimized) ON s.product_id = p.product_id
WHERE s.action = 'purchase'
GROUP BY p.product_name
ORDER BY c DESC

Splunk
Condition:
Table: splunk
action =
'purchase' Key: product_name
Agg: count
Key: c DESC

Key: product_id
scan filter

MySQL
join group sort
scan
Table: products

Apache Drill
“Apache Drill (incubating) is a distributed system
for interactive analysis of large-scale datasets,
based on Google's Dremel. Its goal is to
efficiently process nested data. It is a design
goal to scale to 10,000 servers or more and to
be able to process petabyes of data and
trillions of records in seconds.”
Data model: JSON, late-binding
Optiq:
SQL → logical plan (current)
Logical → physical plan (proposed)

Cascading Lingual
“Cascading is the de facto Java API for creating
complex data processing workloads and the
engine underneath Scalding, Cascalog, and
others.”

Lingual uses Optiq to translate SQL onto
Cascading flows
SQL is “yet another DSL” for Cascading
Just released!

Mondrian next-gen architecture
mondrian mondrian mondrian Optiq provides SQL
view onto hybrid
SQL + NoSQL +
optiq optiq optiq in-memory store

cache data cache grid cache In-memory tables
(query results,
planned & on-the-fly
materializations)
control

control

control
cache

cache

cache
Raw data +
summarized /
HDFS MongoDB DBMS projected / sorted /
re-organized data.
Partitions.

Summary: Data independence
Logical & physical data models
Requires & allows query optimization
Allows you (or the system) to re-organize data
Query federation, data movement, caching
SQL interface for humans & machines
Optiq lets you add rules to optimize better

Thank you!
@julianhyde

optiq https://github.com/julianhyde/optiq
drill http://incubator.apache.org/drill/
lingual http://www.cascading.org/lingual/
mondrian http://mondrian.pentaho.com
slides https://github.com/julianhyde/share/tree/master/slides

Mais conteúdo relacionado

Mais procurados

Apache Calcite: One planner fits allJulian Hyde

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde

Cost-based query optimization in Apache Hive 0.14Julian Hyde

Apache Calcite: One Frontend to Rule Them AllMichael Mior

A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde

ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde

Streaming SQL with Apache CalciteJulian Hyde

Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde

Why you care about  relational algebra (even though you didn’t know it)Julian Hyde

Optiq: a SQL front-end for everythingJulian Hyde

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov

What's new in Mondrian 4?Julian Hyde

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde

Introduction to Apache CalciteJordan Halterman

Tactical data engineeringJulian Hyde

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde

Discardable In-Memory Materialized Queries With HadoopJulian Hyde

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

Apache Calcite overviewJulian Hyde

Mais procurados (20)

Apache Calcite: One planner fits all

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

Cost-based query optimization in Apache Hive 0.14

Apache Calcite: One Frontend to Rule Them All

A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite

ONE FOR ALL! Using Apache Calcite to make SQL smart

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)

Streaming SQL with Apache Calcite

Cost-based Query Optimization in Apache Phoenix using Apache Calcite

Why you care about  relational algebra (even though you didn’t know it)

Optiq: a SQL front-end for everything

Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...

What's new in Mondrian 4?

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...

Introduction to Apache Calcite

Tactical data engineering

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Discardable In-Memory Materialized Queries With Hadoop

DataEngConf SF16 - Spark SQL Workshop

Apache Calcite overview

Semelhante a SQL on Big Data using Optiq

Klout changing landscape of social mediaDataWorks Summit

How Klout is changing the landscape of social media with Hadoop and BIDenny Lee

Hyperspace: An Indexing Subsystem for Apache SparkDatabricks

Big Data on the CloudSercan Karaoglu

Scaling the Content Repository with ElasticsearchNuxeo

A Smarter Pig: Building a SQL interface to Pig using Apache CalciteSalesforce Engineering

Cepta The Future of Data with Power BIKellyn Pot'Vin-Gorman

Powering a Graph Data System with Scylla + JanusGraphScyllaDB

Qubole - Big data in cloudDmitry Tolpeko

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services

Fossasia 2018-chetan-khatriChetan Khatri

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAmazon Web Services

IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...In-Memory Computing Summit

Intro to Spark and Spark SQLjeykottalam

Spark Sql for TrainingBryan Yang

Autoscaling Your Kubernetes Workloads (Sponsored by Datadog) - AWS Summit SydneyAmazon Web Services

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

PolyalgebraDataWorks Summit/Hadoop Summit

Data warehouse con azure synapse analyticsEduardo Castro

Semelhante a SQL on Big Data using Optiq (20)

Klout changing landscape of social media

How Klout is changing the landscape of social media with Hadoop and BI

Hyperspace: An Indexing Subsystem for Apache Spark

Big Data on the Cloud

Scaling the Content Repository with Elasticsearch

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite

Cepta The Future of Data with Power BI

Powering a Graph Data System with Scylla + JanusGraph

Qubole - Big data in cloud

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...

Fossasia 2018-chetan-khatri

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit

IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...

Intro to Spark and Spark SQL

Spark Sql for Training

Autoscaling Your Kubernetes Workloads (Sponsored by Datadog) - AWS Summit Sydney

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Polyalgebra

Data warehouse con azure synapse analytics

Mais de Julian Hyde

Building a semantic/metrics layer using CalciteJulian Hyde

Cubing and Metrics in SQL, oh my!Julian Hyde

Adding measures to Calcite SQLJulian Hyde

Morel, a data-parallel programming languageJulian Hyde

Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde

Morel, a Functional Query LanguageJulian Hyde

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

The evolution of Apache Calcite and its CommunityJulian Hyde

What to expect when you're IncubatingJulian Hyde

Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde

Efficient spatial queries on vanilla databasesJulian Hyde

Don't optimize my queries, organize my data!Julian Hyde

Spatial query on vanilla databasesJulian Hyde

Lazy beats Smart and FastJulian Hyde

Don’t optimize my queries, optimize my data!Julian Hyde

Data profiling with Apache CalciteJulian Hyde

Data Profiling in Apache CalciteJulian Hyde

Streaming SQLJulian Hyde

Mais de Julian Hyde (20)

Building a semantic/metrics layer using Calcite

Cubing and Metrics in SQL, oh my!

Adding measures to Calcite SQL

Morel, a data-parallel programming language

Is there a perfect data-parallel programming language? (Experiments with More...

Morel, a Functional Query Language

Apache Calcite (a tutorial given at BOSS '21)

The evolution of Apache Calcite and its Community

What to expect when you're Incubating

Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite

Efficient spatial queries on vanilla databases

Don't optimize my queries, organize my data!

Spatial query on vanilla databases

Lazy beats Smart and Fast

Don’t optimize my queries, optimize my data!

Data profiling with Apache Calcite

Data Profiling in Apache Calcite

Streaming SQL

Último

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

"ML in Production",Oleksandr BaganFwdays

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Gen AI in Business - Global Trends Report 2024.pdfAddepto

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Story boards and shot lists for my a level piececharlottematthew16

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

SQL on Big Data using Optiq

1. SQL on Big Data using Optiq @julianhyde Real-time Big Data Meetup at RichRelevance April 2013

2. What is “SQL on Big Data?” □ “Open-source Teradata” □ SQL generator for Map-Reduce □ ETL (Extract-Transform Load) □ Scalable transaction processing □ Querying nested data sets □ Querying documents & populating databases □ Continuous query/streaming (Check one or more.)

3. Revolution & counter-revolution “Big Data” was a revolution in data management. Lots of broken things got fixed (unlimited scale, data anywhere & any format, late schema, flexible queries). Some useful things got broken (standard interface, data independence, central control). “In 5 years everyone will be using Hadoop and they won't even know it.” – me, a few years ago

4. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data

5. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data

6. SELECT p.product_name, COUNT(*) AS c Expression FROM splunk.splunk AS s JOIN mysql.products AS p tree ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC Splunk Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products

7. SELECT p.product_name, COUNT(*) AS c Expression FROM splunk.splunk AS s JOIN mysql.products AS p tree (optimized) ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC Splunk Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products

8. Apache Drill “Apache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.” Data model: JSON, late-binding Optiq: SQL → logical plan (current) Logical → physical plan (proposed)

9. Cascading Lingual “Cascading is the de facto Java API for creating complex data processing workloads and the engine underneath Scalding, Cascalog, and others.” Lingual uses Optiq to translate SQL onto Cascading flows SQL is “yet another DSL” for Cascading Just released!

10. Mondrian (Pentaho Analysis)

11. Mondrian next-gen architecture mondrian mondrian mondrian Optiq provides SQL view onto hybrid SQL + NoSQL + optiq optiq optiq in-memory store cache data cache grid cache In-memory tables (query results, planned & on-the-fly materializations) control control control cache cache cache Raw data + summarized / HDFS MongoDB DBMS projected / sorted / re-organized data. Partitions.

12. Summary: Data independence Logical & physical data models Requires & allows query optimization Allows you (or the system) to re-organize data Query federation, data movement, caching SQL interface for humans & machines Optiq lets you add rules to optimize better

13. Thank you! @julianhyde optiq https://github.com/julianhyde/optiq drill http://incubator.apache.org/drill/ lingual http://www.cascading.org/lingual/ mondrian http://mondrian.pentaho.com slides https://github.com/julianhyde/share/tree/master/slides

SQL on Big Data using Optiq

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a SQL on Big Data using Optiq

Semelhante a SQL on Big Data using Optiq (20)

Mais de Julian Hyde

Mais de Julian Hyde (20)

Último

Último (20)

SQL on Big Data using Optiq