Optiq: a SQL front-end for everything

•Transferir como PPT, PDF•

14 gostaram•7,911 visualizações

Optiq is a dynamic query planning framework. It can potentially help integrate Pentaho Mondrian and Kettle with various SQL, NoSQL and BigData data sources.

Tecnologia

Optiq: a SQL front-end for everything

Julian Hyde @julianhyde

http://github.com/julianhyde/optiq
http://github.com/julianhyde/optiq-splunk

Pentaho Community Meetup
Amsterdam, 2012

http://www.flickr.com/photos/torkildr/3462606643

http://www.flickr.com/photos/sylvar/31436961/

“Big Data”
Right data, right time
Diverse data sources / Performance / Suitable format

Use case: Splunk

NoSQL database

Every log file in the enterprise

A single “table”

A record for every line in every log file

A column for every field that exists in any log file

No schema
SELECT “source”, “product_id”, “http_code”
FROM “splunk”.”splunk”
WHERE “action” = 'purchase'

How do it (wrong)
action =
'purchase'
“search”

Splunk Optiq filter

SELECT “source”, “product_id”
FROM “splunk”.”splunk”
WHERE “action” = 'purchase'

How do it (right)
“search
action=purchase”

Splunk Optiq

SELECT “source”, “product_id”
FROM “splunk”.”splunk”
WHERE “action” = 'purchase'

Example #2
Combining data from 2 sources (Splunk & MySQL)
Also possible: 3 or more sources; 3-way joins; unions

Expression tree
SELECT p.“product_name”, COUNT(*) AS c
FROM “splunk”.”splunk” AS s
JOIN “mysql”.”products” AS p
ON s.”product_id” = p.”product_id”
WHERE s.“action” = 'purchase'
Splunk GROUP BY p.”product_name”
ORDER BY c DESC
Table: splunk
Key: product_name
Key: product_id Agg: count
Condition: Key: c DESC
action =
'purchase'
scan
join
MySQL filter group sort
scan
Table: products

Expression tree SELECT p.“product_name”, COUNT(*) AS c
FROM “splunk”.”splunk” AS s
(optimized) JOIN “mysql”.”products” AS p
ON s.”product_id” = p.”product_id”
WHERE s.“action” = 'purchase'
GROUP BY p.”product_name”
Splunk ORDER BY c DESC
Condition:
Table: splunk action =
'purchase' Key: product_name
Agg: count
Key: c DESC
Key: product_id
scan filter

MySQL
join group sort
scan
Table: products

http://www.flickr.com/photos/telstra-corp/5069403309/

Conventional database architecture
JDBC client

JDBC server
SQL parser /
validator Metadata
Query
optimizer
Data-flow
operators

Data Data

Optiq architecture
JDBC client

JDBC server
Optional SQL parser / Metadata
validator SPI
Core Query Pluggable
optimizer rules
3rd 3rd
Pluggable party party
ops ops
3rd party 3rd party
data data

What is Optiq?
A really, really smart JDBC driver
Framework
Potential core of a data management system

Writing an adapter
Driver – if you want a vanity URL like “jdbc:splunk:”
Schema – describes what tables exist (Splunk has just one)
Table – what are the columns, and how to get the data. (Splunk's
table has any column you like... just ask for it.)
Operators (optional) – non-relational operations
Rules (optional, but recommended) – improve efficiency by
changing the question
Parser (optional) – to query via a language other than SQL

http://www.flickr.com/photos/walkercarpenter/4697637143/

Optiq roadmap ideas
Mondrian use Optiq to read from data sources such as Splunk &
MongoDB, combine multiple data sources
Kettle integration: JDBC front-end; optimize jobs; push down
filters & aggregations to data sources (e.g. SQL database)
Adapters: Cascading, MongoDB, Hbase, Apache Drill, …?
Front-ends: linq4j, Scala SLICK, Java8 streams
Contributions

Conclusions
Liberate your data!
Optiq is a framework
Build & share Optiq adapters

Questions?

@julianhyde
http://julianhyde.blogspot.com
http://github.com/julianhyde/optiq
http://github.com/julianhyde/optiq-splunk

Additional material: The following queries were used in the
demo
select s."source", s."sourcetype" select * from "mysql"."products";
from "splunk"."splunk" as s;

select p."product_name",
select s."source", s."action"
s."sourcetype", s."action" from "splunk"."splunk" as s
from "splunk"."splunk" as s
join "mysql"."products" as p
where s."action" = 'purchase';
on s."product_id" =
p."product_id";
select s."source",

Mais conteúdo relacionado

Mais procurados

Mondrian update (Pentaho community meetup 2012, Amsterdam)Julian Hyde

SQL for NoSQL and how Apache Calcite can helpChristian Tzolov

What's new in Mondrian 4?Julian Hyde

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde

Apache Calcite overviewJulian Hyde

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde

Introduction to Apache CalciteJordan Halterman

A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde

Spark sqlFreeman Zhang

SQL on everything, in memoryJulian Hyde

Introduce to Spark sql 1.3.0 Bryan Yang

Elastic Stack IntroductionVikram Shinde

ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchMapR Technologies

Elastic search overviewABC Talks

Log analytics with ELK stackAWS User Group Bengaluru

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde

Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingSpark Summit

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye

Cubes – pluggable model explainedStefan Urbanek

Mais procurados (20)

Mondrian update (Pentaho community meetup 2012, Amsterdam)

SQL for NoSQL and how Apache Calcite can help

What's new in Mondrian 4?

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Apache Calcite overview

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...

Introduction to Apache Calcite

A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite

Spark sql

SQL on everything, in memory

Introduce to Spark sql 1.3.0

Elastic Stack Introduction

ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch

Elastic search overview

Log analytics with ELK stack

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

Bulletproof Jobs: Patterns For Large-Scale Spark Processing

DataEngConf SF16 - Spark SQL Workshop

ONE FOR ALL! Using Apache Calcite to make SQL smart

Cubes – pluggable model explained

Destaque

Streaming SQL with Apache CalciteJulian Hyde

Apresentação Aplicativo Obras do PAC no FISL 13IT4biz IT Solutions

Data Science Summit 2012 レポートnagix

Curso de Criação de Dashboards com o Pentaho (BI Open Source)IT4biz IT Solutions

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Curso Gratuito Online Desmistificando BI (Business Intelligence) Open Source ...Caio Moreno

Java8 Stream APIとApache SparkとAsakusa Frameworkの類似点・相違点hishidama

Destaque (8)

Streaming SQL with Apache Calcite

Apresentação Aplicativo Obras do PAC no FISL 13

Data Science Summit 2012 レポート

Curso de Criação de Dashboards com o Pentaho (BI Open Source)

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Curso Gratuito Online Desmistificando BI (Business Intelligence) Open Source ...

Java8 Stream APIとApache SparkとAsakusa Frameworkの類似点・相違点

Semelhante a Optiq: a SQL front-end for everything

Projeto-web-services-Spring-Boot-JPA.pdfAdrianoSantos888423

Spark Summit EU talk by Michael NitschingerSpark Summit

SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo Splunk

Splunk Ninjas: New Features and Search DojoSplunk

PolyalgebraDataWorks Summit/Hadoop Summit

PYSPARK PROGRAMMING.pdfMuhammadFauzi713466

Spark Sql for TrainingBryan Yang

Storlets fb session_16_9Eran Rom

Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks

Intro to Spark and Spark SQLjeykottalam

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen

Elasticsearch a real-time distributed search and analytics enginegautam kumar

From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services

Backbone.jsKnoldus Inc.

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAmazon Web Services

Automatically scaling your Kubernetes workloads - SVC201-S - Chicago AWS SummitAmazon Web Services

Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...confluent

Spline 0.3 and Plans for 0.4 Vaclav Kosar

Semelhante a Optiq: a SQL front-end for everything (20)

Projeto-web-services-Spring-Boot-JPA.pdf

Spark Summit EU talk by Michael Nitschinger

SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo

Splunk Ninjas: New Features and Search Dojo

Polyalgebra

PYSPARK PROGRAMMING.pdf

Spark Sql for Training

Storlets fb session_16_9

Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...

Intro to Spark and Spark SQL

Writing Continuous Applications with Structured Streaming in PySpark

Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...

Elasticsearch a real-time distributed search and analytics engine

From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...

Backbone.js

Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit

Automatically scaling your Kubernetes workloads - SVC201-S - Chicago AWS Summit

Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...

Spline 0.3 and Plans for 0.4

Mais de Julian Hyde

Building a semantic/metrics layer using CalciteJulian Hyde

Cubing and Metrics in SQL, oh my!Julian Hyde

Adding measures to Calcite SQLJulian Hyde

Morel, a data-parallel programming languageJulian Hyde

Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde

Morel, a Functional Query LanguageJulian Hyde

The evolution of Apache Calcite and its CommunityJulian Hyde

What to expect when you're IncubatingJulian Hyde

Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde

Efficient spatial queries on vanilla databasesJulian Hyde

Tactical data engineeringJulian Hyde

Don't optimize my queries, organize my data!Julian Hyde

Spatial query on vanilla databasesJulian Hyde

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde

Lazy beats Smart and FastJulian Hyde

Don’t optimize my queries, optimize my data!Julian Hyde

Data profiling with Apache CalciteJulian Hyde

Data Profiling in Apache CalciteJulian Hyde

Streaming SQLJulian Hyde

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde

Mais de Julian Hyde (20)

Building a semantic/metrics layer using Calcite

Cubing and Metrics in SQL, oh my!

Adding measures to Calcite SQL

Morel, a data-parallel programming language

Is there a perfect data-parallel programming language? (Experiments with More...

Morel, a Functional Query Language

The evolution of Apache Calcite and its Community

What to expect when you're Incubating

Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite

Efficient spatial queries on vanilla databases

Tactical data engineering

Don't optimize my queries, organize my data!

Spatial query on vanilla databases

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...

Lazy beats Smart and Fast

Don’t optimize my queries, optimize my data!

Data profiling with Apache Calcite

Data Profiling in Apache Calcite

Streaming SQL

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)

Último

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

GenAI Risks & Security Meetup 01052024.pdflior mazor

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Optiq: a SQL front-end for everything

1. Optiq: a SQL front-end for everything Julian Hyde @julianhyde http://github.com/julianhyde/optiq http://github.com/julianhyde/optiq-splunk Pentaho Community Meetup Amsterdam, 2012

2. http://www.flickr.com/photos/torkildr/3462606643

3. http://www.flickr.com/photos/sylvar/31436961/

4. “Big Data” Right data, right time Diverse data sources / Performance / Suitable format

5. Use case: Splunk  NoSQL database  Every log file in the enterprise  A single “table”  A record for every line in every log file  A column for every field that exists in any log file  No schema SELECT “source”, “product_id”, “http_code” FROM “splunk”.”splunk” WHERE “action” = 'purchase'

6. How do it (wrong) action = 'purchase' “search” Splunk Optiq filter SELECT “source”, “product_id” FROM “splunk”.”splunk” WHERE “action” = 'purchase'

7. How do it (right) “search action=purchase” Splunk Optiq SELECT “source”, “product_id” FROM “splunk”.”splunk” WHERE “action” = 'purchase'

8. Example #2 Combining data from 2 sources (Splunk & MySQL) Also possible: 3 or more sources; 3-way joins; unions

9. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' Splunk GROUP BY p.”product_name” ORDER BY c DESC Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products

10. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s (optimized) JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” Splunk ORDER BY c DESC Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products

11. Optiq is not a database.

12. http://www.flickr.com/photos/torkildr/3462606643

13. http://www.flickr.com/photos/telstra-corp/5069403309/

14. Conventional database architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data

15. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data

16. What is Optiq? A really, really smart JDBC driver Framework Potential core of a data management system

17. Writing an adapter Driver – if you want a vanity URL like “jdbc:splunk:” Schema – describes what tables exist (Splunk has just one) Table – what are the columns, and how to get the data. (Splunk's table has any column you like... just ask for it.) Operators (optional) – non-relational operations Rules (optional, but recommended) – improve efficiency by changing the question Parser (optional) – to query via a language other than SQL

18. http://www.flickr.com/photos/walkercarpenter/4697637143/

19. Optiq roadmap ideas Mondrian use Optiq to read from data sources such as Splunk & MongoDB, combine multiple data sources Kettle integration: JDBC front-end; optimize jobs; push down filters & aggregations to data sources (e.g. SQL database) Adapters: Cascading, MongoDB, Hbase, Apache Drill, …? Front-ends: linq4j, Scala SLICK, Java8 streams Contributions

20. Conclusions Liberate your data! Optiq is a framework Build & share Optiq adapters

21. Questions? @julianhyde http://julianhyde.blogspot.com http://github.com/julianhyde/optiq http://github.com/julianhyde/optiq-splunk

22. Additional material: The following queries were used in the demo select s."source", s."sourcetype" select * from "mysql"."products"; from "splunk"."splunk" as s; select p."product_name", select s."source", s."action" s."sourcetype", s."action" from "splunk"."splunk" as s from "splunk"."splunk" as s join "mysql"."products" as p where s."action" = 'purchase'; on s."product_id" = p."product_id"; select s."source",

Notas do Editor

The obligatory “big data” definition slide. What is “big data”? It's not really about “big”. We need to access data from different parts of the organization, when we need it (which often means we don't have time to copy it), and the performance needs to be reasonable. If the data is large, it is often larger than the disks one can fit on one machine. It helps if we can process the data in place, leveraging the CPU and memory of the machines where the data is stored. We'd rather not copy it from one system to another. It needs to be flexible, to deal with diverse systems and formats. That often means that open source is involved. Some systems (e.g. reporting tools) can't easily be changed to accommodate new formats. So it helps if the data can be presented in standard formats, e.g. SQL.
The wrong way to execute the query is for Splunk to send all of the data to Optiq. Splunk does more work than it needs to, it doesn't use any indexes, the network sends too much data, Optiq does too much work.
The right way to execute the query is to pass the filter down to Splunk. This lets Splunk use its indexes, so it does less work, passes less data over the network, and the query finishes faster. This is just a simple answer, but a lot of problems can be solved by “pushing down” expressions, filters, computation of summaries. Do the work, and reduce the volume of data, as early in the process as possible.
Demo connecting to Splunk via the Optiq driver. We aer using sqlline as the shell (it works with any JDBC driver). Se;ect “source” from “splunk”.”splunk” where “sourcetype=” = 'mysqld-4'; In the generated Java on the screen, Note how sourcetype is pushed down to Splunk.
It's much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn't have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
It's much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn't have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
To recap. Optiq is not a database. It does as little of the database processing as it can get away with. Ideally, nothing at all. But what is it?
Optiq is not a database... it is more like a telephone exchange. Applications can get the data they need, quickly and efficiently.
Conventional database has ODBC/JDBC driver, SQL parser, . Data sources. Expression tree. Expression transformation rules. Optimizer. For NoSQL databases, the language may not be SQL, and the optimizer may be less sophisticated, but the picture is basically the same. For frameworks, such as Hadoop, there is no planner. You end up writing code (e.g MapReduce jobs).
In Optiq, the query optimizer (we modestly call it the planner) is central. The JDBC driver/server and SQL parser are optional; skip them if you have another language. Plug-ins provide metadata (the schema), planner rules, and runtime operators. There are built-in relational operators and rules, and there are built-in operators implemented in Java. But to access data, you need to provide at least one operator.
It needs to be said. Optiq is not a database. It looks like a database to your applications, and that's great. But when you want to integrate data from multiple sources, in different formats, and have those systems talk to each other, it doesn't force you to copy the data around. It gets out of your way. You configure Optiq by writing Java code. Therefore it is a framework, like Spring and, yes, like Hadoop. Optiq masquerades as a really, really smart JDBC driver. It has a SQL parser and JDBC driver. And actually you can embed it into another data management system, with a language other than SQL.

Optiq: a SQL front-end for everything

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Optiq: a SQL front-end for everything

Semelhante a Optiq: a SQL front-end for everything (20)

Mais de Julian Hyde

Mais de Julian Hyde (20)

Último

Último (20)

Optiq: a SQL front-end for everything

Notas do Editor