Optiq: A dynamic data management framework

•

7 gostaram•5,126 visualizações

Optiq is a framework that allows heterogeneous data from different locations and formats to be queried using a standard SQL interface. It helps optimize queries by caching results, pre-computing aggregates, and using column-oriented storage and materialized views to speed up analytics queries on large datasets. By treating different data sources as a single database and helping "cheat" through query optimization techniques, Optiq provides the benefits of a centralized database for managing and querying hybrid transactional and analytical workloads.

Tecnologia

Optiq
A dynamic data management framework
@julianhyde
Friday, October 4, 13

3 things
1. Databases are good.
2. It is hard to build analytics if your data is
“all over the place.”
3. Optiq makes heterogeneous data look and
behave more like a database.
Friday, October 4, 13

Mondrian on Optiq on
hybrid data + memory
Friday, October 4, 13

“Data all over the
place”
• Different locations (HDFS, memory, DBMS)
• Different formats
• Different workloads
• Transactional: Mainly write, targeted read
• Analytic: Mainly read (bulk write)
• Query latency: Interactive vs. batch
• Data latency: Can we show out-of-date data?
Friday, October 4, 13

Databases are good
• Central point to manage data
• Simple, standard API for apps
• Powerful modeling techniques (e.g. star
schemas)
• Data independence (i.e. tune your data
after you write your application)
• Query optimization
Friday, October 4, 13

Conventional DB architecture
Friday, October 4, 13

Optiq architecture
Friday, October 4, 13

Example #1: CSV
• Uses CSV adapter (optiq-csv)
• Demo using sqlline
• Easy to run this for yourself:
$ git clone https://github.com/julianhyde/optiq-csv
$ cd optiq-csv
$ mvn install
$ ./sqlline
Friday, October 4, 13

More adapters
Adapters Embedded Planned
CSV Cascading (Lingual) HBase (Phoenix)
JDBC Apache Drill Spark
MongoDB Cassandra
Splunk Mondrian
linq4j
Friday, October 4, 13

Example #2:
Splunk + MySQL
SELECT p.product_name,
COUNT(*) AS c
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.product_id = p.product_id
WHERE s.action = 'purchase'
GROUP BY p.product_name
ORDER BY c DESC
Friday, October 4, 13

Analytics on
heterogeneous data
Friday, October 4, 13

Simple analytics
problem
• 100M U.S. census records
• 1KB each record, 100GB total
• 4 SATA3 disks, total 1.2GB/s
• How to count all records in under 5s?
Friday, October 4, 13

Simple analytics
problem
• 100M U.S. census records
• 1KB each record, 100GB total
• 4 SATA3 disks, total 1.2GB/s
• How to count all records in under 5s?
• Not possible?! It takes 80s just to read the
data.
Friday, October 4, 13

Solution: Cheat!
• Compress data
• Column-oriented storage
• Store data in sorted order
• Put data in memory
• Cache previous query results
• Pre-compute (materialize) aggregates
Friday, October 4, 13

How Optiq helps you
to cheat
Friday, October 4, 13

How Optiq helps you
to cheat
• Materialized views
• Pre-deﬁned aggregate tables
• Cached query results = In-memory tables
• Smart cache maintenance
• Quickly bring materializations online & ofﬂine
• Materializations over a subset of the data
• Spark distributed, in-memory processing & cache
• Application thinks it is talking to a single SQL database
Friday, October 4, 13

3 things (reprise)
1. Databases are good. Especially the
ﬂexibility that SQL gives us.
2. It is hard to build analytics if your
data is all over the place. Different
workloads (operational vs. analytic, small write
vs. bulk read) require different data structures.
3. Optiq is not a database. But Optiq creates
a federated data architecture that performs
well, and looks like a database to your tools.
Friday, October 4, 13

@julianhyde
optiq https://github.com/julianhyde/optiq
mondrian http://mondrian.pentaho.com
blog http://julianhyde.blogspot.com
Friday, October 4, 13

Mais conteúdo relacionado

Mais procurados

Insights into Customer Behavior from Clickstream Data by Ronald Nowling

Spark Summit

Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution) Avec l'extraction de données stockées dans une base de données relationnelle à l'aide d'un outil de BI avancé, et avec l'envoi via Kafka des données vers Tachyon, plusieurs sessions Spark peuvent travailler sur le même dataset en limitant la duplication. On obtient grâce à cela une communication à coût contrôlé entre la base de données d'origine et Spark ce qui permet de réintroduire de manière dynamique les données modifiées avec MLlib tout en travaillant sur des données à jour. Les résultats préliminaires seront partagés durant cette présentation.

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Modern Data Stack France

Azure DocumentDB 101

Ike Ellis

NoSQL for SQL Users

IBM Cloud Data Services

Benjamin Hopp (Solutions Architect) @ Imply: Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets. This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit. Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack. The most important contributor to a fast analytical setup is getting the data model right. The talk will center around various choices you can make to prepare your data to get best possible query performance. We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes. We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed. We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage. You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more. And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes Speaker: Matthew Powers

Optimizing Delta/Parquet Data Lakes for Apache Spark

Databricks

Hadoop for the Absolute Beginner

Ike Ellis

Introduction to Google BigQuery

Csaba Toth

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Dremio Corporation

Vadim Solovey is a CTO of DoiT International has helped to implement Google BigQuery as a cloud data warehouse for many medium and large sized data and analytics initiatives. BigQuery’s serverless architecture had redefined what it means to be fully managed for hundreds of Israeli's startups. Recently, Google announced an update to BigQuery that dramatically advances cloud data analytics for large-scale businesses such as BigQuery now support Standard SQL, implementing the SQL 2011 standard as well as new ODBC drivers making it possible to use BigQuery with a number of tools ranging from Microsoft Excel to traditional business intelligence systems such as Microstrategy and Qlik. Agenda: • Partitioned tables • The ability to update, delete rows and columns using SQL • Integration with IAM for fine-grained security policies • Monitoring w/ StackDriver to track performance and usage • Query sharing via links, to foster knowledge within orgs • Cost optimisation strategies

Google BigQuery 101 & What’s New

DoiT International

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. One key feature in Presto is the ability to query data where it lives via an uniform ANSI SQL interface. Presto’s connector architecture creates an abstraction layer for anything that can be represented in a columnar or row-like format, such as HDFS, Amazon S3, Azure Storage, NoSQL stores, relational databases, Kafka streams and even proprietary data stores. Furthermore, a single Presto query can combine data from multiple sources, allowing for analytics across an entire organization.

Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup

Wojciech Biela

Real-World NoSQL Schema Design

DataWorks Summit/Hadoop Summit

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Wes McKinney

Practical Use of a NoSQL Database

IBM Cloud Data Services

Apache Arrow: Cross-language Development Platform for In-memory Data

Wes McKinney

SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics. This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.

SQL on Big Data using Optiq

Julian Hyde

Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.

Apache Arrow: In Theory, In Practice

Dremio Corporation

Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...

Lucidworks

Database Choices

Lynn Langit

Dremio introduction

Alexis Gendronneau

Mais procurados (20)

Insights into Customer Behavior from Clickstream Data by Ronald Nowling

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Azure DocumentDB 101

NoSQL for SQL Users

A Day in the Life of a Druid Implementor and Druid's Roadmap

Optimizing Delta/Parquet Data Lakes for Apache Spark

Hadoop for the Absolute Beginner

Introduction to Google BigQuery

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Google BigQuery 101 & What’s New

Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup

Real-World NoSQL Schema Design

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Practical Use of a NoSQL Database

Apache Arrow: Cross-language Development Platform for In-memory Data

SQL on Big Data using Optiq

Apache Arrow: In Theory, In Practice

Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...

Database Choices

Dremio introduction

Destaque

Apache Calcite: One planner fits all

Julian Hyde

Streaming SQL

Julian Hyde

With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects. This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.

Streaming SQL with Apache Calcite

Julian Hyde

Apache Calcite overview

Julian Hyde

A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12. Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)

Julian Hyde

7 tecnicas para construir un equipo brillante

David Bonilla

En esta charla pretendía comentar cómo nos habíamos dado cuenta, en una empresa de 25 empleados, de que éramos poco homogéneos a la hora de proporcionar servicios, sobre todo en uno de los puntos vitales: definir y estimar un proyecto en base a historias y minimizar riesgos. Un modo de conseguirlo es crear métodos para que todo el mundo vaya adquiriendo los conocimientos. Se describieron los elementos fundamentales: épicas, historias, chores, temas, spikes, etc. También la técnica para estimar proyectos grandes: - Conseguir todas las historias. - Descomponerlas en ventanas por prioridad. - Estudiar en detalle a primera ventana (de pongamos 25). - Estimar la primera ventana. - Asegurarse que no hay historias (que no tenemos capacidad de detallar y estimar) que se desmadren, estudiando los posibles estados del sistema y viendo el riesgo técnico/funcional. Para descomponer historias de usuario existen numerosos patrones que sería interesante conocer. Otro asunto vital es "mirar el todo" y estudiar los riesgos. Describimos un método que hemos llamado Risk Evaluation Machine (REM) para hacer una gestión visual de riesgos: Área de definición. - Definir los riesgos entre los compañeros. - Agruparlos por categoría (para ver solapes y los que faltan). - Sugerir otras categorías para encontrar otros riesgos que no tenemos en la cabeza. Máquina (área central de estudio que tiene que terminar limpia) - Se estudia la probabilidad e impacto. - Se cualifican y descartan los menos importantes. - Se estudia en detalle los vitales (gran probabilidad e impacto) - Se hacen preguntas (5 por que´s) para encontrar las causas primeras y sugerir soluciones. Área resumen de proyecto - Se estructuran los riesgos en un panel para cada proyecto. - Usaremos este panel como elemento de comunicación incluso con técnicas originales como pegar globos o luces en él para que nadie ignore los riesgos. Esta charla fue curiosa porque como lo conté no tube apenas feedback sobre el contenido pero si sobre el jefismo que transmití (no hice el discleimer inicial) :-) A nadie pareció gustarle el concepto de disciplina en un evento ágil... Parece que la gente no entiende que nada es blanco ni es negro sino que todo tiene matices. Me gusta la frase: Esto es Esparta haciendo referencia a la película 300: Un grupo pequeño, de élite, tiene cada miembro una gran capacidad pero tiene una gran disciplina colectiva, que es lo que les da el poder. Si entras en los SWAT ya tendrás grandes destrezas que seguro que mejoraras. Por muy bueno que sea el grupo será mejor cuanto más compenetrados estén. Todo el mundo asumirá que hay una disciplina. Los mismos métodos que se usan el un grupo de élite (donde alguien voluntariamente ha querido entrar) no se pueden usar en otros contextos. Obviamente hay que adaptar el comportamiento al contexto.

Patrones de toma de requisitos en proyectos ágiles en la Cas2013

Roberto Canales

Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...

MongoDB

Streaming is a paradigm for data processing that is rapidly growing in popularity, because it allows high throughput, low latency responses, and efficiently manages multitudes of IoT devices. Is it an alternative to database processing, or is it complementary? Julian Hyde argues for applying the database paradigm to streaming systems, using SQL as a high-level language for streaming. He presents streaming SQL, a super-set of standard SQL developed in collaboration with several Apache projects, and the use cases it can solve, such as combining data in flight with historic data at rest. He also shows how query optimization techniques can make streaming applications more efficient. A talk given by Julian Hyde at 9th XLDB conference at SLAC, Menlo Park, on 2016/05/25.

Streaming SQL

Julian Hyde

The twins that everyone loved too much

Julian Hyde

What's new in Mondrian 4?

Julian Hyde

A talk given by Julian Hyde at Enterprise Data World on in Washington, DC on April 2nd, 2015. With data in different systems, in different formats, and accessed via different tools, we need a lingua franca for data. Not all tools speak SQL, and data cannot be moved into a single convenient location. Relational algebra underpins SQL and many other DB languages. It is also perfect for optimizing, caching and mediating. Apache Calcite (formerly Optiq) is a framework for building and optimizing expressions in relational algebra. We show how to write queries, optimize queries using rewrite rules, and write adapters for back-end systems. We also show to configure Calcite to materialize queries, so your interactive analytics are effectively running against a fast in-memory database.

Why you care about  relational algebra (even though you didn’t know it)

Julian Hyde

Cost-based query optimization in Apache Hive 0.14

Julian Hyde

A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin. Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...

Julian Hyde

A talk given at Hadoop Summit 2016, Dublin. The internet of things (IoT) generates data at an unprecedented rate and requires results at low latency. Only streaming technologies can keep up. But IoT applications must be integrated with existing applications, such as Tableau, whose lingua franca is SQL. Samza and Storm are adding support standard SQL with extensions for streaming, using Calcite for parsing and planning. We show, with examples, how you would accomplish typical tasks in Kafka/Samza or Storm/Trident using SQL. Streaming SQL allows you to work at a higher level. For example, many queries need to combine streams with historic or reference data in databases or HDFS, or recent data in memory. We show how to define multiple data source systems, and how to write queries that compute joins or aggregations on streams.

Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident

Julian Hyde

Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.

Cost-based query optimization in Apache Hive 0.14

Julian Hyde

Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL. Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.

SQL on everything, in memory

Julian Hyde

This talk, given by Maryann Xue and Julian Hyde at Hadoop Summit, San Jose on June 30th, 2016, describes how we re-engineered Apache Phoenix with a cost-based optimizer based on Apache Calcite. Apache Phoenix has rapidly become a workhorse in many organizations, providing a convenient standard SQL interface to HBase suitable for a wide variety of workloads from transactions to ETL and analytics. But Phoenix's initial query optimizer was based on static optimization procedures and thus could not choose between several potential plans or indices based on cost metrics. We describe how we rebuilt Phoenix's parser and query optimizer using the Calcite framework, improving Phoenix's performance and SQL compliance. The new architecture uses relational algebra as an intermediate language, and this enables you to switch in other engines, especially those also based on Calcite. As an example of this, we demonstrate querying a Phoenix database via Apache Drill.

Cost-based Query Optimization in Apache Phoenix using Apache Calcite

Julian Hyde

Pentaho Analytics for MongoDB - presentation from MongoDB World 2014

Pentaho

Streaming is necessary to handle data rates and latency but SQL is unquestionably the lingua franca of data. Where do the two meet? Apache Calcite is extending SQL to include streaming, and the Samza, Storm and Flink are projects are each building it into their engines. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency. Julian Hyde gave this talk at the first Kafka Summit, San Francisco, 2016/04/26.

Streaming SQL

Julian Hyde

Destaque (20)

Apache Calcite: One planner fits all

Streaming SQL

Streaming SQL with Apache Calcite

Apache Calcite overview

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)

7 tecnicas para construir un equipo brillante

Patrones de toma de requisitos en proyectos ágiles en la Cas2013

Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...

Streaming SQL

The twins that everyone loved too much

What's new in Mondrian 4?

Why you care about  relational algebra (even though you didn’t know it)

Cost-based query optimization in Apache Hive 0.14

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...

Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident

Cost-based query optimization in Apache Hive 0.14

SQL on everything, in memory

Cost-based Query Optimization in Apache Phoenix using Apache Calcite

Pentaho Analytics for MongoDB - presentation from MongoDB World 2014

Streaming SQL

Semelhante a Optiq: A dynamic data management framework

Archiving is the most impactful way to alleviate overall storage costs, which today, consume up to 70% of IT hardware budgets. Organizations are pressured to build reliable and auditable archive practices to meet regulatory requirements. Given the growing importance of archives, is tape, disk or some other storage technology the best choice for building an archive? Please join Jon Toigo, data management veteran, to discuss the latest developments among competing archive platforms and reveal the six key criteria for building enterprise archives.

Is Disk Now a Viable Solution for Archive - Jon Toigo

spectralogic

A talk given at the August, 2013 NoSQL Now! conference in San Jose, California. 2013 is the year of that SQL meets Hadoop & NoSQL. There is a surprisingly good fit between thirty-something SQL and the upstart millennial data technologies. SQL helps you deliver faster value, improve integration with visualization and analytics tools, and bring your data to a larger audience. Julian Hyde, author of Mondrian and Optiq, and a contributor to Apache Drill and Cascading Lingual, shows how to quickly build a SQL interface to a NoSQL system using the Optiq framework. As a case study, he shows how Optiq has allows the Mondrian in-memory analysis engine (part of Pentaho's business intelligence suite) to leverage the memory and compute power of a cluster and pull data from hybrid SQL and NoSQL sources.

SQL Now! How Optiq brings the best of SQL to NoSQL data.

Julian Hyde

Agile analytics applications on hadoop

Russell Jurney

AppEngine Performance Tuning

David Chen

2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services

PHP Conference Argentina

Escalando una PHP App con DB sharding - PHP Conference

Matias Paterlini

2010 AIRI Petabyte Challenge - View From The Trenches

George Ang

MongoDB at Yle

MongoDB

Data Management 101

Kristin Briney

Over the past year, more and more Java applications have benefited from gaining access to elastic, cloud-ready, data grids thanks to Infinispan, and from now on, Ruby apps running within TorqueBox get the same benefit as well thanks to the ability of TorqueBox to talk to Infinispan Hot Rod servers. In this talk, Galder will demo the integrations between TorqueBox, which is a Ruby application plattform, and Infinispan, which is a Java based data grid plattform, highlighting the benefits for TorqueBox administrators and Ruby developers, which include, amongst others, access to highly-scalable, low-latency data store that avoids single point of failure.

Ruby-on-Infinispan

Galder Zamarreño

How to interactively visualise and explore a billion objects (wit vaex)

Ali-ziane Myriam

With datasets becoming larger, new visualizations techniques are needed. One problem is that the number of datapoints is so large, that a scatter plot becomes cluttered. Another problem is that with over a billion objects, only a few cpu cycles are available per object if one wants to process them within a second, making traditional methods not viable. I will show that it is possible to visualize a billion objects in about 1 second on a modern desktop computer using memory mapping of hdf5 files together with a simple binning algorithm in Python. Density fields in 1, 2 and 3d are computed and statistics of any properties of the objects, or any mathematical operation on them can be done on the fly. This enables efficient exploration or large datasets interactively, making science exploration feasible. This idea is implemented in a Python library called vaex, which integrates well in the Jupyter/Numpy/Astropy stack. Build on top of this is the vaex application, which allows for interactive exploration and visualization. The motivation for vaex is the upcoming astronomical catalogue of the Gaia satellite, which will contain properties of over a billion stars. However, vaex can also be used on N-body simulations, other catalogues or any tabular data. https://www.astro.rug.nl/~breddels/vaex https://github.com/maartenbreddels/vaex

Vaex talk-pydata-paris

Maarten Breddels

Big Data Analytics: Finding diamonds in the rough with Azure

Christos Charmatzis

Cassandra at scale

Patrick McFadin

Big Data with MySQL

Ivan Zoratti

Bender kuszmaul tutorial-xldb12

Atner Yegorov

Data Structures and Algorithms for Big Databases

omnidba

Data Management Crash Course

Kristin Briney

Searching Billions of Product Logs in Real Time (Use Case)

Ryan Tabora

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

Gigaom

Semelhante a Optiq: A dynamic data management framework (20)

Is Disk Now a Viable Solution for Archive - Jon Toigo

SQL Now! How Optiq brings the best of SQL to NoSQL data.

Agile analytics applications on hadoop

AppEngine Performance Tuning

2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services

Escalando una PHP App con DB sharding - PHP Conference

2010 AIRI Petabyte Challenge - View From The Trenches

MongoDB at Yle

Data Management 101

Ruby-on-Infinispan

How to interactively visualise and explore a billion objects (wit vaex)

Vaex talk-pydata-paris

Big Data Analytics: Finding diamonds in the rough with Azure

Cassandra at scale

Big Data with MySQL

Bender kuszmaul tutorial-xldb12

Data Structures and Algorithms for Big Databases

Data Management Crash Course

Searching Billions of Product Logs in Real Time (Use Case)

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

Mais de Julian Hyde

A semantic layer, also known as a metrics layer, lies between business users and the database, and lets those users compose queries in the concepts that they understand. It also governs access to the data, manages data transformations, and can tune the database by defining materializations. Like many new ideas, the semantic layer is a distillation and evolution of many old ideas, such as query languages, multidimensional OLAP, and query federation. In this talk, we describe the features we are adding to Calcite to define business views, query measures, and optimize performance. A talk given at Community over Code, the annual conference of the Apache Software Foundation, in Halifax, NS, on 9th October, 2023.

Building a semantic/metrics layer using Calcite

Julian Hyde

If SQL is the universal language of data, why do we author our most important data applications (metrics, analytics, business intelligence) in languages other than SQL? Multidimensional databases and languages such as MDX, DAX and Tableau LOD solve these problems but introduce others: they require specialized knowledge, complicate the data pipeline and don’t integrate well. Is it possible to define and query business intelligence models in SQL? Apache Calcite has extended SQL to support metrics (which we call ‘measures’), filter context, and analytic expressions. With these concepts you can define data models (which we call Analytic Views) that contain metrics, use them in queries, and define new metrics in queries. In this talk by the original developer of Apache Calcite, we describe the SQL syntax extensions for metrics, and how to use them for cross-dimensional calculations such as period-over-period, percent-of-total, non-additive and semi-additive measures. We describe how we got around fundamental limitations in SQL semantics, and approaches for optimizing queries that use metrics. A talk given by Julian Hyde at Data Council, Austin, TX, on March 29, 2023.

Cubing and Metrics in SQL, oh my!

Julian Hyde

Adding measures to Calcite SQL

Julian Hyde

What would the perfect data-parallel programming language look like? It would be as expressive as a general-purpose functional programming language, as powerful and concise as SQL, and run programs just as efficiently on a laptop or a thousand-node cluster. We present Morel, a functional programming language with relational extensions, working towards that goal. Morel is implemented in the Apache Calcite community on top of Calcite’s relational algebra framework. In this talk, we describe Morel’s evolution, including how we are pushing Calcite’s capabilities with graph and recursive queries. A talk given by Julian Hyde at ApacheCon, New Orleans, October 4th 2022.

Morel, a data-parallel programming language

Julian Hyde

The perfect data parallel language has not yet been invented. SQL queries can achieve great performance and scale, but there are many general purpose algorithms that it cannot express. In Morel, we build on the functional and relational roots of MapReduce in an elegant and strongly-typed general-purpose programming language. But Morel is, in a real sense, a query language; programs are executed on relational frameworks such as Google BigQuery and Spark. In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems. We also introduce Apache Calcite, the popular open source framework for query planning, and describe how Morel's compiler uses Calcite's relational algebra and rewrite rules to generate efficient plans.

Is there a perfect data-parallel programming language? (Experiments with More...

Julian Hyde

Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter. Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data. Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends. In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems. (A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)

Morel, a Functional Query Language

Julian Hyde

Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research. Presenters: Julian Hyde and Stamatis Zampetakis

Apache Calcite (a tutorial given at BOSS '21)

Julian Hyde

Apache Calcite is an open source framework for building databases, and includes a SQL parser, relational algebra, and a highly extensible query optimizer. It has achieved wide adoption, used in many commercial products, open source projects, and as a test bed for computer science research. But there is a bootstrap problem: If software is written by a community of contributors, and each contributor acts in their own self-interest, how do you get the first working version of the product? The answer is in the story of how the technology evolved, and how the community evolved with it, and in this talk we tell that story.

The evolution of Apache Calcite and its Community

Julian Hyde

What to expect when you're Incubating

Julian Hyde

Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite

Julian Hyde

A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20. Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them. In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!

Efficient spatial queries on vanilla databases

Julian Hyde

What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible. A talk by Julian Hyde at JOIN 2019 in San Francisco.

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Julian Hyde

A talk given by Julian Hyde at DataCouncil SF on April 18, 2019 How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar. This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.

Tactical data engineering

Julian Hyde

Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.

Don't optimize my queries, organize my data!

Julian Hyde

A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018. Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them. In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!

Spatial query on vanilla databases

Julian Hyde

The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL. In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like. He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard. A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...

Julian Hyde

A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>. Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

Julian Hyde

A talk given by Julian Hyde at DataEngConf SF on April 17th 2018. Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems. If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.

Lazy beats Smart and Fast

Julian Hyde

Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them. A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.

Don’t optimize my queries, optimize my data!

Julian Hyde

Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data. A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.

Data profiling with Apache Calcite

Julian Hyde

Mais de Julian Hyde (20)

Building a semantic/metrics layer using Calcite

Cubing and Metrics in SQL, oh my!

Adding measures to Calcite SQL

Morel, a data-parallel programming language

Is there a perfect data-parallel programming language? (Experiments with More...

Morel, a Functional Query Language

Apache Calcite (a tutorial given at BOSS '21)

The evolution of Apache Calcite and its Community

What to expect when you're Incubating

Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite

Efficient spatial queries on vanilla databases

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Tactical data engineering

Don't optimize my queries, organize my data!

Spatial query on vanilla databases

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...

Lazy beats Smart and Fast

Don’t optimize my queries, optimize my data!

Data profiling with Apache Calcite

Último

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

A Principled Technologies deployment guide Conclusion Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Principled Technologies

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Top 10 Most Downloaded Games on Play Store in 2024

SynarionITSolutions

Real Time Object Detection Using Open CV

Khem

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

Optiq: A dynamic data management framework

1. Optiq A dynamic data management framework @julianhyde Friday, October 4, 13

2. 3 things 1. Databases are good. 2. It is hard to build analytics if your data is “all over the place.” 3. Optiq makes heterogeneous data look and behave more like a database. Friday, October 4, 13

3. Mondrian on Optiq on hybrid data + memory Friday, October 4, 13

4. Big Data Friday, October 4, 13

5. Friday, October 4, 13

6. “Data all over the place” • Different locations (HDFS, memory, DBMS) • Different formats • Different workloads • Transactional: Mainly write, targeted read • Analytic: Mainly read (bulk write) • Query latency: Interactive vs. batch • Data latency: Can we show out-of-date data? Friday, October 4, 13

7. Databases are good • Central point to manage data • Simple, standard API for apps • Powerful modeling techniques (e.g. star schemas) • Data independence (i.e. tune your data after you write your application) • Query optimization Friday, October 4, 13

8. Optiq Friday, October 4, 13

9. Conventional DB architecture Friday, October 4, 13

10. Optiq architecture Friday, October 4, 13

11. Examples Friday, October 4, 13

12. Example #1: CSV • Uses CSV adapter (optiq-csv) • Demo using sqlline • Easy to run this for yourself: $ git clone https://github.com/julianhyde/optiq-csv $ cd optiq-csv $ mvn install $ ./sqlline Friday, October 4, 13

13. More adapters Adapters Embedded Planned CSV Cascading (Lingual) HBase (Phoenix) JDBC Apache Drill Spark MongoDB Cassandra Splunk Mondrian linq4j Friday, October 4, 13

14. Example #2: Splunk + MySQL SELECT p.product_name, COUNT(*) AS c FROM splunk.splunk AS s JOIN mysql.products AS p ON s.product_id = p.product_id WHERE s.action = 'purchase' GROUP BY p.product_name ORDER BY c DESC Friday, October 4, 13

15. Expression tree Friday, October 4, 13

16. Optimized tree Friday, October 4, 13

17. Analytics on heterogeneous data Friday, October 4, 13

18. Simple analytics problem • 100M U.S. census records • 1KB each record, 100GB total • 4 SATA3 disks, total 1.2GB/s • How to count all records in under 5s? Friday, October 4, 13

19. Simple analytics problem • 100M U.S. census records • 1KB each record, 100GB total • 4 SATA3 disks, total 1.2GB/s • How to count all records in under 5s? Friday, October 4, 13

20. Simple analytics problem • 100M U.S. census records • 1KB each record, 100GB total • 4 SATA3 disks, total 1.2GB/s • How to count all records in under 5s? • Not possible?! It takes 80s just to read the data. Friday, October 4, 13

21. Solution: Cheat! Friday, October 4, 13

22. Solution: Cheat! • Compress data • Column-oriented storage • Store data in sorted order • Put data in memory • Cache previous query results • Pre-compute (materialize) aggregates Friday, October 4, 13

23. How Optiq helps you to cheat Friday, October 4, 13

24. How Optiq helps you to cheat • Materialized views • Pre-deﬁned aggregate tables • Cached query results = In-memory tables • Smart cache maintenance • Quickly bring materializations online & ofﬂine • Materializations over a subset of the data • Spark distributed, in-memory processing & cache • Application thinks it is talking to a single SQL database Friday, October 4, 13

25. Mondrian on Optiq on hybrid data + memory Friday, October 4, 13

26. Summary Friday, October 4, 13

27. 3 things (reprise) 1. Databases are good. Especially the ﬂexibility that SQL gives us. 2. It is hard to build analytics if your data is all over the place. Different workloads (operational vs. analytic, small write vs. bulk read) require different data structures. 3. Optiq is not a database. But Optiq creates a federated data architecture that performs well, and looks like a database to your tools. Friday, October 4, 13

28. @julianhyde optiq https://github.com/julianhyde/optiq mondrian http://mondrian.pentaho.com blog http://julianhyde.blogspot.com Friday, October 4, 13

Optiq: A dynamic data management framework

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Optiq: A dynamic data management framework

Semelhante a Optiq: A dynamic data management framework (20)

Mais de Julian Hyde

Mais de Julian Hyde (20)

Último

Último (20)

Optiq: A dynamic data management framework