An introduction to self-service data with Dremio. Dremio reimagines analytics for modern data. Created by veterans of open source and big data technologies, Dremio is a fundamentally new approach that dramatically simplifies and accelerates time to insight. Dremio empowers business users to curate precisely the data they need, from any data source, then accelerate analytical processing for BI tools, machine learning, data science, and SQL clients. Dremio starts to deliver value in minutes, and learns from your data and queries, making your data engineers, analysts, and data scientists more productive.
3. Tomer Shiran
Founder & CEO
Previously VP Product, MapR
Previously Microsoft; IBM
Jacques Nadeau
Founder & CTO
Recognized SQL & NoSQL expert Founder of
Apache Arrow
Kelly Stirman
CMO
Previously VP Strategy, MongoDB
Previously Field CTO, MarkLogic
Ajay Singh
Head of Field Engineering
Previously Technical Alliances, Hortonworks
Previously Alliances, MarkLogic
Ron Avnur
VP Engineering
Previously VP Product, MongoDB
Previously VP Engineering, MarkLogic
Collin Weitzman
VP Customer Success
Previously Sales Executive, Mesosphere
Previously Sales Executive, MapR
4. Area Dremio Team Members
Columnar memory Creators of Apache Arrow
Columnar storage Creator of Apache Parquet
ETL Tech lead of Twitter analytics data pipeline
UI
UI lead for Apple (iCloud Photos, iTunes
U)
UX UX lead for Splunk
World Class
Technical Team
Top Silicon
Valley VCs
7. The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
9. Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
10. Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
11. Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
13. ✓ Works with any data source
✓ Works with any BI tool
✓ No ETL, no data warehouse, no cubes
✓ Makes data self-service, collaborative
✓ Makes Big Data feel small
✓ Open source
There’s a better way,
16. Four key areas excite customers
BI on Modern Data
Use any BI tool with Elasticsearch, MongoDB,
S3, HDFS, plus joins to relational data
Autonomous Data
Acceleration
Make PB-scale queries fast, without cubes,
aggregation tables, or ETL
Data Lineage
Improve governance with full view of access
patterns, data flows, data reshaping, and sharing
Self Service Data
Empower IT and analysts to discover, curate,
accelerate, and share data
1
2
3
4
BI assumes single relational database, but…
Data in non-relational technologies
Data fragmented across many systems
Massive scale and velocity
Data is the business, and…
Era of impatient smartphone natives
Rise of self-service BI
Accelerating time to market
Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle:
Slow or non-responsive IT
“Shadow Analytics”
Data governance risk
Illusive data engineers
Immature software
Competing strategic initiatives
Here’s the problem everyone is trying to solve today.
You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL.
Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3.
So how are you going to get the data to the people asking for it?
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts.
In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change.
But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data:
They open a ticket with IT
IT begins an engineering project to build another set of pipelines, over several weeks or months
And so we started Dremio to say, hey, we think there’s a better way to do this.
And when we got started we asked ourselves, what would we need to do to make this better. And we came up with these requirements.
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, S3, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
And that’s Dremio. It sits between all the places you’re creating or capturing data, and all the tools you use to access data. At a high level, that’s how Dremio works. We’ll get into how it works a little later.
To go one level deeper, Dremio is a distributed process that you run on 1 – 1000+ servers. You can run it on dedicated infrastructure, like you see on the left. Or in your Hadoop cluster, provisioned and managed via YARN.
OK, enough with the pictures, let’s get into the demo.
But one quick question for you.
Dremio is a big product, and there are lots of things we could show you, but it would be great to get a little guidance on how to spend our time.
When we show Dremio to customers, they tend to get excited by four key areas: