Today's data economics is flawed. There is a need for a fundamental change in the way we produce, distribute and consume data. This presentation describes a solution with TileDB that can shape the future of data management.
2. Data Economics
Consumption
How tools can compute
on the data, where
does the computation
happen
Distribution
Who has access to the
data, what is the means
of access, and
monetization
Production
What format does the
data get produced in
and where does it get
stored
3. The Problem
Distribution (sharing & collaboration) is an afterthought
Data produced in inefficient formats
Most formats are domain-specific
Tools mostly domain-specific
A lot of ETL involved for
generic tools (e.g., DBs)
All data management
solutions focus here
Consumption
How tools can
compute on the data,
where does the
computation happen
4. Examples
Data in CSV files
Storage in cloud buckets
Marine traffic (AIS)
Dropbox-like solutions for secure sharing
Analysis with tools like bcftools, GATK
Genomics
Data in FastQ, BAM, VCF formats
Distributed / sold as files
Storage in cloud buckets
Distributed / sold as files
Finance (historical market data)
Data in text files
Storage in cloud buckets
Distributed as files with metadata
LiDAR
Data in LAS files
Storage in cloud buckets or FTP servers
Analysis / visualization with GIS tools
Analysis with DB and data science tools Analysis / visualization with GIS tools
5. The Production Problem
Data in some
custom format
slow & expensive
often custom & in-house
costly & time consuming
Some analytics
infra
Extract - Transform -
Load (ETL)
Storage in some cloud
bucket or file manager
6. The Distribution Problem #1
Data in some
custom format
wasteful re-invention
Storage in some cloud
bucket or marketplace Org #N:
Download + Wrangle +
Built analytics infra
Org #1:
Download + Wrangle +
Built analytics infra
7. The Distribution Problem #2
Data in some
custom format
Data owner bears the distribution cost,
also re-invention across application domains
ETL, etc. some analytics infra
Queries by
consumer #1
Queries by
consumer #N
8. The Consumption Problem
Data in some
custom format
inefficient & costly
with poor governance
Storage in some cloud
bucket or server
Group #N:
Wrangle + Copy - Use tool & infra #N
Group #1:
Wrangle + Copy - Use tool & infra #1
9. The Solution
No ETL, no copies
Common for all data applications
Unified governance
Built-in marketplace
One infra, any backend, any scale
Universal
data management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
10. Enter TileDB
Secure governance & collaboration
Scalable, serverless compute
Data & code sharing & monetization
Pay-as-you-go, consumer pays
Extreme interoperability
No infra hassles
multi-dimensional arrays
Universal data
management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
11. The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array
12. TileDB Cloud cloud.tiledb.com
❏ Access control and logging
❏ Serverless SQL, UDFs, task graphs
❏ Jupyter notebooks and dashboards
Unified data management
and easy serverless compute
at global scale
How we built a Universal Database
Pluggable Compute: Efficient APIs & Tool Integrations
TileDB Embedded github.com/TileDB-Inc/TileDB
Open-source interoperable
storage with a universal
open-spec array format
❏ Parallel IO, rapid reads & writes
❏ Columnar, cloud-optimized
❏ Data versioning & time traveling
13. Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB got spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS