SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Andrew Lamb
Guest Lecture
CSC-132 Database Systems Implementation in Spring '21.
UCSD
April 20, 2021
Apache Arrow and its Impact
on the Database industry
IOx Team at InfluxData
Apache Arrow PMC Member
Query Optimizer / Architect @ Vertica
Technical Staff @ Oracle, on Database Server
Chief Architect @ DataRobot (ML Startup)
Chief Architect/VP Engineering @ Nutonian (ML Startup)
XLST JIT Compiler Team at DataPower (XML startup)
Goals + Outline
Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤
● Opinions and Perspectives of Databases
● Why and What of Arrow
● Arrow Examples, in Rust
Props to Wes McKinney who provided many of the arrow slides
Databases -- Trend Towards Specialization
Relational
Key-Value
Timeseries
Graph
Array / Scientific
Document
Stream
Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st
International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1
Data Model Deployment
Embedded / Edge
Cloud
Single-Node
Hybrid
Ecosystem
Hadoop
Java
Json / Javascript
AWS
GCP
Azure
Apple Cloud
Use Case
Transactions
Analytics
Streaming
...
… and our new database is …
🎉
InfluxDB IOx - The Future Core of InfluxDB Built
with Rust and Arrow
And we are not alone
A proliferation of new databases…. And many are choosing to use some of the
Arrow ecosystem
⇒ lets them innovate at a higher level not the nuts and bolts
6
Arrow Parquet
● AWS Athena
● ClickHouse
● Snowflake
● Vertica
● DuckDb
● Greenplum
● DataBricks
● …
Arrow Flight
● Google BigQuery
● Snowflake
● Dremio
● DataBricks
● ...
Arrow Arrays
● Dremio
● TileDB
● SciDB
● ...
Analytic Systems (vs Transactional)
● Transactional (OLTP, Key-value stores, etc)
○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent”
○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, SQLite etc etc
● Analytic (OLAP, “Big Data”, etc)
○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated
○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc
⇒ Rest of the talk focused on Analytic Databases
So, you want to build a new database… ?
Databases need many features just to look like a database:
● Get Data In and Out
● Store Data and Catalog / Metadata
● Query Language + Query Engine
● Connect: Client API
…
Before you can invest in what makes your database special
Implementation timeline for a new Database system
Client
API
In memory
storage
In-Memory
filter + aggregation
Durability /
persistence
Metadata Catalog +
Management
Query
Language
Parser
Optimized /
Compressed
storage
Execution on
Compressed
Data
Joins!
Additional Client
Languages
Outer
Joins
Subquery
support
More advanced
analytics
Cost
based
optimizer
Out of core
algorithms
Storage
Rearrangement
Heuristic
Query
Planner
Arithmetic
expressions
Date / time
Expressions
Concurrency
Control
Data Model /
Type System
Distributed query
execution
Resource
Management
“Lets Build
a Database”
🤔
“Ok now this
is pretty
good”
😐
“Look mom!
I have a
database!”
😃
Online
recovery
10
Arrow Project Goals
“Build a better open source
foundation for data science”
🤔 How is this related to databases?
https://arrow.apache.org/
• Open source project begun in 2016, founded by
Wes McKinney
• Initially focused on applying systems / database
research to data science ecosystems (pandas)
Data scientists working with “small” data
have not experienced great pain
Small Data (< ~10GB)
Medium Data (~10 - ~100GB)
Big Data (> ~100GB-1TB)
Current Python/R
stack begins to “fail”
around this point
Users doing fine here
NYC R Conference 2018-04-20
We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation (LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed
NYC R Conference 2018-04-20
• Language agnostic, state-of-the-art library
for building fast data systems
• Arrow Memory Format
• Compute Kernels (not standardized)
• Parquet File Format
• Flight: Data RPC / Network Transfer format
• Query Engines (not standardized)
Arrow Implementations and Bindings
16
Java
C++
JavaScript
Go
Rust
C Ruby
Python
Julia
R
C#
Matlab
17
Defragmenting Data Access
https://www.snowflake.com/blog/fetching-query-results-from-snowflake-just-got-a-lot-faster-with-apache-arrow/
https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171
Apache Arrow in Academic Papers
20
Some Industry Contributors in Apache Arrow
ClearCode
Arrow == toolkit for a modern analytic databases
match tool_needed {
File Format (persistence) => Parquet
Columnar memory representation => Arrow Arrays
Operations (e.g. add, multiply) => Compute Kernels
Network transfer => Arrow Flight IPC
_ => ... to be continued ...
}
Code Examples
(In Rust)
23
Code Examples
Thesis: “When writing an analytic database, you will end up implementing the
Arrow feature set”
+
* Take performance comparisons with a large grain of salt
Compare Plain Rust and Rust using the Arrow library
InfluxDB line protocol
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
weather,location=us-east temperature=83,humidity=69 1465839830200400200
weather,location=us-midwest temperature=87,humidity=78 1465839830200400200
weather,location=us-west temperature=72,humidity=56 1465839830200400200
weather,location=us-east temperature=84,humidity=67 1465839830300400200
weather,location=us-midwest temperature=90,humidity=82 1465839830400400200
weather,location=us-west temperature=71,humidity=57 1465839830400400200
Line Protocol Tutorial (link)
Measurements
Tags Fields
Timestamp
IOx Data Model
weather,location=us-east temperature=82,humidity=67 1465839830100400200
weather,location=us-midwest temperature=82,humidity=65 1465839830100400200
weather,location=us-west temperature=70,humidity=54 1465839830100400200
weather,location=us-east temperature=83,humidity=69 1465839830200400200
weather,location=us-midwest temperature=87,humidity=78 1465839830200400200
weather,location=us-west temperature=72,humidity=56 1465839830200400200
weather,location=us-east temperature=84,humidity=67 1465839830300400200
weather,location=us-midwest temperature=90,humidity=82 1465839830400400200
weather,location=us-west temperature=71,humidity=57 1465839830400400200
location
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
"us-east"
"us-midwest"
"us-west"
temperature
82
82
70
83
87
72
84
90
71
humidity
67
65
54
69
78
56
67
82
57
timestamp
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.1004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.2004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
2016-06-13T17:43:50.3004002Z
Table: weather
Motivating Example
“Find the rows that are not in `us-west`”
Create the Array
let string_vec: Vec<String> =
(0..NUM_TAGS)
.map(|i| {
match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
}.into()
})
.collect();
let mut builder =
StringBuilder::new(NUM_TAGS);
(0..NUM_TAGS).enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append_value(location)
.unwrap()
});
let array = builder.finish();
> created array with 10000000 elements
~600ms
> created array with 10000000 elements
~400ms
+
Memory Footprint
let size =
size_of::<Vec<String>>() +
string_vec
.iter()
.fold(0, |sz, s| {
sz + size_of::<String>() + s.len()
});
println!("total size: {} bytes", size);
println!("total size: {} bytes",
array.get_array_memory_size());
> total size: 320000023 bytes
~320 MB *
> total size: 149206128 bytes
~150 MB
+
Find Rows != “us-west”
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| s != "us-west")
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
let not_west_bitset =
neq_utf8_scalar(
&array,
"us-west"
).unwrap();
let num_not_west = not_west_bitset
.iter()
.filter(|v| matches!(v, Some(true)))
.count();
> Found 6666667 not in west
~50ms
> Found 6666667 not in west
~120ms
+
Find Rows != “us-west” (with null handling)
let string_vec: Vec<Option<String>> = ...;
let not_west_bitset: Vec<bool> =
string_vec
.iter()
.map(|s| {
s.as_ref()
.map(|s| s != "us-west")
.unwrap_or(false)
})
.collect();
let num_not_west = not_west_bitset
.iter()
.filter(|&&v| v)
.count();
+
Same as previous
> Found 6666667 not in west
~50ms
Materialize rows for future processing
let not_west: Vec<String> = not_west_bitset
.iter()
.enumerate()
.filter_map(|(i, &v)| {
if v {
Some(string_vec[i].clone())
} else {
None
}
})
.collect();
let not_west = filter(
&array,
&not_west_bitset
).unwrap();
> Made array of 6666667 Strings not in west
~450 ms
> Made array of 6666667 Strings not in west
~50 ms
+
More efficient encoding (dictionary)
let vb = StringBuilder::new();
let kb = Int8Builder::new();
let mut builder =
StringDictionaryBuilder::new(vb,kb);
(0..NUM_TAGS)
.enumerate()
.for_each(|(i, _)| {
let location = match i % 3 {
0 => "us-east",
1 => "us-midwest",
2 => "us-west",
};
builder.append(location).unwrap();
});
let array = builder.finish();
> total size: 10000688 bytes
10MB
250 ms
+
dictionary
"us-east"
"us-midwest"
"us-west"
Location
0
1
2
0
1
2
0
1
2
[0]
[1]
[2]
[u8]
SIMD Anyone?
let output = gt(
&left,
&right
).unwrap();
+
10
20
17
5
23
5
9
12
4
5
76
2
3
5
2
33
2
1
6
7
8
2
7
2
5
6
7
8
left right output
1
0
1
1
1
0
1
1
0
1
1
0
0
0
>
>
>
>
SIMD Implementation
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
feature = "simd"))]
fn simd_compare_op<T, F>(left: &PrimitiveArray<T>,
right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray>
where
T: ArrowNumericType,
F: Fn(T::Simd, T::Simd) -> T::SimdMask,
{
// use / error checking elided
let null_bit_buffer = combine_option_bitmap(
left.data_ref(), right.data_ref(), len
)?;
let lanes = T::lanes();
let mut result = MutableBuffer::new(
left.len() * mem::size_of::<bool>()
);
let rem = len % lanes;
for i in (0..len - rem).step_by(lanes) {
let simd_left = T::load(left.value_slice(i, lanes));
let simd_right = T::load(right.value_slice(i, lanes));
let simd_result = op(simd_left, simd_right);
T::bitmask(&simd_result, |b| {
result.write(b).unwrap();
});
}
Source: arrow/src/compute/kernels/comparison.rs
if rem > 0 {
let simd_left = T::load(left.value_slice(len - rem, lanes));
let simd_right = T::load(right.value_slice(len - rem,
lanes));
let simd_result = op(simd_left, simd_right);
let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize;
T::bitmask(&simd_result, |b| {
result.write(&b[0..rem_buffer_size]).unwrap();
});
}
let data = ArrayData::new(
DataType::Boolean,
left.len(),
None,
null_bit_buffer,
0,
vec![result.freeze()],
vec![],
);
Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data)))
}
Other things needed in a database
Vec<Option<String>> to support nulls
Handle other data types with same code
Vectorized implementations of filter, aggregate, etc
Persist it to storage
Send data over the network
Ecosystem compatibility
...
Thank You

Mais conteúdo relacionado

Mais procurados

How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 

Mais procurados (20)

How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 

Semelhante a 2021 04-20 apache arrow and its impact on the database industry.pptx

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 

Semelhante a 2021 04-20 apache arrow and its impact on the database industry.pptx (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
MCT Virtual Summit 2021
MCT Virtual Summit 2021MCT Virtual Summit 2021
MCT Virtual Summit 2021
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
NextGenML
NextGenML NextGenML
NextGenML
 

Último

BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Último (20)

BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

2021 04-20 apache arrow and its impact on the database industry.pptx

  • 1. Andrew Lamb Guest Lecture CSC-132 Database Systems Implementation in Spring '21. UCSD April 20, 2021 Apache Arrow and its Impact on the Database industry
  • 2. IOx Team at InfluxData Apache Arrow PMC Member Query Optimizer / Architect @ Vertica Technical Staff @ Oracle, on Database Server Chief Architect @ DataRobot (ML Startup) Chief Architect/VP Engineering @ Nutonian (ML Startup) XLST JIT Compiler Team at DataPower (XML startup)
  • 3. Goals + Outline Goal: ⇒ Arrow is a good basis for a new (time series) Databases ❤ ● Opinions and Perspectives of Databases ● Why and What of Arrow ● Arrow Examples, in Rust Props to Wes McKinney who provided many of the arrow slides
  • 4. Databases -- Trend Towards Specialization Relational Key-Value Timeseries Graph Array / Scientific Document Stream Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE '05). IEEE Computer Society, USA, 2–11. DOI:https://doi.org/10.1109/ICDE.2005.1 Data Model Deployment Embedded / Edge Cloud Single-Node Hybrid Ecosystem Hadoop Java Json / Javascript AWS GCP Azure Apple Cloud Use Case Transactions Analytics Streaming ...
  • 5. … and our new database is … 🎉 InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow
  • 6. And we are not alone A proliferation of new databases…. And many are choosing to use some of the Arrow ecosystem ⇒ lets them innovate at a higher level not the nuts and bolts 6 Arrow Parquet ● AWS Athena ● ClickHouse ● Snowflake ● Vertica ● DuckDb ● Greenplum ● DataBricks ● … Arrow Flight ● Google BigQuery ● Snowflake ● Dremio ● DataBricks ● ... Arrow Arrays ● Dremio ● TileDB ● SciDB ● ...
  • 7. Analytic Systems (vs Transactional) ● Transactional (OLTP, Key-value stores, etc) ○ Workload is “lookup a record by id”, “update a record”, “keep data durable and consistent” ○ Examples: Oracle, Postgres, Cassandra, DynamoDB, MongoDB, SQLite etc etc ● Analytic (OLAP, “Big Data”, etc) ○ Workload: aggregate many rows to get historical view, bulk loads, rarely updated ○ Examples: ClickHouse, MapReduce, Spark, Vertica, Pig, Hive, InfluxDB, etc etc ⇒ Rest of the talk focused on Analytic Databases
  • 8. So, you want to build a new database… ? Databases need many features just to look like a database: ● Get Data In and Out ● Store Data and Catalog / Metadata ● Query Language + Query Engine ● Connect: Client API … Before you can invest in what makes your database special
  • 9. Implementation timeline for a new Database system Client API In memory storage In-Memory filter + aggregation Durability / persistence Metadata Catalog + Management Query Language Parser Optimized / Compressed storage Execution on Compressed Data Joins! Additional Client Languages Outer Joins Subquery support More advanced analytics Cost based optimizer Out of core algorithms Storage Rearrangement Heuristic Query Planner Arithmetic expressions Date / time Expressions Concurrency Control Data Model / Type System Distributed query execution Resource Management “Lets Build a Database” 🤔 “Ok now this is pretty good” 😐 “Look mom! I have a database!” 😃 Online recovery
  • 10. 10
  • 11. Arrow Project Goals “Build a better open source foundation for data science” 🤔 How is this related to databases? https://arrow.apache.org/
  • 12. • Open source project begun in 2016, founded by Wes McKinney • Initially focused on applying systems / database research to data science ecosystems (pandas)
  • 13. Data scientists working with “small” data have not experienced great pain Small Data (< ~10GB) Medium Data (~10 - ~100GB) Big Data (> ~100GB-1TB) Current Python/R stack begins to “fail” around this point Users doing fine here NYC R Conference 2018-04-20
  • 14. We can do so much better through modern systems techniques Multi-core algorithms, GPU acceleration, Code generation (LLVM) Lazy evaluation, “query” optimization Sophisticated memory management, Efficient access to huge data sets Interoperable memory models, zero-copy interchange between system components Note 1 Moore’s Law (and small data) enabled us to get by for a long time without confronting some of these challenges Note 2 Most of these methods have already been widely employed in analytic databases. Limited “novel” research needed NYC R Conference 2018-04-20
  • 15. • Language agnostic, state-of-the-art library for building fast data systems • Arrow Memory Format • Compute Kernels (not standardized) • Parquet File Format • Flight: Data RPC / Network Transfer format • Query Engines (not standardized)
  • 16. Arrow Implementations and Bindings 16 Java C++ JavaScript Go Rust C Ruby Python Julia R C# Matlab
  • 20. Apache Arrow in Academic Papers 20
  • 21. Some Industry Contributors in Apache Arrow ClearCode
  • 22. Arrow == toolkit for a modern analytic databases match tool_needed { File Format (persistence) => Parquet Columnar memory representation => Arrow Arrays Operations (e.g. add, multiply) => Compute Kernels Network transfer => Arrow Flight IPC _ => ... to be continued ... }
  • 24. Code Examples Thesis: “When writing an analytic database, you will end up implementing the Arrow feature set” + * Take performance comparisons with a large grain of salt Compare Plain Rust and Rust using the Arrow library
  • 25. InfluxDB line protocol weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 Line Protocol Tutorial (link) Measurements Tags Fields Timestamp
  • 26. IOx Data Model weather,location=us-east temperature=82,humidity=67 1465839830100400200 weather,location=us-midwest temperature=82,humidity=65 1465839830100400200 weather,location=us-west temperature=70,humidity=54 1465839830100400200 weather,location=us-east temperature=83,humidity=69 1465839830200400200 weather,location=us-midwest temperature=87,humidity=78 1465839830200400200 weather,location=us-west temperature=72,humidity=56 1465839830200400200 weather,location=us-east temperature=84,humidity=67 1465839830300400200 weather,location=us-midwest temperature=90,humidity=82 1465839830400400200 weather,location=us-west temperature=71,humidity=57 1465839830400400200 location "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" "us-east" "us-midwest" "us-west" temperature 82 82 70 83 87 72 84 90 71 humidity 67 65 54 69 78 56 67 82 57 timestamp 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.1004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.2004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z 2016-06-13T17:43:50.3004002Z Table: weather
  • 27. Motivating Example “Find the rows that are not in `us-west`”
  • 28. Create the Array let string_vec: Vec<String> = (0..NUM_TAGS) .map(|i| { match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }.into() }) .collect(); let mut builder = StringBuilder::new(NUM_TAGS); (0..NUM_TAGS).enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append_value(location) .unwrap() }); let array = builder.finish(); > created array with 10000000 elements ~600ms > created array with 10000000 elements ~400ms +
  • 29. Memory Footprint let size = size_of::<Vec<String>>() + string_vec .iter() .fold(0, |sz, s| { sz + size_of::<String>() + s.len() }); println!("total size: {} bytes", size); println!("total size: {} bytes", array.get_array_memory_size()); > total size: 320000023 bytes ~320 MB * > total size: 149206128 bytes ~150 MB +
  • 30. Find Rows != “us-west” let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| s != "us-west") .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); let not_west_bitset = neq_utf8_scalar( &array, "us-west" ).unwrap(); let num_not_west = not_west_bitset .iter() .filter(|v| matches!(v, Some(true))) .count(); > Found 6666667 not in west ~50ms > Found 6666667 not in west ~120ms +
  • 31. Find Rows != “us-west” (with null handling) let string_vec: Vec<Option<String>> = ...; let not_west_bitset: Vec<bool> = string_vec .iter() .map(|s| { s.as_ref() .map(|s| s != "us-west") .unwrap_or(false) }) .collect(); let num_not_west = not_west_bitset .iter() .filter(|&&v| v) .count(); + Same as previous > Found 6666667 not in west ~50ms
  • 32. Materialize rows for future processing let not_west: Vec<String> = not_west_bitset .iter() .enumerate() .filter_map(|(i, &v)| { if v { Some(string_vec[i].clone()) } else { None } }) .collect(); let not_west = filter( &array, &not_west_bitset ).unwrap(); > Made array of 6666667 Strings not in west ~450 ms > Made array of 6666667 Strings not in west ~50 ms +
  • 33. More efficient encoding (dictionary) let vb = StringBuilder::new(); let kb = Int8Builder::new(); let mut builder = StringDictionaryBuilder::new(vb,kb); (0..NUM_TAGS) .enumerate() .for_each(|(i, _)| { let location = match i % 3 { 0 => "us-east", 1 => "us-midwest", 2 => "us-west", }; builder.append(location).unwrap(); }); let array = builder.finish(); > total size: 10000688 bytes 10MB 250 ms + dictionary "us-east" "us-midwest" "us-west" Location 0 1 2 0 1 2 0 1 2 [0] [1] [2] [u8]
  • 34. SIMD Anyone? let output = gt( &left, &right ).unwrap(); + 10 20 17 5 23 5 9 12 4 5 76 2 3 5 2 33 2 1 6 7 8 2 7 2 5 6 7 8 left right output 1 0 1 1 1 0 1 1 0 1 1 0 0 0 > > > >
  • 35. SIMD Implementation #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), feature = "simd"))] fn simd_compare_op<T, F>(left: &PrimitiveArray<T>, right: &PrimitiveArray<T>, op: F) -> Result<BooleanArray> where T: ArrowNumericType, F: Fn(T::Simd, T::Simd) -> T::SimdMask, { // use / error checking elided let null_bit_buffer = combine_option_bitmap( left.data_ref(), right.data_ref(), len )?; let lanes = T::lanes(); let mut result = MutableBuffer::new( left.len() * mem::size_of::<bool>() ); let rem = len % lanes; for i in (0..len - rem).step_by(lanes) { let simd_left = T::load(left.value_slice(i, lanes)); let simd_right = T::load(right.value_slice(i, lanes)); let simd_result = op(simd_left, simd_right); T::bitmask(&simd_result, |b| { result.write(b).unwrap(); }); } Source: arrow/src/compute/kernels/comparison.rs if rem > 0 { let simd_left = T::load(left.value_slice(len - rem, lanes)); let simd_right = T::load(right.value_slice(len - rem, lanes)); let simd_result = op(simd_left, simd_right); let rem_buffer_size = (rem as f32 / 8f32).ceil() as usize; T::bitmask(&simd_result, |b| { result.write(&b[0..rem_buffer_size]).unwrap(); }); } let data = ArrayData::new( DataType::Boolean, left.len(), None, null_bit_buffer, 0, vec![result.freeze()], vec![], ); Ok(PrimitiveArray::<BooleanType>::from(Arc::new(data))) }
  • 36. Other things needed in a database Vec<Option<String>> to support nulls Handle other data types with same code Vectorized implementations of filter, aggregate, etc Persist it to storage Send data over the network Ecosystem compatibility ...