SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Parquet
Columnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
Nong Li nong@cloudera.com Software engineer, Cloudera Impala
http://parquet.io
Outline
• Context

from various companies

• Results

in production and benchmarks

• Format

deep-dive

http://parquet.io
Twitter Context
•

Twitter’s data
•

230M+ monthly active users generating and consuming 500M+ tweets a day.

•

100TB+ a day of compressed data

•

Scale is huge:
•

•

Instrumentation, User graph, Derived data, ...

Analytics infrastructure:
•

Several 1K+ node Hadoop clusters

•

Log collection pipeline

•

Processing tools

The Parquet Planers
Gustave Caillebotte

http://parquet.io
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 levels of nesting.

struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}

struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}

http://parquet.io
Goal

To have a state of the art columnar storage available across the
Hadoop platform
•

Hadoop is very reliable for big long running queries but also IO heavy.

•

Incrementally take advantage of column based storage in existing framework.

•

Not tied to any framework in particular

http://parquet.io
Columnar Storage
•

Limits IO to data actually needed:
•

•

loads only the columns that need to be accessed.

Saves space:
•

Columnar layout compresses better

•
•

@EmrgencyKittens

Type specic encodings.

Enables vectorized execution engines.

http://parquet.io
Collaboration between Twitter and Cloudera:
•

Common le format denition:
•
•

•

Language independent
Formally specied.

Implementation in Java for Map/Reduce:
•

•

https://github.com/Parquet/parquet-mr

C++ and code generation in Cloudera Impala:
•

https://github.com/cloudera/impala
http://parquet.io
Results in Impala

 TPC-H lineitem table @ 1TB scale factor

GB

http://parquet.io
Text
Seq w/ Snappy
RC w/Snappy
Parquet w/Snappy

Impala query times on TPC-DS

Seconds (wall clock)

500
375
250
125
0

Q27

Q34

Q42

Q43

Q46

Q52

Q55

Q59

Q65

Q73

Q79

Q96

http://parquet.io
Criteo: The Context
• Billions
• ~60

columns per log

• Heavy
• BI

of new events per day

analytic workload

analysts using Hive and RCFile

• Frequent

• Perfect

schema modications

use case for Parquet + Hive !
http://parquet.io
Parquet + Hive: Basic Reqs

•

MapRed compatibility due to Hive.

•

Correctly handle evolving schemas across Parquet les.

•

Read only the columns used by query to minimize data read.

•

Interoperability with other execution engines (eg Pig, Impala, etc.)

http://parquet.io
Performance of Hive 0.11 with Parquet vs orc
Size relative to text:
orc-snappy: 35%
parquet-snappy: 33%

TPC-DS scale factor 100
All jobs calibrated to run ~50 mappers
Nodes:
2 x 6 cores, 96 GB RAM, 14 x 3TB
DISK

total CPU seconds

20000
orc-snappy
parquet-snappy

15000
10000
5000
0
q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68

q7

q73 q79

q8

q89 q98
http://parquet.io
Twitter: production results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed les (LZO)
New format: Parquet (LZO)

Scan time

Space
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0%

100.00%
75.00%
50.00%
25.00%
0%

Space
Thrift

1

Thrift

Parquet

Parquet

30

columns

•

Space saving: 30% using the same compression algorithm

•

Scan + assembly time compared to original:
• One column: 10%
• All columns: 110%

http://parquet.io
Production savings at Twitter
• Petabytes
• Example
• Job

of storage saved.

jobs taking advantage of projection push down:

1 (Pig): reading 32% less data => 20% task time saving.

• Job

2 (Scalding): reading 14 out of 35 columns. reading 80% less
data => 66% task time saving.

• Terabytes

of scanning saved every day.

http://parquet.io

14
Format
•

Row group
Column a

Row group: A group of rows in columnar format.
•
•
•

Page 0

Max size buffered in memory while writing.

•

Page 0

Page 1
Page 2

One (or more) per split while reading. 

Page 1
Page 2

Page 3

roughly: 50MB < row group < 1 GB

Column chunk: The data for one column in a row group.
•

Page 0

Column c

Page 1

Page 4

•

Column b

Page 2

Page 3

Row group

Column chunks can be read independently for efcient scans.

Page: Unit of access in a column chunk.
•

Should be big enough for compression to be efcient.

•

Minimum size to read to access a single record (when index pages are available).

•

roughly: 8KB < page < 1MB

http://parquet.io

15
Format
•

Layout:

Row groups in columnar
format. A footer contains
column chunks offset and
schema.
• Language independent:
Well dened format. Hadoop
and Cloudera Impala support.

http://parquet.io

16
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Schema:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}

Max rep. Max def.
level
level

DocId

Backward

Forward

0

1

2

Links.Forward

Links

0

Links.Backward

DocId

Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80

Columns

Document

1

2

Column

Document

Value

R

D

DocId

Backward

10

30

Forward

0

Links.Backward

10

0

2

30

1

2

Links.Forward

Links

20

0

Links.Backward

DocId

20

80

0

2

80

http://parquet.io

17
Repetition level
1

2

R

Nested lists

level1

level2: a

0

new record

level2: b

2

new level2 entry

level2: c

2

new level2 entry

level2: d

1

new level1 entry

level2: e

2

new level2 entry

level2: f

2

new level2 entry

level2: g

2

new level2 entry

level1

level2: h

0

new record

level1

Schema:
message nestedLists {
repeated group level1 {
repeated string level2;
}
}

0

level2: i

1

new level1 entry

level2: j

2

new level2 entry

level1

Records:
[[a, b, c], [d, e, f, g]]
[[h], [i, j]]
Nested lists

Columns:
Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f,g,h,i,j

more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

http://parquet.io
Differences of Parquet and ORC Nesting support
• Parquet:
Document

Repetition/Denition levels capture the structure.
=> one column per Leaf in the schema.
Array<int> is one column.

DocId

Links

Backward

Forward

Nullity/repetition of an inner node is stored in each of its children
=> One column independently of nesting with some redundancy.

• ORC:
An extra column for each Map or List to record their size.
=> one column per Node in the schema.
Array<int> is two columns: array size and content.
=> An extra column per nesting level.
http://parquet.io
Reading assembled records
• Record level API to integrate with existing row
based engines (Hive, Pig, M/R).

a1
a2

b1

a2

• Aware of dictionary encoding: enable
optimizations.

a1

b2

a3

b3

a3
b1
b2
b3

• Assembles projection for any subset of the
columns: only those are loaded from disc.
Document

DocId

Document

Document

Document

Links

Links

Links

20

Backward

10

30

Backward

10

30

Forward

80

Forward

80

http://parquet.io

20
Projection push down
• Automated in Pig and Hive:

Based on the query being executed only the columns for the elds accessed will be fetched.
• Explicit in MapReduce, Scalding and Cascading using globing syntax.

Example: field1;field2/**;field4/{subfield1,subfield2}
Will return:
field1
all the columns under field2
subfield1 and 2 under field4
but not field3

http://parquet.io

21
Reading columns

•

To implement column based execution engine

Row:

0

1

A

0

1

B

1

1

C

2

0

0

3

•

V

1

•

D

0

•

R

0

1

Iteration on triplets: repetition level, denition level, value.

R=1 => same row
D<1 => Null

D

Repetition level = 0 indicates a new record.
Encoded or decoded values: computing aggregations on integers is faster than on
strings.

http://parquet.io
Integration APIs
•

Schema denition and record materialization:
•

•

•

Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro,
ProtocolBuffers do.
Event-based SAX-style record materialization layer. No double conversion.

Integration with existing type systems and processing frameworks:
•

Impala

•

Pig

•

Thrift and Scrooge for M/R, Cascading and Scalding

•

Cascading tuples

•

Avro

•

Hive

•

Spark

http://parquet.io
Encodings
•

Bit packing:

1

3

2

•

•

2

2

0

1

1

1

1

1

01|11|10|00 00|10|10|00

Useful for repetition level, denition levels and dictionary keys

Run Length Encoding:
•

1

1

8

1

Cheap compression

•

1

Used in combination with bit packing

•

•

0

Small integers encoded in the minimum bits required

•

0

Works well for denition level of sparse columns.

Dictionary encoding:
•
•

•

Useful for columns with few ( < 50,000 ) distinct values
When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy)

Extensible: Dening new encodings is supported by the format

http://parquet.io
Parquet 2.0

•

More encodings: compact storage without heavyweight compression
•

Delta encodings: for integers, strings and sorted dictionaries.

•

Improved encoding for strings and boolean.

•

Statistics: to be used by query planners and predicate pushdown.

•

New page format: to facilitate skipping ahead at a more granular level.

http://parquet.io
Main contributors
Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings
Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala
Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down
MickaĂŤl Lacour, RĂŠmy Pecqueur (Criteo): Hive integration
Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration
Tom White (Cloudera): Avro integration
Avi Bryant, Colin Marc (Stripe): Cascading tuples integration
Matt Massie (Berkeley AMP lab): predicate and projection push down
David Chen (Linkedin): Avro integration improvements

http://parquet.io
How to contribute
Questions? Ideas? Want to contribute?
Contribute at: github.com/Parquet
Come talk to us.
Cloudera
Criteo
Twitter
http://parquet.io

27

Mais conteĂşdo relacionado

Mais procurados

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 

Mais procurados (20)

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 

Semelhante a Parquet Strata/Hadoop World, New York 2013

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaRedis Labs
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupItamar Haber
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用LINE Corporation
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed ComputingFederico Cargnelutti
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 

Semelhante a Parquet Strata/Hadoop World, New York 2013 (20)

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 

Mais de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Parquet overview
Parquet overviewParquet overview
Parquet overviewJulien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 

Mais de Julien Le Dem (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Parquet Strata/Hadoop World, New York 2013

  • 1. Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala http://parquet.io
  • 2. Outline • Context from various companies • Results in production and benchmarks • Format deep-dive http://parquet.io
  • 3. Twitter Context • Twitter’s data • 230M+ monthly active users generating and consuming 500M+ tweets a day. • 100TB+ a day of compressed data • Scale is huge: • • Instrumentation, User graph, Derived data, ... Analytics infrastructure: • Several 1K+ node Hadoop clusters • Log collection pipeline • Processing tools The Parquet Planers Gustave Caillebotte http://parquet.io
  • 4. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io
  • 5. Goal To have a state of the art columnar storage available across the Hadoop platform • Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • Not tied to any framework in particular http://parquet.io
  • 6. Columnar Storage • Limits IO to data actually needed: • • loads only the columns that need to be accessed. Saves space: • Columnar layout compresses better • • @EmrgencyKittens Type specic encodings. Enables vectorized execution engines. http://parquet.io
  • 7. Collaboration between Twitter and Cloudera: • Common le format denition: • • • Language independent Formally specied. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io
  • 8. Results in Impala  TPC-H lineitem table @ 1TB scale factor GB http://parquet.io
  • 9. Text Seq w/ Snappy RC w/Snappy Parquet w/Snappy Impala query times on TPC-DS Seconds (wall clock) 500 375 250 125 0 Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96 http://parquet.io
  • 10. Criteo: The Context • Billions • ~60 columns per log • Heavy • BI of new events per day analytic workload analysts using Hive and RCFile • Frequent • Perfect schema modications use case for Parquet + Hive ! http://parquet.io
  • 11. Parquet + Hive: Basic Reqs • MapRed compatibility due to Hive. • Correctly handle evolving schemas across Parquet les. • Read only the columns used by query to minimize data read. • Interoperability with other execution engines (eg Pig, Impala, etc.) http://parquet.io
  • 12. Performance of Hive 0.11 with Parquet vs orc Size relative to text: orc-snappy: 35% parquet-snappy: 33% TPC-DS scale factor 100 All jobs calibrated to run ~50 mappers Nodes: 2 x 6 cores, 96 GB RAM, 14 x 3TB DISK total CPU seconds 20000 orc-snappy parquet-snappy 15000 10000 5000 0 q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68 q7 q73 q79 q8 q89 q98 http://parquet.io
  • 13. Twitter: production results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed les (LZO) New format: Parquet (LZO) Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Space Thrift 1 Thrift Parquet Parquet 30 columns • Space saving: 30% using the same compression algorithm • Scan + assembly time compared to original: • One column: 10% • All columns: 110% http://parquet.io
  • 14. Production savings at Twitter • Petabytes • Example • Job of storage saved. jobs taking advantage of projection push down: 1 (Pig): reading 32% less data => 20% task time saving. • Job 2 (Scalding): reading 14 out of 35 columns. reading 80% less data => 66% task time saving. • Terabytes of scanning saved every day. http://parquet.io 14
  • 15. Format • Row group Column a Row group: A group of rows in columnar format. • • • Page 0 Max size buffered in memory while writing. • Page 0 Page 1 Page 2 One (or more) per split while reading.  Page 1 Page 2 Page 3 roughly: 50MB < row group < 1 GB Column chunk: The data for one column in a row group. • Page 0 Column c Page 1 Page 4 • Column b Page 2 Page 3 Row group Column chunks can be read independently for efcient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efcient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 15
  • 16. Format • Layout: Row groups in columnar format. A footer contains column chunks offset and schema. • Language independent: Well dened format. Hadoop and Cloudera Impala support. http://parquet.io 16
  • 17. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, denition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. Max def. level level DocId Backward Forward 0 1 2 Links.Forward Links 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns Document 1 2 Column Document Value R D DocId Backward 10 30 Forward 0 Links.Backward 10 0 2 30 1 2 Links.Forward Links 20 0 Links.Backward DocId 20 80 0 2 80 http://parquet.io 17
  • 18. Repetition level 1 2 R Nested lists level1 level2: a 0 new record level2: b 2 new level2 entry level2: c 2 new level2 entry level2: d 1 new level1 entry level2: e 2 new level2 entry level2: f 2 new level2 entry level2: g 2 new level2 entry level1 level2: h 0 new record level1 Schema: message nestedLists { repeated group level1 { repeated string level2; } } 0 level2: i 1 new level1 entry level2: j 2 new level2 entry level1 Records: [[a, b, c], [d, e, f, g]] [[h], [i, j]] Nested lists Columns: Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet http://parquet.io
  • 19. Differences of Parquet and ORC Nesting support • Parquet: Document Repetition/Denition levels capture the structure. => one column per Leaf in the schema. Array<int> is one column. DocId Links Backward Forward Nullity/repetition of an inner node is stored in each of its children => One column independently of nesting with some redundancy. • ORC: An extra column for each Map or List to record their size. => one column per Node in the schema. Array<int> is two columns: array size and content. => An extra column per nesting level. http://parquet.io
  • 20. Reading assembled records • Record level API to integrate with existing row based engines (Hive, Pig, M/R). a1 a2 b1 a2 • Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: only those are loaded from disc. Document DocId Document Document Document Links Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io 20
  • 21. Projection push down • Automated in Pig and Hive: Based on the query being executed only the columns for the elds accessed will be fetched. • Explicit in MapReduce, Scalding and Cascading using globing syntax. Example: field1;field2/**;field4/{subfield1,subfield2} Will return: field1 all the columns under field2 subfield1 and 2 under field4 but not field3 http://parquet.io 21
  • 22. Reading columns • To implement column based execution engine Row: 0 1 A 0 1 B 1 1 C 2 0 0 3 • V 1 • D 0 • R 0 1 Iteration on triplets: repetition level, denition level, value. R=1 => same row D<1 => Null D Repetition level = 0 indicates a new record. Encoded or decoded values: computing aggregations on integers is faster than on strings. http://parquet.io
  • 23. Integration APIs • Schema denition and record materialization: • • • Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive • Spark http://parquet.io
  • 24. Encodings • Bit packing: 1 3 2 • • 2 2 0 1 1 1 1 1 01|11|10|00 00|10|10|00 Useful for repetition level, denition levels and dictionary keys Run Length Encoding: • 1 1 8 1 Cheap compression • 1 Used in combination with bit packing • • 0 Small integers encoded in the minimum bits required • 0 Works well for denition level of sparse columns. Dictionary encoding: • • • Useful for columns with few ( < 50,000 ) distinct values When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy) Extensible: Dening new encodings is supported by the format http://parquet.io
  • 25. Parquet 2.0 • More encodings: compact storage without heavyweight compression • Delta encodings: for integers, strings and sorted dictionaries. • Improved encoding for strings and boolean. • Statistics: to be used by query planners and predicate pushdown. • New page format: to facilitate skipping ahead at a more granular level. http://parquet.io
  • 26. Main contributors Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down MickaĂŤl Lacour, RĂŠmy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration Tom White (Cloudera): Avro integration Avi Bryant, Colin Marc (Stripe): Cascading tuples integration Matt Massie (Berkeley AMP lab): predicate and projection push down David Chen (Linkedin): Avro integration improvements http://parquet.io
  • 27. How to contribute Questions? Ideas? Want to contribute? Contribute at: github.com/Parquet Come talk to us. Cloudera Criteo Twitter http://parquet.io 27