SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Parquet
Columnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
Nong Li nong@cloudera.com Software engineer, Cloudera Impala
http://parquet.io
1
• Context from various companies
• Early results
• Format deep-dive
2
Outline
http://parquet.io
3
Twitter Context
• Twitter’s data
• 200M+ monthly active users generating and consuming 400M+ tweets a day.
• 100TB+ a day of compressed data
• Scale is huge: Instrumentation, User graph, Derived data, ...
• Analytics infrastructure:
• Several 1K+ node Hadoop clusters
• Log collection pipeline
• Processing tools
http://parquet.io
The Parquet Planers
Gustave Caillebotte
• Logs available on HDFS
• Thrift to store logs
• example: one schema has 87 columns, up to 7 levels of nesting.
4
Twitter’s use case
struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}
struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}
http://parquet.io
5
Goal
To have a state of the art columnar storage available across the
Hadoop platform
• Hadoop is very reliable for big long running queries but also IO heavy.
• Incrementally take advantage of column based storage in existing framework.
• Not tied to any framework in particular
http://parquet.io
• Limits the IO to only the data that is needed.
• Saves space: columnar layout compresses better
• Enables better scans: load only the columns that need to be accessed
• Enables vectorized execution engines.
6
Columnar Storage
http://parquet.io
Collaboration between Twitter and Cloudera:
• Common file format definition:
• Language independent
• Formally specified.
• Implementation in Java for Map/Reduce:
• https://github.com/Parquet/parquet-mr
• C++ and code generation in Cloudera Impala:
• https://github.com/cloudera/impala
7http://parquet.io
Results in Impala TPC-H lineitem table @ 1TB scale factor
8http://parquet.io
GB
Impala query times on TPC-H lineitem table
9http://parquet.io
Seconds GB
Criteo: The Context
• ~20B Events per day
• ~60 Columns per Log
• Heavy analytic workload
• BI Analysts using Hive and RCFile
• Frequent Schema Modifications
==
• Perfect use case for Parquet + Hive !
10http://parquet.io
Parquet + Hive: Basic Reqs
• MapRed Compatibility due to Hive
• Correctly Handle Different Schemas in Parquet Files
• Read Only The Columns Used by Query
• Interoperability with Other Execution Engines (eg Pig, Impala, etc.)
• Optimize Amount of Data Read by each Mapper
11http://parquet.io
Parquet + Hive: Early User Experience
12
Relative Performance of Hive+Parquet vs Orc and RCFile:
http://parquet.io
LZO compressed
Twitter: Initial results
Space saving: 28% using the same compression algorithm
Scan + assembly time compared to original:
One column: 10%
All columns: 114% 13
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files
http://parquet.io
0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Thrift Parquet
0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
1 30
Scan time
columns
Thrift Parquet
Additional gains with dictionary encoding
14http://parquet.io
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw data but only 3% of compressed data
Space saving: another 52% using the same compression algorithm (on
top of the original columnar storage gains)
Scan + assembly time compared to plain Parquet:
All 13 columns: 48% (of the already faster columnar scan)
0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Parquet compressed (LZO)
Parquet Dictionary uncompressed
Parquet Dictionary compressed (LZO)
Columns
0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
1 13
Scan time
Parquet compressed
Parquet dictionary compressed
to thrift
Row group
Column a
Page 0
Row group
15
Format
Page 1
Page 2
Column b Column c
Page 0
Page 1
Page 2
Page 0
Page 1
Page 2
• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing.
• One (or more) per split while reading. 
• roughly: 50MB < row group < 1 GB
• Column chunk: The data for one column in a row group.
• Column chunks can be read independently for efficient scans.
• Page: Unit of access in a column chunk.
• Should be big enough for compression to be efficient.
• Minimum size to read to access a single record (when index pages are available).
• roughly: 8KB < page < 1MB
http://parquet.io
16
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Well defined format.
Hadoop and Cloudera
Impala support.
http://parquet.io
17
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Columns Max rep.
level
Max def. level
DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Column value R D
DocId 20 0 0
Links.Backward 10 0 2
Links.Backward 30 1 2
Links.Forward 80 0 2
Schema:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}
Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Document
DocId Links
Backward Forward
Document
DocId Links
Backward Forward
20
10 30 80
http://parquet.io
Differences of Parquet and ORC Nesting support
Parquet:
• Repetition/Definition levels capture the structure.
=> one column per Leaf in the schema.
• Array<int> is one column.
• Nullity/repetition of an inner node is stored in each of its children
• => One column independently of nesting with some redundancy.
ORC:
• An extra column for each Map or List to record their size.
=> one column per Node in the schema.
• Array<int> is two columns: array size and content.
• => An extra column per nesting level.
18
Document
DocId Links
Backward Forward
http://parquet.io
19
Iteration on fully assembled records
• To integrate with existing row based engines (Hive, Pig, M/R).
• Aware of dictionary encoding: enable optimizations.
• Assembles projection for any subset of the columns: only those are loaded from disc.
http://parquet.io
Document
DocId 20
Document
Links
Backward 10 30
Document
Links
Backward Forward10 30 80
Document
Links
Forward 80
a1
a2
a3
b1
b2
b3
a1
a2
a3
b1
b2
b3
Iteration on columns
• To implement column based execution engine
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
• Encoded or decoded values: computing aggregations on integers is faster than on
strings.
20http://parquet.io
D<1 => Null
Row:
0
1
2
3
R=1 => same row
0
0
1
R D V
0
1
1
1
0
A
B
C
0 1 D
APIs
• Schema definition and record materialization:
• Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do.
• Event-based SAX-style record materialization layer. No double conversion.
• Integration with existing type systems and processing frameworks:
• Impala
• Pig
• Thrift and Scrooge for M/R, Cascading and Scalding
• Cascading tuples
• Avro
• Hive
21http://parquet.io
Encodings
• Bit packing:
• Small integers encoded in the minimum bits required
• Useful for repetition level, definition levels and dictionary keys
• Run Length Encoding:
• Used in combination with bit packing,
• Cheap compression
• Works well for definition level of sparse columns.
• Dictionary encoding:
• Useful for columns with few ( < 50,000 ) distinct values
• Extensible:
• Defining new encodings is supported by the format
22http://parquet.io
01|11|10|00 00|10|10|001 3 2 0 0 2 2 0
8 11 1 1 1 1 1 1 1
Contributors
Main contributors:
• Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings
• Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala
• Jonathan Coveney, Alex Levenson, Aniket Mokashi (Twitter): Encodings
• Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration
• Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration
• Tom White (Cloudera): Avro integration
• Avi Bryant (Stripe): Cascading tuples integration
23http://parquet.io
Future
• Indices for random access (lookup by ID).
• More encodings.
• Extensibility
• Statistics pages (max, min, ...)
24http://parquet.io
25
How to contribute
Questions? Ideas?
Contribute at: github.com/Parquet
Come talk to us:
Twitter booth #26
Cloudera booth #45
http://parquet.io

Mais conteúdo relacionado

Mais procurados

Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet FormatYue Chen
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 

Mais procurados (20)

Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 

Destaque

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesDataStax Academy
 
Gc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxGc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxCuong Tran
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterAttila Szegedi
 
A21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo Aizawa
A21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo AizawaA21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo Aizawa
A21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo AizawaInsight Technology, Inc.
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Get More Out of MySQL with TokuDB
Get More Out of MySQL with TokuDBGet More Out of MySQL with TokuDB
Get More Out of MySQL with TokuDBTim Callaghan
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionDataWorks Summit/Hadoop Summit
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 

Destaque (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
Gc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linuxGc and-pagescan-attacks-by-linux
Gc and-pagescan-attacks-by-linux
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
A21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo Aizawa
A21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo AizawaA21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo Aizawa
A21 列指向DB HP Vertica ~その圧倒的な高速検索の謎を解き明かす~ byKeizo Aizawa
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Get More Out of MySQL with TokuDB
Get More Out of MySQL with TokuDBGet More Out of MySQL with TokuDB
Get More Out of MySQL with TokuDB
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC Files
ORC FilesORC Files
ORC Files
 

Semelhante a Parquet Hadoop Summit 2013

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用LINE Corporation
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupItamar Haber
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...P. Taylor Goetz
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 

Semelhante a Parquet Hadoop Summit 2013 (20)

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 

Mais de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languagesJulien Le Dem
 

Mais de Julien Le Dem (19)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Parquet Hadoop Summit 2013

  • 1. Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala http://parquet.io 1
  • 2. • Context from various companies • Early results • Format deep-dive 2 Outline http://parquet.io
  • 3. 3 Twitter Context • Twitter’s data • 200M+ monthly active users generating and consuming 400M+ tweets a day. • 100TB+ a day of compressed data • Scale is huge: Instrumentation, User graph, Derived data, ... • Analytics infrastructure: • Several 1K+ node Hadoop clusters • Log collection pipeline • Processing tools http://parquet.io The Parquet Planers Gustave Caillebotte
  • 4. • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. 4 Twitter’s use case struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io
  • 5. 5 Goal To have a state of the art columnar storage available across the Hadoop platform • Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • Not tied to any framework in particular http://parquet.io
  • 6. • Limits the IO to only the data that is needed. • Saves space: columnar layout compresses better • Enables better scans: load only the columns that need to be accessed • Enables vectorized execution engines. 6 Columnar Storage http://parquet.io
  • 7. Collaboration between Twitter and Cloudera: • Common file format definition: • Language independent • Formally specified. • Implementation in Java for Map/Reduce: • https://github.com/Parquet/parquet-mr • C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala 7http://parquet.io
  • 8. Results in Impala TPC-H lineitem table @ 1TB scale factor 8http://parquet.io GB
  • 9. Impala query times on TPC-H lineitem table 9http://parquet.io Seconds GB
  • 10. Criteo: The Context • ~20B Events per day • ~60 Columns per Log • Heavy analytic workload • BI Analysts using Hive and RCFile • Frequent Schema Modifications == • Perfect use case for Parquet + Hive ! 10http://parquet.io
  • 11. Parquet + Hive: Basic Reqs • MapRed Compatibility due to Hive • Correctly Handle Different Schemas in Parquet Files • Read Only The Columns Used by Query • Interoperability with Other Execution Engines (eg Pig, Impala, etc.) • Optimize Amount of Data Read by each Mapper 11http://parquet.io
  • 12. Parquet + Hive: Early User Experience 12 Relative Performance of Hive+Parquet vs Orc and RCFile: http://parquet.io LZO compressed
  • 13. Twitter: Initial results Space saving: 28% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 114% 13 Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files http://parquet.io 0% 25.00% 50.00% 75.00% 100.00% Space Space Thrift Parquet 0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 30 Scan time columns Thrift Parquet
  • 14. Additional gains with dictionary encoding 14http://parquet.io 13 out of the 30 columns are suitable for dictionary encoding: they represent 27% of raw data but only 3% of compressed data Space saving: another 52% using the same compression algorithm (on top of the original columnar storage gains) Scan + assembly time compared to plain Parquet: All 13 columns: 48% (of the already faster columnar scan) 0% 25.00% 50.00% 75.00% 100.00% Space Space Parquet compressed (LZO) Parquet Dictionary uncompressed Parquet Dictionary compressed (LZO) Columns 0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 1 13 Scan time Parquet compressed Parquet dictionary compressed to thrift
  • 15. Row group Column a Page 0 Row group 15 Format Page 1 Page 2 Column b Column c Page 0 Page 1 Page 2 Page 0 Page 1 Page 2 • Row group: A group of rows in columnar format. • Max size buffered in memory while writing. • One (or more) per split while reading.  • roughly: 50MB < row group < 1 GB • Column chunk: The data for one column in a row group. • Column chunks can be read independently for efficient scans. • Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io
  • 16. 16 Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io
  • 17. 17 Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Columns Max rep. level Max def. level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Column value R D DocId 20 0 0 Links.Backward 10 0 2 Links.Backward 30 1 2 Links.Forward 80 0 2 Schema: message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Document DocId Links Backward Forward Document DocId Links Backward Forward 20 10 30 80 http://parquet.io
  • 18. Differences of Parquet and ORC Nesting support Parquet: • Repetition/Definition levels capture the structure. => one column per Leaf in the schema. • Array<int> is one column. • Nullity/repetition of an inner node is stored in each of its children • => One column independently of nesting with some redundancy. ORC: • An extra column for each Map or List to record their size. => one column per Node in the schema. • Array<int> is two columns: array size and content. • => An extra column per nesting level. 18 Document DocId Links Backward Forward http://parquet.io
  • 19. 19 Iteration on fully assembled records • To integrate with existing row based engines (Hive, Pig, M/R). • Aware of dictionary encoding: enable optimizations. • Assembles projection for any subset of the columns: only those are loaded from disc. http://parquet.io Document DocId 20 Document Links Backward 10 30 Document Links Backward Forward10 30 80 Document Links Forward 80 a1 a2 a3 b1 b2 b3 a1 a2 a3 b1 b2 b3
  • 20. Iteration on columns • To implement column based execution engine • Iteration on triplets: repetition level, definition level, value. • Repetition level = 0 indicates a new record. • Encoded or decoded values: computing aggregations on integers is faster than on strings. 20http://parquet.io D<1 => Null Row: 0 1 2 3 R=1 => same row 0 0 1 R D V 0 1 1 1 0 A B C 0 1 D
  • 21. APIs • Schema definition and record materialization: • Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. • Event-based SAX-style record materialization layer. No double conversion. • Integration with existing type systems and processing frameworks: • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive 21http://parquet.io
  • 22. Encodings • Bit packing: • Small integers encoded in the minimum bits required • Useful for repetition level, definition levels and dictionary keys • Run Length Encoding: • Used in combination with bit packing, • Cheap compression • Works well for definition level of sparse columns. • Dictionary encoding: • Useful for columns with few ( < 50,000 ) distinct values • Extensible: • Defining new encodings is supported by the format 22http://parquet.io 01|11|10|00 00|10|10|001 3 2 0 0 2 2 0 8 11 1 1 1 1 1 1 1
  • 23. Contributors Main contributors: • Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings • Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala • Jonathan Coveney, Alex Levenson, Aniket Mokashi (Twitter): Encodings • Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration • Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration • Tom White (Cloudera): Avro integration • Avi Bryant (Stripe): Cascading tuples integration 23http://parquet.io
  • 24. Future • Indices for random access (lookup by ID). • More encodings. • Extensibility • Statistics pages (max, min, ...) 24http://parquet.io
  • 25. 25 How to contribute Questions? Ideas? Contribute at: github.com/Parquet Come talk to us: Twitter booth #26 Cloudera booth #45 http://parquet.io