Parquet overview

•Transferir como PPT, PDF•

5 gostaram•7,342 visualizações

Julien Le Dem

Parquet overview given to the Apache Drill meetup

Format
Schema definition: for binary
representation

Layout: currently PAX, supports one file
per column when Hadoop allows block
placement policy.

Not java centric: encodings, compression
codecs, etc are ENUMs, not java class
names. i.e.: formally defined. Impala
reads Parquet files.

Footer: contains column chunks offsets

2

Format

• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing.
• One (or more) per split while reading.
• roughly: 10MB < row group < 1 GB

• Column chunk: The data for one column in a row group.
• Column chunks can be read independently for efficient scans.

• Page: Unit of compression in a column chunk
• Should be big enough for compression to be efficient.
• Minimum size to read to access a single record (when index pages are available).
• roughly: 8KB < page < 100KB

3

Dremel’s shredding/assembly
Schema:
message Document {
required int64 DocId; Columns:
optional group Links { DocId
repeated int64 Backward; Links.Backward
repeated int64 Forward; } Links.Forward
repeated group Name { Name.Language.Code
repeated group Language { Name.Language.Country
required string Code; Name.Url
optional string Country; }
optional string Url; }}

Reference:
http://research.google.com/pubs/pub36632.html
• Each cell is encoded as a triplet: repetition level, definition level, value.
• This allows reconstructing the nested records.
• Level values are bound by the depth of the schema: They are stored in a
compact form.

Example: Max repetition level Max definition level

DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Name.Language.Code 2 2
Name.Language.Country 2 3
Name.Url 1 2

4

Abstractions

• Column layer:
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
•When dictionary encoding and other compact encodings are implemented, can iterate over
encoded or un-encoded values.

• Record layer:
• Iteration on fully assembled records.
•Provides assembled records for any subset of the columns, so that only columns actually
accessed are loaded.

5

Extensibility

• Schema conversion:
• Hadoop does not have a notion of schema.
• However Pig, Hive, Thrift, Avro, ProtoBufs, etc do.

• Record materialization:
• Pluggable record materialization layer.
• No double conversion.
• Sax-style Event base API.

• Encodings:
• Extensible encoding definitions.
• Planned: dictionary encoding, zigzag, rle, ...

6

Mais conteúdo relacionado

Mais procurados

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson

How to use Parquet as a basis for ETL and analyticsJulien Le Dem

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData

Introduction to Apache CalciteJordan Halterman

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

Securing Hadoop with Apache RangerDataWorks Summit

Managing 2000 Node Cluster with AmbariDataWorks Summit

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative

Making Apache Spark Better with Delta LakeDatabricks

Building an open data platform with apache icebergAlluxio, Inc.

Mais procurados (20)

The Parquet Format and Performance Optimization Opportunities

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...

How to use Parquet as a basis for ETL and analytics

The columnar roadmap: Apache Parquet and Apache Arrow

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...

Introduction to Apache Calcite

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

Securing Hadoop with Apache Ranger

Managing 2000 Node Cluster with Ambari

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Spark (Structured) Streaming vs. Kafka Streams

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

Apache Arrow Flight: A New Gold Standard for Data Transport

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...

Making Apache Spark Better with Delta Lake

Building an open data platform with apache iceberg

Destaque

Spark, Python and Parquet odsc

ApacheCon-Flume-Kafka-2016Jayesh Thakrar

大型电商的数据服务的要点和难点 Chao Zhu

Implementing and running a secure datalake from the trenches DataWorks Summit

Data Aggregation At Scale Using Apache FlumeArvind Prabhakar

Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty

Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise

Parquet and AVROairisData

Paytm labs soyouwanttodatascienceAdam Muise

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Flume vs. kafkaOmid Vahdaty

Destaque (11)

Spark, Python and Parquet

ApacheCon-Flume-Kafka-2016

大型电商的数据服务的要点和难点

Implementing and running a secure datalake from the trenches

Data Aggregation At Scale Using Apache Flume

Introduction to streaming and messaging flume,kafka,SQS,kinesis

Moving to a data-centric architecture: Toronto Data Unconference 2015

Parquet and AVRO

Paytm labs soyouwanttodatascience

File Format Benchmark - Avro, JSON, ORC & Parquet

Flume vs. kafka

Semelhante a Parquet overview

HadoopAbhishek Agarwal

(Julien le dem) parquetNAVER D2

Outside The Box With Apache CassnadraEric Evans

An introduction to PincasterFrank Denis

MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni

The Cassandra Distributed DatabaseEric Evans

Spring one2gx2010 spring-nonrelational_dataRoger Xia

MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc

SDEC2011 NoSQL concepts and modelsKorea Sdec

Building a distributed Key-Value store with Cassandraaaronmorton

Parquet Twitter Seattle open houseJulien Le Dem

Caching solutions with RedisGeorge Platon

Thoughts on Transaction and Consistency Modelsiammutex

On Rails with Apache CassandraStu Hood

Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.

Hbase jddAndrzej Grzesik

What every developer should know about database scalability, PyCon 2010jbellis

Drop acidMike Feltman

Non Relational DatabasesChris Baglieri

Inexpensive storageManfred Furuholmen

Semelhante a Parquet overview (20)

Hadoop

(Julien le dem) parquet

Outside The Box With Apache Cassnadra

An introduction to Pincaster

MongoDB Replication fundamentals - Desert Code Camp - October 2014

The Cassandra Distributed Database

Spring one2gx2010 spring-nonrelational_data

MongoDB Replication fundamentals - Desert Code Camp - October 2014

SDEC2011 NoSQL concepts and models

Building a distributed Key-Value store with Cassandra

Parquet Twitter Seattle open house

Caching solutions with Redis

Thoughts on Transaction and Consistency Models

On Rails with Apache Cassandra

Doug Cutting on the State of the Hadoop Ecosystem

Hbase jdd

What every developer should know about database scalability, PyCon 2010

Drop acid

Non Relational Databases

Inexpensive storage

Mais de Julien Le Dem

Data and AI summit: data pipelines observability with open lineageJulien Le Dem

Data pipelines observability: OpenLineage & MarquezJulien Le Dem

Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem

Data platform architecture principles - ieee infrastructure 2020Julien Le Dem

Data lineage and observability with Marquez - subsurface 2020Julien Le Dem

Strata NY 2018: The deconstructed databaseJulien Le Dem

From flat files to deconstructed databaseJulien Le Dem

Strata NY 2017 Parquet Arrow roadmapJulien Le Dem

Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem

Mule soft mar 2017 Parquet ArrowJulien Le Dem

Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem

Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem

Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem

Sql on everything with drillJulien Le Dem

If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem

Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem

Embedding Pig in scripting languagesJulien Le Dem

Mais de Julien Le Dem (17)

Data and AI summit: data pipelines observability with open lineage

Data pipelines observability: OpenLineage & Marquez

Open core summit: Observability for data pipelines with OpenLineage

Data platform architecture principles - ieee infrastructure 2020

Data lineage and observability with Marquez - subsurface 2020

Strata NY 2018: The deconstructed database

From flat files to deconstructed database

Strata NY 2017 Parquet Arrow roadmap

Improving Python and Spark Performance and Interoperability with Apache Arrow

Mule soft mar 2017 Parquet Arrow

Data Eng Conf NY Nov 2016 Parquet Arrow

Strata NY 2016: The future of column-oriented data processing with Arrow and ...

Strata London 2016: The future of column oriented data processing with Arrow ...

Sql on everything with drill

If you have your own Columnar format, stop now and use Parquet 😛

Poster Hadoop summit 2011: pig embedding in scripting languages

Embedding Pig in scripting languages

Parquet overview

1. Parquet overview Julien Le Dem Twitter http://parquet.github.com

2. Format Schema definition: for binary representation Layout: currently PAX, supports one file per column when Hadoop allows block placement policy. Not java centric: encodings, compression codecs, etc are ENUMs, not java class names. i.e.: formally defined. Impala reads Parquet files. Footer: contains column chunks offsets 2

3. Format • Row group: A group of rows in columnar format. • Max size buffered in memory while writing. • One (or more) per split while reading. • roughly: 10MB < row group < 1 GB • Column chunk: The data for one column in a row group. • Column chunks can be read independently for efficient scans. • Page: Unit of compression in a column chunk • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 100KB 3

4. Dremel’s shredding/assembly Schema: message Document { required int64 DocId; Columns: optional group Links { DocId repeated int64 Backward; Links.Backward repeated int64 Forward; } Links.Forward repeated group Name { Name.Language.Code repeated group Language { Name.Language.Country required string Code; Name.Url optional string Country; } optional string Url; }} Reference: http://research.google.com/pubs/pub36632.html • Each cell is encoded as a triplet: repetition level, definition level, value. • This allows reconstructing the nested records. • Level values are bound by the depth of the schema: They are stored in a compact form. Example: Max repetition level Max definition level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Name.Language.Code 2 2 Name.Language.Country 2 3 Name.Url 1 2 4

5. Abstractions • Column layer: • Iteration on triplets: repetition level, definition level, value. • Repetition level = 0 indicates a new record. •When dictionary encoding and other compact encodings are implemented, can iterate over encoded or un-encoded values. • Record layer: • Iteration on fully assembled records. •Provides assembled records for any subset of the columns, so that only columns actually accessed are loaded. 5

6. Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6

7. Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6

Parquet overview

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (11)

Semelhante a Parquet overview

Semelhante a Parquet overview (20)

Mais de Julien Le Dem

Mais de Julien Le Dem (17)

Parquet overview