ApacheCon Europe Big Data 2016 – Parquet in practice & detail

•

0 gostou•1,021 visualizações

Apache Parquet is among the most commonly used column-oriented data formats in the big data processing space. It leverages various techniques to store data in a CPU- and I/O-efficient way. Furthermore, it has the capabilities to push-down analytical queries on the data to the I/O layer to avoid the loading of nonrelevant data chunks. With various Java and a C++ implementation, Parquet is also the perfect choice to exchange data between different technology stacks. As part of this talk, a general introduction to the format and its techniques will be given. Their benefits and some of the inner workings will be explained to give a better understanding how Parquet achieves its performance. At the end, benchmarks comparing the new C++ & Python implementation with other formats will be shown.

Dados e análise

What is Parquet? How is it so eﬃcient? Why should I
actually use it?
Parquet in Practice & Detail

About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• Committer to Apache {Arrow, Parquet}
• Work in Python, Cython, C++11 and SQL
xhochy
uwe@apache.org

Agenda
Origin and Use Case
Parquet under the bonnet
Python & C++
The Community and its neighbours

About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option

Why use Parquet?
1. Columnar format 
—> vectorized operations
2. Eﬃcient encodings and compressions 
—> small size without the need for a fat CPU
3. Query push-down 
—> bring computation to the I/O layer
4. Language independent format 
—> libs in Java / Scala / C++ / Python /…

Who uses Parquet?
• Query Engines
• Hive
• Impala
• Drill
• Presto
• …
• Frameworks
• Spark
• MapReduce
• …
• Pandas

• More than a flat table!
• Structure borrowed from Dremel paper
• https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Nested data
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url

Why columnar?
2D Table
row layout
columnar layout

File Structure
File
RowGroup
Column Chunks
Page
Statistics

Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml

Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB

Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels

Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)

Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli 
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%) 
Snappy: 216 MiB (14 %)

https://github.com/apache/parquet-mr/pull/384

Query pushdown
1. Only load used data
1. skip columns that are not needed
2. skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded

Competitors (Python)
• HDF5
• binary (with schema)
• fast, just not with strings
• not a first-class citizen in the Hadoop ecosystem
• msgpack
• fast but unstable
• CSV
• The universal standard.
• row-based
• schema-less

C++
1. General purpose read & write of Parquet
• data structure independent
• pluggable interfaces (allocator, I/O, …)
2. Routines to read into specific data structures
• Apache Arrow
• …

Use Parquet in Python
https://pyarrow.readthedocs.io/en/latest/install.html#building-from-source

Get involved!
1. Mailinglist: dev@parquet.apache.org
2. Website: https://parquet.apache.org/
3. Or directly start contributing by grabbing an issue on
https://issues.apache.org/jira/browse/PARQUET
4. Slack: https://parquet-slack-invite.herokuapp.com/

Mais conteúdo relacionado

Mais procurados

cLoki: Like Loki but for ClickHouseAltinity Ltd

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks

Databricks Overview for MLOpsDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson

Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal

Kudu Deep-DiveSupriya Sahay

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha

Getting The Most Out Of Your Flash/SSDsAerospike, Inc.

Scalable complex event processing on samza @UBERShuyi Chen

Intro to Delta LakeDatabricks

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks

MLOps Virtual Event: Automating ML at ScaleDatabricks

Greenplum User Case VMware Tanzu Korea

Introduction to apache spark Aakashdata

From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent

Introduction to MLflowDatabricks

Building Data Pipelines with Spark and StreamSetsPat Patterson

Introduction to redisTanu Siwag

Mais procurados (20)

cLoki: Like Loki but for ClickHouse

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...

Databricks Overview for MLOps

A Thorough Comparison of Delta Lake, Iceberg and Hudi

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...

Building Big Data Applications using Spark, Hive, HBase and Kafka

Kudu Deep-Dive

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases

Getting The Most Out Of Your Flash/SSDs

Scalable complex event processing on samza @UBER

Intro to Delta Lake

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...

MLOps Virtual Event: Automating ML at Scale

Greenplum User Case

Introduction to apache spark

From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...

Introduction to MLflow

Building Data Pipelines with Spark and StreamSets

Introduction to redis

Semelhante a ApacheCon Europe Big Data 2016 – Parquet in practice & detail

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

Taming the resource tigerElizabeth Smith

Realtime traffic analyserAlex Moskvin

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

2.4 Optimizing your Visual COBOL ApplicationsMicro Focus

What's new in Hadoop Common and HDFS DataWorks Summit/Hadoop Summit

From a student to an apache committer practice of apache io tdbjixuan1989

PyData Boston 2013Travis Oliphant

GPU Computing for Data Science Domino Data Lab

OpenPOWER Acceleration of HPCC SystemsHPCC Systems

Current Trends in HPCPutchong Uthayopas

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs

Gpgpu introDominik Seifert

Scaling systems for research computingThe BioTeam Inc.

MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi

Basic Application Performance Optimization Techniques (Backend)Klas Berlič Fras

Semelhante a ApacheCon Europe Big Data 2016 – Parquet in practice & detail (20)

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

Taming the resource tiger

Realtime traffic analyser

How Apache Arrow and Parquet boost cross-language interoperability

2.4 Optimizing your Visual COBOL Applications

What's new in Hadoop Common and HDFS

From a student to an apache committer practice of apache io tdb

PyData Boston 2013

GPU Computing for Data Science

OpenPOWER Acceleration of HPCC Systems

Current Trends in HPC

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Gpgpu intro

Scaling systems for research computing

MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)

The Parquet Format and Performance Optimization Opportunities

The state of Hive and Spark in the Cloud (July 2017)

Basic Application Performance Optimization Techniques (Backend)

Mais de Uwe Korn

Going beyond Apache Parquet's default settingsUwe Korn

pandas.(to/from)_sql is simple but not fastUwe Korn

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Scalable Scientific Computing with DaskUwe Korn

Extending Pandas using Apache Arrow and NumbaUwe Korn

PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...Uwe Korn

Mais de Uwe Korn (8)

Going beyond Apache Parquet's default settings

pandas.(to/from)_sql is simple but not fast

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Scalable Scientific Computing with Dask

Extending Pandas using Apache Arrow and Numba

PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...

Último

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Midocean dropshipping via API with DroFxolyaivanovalion

Data-Analysis for Chicago Crime Data 2023ymrp368

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Halmar dropshipping via API with DroFxolyaivanovalion

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

ApacheCon Europe Big Data 2016 – Parquet in practice & detail

1. What is Parquet? How is it so eﬃcient? Why should I actually use it? Parquet in Practice & Detail

2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) • Committer to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL xhochy uwe@apache.org

4. Agenda Origin and Use Case Parquet under the bonnet Python & C++ The Community and its neighbours

5. About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option

6. Why use Parquet? 1. Columnar format  —> vectorized operations 2. Eﬃcient encodings and compressions  —> small size without the need for a fat CPU 3. Query push-down  —> bring computation to the I/O layer 4. Language independent format  —> libs in Java / Scala / C++ / Python /…

7. Who uses Parquet? • Query Engines • Hive • Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas

8. • More than a flat table! • Structure borrowed from Dremel paper • https://blog.twitter.com/2013/dremel-made-simple-with-parquet Nested data Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url

9. Why columnar? 2D Table row layout columnar layout

10. File Structure File RowGroup Column Chunks Page Statistics

11. Encodings • Know the data • Exploit the knowledge • Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml

12. Encodings — PLAIN • Simply write the binary representation to disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB

13. Encodings — RLE & Bit Packing • bit-packing: only use the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels

14. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)

15. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli  —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)  Snappy: 216 MiB (14 %)

16. https://github.com/apache/parquet-mr/pull/384

17. Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded

18. Competitors (Python) • HDF5 • binary (with schema) • fast, just not with strings • not a first-class citizen in the Hadoop ecosystem • msgpack • fast but unstable • CSV • The universal standard. • row-based • schema-less

19. C++ 1. General purpose read & write of Parquet • data structure independent • pluggable interfaces (allocator, I/O, …) 2. Routines to read into specific data structures • Apache Arrow • …

20. Use Parquet in Python https://pyarrow.readthedocs.io/en/latest/install.html#building-from-source

21. Get involved! 1. Mailinglist: dev@parquet.apache.org 2. Website: https://parquet.apache.org/ 3. Or directly start contributing by grabbing an issue on https://issues.apache.org/jira/browse/PARQUET 4. Slack: https://parquet-slack-invite.herokuapp.com/

22. We’re hiring! Questions?!

ApacheCon Europe Big Data 2016 – Parquet in practice & detail

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a ApacheCon Europe Big Data 2016 – Parquet in practice & detail

Semelhante a ApacheCon Europe Big Data 2016 – Parquet in practice & detail (20)

Mais de Uwe Korn

Mais de Uwe Korn (8)

Último

Último (20)

ApacheCon Europe Big Data 2016 – Parquet in practice & detail