SlideShare uma empresa Scribd logo
1 de 45
How to use Parquet
as a basis for ETL and analytics
Julien Le Dem @J_
VP Apache Parquet
Analytics Data Pipeline tech lead, Data Platform
@ApacheParquet
Outline
2
- Storing data efficiently for analysis
- Context: Instrumentation and data collection
- Constraints of ETL
Storing data efficiently for analysis
Why do we need to worry about efficiency?
Producing a lot of data is easy
5
Producing a lot of derived data is even easier.

Solution: Compress all the things!
Scanning a lot of data is easy
6
1% completed
… but not necessarily fast.

Waiting is not productive. We want faster turnaround.

Compression but not at the cost of reading speed.
Interoperability not that easy
7
We need a storage format interoperable with all the tools we use
and
keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
9
Interoperability
Space efficiency
Query efficiency
Efficiency
Columnar storage
11
Logical table
representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
Properties of efficient encodings
12
- Minimize CPU pipeline bubbles:

	 highly predictable branching
	 reduce data dependency
- Minimize CPU cache misses

	 reduce size of the working set
The right encoding for the right job
13
- Delta encodings:

for sorted datasets or signals where the variation is less important than the absolute
value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching.
- Prefix coding (delta encoding for strings)

When dictionary encoding does not work.
- Dictionary encoding: 

small (60K) set of values (server IP, experiment id, …)
- Run Length Encoding:

repetitive data.
Parquet nested representation
14
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for filter and query optimization
15
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Interoperability
Interoperable
17
Model agnostic
Language agnostic
Java C++
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping
Parquet file format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Query
execution
Frameworks and libraries integrated with Parquet
18
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto, SparkSQL
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
Context: Instrumentation and data collection
Typical data flow
20
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Happy users
Typical data flow
21
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Typical data flow
22
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efficient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Typical data flow
23
Happy
Data Scientist
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efficient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Schema management
Schema in Hadoop
25
Hadoop does not define a standard
notion of schema but there are
many available:

- Avro

- Thrift

- Protocol Buffers

- Pig

- Hive

- …

And they are all different
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
What they define
26
Schema:

Structure of a record

Constraints on the type

Row oriented binary format:
How records are represented one
at a time
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
What they *do not* define
27
Column oriented binary format:
Parquet reuses the schema
definitions and provides a common
column oriented binary format
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Example: address book
28
AddressBook
Address
street
city
state
zip
comment
addresses
Protocol Buffers
29
message AddressBook {
repeated group addresses = 1 {
required string street = 2;
required string city = 3;
required string state = 4;
required string zip = 5;
optional string comment = 6;
}
}
- Allows recursive definition

- Types: Group or primitive

- binary format refers to field ids only => Renaming fields does not impact binary format

- Requires installing a native compiler separated from your build
Fields have ids and can be
optional, required or repeated
Lists are repeated fields
Thrift
30
struct AddressBook {
1: required list<Address> addresses;
}
struct Addresses {
1: required string street;
2: required string city;
3: required string state;
4: required string zip;
5: optional string comment;
}
- No recursive definition

- Types: Struct, Map, List, Set, Union or primitive

- binary format refers to field ids only => Renaming fields does not impact binary format

- Requires installing a native compiler separately from the build
Fields have ids and can be
optional or required
explicit collection types
Avro
31
{
"type": "record",
"name": "AddressBook",
"fields" : [{
"name": "addresses",
"type": "array",
"items": {
“type”: “record”,
“fields”: [
{"name": "street", "type": “string"},
{"name": "city", "type": “string”}
{"name": "state", "type": “string"}
{"name": "zip", "type": “string”}
{"name": "comment", "type": [“null”, “string”]}
]
}
}]
}
explicit collection types
- Allows recursive definition

- Types: Records, Arrays, Maps, Unions or primitive

- Binary format requires knowing the write-time schema

➡ more compact but not self descriptive

➡ renaming fields does not impact binary format

- generator in java (well integrated in the build)
null is a type
Optional is a union
Requirements of ETL
Log event collection
33
Initial collection is fundamentally
row oriented:
- Sync to disk as early as
possible to minimize event loss
- Counting events sent and
received is a good idea Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Columnar storage conversion
34
Columnar storage requires writes
to be buffered in memory for the
entire row group:
- Write many records at a time.
- Better executed in batch
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Columnar storage conversion
35
Not just columnar storage:
- Dynamic partitioning
- Sort order
- Stats generation
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Hive / Impala:
Write to Parquet
36
OutputFormat ProtoParquetOutputFormat ParquetThriftOutputFormat AvroParquetOutputFormat
define schema
setProtobufClass(job,
AddressBook.class)
setThriftClass(job,
AddressBook.class)
setSchema(job,
AddressBook.SCHEMA$)
Scalding: // define the Parquet source
case class AddressBookParquetSource(override implicit val dateRange: DateRange)
extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)
// load and transform data
pipe.write(ParquetSource())
Pig: STORE mydata
INTO ‘my/data’
USING parquet.pig.ParquetStorer();
MapReduce:
create table parquet_table (x int, y string) stored as parquetfile;
insert into parquet_table select x, y from some_other_table;
Query engines
Scalding
38
loading:
new FixedPathParquetThrift[AddressBook](“my”, “data”) {
val city = StringColumn("city")
override val withFilter: Option[FilterPredicate] =
Some(city === “San Jose”)
}
operations:
p.map( (r) => r.a + r.b )
p.groupBy( (r) => r.c )
p.join
…
Explicit push
down
Pig
39
loading:
mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();
operations:
A = FOREACH mydata GENERATE a + b;
B = GROUP mydata BY c;
C = JOIN A BY a, B BY b;
Projection push
down happens
automatically
SQL engines
40
Load query
Hive
create table as … SELECT city FROM addresses WHERE zip == 95113Impala
Presto
Drill
optional.
Drill can directly
query parquet files
SELECT city FROM dfs.`/table/addresses` zip == 95113
SparkSQL
val parquetFile =
sqlContext.parquetFile(
"/table/addresses")
val result = sqlContext
.sql("SELECT city FROM addresses WHERE zip == 95113”)
result.map((r) => …)
Projection push
down happens
automatically
Community
Parquet timeline
42
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats

- March 2013: OSS announcement; Criteo signs on for Hive integration

- July 2013: 1.0 release. 18 contributors from more than 5 organizations.

- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.

- Apr 2015: Parquet graduates from the Apache Incubator
Thank you to our contributors
43
Open Source announcement
1.0 release
Get involved
44
Mailing lists:
- dev@parquet.apache.org
Github repo:
- https://github.com/apache/parquet-mr
Parquet sync ups:
- Regular meetings on google hangout
@ApacheParquet
Questions
45
SELECT answer(question) FROM audience
@ApacheParquet

Mais conteúdo relacionado

Mais procurados

ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
Eric Sun
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 

Mais procurados (20)

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 

Destaque

Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 

Destaque (20)

Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
large scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraphlarge scale collaborative filtering using Apache Giraph
large scale collaborative filtering using Apache Giraph
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 

Semelhante a How to use Parquet as a Sasis for ETL and Analytics

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4
Open Networking Summits
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Semelhante a How to use Parquet as a Sasis for ETL and Analytics (20)

How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Fletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGAFletcher Framework for Programming FPGA
Fletcher Framework for Programming FPGA
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

How to use Parquet as a Sasis for ETL and Analytics

  • 1. How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ VP Apache Parquet Analytics Data Pipeline tech lead, Data Platform @ApacheParquet
  • 2. Outline 2 - Storing data efficiently for analysis - Context: Instrumentation and data collection - Constraints of ETL
  • 4. Why do we need to worry about efficiency?
  • 5. Producing a lot of data is easy 5 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 6. Scanning a lot of data is easy 6 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 7. Interoperability not that easy 7 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 9. Parquet design goals 9 Interoperability Space efficiency Query efficiency
  • 11. Columnar storage 11 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  • 12. Properties of efficient encodings 12 - Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency - Minimize CPU cache misses reduce size of the working set
  • 13. The right encoding for the right job 13 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. - Prefix coding (delta encoding for strings) When dictionary encoding does not work. - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) - Run Length Encoding: repetitive data.
  • 14. Parquet nested representation 14 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 15. Statistics for filter and query optimization 15 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 17. Interoperable 17 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  • 18. Frameworks and libraries integrated with Parquet 18 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto, SparkSQL Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 19. Context: Instrumentation and data collection
  • 21. Typical data flow 21 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 22. Typical data flow 22 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  • 23. Typical data flow 23 Happy Data Scientist Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  • 25. Schema in Hadoop 25 Hadoop does not define a standard notion of schema but there are many available: - Avro - Thrift - Protocol Buffers - Pig - Hive - … And they are all different Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 26. What they define 26 Schema: Structure of a record Constraints on the type Row oriented binary format: How records are represented one at a time Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 27. What they *do not* define 27 Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 29. Protocol Buffers 29 message AddressBook { repeated group addresses = 1 { required string street = 2; required string city = 3; required string state = 4; required string zip = 5; optional string comment = 6; } } - Allows recursive definition - Types: Group or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separated from your build Fields have ids and can be optional, required or repeated Lists are repeated fields
  • 30. Thrift 30 struct AddressBook { 1: required list<Address> addresses; } struct Addresses { 1: required string street; 2: required string city; 3: required string state; 4: required string zip; 5: optional string comment; } - No recursive definition - Types: Struct, Map, List, Set, Union or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separately from the build Fields have ids and can be optional or required explicit collection types
  • 31. Avro 31 { "type": "record", "name": "AddressBook", "fields" : [{ "name": "addresses", "type": "array", "items": { “type”: “record”, “fields”: [ {"name": "street", "type": “string"}, {"name": "city", "type": “string”} {"name": "state", "type": “string"} {"name": "zip", "type": “string”} {"name": "comment", "type": [“null”, “string”]} ] } }] } explicit collection types - Allows recursive definition - Types: Records, Arrays, Maps, Unions or primitive - Binary format requires knowing the write-time schema ➡ more compact but not self descriptive ➡ renaming fields does not impact binary format - generator in java (well integrated in the build) null is a type Optional is a union
  • 33. Log event collection 33 Initial collection is fundamentally row oriented: - Sync to disk as early as possible to minimize event loss - Counting events sent and received is a good idea Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 34. Columnar storage conversion 34 Columnar storage requires writes to be buffered in memory for the entire row group: - Write many records at a time. - Better executed in batch Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 35. Columnar storage conversion 35 Not just columnar storage: - Dynamic partitioning - Sort order - Stats generation Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 36. Hive / Impala: Write to Parquet 36 OutputFormat ProtoParquetOutputFormat ParquetThriftOutputFormat AvroParquetOutputFormat define schema setProtobufClass(job, AddressBook.class) setThriftClass(job, AddressBook.class) setSchema(job, AddressBook.SCHEMA$) Scalding: // define the Parquet source case class AddressBookParquetSource(override implicit val dateRange: DateRange) extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange) // load and transform data pipe.write(ParquetSource()) Pig: STORE mydata INTO ‘my/data’ USING parquet.pig.ParquetStorer(); MapReduce: create table parquet_table (x int, y string) stored as parquetfile; insert into parquet_table select x, y from some_other_table;
  • 38. Scalding 38 loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) { val city = StringColumn("city") override val withFilter: Option[FilterPredicate] = Some(city === “San Jose”) } operations: p.map( (r) => r.a + r.b ) p.groupBy( (r) => r.c ) p.join … Explicit push down
  • 39. Pig 39 loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader(); operations: A = FOREACH mydata GENERATE a + b; B = GROUP mydata BY c; C = JOIN A BY a, B BY b; Projection push down happens automatically
  • 40. SQL engines 40 Load query Hive create table as … SELECT city FROM addresses WHERE zip == 95113Impala Presto Drill optional. Drill can directly query parquet files SELECT city FROM dfs.`/table/addresses` zip == 95113 SparkSQL val parquetFile = sqlContext.parquetFile( "/table/addresses") val result = sqlContext .sql("SELECT city FROM addresses WHERE zip == 95113”) result.map((r) => …) Projection push down happens automatically
  • 42. Parquet timeline 42 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Apr 2015: Parquet graduates from the Apache Incubator
  • 43. Thank you to our contributors 43 Open Source announcement 1.0 release
  • 44. Get involved 44 Mailing lists: - dev@parquet.apache.org Github repo: - https://github.com/apache/parquet-mr Parquet sync ups: - Regular meetings on google hangout @ApacheParquet
  • 45. Questions 45 SELECT answer(question) FROM audience @ApacheParquet