SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
How to use Parquet
as a basis for ETL and analytics
Julien Le Dem @J_
Analytics Data Pipeline tech lead, Data Platform
@ApacheParquet
Outline
2
- Instrumentation and data collection
- Storing data efficiently for analysis
- Openness and Interoperability
Instrumentation and data collection
Typical data flow
4
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Happy users
Typical data flow
5
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Typical data flow
6
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efficient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Typical data flow
7
Happy
Data Scientist
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efficient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Storing data for analysis
Producing a lot of data is easy
9
Producing a lot of derived data is even easier.

Solution: Compress all the things!
Scanning a lot of data is easy
10
1% completed
… but not necessarily fast.

Waiting is not productive. We want faster turnaround.

Compression but not at the cost of reading speed.
Interoperability not that easy
11
We need a storage format interoperable with all the tools we use
and
keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
13
- Interoperability

- Space efficiency

- Query efficiency
Efficiency
Columnar storage
15
Logical table
representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
Parquet nested representation
16
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for filter and query optimization
17
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Properties of efficient encodings
18
- Minimize CPU pipeline bubbles:

	 highly predictable branching
	 reduce data dependency
!
- Minimize CPU cache misses

	 reduce size of the working set
The right encoding for the right job
19
- Delta encodings:

for sorted datasets or signals where the variation is less important than the absolute
value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching.
!
- Prefix coding (delta encoding for strings)

When dictionary encoding does not work.
!
- Dictionary encoding: 

small (60K) set of values (server IP, experiment id, …)
!
- Run Length Encoding:

repetitive data.
Interoperability
Interoperable
21
Model agnostic
Language agnostic
Java C++
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping
Parquet file format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Query
execution
Frameworks and libraries integrated with Parquet
22
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto
!
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
!
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
Schema management
Schema in Hadoop
24
Hadoop does not define a standard notion of schema but there are many
available:

- Avro

- Thrift

- Protocol Buffers

- Pig

- Hive

- …

And they are all different
What they define
25
Schema:

Structure of a record

Constraints on the type

!
Row oriented binary format:
How records are represented one at a time
What they *do not* define
26
	 Column oriented binary format:
Parquet reuses the schema definitions and provides a common column
oriented binary format
Example: address book
27
AddressBook
Address
street
city
state
zip
comment
addresses
Protocol Buffers
28
message AddressBook {!
repeated group addresses = 1 {!
required string street = 2;!
required string city = 3;!
required string state = 4;!
required string zip = 5;!
optional string comment = 6;!
}!
}!
!
- Allows recursive definition

- Types: Group or primitive

- binary format refers to field ids only => Renaming fields does not impact binary format

- Requires installing a native compiler separated from your build
Fields have ids and can be
optional, required or repeated
Lists are repeated fields
Thrift
29
struct AddressBook {!
1: required list<Address> addresses;!
}!
struct Addresses {!
1: required string street;!
2: required string city;!
3: required string state;!
4: required string zip;!
5: optional string comment;!
}!
!
- No recursive definition

- Types: Struct, Map, List, Set, Union or primitive

- binary format refers to field ids only => Renaming fields does not impact binary format

- Requires installing a native compiler separately from the build
Fields have ids and can be
optional or required
explicit collection types
Avro
30
{!
"type": "record", !
"name": "AddressBook",!
"fields" : [{ !
"name": "addresses", !
"type": "array", !
"items": { !
“type”: “record”,!
“fields”: [!
{"name": "street", "type": “string"},!
{"name": "city", "type": “string”}!
{"name": "state", "type": “string"}!
{"name": "zip", "type": “string”}!
{"name": "comment", "type": [“null”, “string”]}!
] !
}!
}]!
}
explicit collection types
- Allows recursive definition

- Types: Records, Arrays, Maps, Unions or primitive

- Binary format requires knowing the write-time schema

➡ more compact but not self descriptive

➡ renaming fields does not impact binary format

- generator in java (well integrated in the build)
null is a type
Optional is a union
Write to Parquet
Write to Parquet with Map Reduce
32
Protocol Buffers:
job.setOutputFormatClass(ProtoParquetOutputFormat.class);!
ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);!
!
Thrift:
job.setOutputFormatClass(ParquetThriftOutputFormat.class);!
ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);!
!
Avro:
job.setOutputFormatClass(AvroParquetOutputFormat.class);!
AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
Write to Parquet with Scalding
33
// define the Parquet source!
case class AddressBookParquetSource(override implicit val dateRange: DateRange)!
extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)!
// load and transform data!
…!
pipe.write(ParquetSource())!
Write with Parquet with Pig
34
…!
STORE mydata !
! INTO ‘my/data’ !
! USING parquet.pig.ParquetStorer();
Query engines
Scalding
36
	 loading:
new FixedPathParquetThrift[AddressBook](“my”, “data”) {!
val city = StringColumn("city")!
override val withFilter: Option[FilterPredicate] = !
Some(city === “San Jose”)!
}!
!
operations:
p.map( (r) => r.a + r.b )!
p.groupBy( (r) => r.c )!
p.join !
…
Pig
37
loading:
mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();!
!
operations:
A = FOREACH mydata GENERATE a + b;!
B = GROUP mydata BY c;!
C = JOIN A BY a, B BY b;
Hive
38
	 loading:
create table parquet_table_name (x INT, y STRING)!
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'!
STORED AS !
INPUTFORMAT "parquet.hive.MapredParquetInputFormat"!
OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";!
!
	 operations:

	 	 SQL!
Impala
39
	 loading:
create table parquet_table (x int, y string) stored as parquetfile;!
insert into parquet_table select x, y from some_other_table;!
select y from parquet_table where x between 70 and 100;!
	 

	 operations:

	 	 SQL!
Drill
40
SELECT * FROM dfs.`/my/data`
Spark SQL
41
	 loading:
val address = sqlContext.parquetFile(“/my/data/addresses“)!
!
operations:
val result = sqlContext!
! .sql("SELECT city FROM addresses WHERE zip == 94707”)!
result.map((r) => …)!
Community
Parquet timeline
43
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats

- March 2013: OSS announcement; Criteo signs on for Hive integration

- July 2013: 1.0 release. 18 contributors from more than 5 organizations.

- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.

- Parquet 2.0 coming as Apache release
Thank you to our contributors
44
Open Source announcement
1.0 release
Get involved
45
Mailing lists:
- dev@parquet.incubator.apache.org
!
Parquet sync ups:
- Regular meetings on google hangout
Questions
46
Questions.foreach( answer(_) )
@ApacheParquet

Mais conteúdo relacionado

Mais procurados

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 

Mais procurados (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 

Semelhante a How to use Parquet as a basis for ETL and analytics

How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]RootedCON
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPatrick Allaert
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)Ontico
 

Semelhante a How to use Parquet as a basis for ETL and analytics (20)

How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Odp
OdpOdp
Odp
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
Lec05
Lec05Lec05
Lec05
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
 

Mais de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 

Mais de Julien Le Dem (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 

Último

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Último (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

How to use Parquet as a basis for ETL and analytics

  • 1. How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ Analytics Data Pipeline tech lead, Data Platform @ApacheParquet
  • 2. Outline 2 - Instrumentation and data collection - Storing data efficiently for analysis - Openness and Interoperability
  • 5. Typical data flow 5 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 6. Typical data flow 6 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  • 7. Typical data flow 7 Happy Data Scientist Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  • 8. Storing data for analysis
  • 9. Producing a lot of data is easy 9 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 10. Scanning a lot of data is easy 10 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 11. Interoperability not that easy 11 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 13. Parquet design goals 13 - Interoperability - Space efficiency - Query efficiency
  • 15. Columnar storage 15 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  • 16. Parquet nested representation 16 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 17. Statistics for filter and query optimization 17 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 18. Properties of efficient encodings 18 - Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency ! - Minimize CPU cache misses reduce size of the working set
  • 19. The right encoding for the right job 19 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  • 21. Interoperable 21 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  • 22. Frameworks and libraries integrated with Parquet 22 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 24. Schema in Hadoop 24 Hadoop does not define a standard notion of schema but there are many available: - Avro - Thrift - Protocol Buffers - Pig - Hive - … And they are all different
  • 25. What they define 25 Schema: Structure of a record Constraints on the type ! Row oriented binary format: How records are represented one at a time
  • 26. What they *do not* define 26 Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format
  • 28. Protocol Buffers 28 message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }! }! ! - Allows recursive definition - Types: Group or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separated from your build Fields have ids and can be optional, required or repeated Lists are repeated fields
  • 29. Thrift 29 struct AddressBook {! 1: required list<Address> addresses;! }! struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;! }! ! - No recursive definition - Types: Struct, Map, List, Set, Union or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separately from the build Fields have ids and can be optional or required explicit collection types
  • 30. Avro 30 {! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]! } explicit collection types - Allows recursive definition - Types: Records, Arrays, Maps, Unions or primitive - Binary format requires knowing the write-time schema ➡ more compact but not self descriptive ➡ renaming fields does not impact binary format - generator in java (well integrated in the build) null is a type Optional is a union
  • 32. Write to Parquet with Map Reduce 32 Protocol Buffers: job.setOutputFormatClass(ProtoParquetOutputFormat.class);! ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);! ! Thrift: job.setOutputFormatClass(ParquetThriftOutputFormat.class);! ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);! ! Avro: job.setOutputFormatClass(AvroParquetOutputFormat.class);! AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
  • 33. Write to Parquet with Scalding 33 // define the Parquet source! case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)! // load and transform data! …! pipe.write(ParquetSource())!
  • 34. Write with Parquet with Pig 34 …! STORE mydata ! ! INTO ‘my/data’ ! ! USING parquet.pig.ParquetStorer();
  • 36. Scalding 36 loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)! }! ! operations: p.map( (r) => r.a + r.b )! p.groupBy( (r) => r.c )! p.join ! …
  • 37. Pig 37 loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();! ! operations: A = FOREACH mydata GENERATE a + b;! B = GROUP mydata BY c;! C = JOIN A BY a, B BY b;
  • 38. Hive 38 loading: create table parquet_table_name (x INT, y STRING)! ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'! STORED AS ! INPUTFORMAT "parquet.hive.MapredParquetInputFormat"! OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";! ! operations: SQL!
  • 39. Impala 39 loading: create table parquet_table (x int, y string) stored as parquetfile;! insert into parquet_table select x, y from some_other_table;! select y from parquet_table where x between 70 and 100;! operations: SQL!
  • 40. Drill 40 SELECT * FROM dfs.`/my/data`
  • 41. Spark SQL 41 loading: val address = sqlContext.parquetFile(“/my/data/addresses“)! ! operations: val result = sqlContext! ! .sql("SELECT city FROM addresses WHERE zip == 94707”)! result.map((r) => …)!
  • 43. Parquet timeline 43 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  • 44. Thank you to our contributors 44 Open Source announcement 1.0 release
  • 45. Get involved 45 Mailing lists: - dev@parquet.incubator.apache.org ! Parquet sync ups: - Regular meetings on google hangout