SlideShare uma empresa Scribd logo
1 de 16
1
Parquet data format &
Impala overview
2
Agenda
• Objective
• Various data formats
• Use case
• Parquet
• Impala
3
Objective
• 2 fold:
• Quest for a more performant data format than
Avro for nested data
• Understand and test new data formats in general
4
Hadoop data formats
• Sequence file. It stores key-value pairs of data in
a flat binary file. Rows stored as values.
• ORC. Stores column oriented data. Added RLE
and Dictionary encoding, and statistics, single file
output. Will add Bloom filter.
• Avro. Data serialization framework: serialization
format & exchange service, for any language. Data
accompanied by schema (in JSON). Supports
schema evolution.
5
Parquet
• Columnar storage
• Automatic dictionary encoding and run-length
encoding. Separation of encoding vs compression.
• Run-length encoding: replaces sequences ("runs")
of consecutive repeated characters (or other units
of data) with a single character and the length of
the run.
• Dictionary encoding takes the different values
present in a column, and represents each one in
compact 2-byte form
6
Parquet
• Parquet can handle multiple schemas. Support
schema evolution.
• LogType A : organizationId, userId, timestamp,
recordId, cpuTime
• LogType V : userId, organizationId, timestamp,
foo, bar
• Can be used by any project in the Hadoop
ecosystem. Integrations provided for M/R, Pig,
Hive, Cascading and Impala.
7
Parquet
• SELECT vs INSERT.
• Parquet tables require relatively little memory to
query, because a query reads and decompresses data
in 8MB chunks.
• Inserting into a Parquet table is a more memory-
intensive operation because the data for each data file
(with a maximum size of 1GB) is stored in memory
until encoded, compressed, and written to disk.
8
Parquet
• Memory issues (Heap space error) resolved by:
• Reducing the parquet.block.size.The block size is the
size of a row group being buffered in memory and its
default value is 256 MB.
• The total memory allocated was around 1 GB.
• Using multiple Hive partitions -> multiple buffers were
getting created (one for writing into each partition ) .
• So writing data using parquet will always have a high
memory requirement .
• Hive’s Distribute by: was workaround to memory issues!
9
Parquet vs other formats
Performance test with 100G data over multiple queries
Parquet wins
10
Impala overview
• MPP implementation of a query engine
• Impala vs Hive: SQL queries for interactive
exploratory analytics on large data sets. Vs Hive,
runs as batch.
• Not using M/R – but uses HDFS
• Not CEP – closer to a RDBMS.
• Impala uses the same metadata store as Hive to
record information about table structure and
properties
11
Impala overview
• Can create a table in Hive, and use it in Impala
• E.g. Impala doesn’t support Avro, but Hive does
• Language is mix between SQL & HiveQL
• Requires a lot of memory (128 G min./node)
• Initial load of data via Refresh; can take a lot of time
• loads the block location data for newly added data
files
12
Impala overview
• Shortcomings
• Impala doesn’t support nested types at this point
(version 1.2.3) as long as it contains only Impala-
compatible data types – it cannot contain nested types
such as array, map, or struct.
• Impala currently does not "spill to disk"
• if intermediate results being processed on a node
exceed the memory reserved for Impala on that
node.
• No Custom Serializer/Deserializer classes (SerDes)
• Impala cancels a running query if any host on which that
query is executing fails
13
Impala overview
• Example. For create a PARQUET table in IMPALA there
are 3 ways:
• -> PARQUET table created in HIVE (with no nested
data types).
• -> Create and load with data a normal text table in
IMPALA:
• IMPALA> create table parquet_table_name LIKE
text_table_name STORED AS PARQUET LOCATION
/user/hdfs/..’;
• Create Parquet format table and then insert into parquet
table using normal text table.
• IMPALA> insert overwrite table parquet_table_name
select * from text_table_name;
14
Use Case
• Can't query Avro table in Impala because having
nested columns.
• Avro table created through Hive, we can use it in
Impala as long as it contains only Impala-compatible
data types.
• (cannot contain nested types such as array, map,
orstruct).
15
Use Case
• How to deal with nested XML data in Hadoop?
• There is no direct mapping from xml to avro. Process goes:
• Parse XML and Convert to Avro : Parse XML using XMLStreamReader and
• Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write
a java class for this.Tried using Parquet/Avro:
• Tested: Process Xml – first convert into Avro and then store into Parquet format using
parquet-avro apis.
• The problem is the Schema provided has some arrays which is union of type string and
null both.
• Currently this AvroSchemaConverter is not able to handle such avro schema and it gives
exception.
• Tested: Impala 1.2.3 on CDH 4.5
• Impala doesn’t support nested types at this point
16
Thank you

Mais conteúdo relacionado

Mais procurados

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.JananiJ19
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsEnkitec
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningBobby Curtis
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...Salesforce Engineering
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19cMaria Colgan
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
How to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsHow to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsSandesh Rao
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
How a Developer can Troubleshoot a SQL performing poorly on a Production DB
How a Developer can Troubleshoot a SQL performing poorly on a Production DBHow a Developer can Troubleshoot a SQL performing poorly on a Production DB
How a Developer can Troubleshoot a SQL performing poorly on a Production DBCarlos Sierra
 

Mais procurados (20)

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsDB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
How to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata EnvironmentsHow to Use EXAchk Effectively to Manage Exadata Environments
How to Use EXAchk Effectively to Manage Exadata Environments
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
How a Developer can Troubleshoot a SQL performing poorly on a Production DB
How a Developer can Troubleshoot a SQL performing poorly on a Production DBHow a Developer can Troubleshoot a SQL performing poorly on a Production DB
How a Developer can Troubleshoot a SQL performing poorly on a Production DB
 

Semelhante a Parquet and impala overview external

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storageSanSan149
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquetManpreet Khurana
 

Semelhante a Parquet and impala overview external (20)

Hadoop
HadoopHadoop
Hadoop
 
Avro intro
Avro introAvro intro
Avro intro
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
6.hive
6.hive6.hive
6.hive
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Parquet and impala overview external

  • 1. 1 Parquet data format & Impala overview
  • 2. 2 Agenda • Objective • Various data formats • Use case • Parquet • Impala
  • 3. 3 Objective • 2 fold: • Quest for a more performant data format than Avro for nested data • Understand and test new data formats in general
  • 4. 4 Hadoop data formats • Sequence file. It stores key-value pairs of data in a flat binary file. Rows stored as values. • ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter. • Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.
  • 5. 5 Parquet • Columnar storage • Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression. • Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run. • Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form
  • 6. 6 Parquet • Parquet can handle multiple schemas. Support schema evolution. • LogType A : organizationId, userId, timestamp, recordId, cpuTime • LogType V : userId, organizationId, timestamp, foo, bar • Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.
  • 7. 7 Parquet • SELECT vs INSERT. • Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks. • Inserting into a Parquet table is a more memory- intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.
  • 8. 8 Parquet • Memory issues (Heap space error) resolved by: • Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB. • The total memory allocated was around 1 GB. • Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) . • So writing data using parquet will always have a high memory requirement . • Hive’s Distribute by: was workaround to memory issues!
  • 9. 9 Parquet vs other formats Performance test with 100G data over multiple queries Parquet wins
  • 10. 10 Impala overview • MPP implementation of a query engine • Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch. • Not using M/R – but uses HDFS • Not CEP – closer to a RDBMS. • Impala uses the same metadata store as Hive to record information about table structure and properties
  • 11. 11 Impala overview • Can create a table in Hive, and use it in Impala • E.g. Impala doesn’t support Avro, but Hive does • Language is mix between SQL & HiveQL • Requires a lot of memory (128 G min./node) • Initial load of data via Refresh; can take a lot of time • loads the block location data for newly added data files
  • 12. 12 Impala overview • Shortcomings • Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala- compatible data types – it cannot contain nested types such as array, map, or struct. • Impala currently does not "spill to disk" • if intermediate results being processed on a node exceed the memory reserved for Impala on that node. • No Custom Serializer/Deserializer classes (SerDes) • Impala cancels a running query if any host on which that query is executing fails
  • 13. 13 Impala overview • Example. For create a PARQUET table in IMPALA there are 3 ways: • -> PARQUET table created in HIVE (with no nested data types). • -> Create and load with data a normal text table in IMPALA: • IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’; • Create Parquet format table and then insert into parquet table using normal text table. • IMPALA> insert overwrite table parquet_table_name select * from text_table_name;
  • 14. 14 Use Case • Can't query Avro table in Impala because having nested columns. • Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types. • (cannot contain nested types such as array, map, orstruct).
  • 15. 15 Use Case • How to deal with nested XML data in Hadoop? • There is no direct mapping from xml to avro. Process goes: • Parse XML and Convert to Avro : Parse XML using XMLStreamReader and • Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro: • Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis. • The problem is the Schema provided has some arrays which is union of type string and null both. • Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception. • Tested: Impala 1.2.3 on CDH 4.5 • Impala doesn’t support nested types at this point

Notas do Editor

  1. Also splittables.