My life as a beekeeper

•Transferir como KEY, PDF•

1 gostou•1,219 visualizações

Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.

Tecnologia

Who am I?
Pedro Figueiredo (pfig@89clouds.com)

Hadoop et al

SocialFacebook games, media (TV,
publishing)

Elastic MapReduce, Cloudera

NoSQL, as in “Not a SQL guy”

The problem with
Hive

It looks like SQL

No, seriously
SELECT
CONCAT(vishi,vislo),
SUM(
CASE WHEN searchengine = 'google'
THEN 1
ELSE 0
END
) AS google_searches
FROM omniture
WHERE
year(hittime) = 2011 AND
month(hittime) = 8 AND
is_search = 'Y'
GROUP BY CONCAT(vishi,vislo);

“It’s just like
Oracle!”
Analysts will be very happy

At least until they join with that 30
billion-record table

Pro tip: explain MapReduce and then
MAPJOIN

set
hive.mapjoin.smalltable.filesize=xxx;

Your first interview
question

“Explain the difference
between CREATE TABLE and
CREATE EXTERNAL TABLE”

Dynamic partitions

Partitions are the poor person’s
indexes

Unstructured data is full of surprises
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;

Plan your partitions ahead

Multi-vitamins

You can minimise input scans by using
multi-table INSERTs:

FROM input
INSERT INTO TABLE output1 SELECT foo
INSERT INTO TABLE output2 SELECT bar;

Persistence, do you
speak it?
External Hive metastore

Avoid the pain of cluster set up

Use an RDS metastore if on AWS, RDBMS
otherwise.

10GB will get you a long way, this
thing is tiny

Now you have 2
problems
Regular expressions are great, if
you’re using a real programming
language.

WHERE foo RLIKE ‘(a|b|c)’ will hurt

WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’

Generate these statements, if needs
be, it will pay off.

Avro

Serialisation framework (think
Thrift/Protocol Buffers).

Avro container files are
SequenceFile-like, splittable.

Support for snappy built-in.

If using the LinkedIn SerDe, the
table creation syntax changes.

Avro
CREATE EXTERNAL TABLE IF NOT EXISTS mytable
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/
hadoop/avro/myschema.avsc')
STORED AS
INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION '/data/mytable'
;

MAKE! MONEY! FAST!

Use spot instances in EMR

Usually stick around until America
wakes up

Brilliant for worker nodes

Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;

To be or not to be
“Consider a traditional RDBMS”

At what size should we do this?

Hive is not an end, it’s the means

Data on HDFS/S3 is simply available,
not “available to Hive”

Hive isn’t suitable for near real
time

Hive != MapReduce

Don’t use Hive instead of Native/
Streaming

“I know, I’ll just stream this bit
through a shell script!”

Imo, Hive excels at analysis and
aggregation, so use it for that

Thank you

Fred Easey (@poppa_f)

Peter Hanlon

Questions?

pfig@89clouds.com
@pfig / @89clouds

http://89clouds.com/

Mais conteúdo relacionado

Mais procurados

PuppetDB, Puppet Explorer and puppetdbqueryPuppet

Unleash your inner console cowboyKenneth Geisshirt

HBase + Hue - LA HBase User Groupgethue

05 pig user defined functions (udfs)Subhas Kumar Ghosh

puppet @techlifecookpadNaoya Nakazawa

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network

AWS Hadoop and PIG and overviewDan Morrill

Docker tips & tricksDharmit Shah

Ordered Record CollectionHadoop User Group

Working with databases in PerlLaurent Dami

COSCUP2012: How to write a bash script like the python?Lloyd Huang

GoとElixir、同時開発した時の気づきTakahiro Kobaru

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel

Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba

Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue

Value protocols and codablesFlorent Vilmart

Parse, scale to millionsFlorent Vilmart

サンプルから見るMap reduceコードShinpei Ohtani

Shell实现的windows回收站功能的脚本Lingfei Kong

Performance Profiling in RustInfluxData

Mais procurados (20)

PuppetDB, Puppet Explorer and puppetdbquery

Unleash your inner console cowboy

HBase + Hue - LA HBase User Group

05 pig user defined functions (udfs)

puppet @techlifecookpad

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

AWS Hadoop and PIG and overview

Docker tips & tricks

Ordered Record Collection

Working with databases in Perl

COSCUP2012: How to write a bash script like the python?

GoとElixir、同時開発した時の気づき

Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018

Hive vs Pig for HadoopSourceCodeReading

Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup

Value protocols and codables

Parse, scale to millions

サンプルから見るMap reduceコード

Shell实现的windows回收站功能的脚本

Performance Profiling in Rust

Destaque

The problem with PerlPedro Figueiredo

CPAN TrainingPedro Figueiredo

Perl in Teh CloudPedro Figueiredo

30 Minutes To CPANdaoswald

PERL Unit 6 regular expressionBinsent Ribera

Logic Progamming in PerlCurtis Poe

Destaque (6)

The problem with Perl

CPAN Training

Perl in Teh Cloud

30 Minutes To CPAN

PERL Unit 6 regular expression

Logic Progamming in Perl

Semelhante a My life as a beekeeper

HadoopScott Leberknight

Good practices for PrestaShop code security and optimizationPrestaShop

Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt

Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain

Interview questions on Apache spark [part 2]knowbigdata

SQL -PHP TutorialInformation Technology

Sql user groupStefan Bauer

Elasticsearch sur Azure : Make sense of your (BIG) data !Microsoft

Your Library Sucks, and why you should use it.Peter Higgins

Nosql hands on handout 04Krishna Sankar

AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...GeeksLab Odessa

Getting started with Hadoop, Hive, and Elastic MapReduceobdit

JavaScript ES6Leo Hernandez

January 2011 HUG: Pig PresentationYahoo Developer Network

Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz

Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz

Introduction to the hadoop ecosystem by Uwe SeilerCodemotion

Html5 OverviewAbdel Moneim Emad

ClickHouse new features and development roadmap, by Aleksei MilovidovAltinity Ltd

Python training for beginnersLADONNEE Consulting, SARL à capital variable

Semelhante a My life as a beekeeper (20)

Hadoop

Good practices for PrestaShop code security and optimization

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...

Interview questions on Apache spark [part 2]

SQL -PHP Tutorial

Sql user group

Elasticsearch sur Azure : Make sense of your (BIG) data !

Your Library Sucks, and why you should use it.

Nosql hands on handout 04

AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...

Getting started with Hadoop, Hive, and Elastic MapReduce

JavaScript ES6

January 2011 HUG: Pig Presentation

Introduction to the Hadoop Ecosystem (SEACON Edition)

Introduction to the Hadoop Ecosystem (codemotion Edition)

Introduction to the hadoop ecosystem by Uwe Seiler

Html5 Overview

ClickHouse new features and development roadmap, by Aleksei Milovidov

Python training for beginners

Último

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

From Family Reminiscence to Scholarly Archive .Alan Dix

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Commit 2024 - Secret Management made easyAlfredo García Lavilla

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

unit 4 immunoblotting technique complete.pptxBkGupta21

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

"ML in Production",Oleksandr BaganFwdays

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

My life as a beekeeper

1. My life as a beekeeper @89clouds

2. Who am I? Pedro Figueiredo (pfig@89clouds.com) Hadoop et al SocialFacebook games, media (TV, publishing) Elastic MapReduce, Cloudera NoSQL, as in “Not a SQL guy”

3. The problem with Hive It looks like SQL

4. No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);

5. “It’s just like Oracle!” Analysts will be very happy At least until they join with that 30 billion-record table Pro tip: explain MapReduce and then MAPJOIN set hive.mapjoin.smalltable.filesize=xxx;

6. Your first interview question “Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”

7. Dynamic partitions Partitions are the poor person’s indexes Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000; Plan your partitions ahead

8. Multi-vitamins You can minimise input scans by using multi-table INSERTs: FROM input INSERT INTO TABLE output1 SELECT foo INSERT INTO TABLE output2 SELECT bar;

9. Persistence, do you speak it? External Hive metastore Avoid the pain of cluster set up Use an RDS metastore if on AWS, RDBMS otherwise. 10GB will get you a long way, this thing is tiny

10. Now you have 2 problems Regular expressions are great, if you’re using a real programming language. WHERE foo RLIKE ‘(a|b|c)’ will hurt WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’ Generate these statements, if needs be, it will pay off.

11. Avro Serialisation framework (think Thrift/Protocol Buffers). Avro container files are SequenceFile-like, splittable. Support for snappy built-in. If using the LinkedIn SerDe, the table creation syntax changes.

12. Avro CREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/ hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable' ;

13. MAKE! MONEY! FAST! Use spot instances in EMR Usually stick around until America wakes up Brilliant for worker nodes

14. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

15. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

16. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

17. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

18. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

19. Bag of tricks set hive.optimize.s3.query=true; set hive.cli.print.header=true; set hive.exec.max.created.files=xxx; set mapred.reduce.tasks=xxx; hive.exec.compress.intermediate=true; hive.exec.parallel=true;

20. To be or not to be “Consider a traditional RDBMS” At what size should we do this? Hive is not an end, it’s the means Data on HDFS/S3 is simply available, not “available to Hive” Hive isn’t suitable for near real time

21. Hive != MapReduce Don’t use Hive instead of Native/ Streaming “I know, I’ll just stream this bit through a shell script!” Imo, Hive excels at analysis and aggregation, so use it for that

22. Thank you Fred Easey (@poppa_f) Peter Hanlon

23. Questions? pfig@89clouds.com @pfig / @89clouds http://89clouds.com/

Notas do Editor

\n
\n
\n
\n
https://www.facebook.com/note.php?note_id=470667928919\n&#x201C;Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000&#x201D;\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
\n
Also, there&#x2019;s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
\n
\n
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won&#x2019;t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
When using an RDBMS, it&#x2019;s much harder to get at your data from other tools\n
Convoluted, long-winded code\nReporting is hard\n
\n
\n

My life as a beekeeper

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a My life as a beekeeper

Semelhante a My life as a beekeeper (20)

Último

Último (20)

My life as a beekeeper

Notas do Editor