Your Hive honeymoon can be cut short if you don't take the necessary precautions. In this talk I'll share my experience with Hive in the last 3 years (in Elastic MapReduce and Cloudera CDH3), describing what I got wrong the first time around, and what eventually saved the day. I've used Hive in environments with a number of events ranging from a few million to a few billion a day, so hopefully there'll be something for everyone.
2. Who am I?
Pedro Figueiredo (pfig@89clouds.com)
Hadoop et al
SocialFacebook games, media (TV,
publishing)
Elastic MapReduce, Cloudera
NoSQL, as in “Not a SQL guy”
4. No, seriously
SELECT
CONCAT(vishi,vislo),
SUM(
CASE WHEN searchengine = 'google'
THEN 1
ELSE 0
END
) AS google_searches
FROM omniture
WHERE
year(hittime) = 2011 AND
month(hittime) = 8 AND
is_search = 'Y'
GROUP BY CONCAT(vishi,vislo);
5. “It’s just like
Oracle!”
Analysts will be very happy
At least until they join with that 30
billion-record table
Pro tip: explain MapReduce and then
MAPJOIN
set
hive.mapjoin.smalltable.filesize=xxx;
6. Your first interview
question
“Explain the difference
between CREATE TABLE and
CREATE EXTERNAL TABLE”
7. Dynamic partitions
Partitions are the poor person’s
indexes
Unstructured data is full of surprises
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
Plan your partitions ahead
8. Multi-vitamins
You can minimise input scans by using
multi-table INSERTs:
FROM input
INSERT INTO TABLE output1 SELECT foo
INSERT INTO TABLE output2 SELECT bar;
9. Persistence, do you
speak it?
External Hive metastore
Avoid the pain of cluster set up
Use an RDS metastore if on AWS, RDBMS
otherwise.
10GB will get you a long way, this
thing is tiny
10. Now you have 2
problems
Regular expressions are great, if
you’re using a real programming
language.
WHERE foo RLIKE ‘(a|b|c)’ will hurt
WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’
Generate these statements, if needs
be, it will pay off.
11. Avro
Serialisation framework (think
Thrift/Protocol Buffers).
Avro container files are
SequenceFile-like, splittable.
Support for snappy built-in.
If using the LinkedIn SerDe, the
table creation syntax changes.
12. Avro
CREATE EXTERNAL TABLE IF NOT EXISTS mytable
PARTITIONED BY (ds STRING)
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/
hadoop/avro/myschema.avsc')
STORED AS
INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION '/data/mytable'
;
13. MAKE! MONEY! FAST!
Use spot instances in EMR
Usually stick around until America
wakes up
Brilliant for worker nodes
14. Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
15. Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
16. Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
17. Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
18. Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
19. Bag of tricks
set hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
20. To be or not to be
“Consider a traditional RDBMS”
At what size should we do this?
Hive is not an end, it’s the means
Data on HDFS/S3 is simply available,
not “available to Hive”
Hive isn’t suitable for near real
time
21. Hive != MapReduce
Don’t use Hive instead of Native/
Streaming
“I know, I’ll just stream this bit
through a shell script!”
Imo, Hive excels at analysis and
aggregation, so use it for that
https://www.facebook.com/note.php?note_id=470667928919\n“Currently, if the total size of small tables is larger than 25MB, then the conditional task will choose the original common join to run. 25MB is a very conservative number and you can change this number with set hive.smalltable.filesize=30000000”\nSELECT /* +mapjoin(f,b,g) */\nset hive.auto.convert.join = true;\nhive.smalltable.filesize, depending on version\nset hive.mapjoin.localtask.max.memory.usage = 0.999;\n\n
\n
Also, there’s no UPDATE, you can only overwrite a whole table, so use partitions\ne.g., 20 games with 40 events with 5 attrs on average, per day (date=/game=/event=/attr=): 1.46M partitions per year (4000/day)\nSET hive.exec.max.dynamic.partitions=100000;\nSET hive.exec.max.dynamic.partitions.pernode=100000;\navoid RECOVER PARTITIONS, generate a partition list and add them statically, or use a persistent metastore\n
Or INSERT OVERWRITE. Append (INSERT INTO) only available from 0.8 onwards\nObviously works with partitions, static (with the value in the INSERT statement) or dynamic, but:\nThe dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause\n
\n
\n
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.\nNo manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.\nThe schema (defined in JSON) is included in the data files\nHive >= 0.9.1\n\n
The new SerDe uses TBLPROPERTIES and avro.schema.url / literal. Another property is\norg.apache.hadoop.hive.serde2.avro.AvroSerDe\nAlso, the statement order is important!\nOne more thing: 1.6.x won’t read files created with 1.7.x. CDH3 up to u3 comes with 1.6.0, so be conservative\n
Look at the historical prices, bid above it\nRegular price: $0.38, spot: $0.03\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
These give you the number of slots per node, adjust the above accordingly:\nmapred.tasktracker.map.tasks.maximum\nmapred.tasktracker.reduce.tasks.maximum\nWatch the memory you give the JVM if you change these.\nmapred.output.compress.*\nhive.exec.parallel.thread.number\nhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties\n
When using an RDBMS, it’s much harder to get at your data from other tools\n