This document discusses strategies for handling mutable data in Hive's immutable data model. It presents several approaches including full refresh, full merge and replace, and partition-level merge and replace. The partition-level strategies allow merging incremental data updates into existing partitions in Hive tables. The document provides examples using Pig to filter, join, and load data to demonstrate performing incremental updates at the partition level. It evaluates the tradeoffs of different strategies based on data volumes and change rates.
1. Page1
Mutable Data in Hive’s Immutable World
Lester Martin – Hortonworks 2015 Hadoop Summit
2. Page2
Connection before Content
Lester Martin – Hortonworks Professional Services
lmartin@hortonworks.com || lester.martin@gmail.com
http://lester.website (links to blog, twitter,
github, LI, FB, etc)
3. Page3
“Traditional” Hadoop Data
Time-Series Immutable (TSI) Data – Hive’s sweet spot
Going beyond web logs to more exotic data such as:
Vehicle sensors (ground, air, above/below water – space!)
Patient data (to include the atmosphere around them)
Smart phone/watch (TONS of info)
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
4. Page4
Good TSI Solutions Exist
Hive partitions
•Store as much as you want
•Only read the files you need
Hive Streaming Data Ingest from Flume or Storm
Sqoop’s –-incremental mode of append
•Use appropriate –-check-column
•“Saved Job” remembering –last-value
5. Page5
Use Case for an Active Archive
Evolving Domain Data – Hive likes immutable data
Need exact copy of mutating tables refreshed periodically
•Structural replica of multiple RDBMS tables
•The data in these tables are being updated
•Don’t need every change; just “as of” content
Existing Systems
ERP CRM SCM
SOURCES
eComm
6. Page6
Start With a Full Refresh Strategy
The epitome of the KISS principle
•Ingest & load new data
•Drop the existing table
•Rename the newly created table
Surely not elegant, but solves the problem until the reload
takes longer than the refresh period
7. Page7
Then Evolve to a Merge & Replace Strategy
Typically, deltas are…
•Small % of existing data
•Plus, some totally new records
In practice, differences in sizes
of circles is often much more
pronounced
8. Page8
Requirements for Merge & Replace
An immutable unique key
•To determine if an addition or a change
•The source table’s (natural or surrogate) PK is perfect
A last-updated timestamp to find the deltas
Leverage Sqoop’s –-incremental mode of
lastmodified to identify the deltas
•Use appropriate –-check-column
•“Saved Job” remembering –last-value
9. Page9
Processing Steps for Merge & Replace
See blog at http://hortonworks.com/blog/four-step-
strategy-incremental-updates-hive/, but note that merge
can be done in multiple technologies, not just Hive
Ingest – bring over the incremental
data
Reconcile – perform the merge
Compact – replace the existing data
with the newly merged content
Purge – cleanup & prepare to repeat
10. Page10
Full Merge & Replace Will NOT Scale
The “elephant” eventually gets too big
and merging it with the “mouse” takes
too long!
Example: A Hive structure with 100
billion rows, but only 100,000 delta
records
12. Page12
But… One Size Does NOT Fit All…
Not everything is “big” – in fact, most operational apps’
tables are NOT too big for a simple Full Refresh
Divide & Conquer requires additional per-table research
to ensure the best partitioning strategy is decided upon
13. Page13
Criteria for Active Archive Partition Values
Non-nullable & immutable
Ensures sliding scale growth with new records generally
creating new partitions
Supports delta records being skewed such that the
percentage of partitions needing merge & replace
operations is relatively small
Classic value is (still) “Date
Created”
15. Page15
Partition-Level Merge & Replace Steps
Generate the delta file
Create list of affected partitions
Perform merge & replace operations for affected partitions
1. Filter the delta file for the current partition
2. Load the Hive table’s current partition
3. Merge the two datasets
4. Delete the existing partition
5. Recreate the partition with the merged content
16. Page16
What Does This Approach Look Like?
A Lightning-Fast Review of an Indicative Hybrid Pig-Hive Example
17. Page17
One-Time: Create the Table
CREATE TABLE bogus_info(
bogus_id int,
field_one string,
field_two string,
field_three string)
PARTITIONED BY (date_created STRING)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZLIB");
18. Page18
One-Time: Get Content from the Source
11,2014-09-17,base,base,base
12,2014-09-17,base,base,base
13,2014-09-17,base,base,base
14,2014-09-18,base,base,base
15,2014-09-18,base,base,base
16,2014-09-18,base,base,base
17,2014-09-19,base,base,base
18,2014-09-19,base,base,base
19,2014-09-19,base,base,base
19. Page19
One-Time: Read Content from HDFS
as_recd = LOAD '/user/fred/original.txt'
USING PigStorage(',') AS
(
bogus_id:int,
date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray
);
20. Page20
One-Time: Sort and Insert into Hive Table
sorted_as_recd = ORDER as_recd BY
date_created, bogus_id;
STORE sorted_as_recd INTO 'bogus_info'
USING
org.apache.hcatalog.pig.HCatStorer();
21. Page21
One-Time: Verify Data are Present
hive> select * from bogus_info;
11 base base base 2014-09-17
12 base base base 2014-09-17
13 base base base 2014-09-17
14 base base base 2014-09-18
15 base base base 2014-09-18
16 base base base 2014-09-18
17 base base base 2014-09-19
18 base base base 2014-09-19
19 base base base 2014-09-19
24. Page24
Read Delta File from HDFS
delta_recd = LOAD '/user/fred/delta1.txt'
USING PigStorage(',') AS
(
bogus_id:int,
date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray
);
25. Page25
Create List of Affected Partitions
by_grp = GROUP delta_recd BY date_created;
part_names = FOREACH by_grp GENERATE group;
srtd_part_names = ORDER part_names BY group;
STORE srtd_part_names INTO
'/user/fred/affected_partitions’;
26. Page26
Loop/Multithread Through Affected Partitions
Pig doesn’t really help you with this problem
This indicative example could be implemented as:
•A simple script that loops through the partitions
•A Java program that multi-threads the partition-aligned processing
Multiple “Control Structures” options exist as described at
http://pig.apache.org/docs/r0.14.0/cont.html
27. Page27
Loop Step: Filter on the Current Partition
delta_recd = LOAD '/user/fred/delta1.txt'
USING PigStorage(',') AS
( bogus_id:int, date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray );
deltaP = FILTER delta_recd BY date_created
== '$partition_key’;
28. Page28
Loop Step: Retrieve Hive’s Current Partition
all_bogus_info = LOAD 'bogus_info' USING
org.apache.hcatalog.pig.HCatLoader();
tblP = FILTER all_bogus_info
BY date_created == '$partition_key';
29. Page29
Loop Step: Merge the Datasets
partJ = JOIN tblP BY bogus_id FULL OUTER,
deltaP BY bogus_id;
combined_part = FOREACH partJ GENERATE
((deltaP::bogus_id is not null) ?
deltaP::bogus_id: tblP::bogus_id) as
bogus_id, /* do for all fields
and end with “;” */
30. Page30
Loop Step: Sort and Save the Merged Data
s_combined_part = ORDER combined_part BY
date_created, bogus_id;
STORE s_combined_part INTO '/user/fred/
temp_$partition_key’ USING PigStorage(',');
hdfs dfs –cat temp_2014-09-17/part-r-00000
11,2014-09-17,base,base,base
12,2014-09-17,base,CHANGED,base
13,2014-09-17,base,base,base
31. Page31
Loop Step: Delete the Partition
ALTER TABLE bogus_info DROP IF EXISTS
PARTITION (date_created='2014-09-17’);
32. Page32
Loop Step: Recreate the Partition
2load = LOAD '/user/fred/
temp_$partition_key'
USING PigStorage(',') AS
( bogus_id:int, date_created:chararray,
field_one:chararray,
field_two:chararray,
field_three:chararray );
STORE 2load INTO 'bogus_info' using
org.apache.hcatalog.pig.HCatStorer();
33. Page33
Verify the Loop Step Updates
select * from bogus_info
where date_created = '2014-09-17’;
11 base base base 2014-09-17
12 base CHANGED base 2014-09-17
13 base base base 2014-09-17
34. Page34
My Head Hurts, Too!
As Promised, We Flew Through That – Take Another Look Later
35. Page35
What Does Merge & Replace Miss?
If critical, you have options
•Create a delete table sourced by a trigger
•At some wide frequency, start all over with a Full Refresh
Fortunately, ~most~ enterprises
don’t delete anything
Marking items “inactive” is
popular
36. Page36
Hybrid: Partition-Level Refresh
If most of the partition is modified, just replace it entirely
Especially if the changes are only recent (or highly skewed)
Use a configured number of partitions to refresh and
assume the rest of the data is static
37. Page37
Active Archive Strategy Review
strategy # of rows % of
chg
chg
skew
handles
deletes
complexity
Full Refresh <= millions any any yes simple
Full Merge &
Replace
<= millions any any no moderate
Partition-Level
Merge & Replace
billions + < 5% < 5% no complex
Partition-Level
Refresh
billions + < 5% < 5% yes complex
38. Page38
Isn’t There Anything Easier?
HIVE-5317 brought us Insert, Update & Delete
•Alan Gates presented Monday
•More tightly-coupled w/o the same “hazard windows”
•“Driver” logic shifts to be delta-only & row-focused
Thoughts & attempts at true DB replication
•Some COTS solutions have been tried
•Ideally, an open-source alternative is best such as enhancing the
Streaming Data Ingest framework
39. Page39
Considerations for HIVE-5317
On performance & scalability; your mileage may vary
Does NOT make Hive a RDBMS
Available in Hive .14 onwards
DDL requirements
•Must utilize partitioning & bucketing
•Initially, only supports ORC
40. Page40
Recommendations
Take another look at this topic once back at “your desk”
As with all things Hadoop…
•Know your data & workloads
•Try several approaches & evaluate results in earnest
•Stick with the KISS principle whenever possible
Share your findings via blogs and local user groups
Expect (even more!) great things from Hive
41. Page41
Questions?
Lester Martin – Hortonworks Professional Services
lmartin@hortonworks.com || lester.martin@gmail.com
http://lester.website (links to blog, twitter, github, LI, FB, etc)
THANKS FOR YOUR TIME!!
Notas do Editor
This diagram shows the typical use case where the deltas represent a small percentage of the existing data as well as the addition of some records that are only present in the delta dataset. In practice, the differences in size of these two circles would be much more pronounced as the existing data includes historical records that have not been modified in a long time and most likely will not be modified again.
If no last-updated timestamp is present, an alter table could create such a column and a combination of a DEFAULT value of current timestamp to have new records populate this and an “ON UPDATE” trigger could be created that takes care of updating the timestamp
You can use other tools than just Hive, such as Pig, to perform these operations.
Pig approach could be:
Ingest the data (like above)
In single Pig script; read old & new, merge them, load into “resolved” table
Drop old table, rename new table, and recreate “resolved” table (for next run)
When there is a small number of actual changes compared with a large number of unchanged records the incremental data ingest step brings limits the amount of data that needs to be transferred across the network as well as the amount of raw data that needs to be persisted in HDFS. There will be a point when the merging of a single delta file with a larger existing records file will take longer than just getting the full copy, or at least, the merge & replace processing will be too lengthy to be useful.
An example of this could be a model where the existing data size is around 100 billion, but there are only 100,000 delta records.
The classic 80/20 rules applies (maybe even 95/5) for table that don’t/do need partitioning for benefit of the merge & replace strategy.
In fact, nothing would be prevent a table that uses Full Refresh or comprehensive Merge & Replace from having partitions.
Bad examples would include a building identifier or the city or zip code that buildings are located within. We could get a wide spread of the data, but delta records would likely cover most, if not all, of the partitions .
The goal is to break down the delta processing at the partition level and to be able to focus only on a subset of the overall partitions. In many ways, the processing is much like the basic merge & replace approach, but that several, much smaller, iterations will occur while leaving the mass majority of the partitions alone.
We’ll go VERY FAST through this during the presentation, but will be useful to review later.
2014-09-17
2014-09-18
2014-09-20
Here it is in its entirety… (simple db metadata reading & template builders could be created to automate this gnarly big FOREACH statement
combined_partition = FOREACH partJ GENERATE
((deltaP::bogus_id is not null) ? deltaP::bogus_id: tblP::bogus_id) as bogus_id,
((deltaP::date_created is not null) ? deltaP::date_created: tblP::date_created) as date_created,
((deltaP::field_one is not null) ? deltaP::field_one: current_tblP::field_one) as field_one,
((deltaP::field_two is not null) ? deltaP::field_two: current_tblP::field_two) as field_two,
((deltaP::field_three is not null) ? deltaP::field_three: current_tblP::field_three) as field_three;
We’ll go VERY FAST through this during the presentation, but will be useful to review later.
Also, this is all rather generalizable so individual projects can build a simple framework to drive via metadata.
Not a calculator, rule, or formula – just to drive the conversation of where each strategy might work best
COTS solns from folks like Oracle, Dell & SAP, but none that I have evaluated
Not picking on the CRUD operations performance & scalability opportunities, just pointing out that many variables are at play which could make things better or worse