O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

EPAM. Hadoop MR streaming in Hive

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process. This presentation will tell how use Streaming feature in Hive to reduce code complexity with real story example.

  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

EPAM. Hadoop MR streaming in Hive

  1. 1. Hadoop MR Streaming in Hive Use-case with Hive and Python from real life Yauheni Yushyn, EPAM Systems – September 2014
  2. 2. Agenda 2 • Intro • Pros and Cons • Hive reference • Use case from Real Life • Possible solutions • Hive Streaming: Architecture • Hive Streaming: Realization • Hive Streaming: Source code • Hive Streaming: Debug • Hive Streaming: Pitfalls • Hive Streaming: Benchmarks
  3. 3. SECTION Hadoop MR Streaming in Hive CONCEPTS 3
  4. 4. Intro Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process Unix like interface: • Streaming API opens an I/O pipe to an external process • Process reads data from the standard input and writes the results out through the standard output By default, INPUT for user script: • columns transformed to STRING • delimited by TAB • NULL values converted to the literal string N (differentiate NULL values from empty strings) OUTPTUT of user script: • treated as TAB-separated STRING columns • N will be re-interpreted as a NULL • resulting STRING column will be cast to the data type specified in the table declaration These defaults can be overridden with ROW FORMAT
  5. 5. Pros and Cons • Simplicity for developer, dealing with stdin/stdout • Schema-less model, treat values as needed • Non-Java interface • Overhead for Serialization/Deserialization between processes • Disallowed when "SQL standard based authorization" is configured (Hive 0.13.0 and later releases)
  6. 6. Hive reference • MAP() • REDUCE() • TRANSFORM() Hive provides several clauses to use streaming: MAP(), REDUCE(), and TRANSFORM(). Note: MAP() does not actually force streaming during the map phase nor does reduce force streaming to happen in the reduce phase. For this reason, the functionally equivalent yet more generic TRANSFORM() clause is suggested to avoid misleading the reader of the query.
  7. 7. SECTION Hadoop MR Streaming in Hive USE CASE 7
  8. 8. Use case from Real Life Requirements: There’re 14 flags in source table in Hive, which controls output values for 4 new fields in target table Solutions: • Hive "case … when" clause • User Defined Function (UDF) • Custom MR Job • Hive Streaming
  9. 9. Use case from Real Life: Requirements
  10. 10. Hive "case … when" clause • There’re more than 1,500 lines of code to map flags with new fields (statement repeats for every new output field) • Complexity for debugging • Fast execution • SQL-like syntax • All logic in one place (hql script) 10
  11. 11. UDF • You are single consumer of UDF (for this particular case, custom logic for single DataMart) • Java-code • Fast execution • Pass only needed flags into UDF (in contrast with Hive Streaming) • In the final point: SQL-like syntax, All logic in one place • Java-code
  12. 12. Hive Streaming • Slower execution (time for SerDe) • Deal with all fields, not only flags (in contrast with UDF) • Reducing complexity of code using script language • Small size of code • Fast developing • Wide stack of programming languages
  13. 13. SECTION Hadoop MR Streaming in Hive REALIZATION 13
  14. 14. Hive Streaming: Architecture
  15. 15. Hive Streaming: Realization Python snippets: • Create matrix (e.g., list of tuples) with flags and related values of fields • Loop through INPUT • Split INPUT by TAB • Split data fields and flags • Compare with matrix and get max possible matching • Spill out data with new fields as TAB separated text
  16. 16. Hive Streaming: Source code #!/usr/bin/env python """Mapper for Hive Streaming, using Python iterators and generators.Spill out new fields in accordance with input flags.""" import sys import logging def read_input(file): """Read data from STDIN using python generator""" #yield "IAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01-01tEpam.COM" #yield "IAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtTRIPADVISOR - UStN1tN1tNtNtNtNtNt1tNtNtNt1tNtNtNtNtNtNtNtNtNt2014-01-01tEpam.COM" for line in file: yield line.strip() def compare_flags(source, target): """Compare flags from source and target lists. Src/trg should have the same size""" size = len(source) out = list() # Go through elemets, add 0 to OUT list if src/trg elements equals for i in xrange(size): if target[i] != '-': if target[i] == source[i]: out.append(0) else: #logging.debug("Position: %i. Values of src/trg not equals, skip: %s,%s" % (i, source[i], target[i])) return None #out.append(1) else: out.append('-') return out def main(separator='t'): column_list = ["ORIGIN","DESTINATION","OND","CARRIER","LOS","BKG_WINDOW","LOCAL_CURRENCY","LOWEST_PRICE","PAGE","POSITION","XP_RANK","XP_PRICE","XP_COMPETED", "XP_PRICE_DIFF","BML","NUMBER_SELLERS","XP_IS_HERO","ECPC_LOSS","PRICE_LOSS","OTA_1","OTA_1_PRICE","OTA_2","OTA_2_PRICE","OTA_3","OTA_3_PRICE","OT A_4","OTA_4_PRICE","OTA_5","OTA_5_PRICE","OTA_6","OTA_6_PRICE","OTA_7","OTA_7_PRICE","OTA_8","OTA_8_PRICE","OTA_9","OTA_9_PRICE","OTA_10","OTA_10_PRI CE","OTA_11","OTA_11_PRICE","OTA_12","OTA_12_PRICE","OTA_13","OTA_13_PRICE","OTA_14","OTA_14_PRICE","OTA_15","OTA_15_PRICE","PARTNER_NAME","RCXR"," DCXR","SPLIT_TICKET","DEPARTURE_DURATION","RETURN_DURATION","DEPARTURE_STOPS","RETURN_STOPS"] flag_list = ["exp_listed_on_route_flag","exp_listed_on_carr_flag","exp_lst_on_itin_flag","carr_is_seller_flag","more_than_1_seller_flag","split_ticket_flag","exp_in_hero_flag","ota_in_hero_flag"," meta_in_hero_flag","carr_in_hero_flag","cheapest_prc_is_unique_flag","exp_prc_match_carr_flag","exp_prc_match_cheapest_flag","cheapest_ota_meta_prc_match_carr_flag"] partition_list = ["SHOP_DATE", "PARTNER_POS"] logging.debug("Star specifying vocabulary matrix") target = [ (["Inventory","Epam not showing route","Epam Lost","Unknown"],["0","-","0","-","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier for Epam"],["1","0","0","1","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier on Meta"],["1","0","0","1","0","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing itinerary","Epam Lost","Unknown"],["1","1","0","-","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","0","0","0","0","1","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","1","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","0","1","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Unknown","Epam Won"],["1","1","1","0","0","0","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Suspected Carrier Restricted Content"],["1","1","0","1","0","0","0","0","0","1","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Unknown"],["1","1","0","0","0","0","-","-","-","-","-","-","-","-"]) ,(["Price","Carrier more expensive","Undercutting carrier","Epam Won"],["1","1","1","1","-","0","1","0","0","0","-","0","-","-"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","0","1","0","-","-","0","0"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","1","0","0","-","-","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Unknown"],["1","1","1","1","1","0","0","-","-","1","-","0","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Carrier controlled pricing"],["1","1","1","1","-","0","0","0","0","1","1","0","-","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Unknown","Epam Won"],["1","1","1","0","-","0","1","0","0","0","1","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","1","0","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","1","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","0","1","0","0","-","1"]) ,(["Rank","Rank","ECPC","Epam Won"],["1","1","1","-","1","-","1","0","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","1","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","1","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","0","1","0","-","1","-"]) ] # Input comes from STDIN data = read_input(sys.stdin) header_list = column_list + flag_list + partition_list logging.debug("Header for input data: %s" % header_list) logging.debug("Start reading from STDIN") # Loop through STDIN for words in data: #for words in sys.stdin: logging.debug("-----------") current_flags = list() #words = words.strip() words = words.split('t') logging.debug("Input values from external process (STDIN): %s" % words) logging.debug("Input length: %s" % len(words)) if (len(header_list) != len(words)): logging.error("Length of IN data (%i) not equal Header length (%i)! Exit" % (len(words), len(header_list))) sys.exit(1) data_set = dict(zip(header_list, words)) logging.debug("Parsing of STDIN: %s" % data_set) # Get flags for flag in flag_list: current_flags.append(data_set[flag]) logging.debug("Find flags: %s" % current_flags) # Get list with result of comparison src/trg compared_list = list() logging.debug("Comparing flags with vocabulary...") for k,v in target: #logging.debug("key, value: %s,%s" % (k, v)) temp_out = compare_flags(current_flags,v) if not temp_out: continue logging.debug("Match is found: %s" % temp_out) compared_list.append((k, temp_out)) temp_out = list() logging.debug("Comparing flags with vocabulary finished. List of matches: %s" % (compared_list)) # Find max occurrence of src in trg (find max-occurrence of zeros) max_zeros = 0 out_fields = list() max_flag_from_trg =list() for k, v in compared_list: #logging.debug("key, value: %s,%s" % (k, v)) count_zero = v.count(0) if count_zero > max_zeros: out_fields = k max_flag_from_trg = v if (not out_fields) or (not max_flag_from_trg): logging.warning("Can't find values in vocabulary. Set values for DEFAULT") logging.warning("Fields: %s" % out_fields) logging.warning("Flags: %s" % max_flag_from_trg) out_fields = ["DEFAULT" for x in xrange(len(target[0][0]))] else: logging.debug("Output fields found") logging.debug("Fields: %s" % out_fields) logging.debug("Flags: %s" % max_flag_from_trg) # Output fields with flags in STDOUT field_data = [data_set[x] for x in column_list] partition_date = [data_set[x] for x in partition_list] out_row = separator.join(field_data) + separator + separator.join(out_fields) + separator + separator.join(partition_date) logging.debug("Output string: %s" % out_row) print out_row #print "%s%s%s%s%s" % (separator.join(field_data), separator, separator.join(out_fields), separator, separator.join(partition_date)) if __name__ == "__main__": logging.basicConfig(level=logging.DEBUG, stream=sys.stderr, #format='%(filename)s[LINE:%(lineno)d]# %(levelname)-8s [%(asctime)s] %(message)s' format='[%(asctime)s][%(filename)s][%(levelname)s] %(message)s' ) main()
  17. 17. Hive Streaming: Debug echo -e “val11tval12t…val1Nnval21tval22t…val2N"| ./script_name.py Example: Put 2 lines (TSV) in stdin echo -e "IAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01- 01tEPAM.COMnIAHtCUNtIAH-CUN t01tt14tUSDt520.99t4t19tNtNt0tNtDIDt2tDIDtDIDtDIDtCHEAPTICKETSt520.99tORBITZt520.99tNtNtNtNtNtNtNtNtN tNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtNtTRIPADVISOR - USt01t01t0t0t0t0t0t1t0t0t0t1t0t0t0t0t0t0t0t0t0t2014-01- 01tEPAM.COM“ | ./script_name.py Get 2 lines with new fields (without flags) in stdout IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM IAH CUN IAH-CUN 01 14 USD 520.99 4 19 N N 0 N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 N N N N N N N N N N N N N N N N N N N N N N N N N N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM
  18. 18. Hive Streaming: Pitfalls • Add script in Distributed Cash before running query with Hive Streaming • Use last columns in select statement for Dynamic Partitioning • Use more robust separator (default, TAB) to prevent inconsistency of data Note: always use iterator/generator (python methodology) functions instead of explicit reading from stdin! It saves system resources and executes script much faster (more over than 10 times) Example: def read_input(file): for line in file: # split the line into words yield line.strip() data = read_input(sys.stdin) for words in data: … for words in sys.stdin: …
  19. 19. SECTION Hadoop MR Streaming in Hive BENCHMARKS 19
  20. 20. Hive Streaming: Benchmarks Hive "case … when" clause Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Non-Partitioned Time spent: 2m39s
  21. 21. Hive Streaming: Benchmarks Hive Streaming Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Non-Partitioned Time spent: 4m53s Note: no compression for output, so “Number of bytes written extremely larger
  22. 22. Hive Streaming: Benchmarks Hive "case … when" clause Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Partitioned by 2 columns Time spent: 2m44s
  23. 23. Hive Streaming: Benchmarks Hive Streaming Source: MANAGED, Non-partitioned, 2M rows Target: MANAGED, Partitioned by 2 columns Time spent: 5m12s
  24. 24. Thanks! Join us at https://www.linkedin.com/groups/Belarus- Hadoop-User-Group-BHUG-8104884 yauheni_yushyn@epam.com skype: ushin.evgenij