O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

IEEE International Conference on Data Engineering 2015

181 visualizações

Publicada em

Hadoop DW SK Telecom Usecase

Publicada em: Engenharia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

IEEE International Conference on Data Engineering 2015

  1. 1. SKT Hadoop DW SK telecom! Corporate R&D Center
 Yousun Jeong
  2. 2. Copyright@ 2015 by SK Telecom All rights reserved. 1. Big Data in SKT 2. What is Hadoop DW ? 3. SQL on Hadoop TAJO 4. Hadoop DW Commercialization Cases Table of Contents 2
  3. 3. Copyright@ 2015 by SK Telecom All rights reserved. High TCO for Data Management 250TB/day (91.25PB/year) 4 Hadoop clusters with various 
 commercial MPP databases for analytics Operational
 Systems Integration 
 Layer Data Warehouse Marts Marketing Sales ERP SCM ODS Staging
 Area Staging
 Area Mart A Mart B Mart C Mart D Hadoop+Hive MPP DBMS High TCO for Data Management
 (Too much data is loaded into MPP DBMS) One Unified Solution 30PB+ (compressed) on 1000+ nodes 10+ Hadoop clusters with Tajo & Spark 
 for all purposes Operational
 Systems Integration 
 Layer Data Warehouse Marts Marketing Sales ERP SCM ODS Staging
 Area Staging
 Area Mart A Mart B Mart C Mart D Hadoop+Tajo+Spark Affordable & Faster
 (Unified framework for Big Data) 1. Big Data in SKT 3
  4. 4. Copyright@ 2015 by SK Telecom All rights reserved. ✓ Optimized configuration of a large-scale cluster ✓ Operation know-how of managing 1000+ nodes ✓ Fault tolerant and effective resource management system Data Collector Data Collect & pre-processing Main Cluster Analysis R&D Cluster ~250 TB/day (700+ node) Service Logic Repository (200+ Node) (100+ node) Service Cluster (150+ node) App. 1 … App. N T-Hadoop Data Feeding Data Feeding Commercialize Develop. 1. Big Data in SKT SKT Hadoop Clusters 4
  5. 5. Copyright@ 2015 by SK Telecom All rights reserved. “Hadoop S/W and Commodity H/W! Based Cost-effective IT Infrastructure System” 【 Hadoop DW Infrastructure】 “High-price, High-performance! Proprietary IT Infrastructure System” 【 Legacy IT Infrastructure 】 ※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, ! SQL Structured Query Language 2. What is Hadoop DW ? Structured/Un-structured Data! Scale-out Structure (Petabyte, Exabyte) Low price
 ($200 ~ $1,000 / TB) Data Cost Structured Data! Scale-up Structure (Terabyte) High price! ($5,000~$50,000 / TB) Commodity H/W (x86 Server)H/W High Performance H/W! (MPP, Fabric Switch, etc.) Hadoop Architecture SQL on Hadoop S/W Proprietary S/W
 (RDBMS, etc.) Transaction/Batch Processing! (SQL) Hadoop File System The Hadoop DW provides a Hadoop Architecture based Data Warehouse from an Enterprise environment so the user can accommodate the massive amount of increasing data at a low cost. Solution SKT Hadoop DW 5
  6. 6. Copyright@ 2015 by SK Telecom All rights reserved. Tajo - Fully Distributed - Vector process HDFS Hadoop Cluster + Tajo [ Legacy Approach (MR) ] [Tajo Approach ] Process more data
 on same clusters
 with improved
 processing speed Response
 Speed Hadoop Cluster Query Hadoop Cluster Query Up to 
 10x min few 
 sec~min + Tajo Try more queries
 for analysis 
 with improved! response speed Hive MapReduce - Partially Distributed - Sequential process HDFS Hadoop Cluster Processing
 Speed High-speed SQL-on-Hadoop processing engine • 3~5x improvement in processing speed to Hive under TPC-H procedure • 80~100% response speed to Impala without data size limit • Full ANSI-SQL support for easy RDBMS migration 3. SQL on Hadoop - TAJO 6
  7. 7. Copyright@ 2015 by SK Telecom All rights reserved. 7 3. SQL on Hadoop - TAJO SQL Support ▪ ANSI SQL support ▪ Partition Type ▪ Meta Store Service Stability ▪ High Availability ▪ Resource Manager ▪ Fair Scheduler Performance ▪ High-speed processing ▪ Shuffling ▪ Dynamic Query Optimizer ▪ Query Rewriting System Integration ▪ BI Connector ▪ Proxy Support ▪ Tajo-R Function Support ▪ Analytic Function ▪ Hive Function [ Tajo Features ] [ Performance Comparison ] [ Apache Top-Level Project ]
  8. 8. Copyright@ 2015 by SK Telecom All rights reserved. Worker! 8 3.1 Tajo Architecture 1. Query Master! 2. TaskRunner Tajo Master! Persistent Storage! !!! Derby Store! MySQL Store! Postgre SQL Store! Logical Planner! Logical Optimizer! Resource Manager! SQL Parser! ! Query Rewriter! Query Manager! Tajo CatalogHCatalog Client Service Handler! JDBC ! Driver Tajo! CLI! Tajo! CLI! Worker! Query Master! !!!!!!!! Global 
 Planner! Client Service Handler! !!!!!!! Local Query Engine! Storage Manager! Local HDFS/Hbase S3 / swift ODBC ! Driver
  9. 9. Copyright@ 2015 by SK Telecom All rights reserved. 9 3.1 Technical Characteristic - Logical Flow Data Processing Tajo Master! ! ! ! ! ! ! ! ! SQL Parser Logical/Global Planner Resource Manager Query Parsing Decomposition of a work unit Work units delivered to the server Tajo Worker! Tajo Worker! Tajo Worker! Tajo Worker! Tajo Worker! ! ! ! ! ! ! ! Physical Planner Query Engine Storage Manager Decomposing the! task operation unit Unit operation Disk data I/O control
  10. 10. Copyright@ 2015 by SK Telecom All rights reserved. 10 3.1 Technical Characteristic - JIT Query Engine Implemented as a binary to 
 consider the number of all cases
 -> performance degradation
 (call, if, switch below 50%) switch(operand)! Case numeric : add numeric! Case string : add string! real-time code generation 
 based on operand type
 combined operation can be 
 processed by the compiler optimization Four functions in a 
 single operation(+2,-1,*1) <Existing methods> <JIT methods> Behavior depends on the operand characteristic! ! - 1 + 2 = 3! - “a” + “b” = “ab”! - {1,2} + {3,4} = {4,6}! - 1 + {1,2} = {2,3} Result = A x (1-B) + (1+C) + x - + A A A A A +
  11. 11. Copyright@ 2015 by SK Telecom All rights reserved. 11 3.1 Technical Characteristic -Vectorized Query Engine <Tuple at a time> <Vectorized engine> - DB! - 1 operation/record - Vectorized data! - 1 operation/vector A[] = {a1, a2, a3, a4, a5, a6}! B[] = {b1, b2, b3, b4, b5, b6}! ! C[] = A[] + B[] a1 a2 a3 a5 a4 a6 b1 b2 b3 b5 b4 b6 + + + + + + a1 a2 a3 a5 a4 a6 + b1 b2 b3 b5 b4 b6
  12. 12. Copyright@ 2015 by SK Telecom All rights reserved. 12 3.1 Technical Characteristic -Storage Manager Tajo Worker! Tajo Worker! Tajo Worker(scan)! Storage Manager! ! ! ! ! ! ! ! ! ! Disk Scanner! ! Pre-fetching Buffer! Disk Scanner! Disk Scanner! Request queue! ! ! ! ! Request queue! Request queue! Scan ! Scheduler Bulk Read Fine granularity File
 request
  13. 13. Copyright@ 2015 by SK Telecom All rights reserved. 13 Business Challenge How SKT Hadoop DW Helped [ SK Telecom ] • Explosion of log data with LTE service • Increase in types of data to be analyzed • Insufficient DW capacity due to high cost ✓ 3x storage expansion under same price, 
 or 80% reduction in unit price ✓ Enabled Ad-hoc analysis of unstructured text data sets for daily ✓ Hadoop DW could decrease contents-based analysis process time from few hours to 20 minutes max. 4. Hadoop DW Commercialization Cases Telco Category MPP DBMS Hadoop DW Raw Data Size 0.5 TB/Day 4 TB/Day Total ETL Time Average of 3 hours Average of 6 hours DW Creation ! 30 minutes 40 minutes Mart Creation 1 hour 1 hour 40 minutes Report Creation 1 hour 30 minutes 2 hours 4 minutes
  14. 14. Copyright@ 2015 by SK Telecom All rights reserved. 14 Business Challenge [ Global Top-5 Semiconductor Player ] • Collect immense amount of unstructured measurement data while manufacturing • RDMBS & BI are incapable for such data type • Even data loading can take up to 20 min How SKT Hadoop DW Helped ✓ Support for unstructured data through variable column schema ✓ 100x increase in data processing capacity ✓ Decreased data loading time by 10x (2 min) ✓ Minimized user action for pivot/unpivot 4. Hadoop DW Commercialization Cases Manufacturer
  15. 15. Copyright@ 2015 by SK Telecom All rights reserved. Thank you.

×