O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Big Data Developers Moscow Meetup 1 - sql on hadoop

719 visualizações

Publicada em

Презентация с первого митапа группы BigData Developers Moscow

Publicada em: Software
  • Seja o primeiro a comentar

Big Data Developers Moscow Meetup 1 - sql on hadoop

  1. 1. Big Data Developers Meetup #1 Aug 2014 Andrey.vykhodtsev@ru.ibm.com Central & Eastern Europe BigData Tech Sales
  2. 2. Первый Meetup 2014 •Про SQL on hadoop •По возможности объективный обзор и конструктивный диалог •Основан на уважении к другим технологиям, в т.ч конкурирующим •Без holywar •Скромные закуски – угощайтесь •Время с 19-00 до 22-00, в 21-00 заканчиваем программу, в 22-00 нужно покинуть здание
  3. 3. Agenda •What is this Hadoop thing? •Why SQL on Hadoop? •What is Hive? •SQL-on-Hadoop landscape •InfoSphere BigInsights for Hadoop with Big SQL •What is it? •SQL capabilities •Architecture •Application portability and integration •Enterprise capabilities •Performance •Conclusion
  4. 4. Big Data Scenarios Span Many Industries – and rely on Hadoop •Optimize existing EDW environment – size, performance, and TCO •Capture, off load, analyze massive amounts of data to get new insights Data Warehouse Modernization •Text analytics on social media commentary around life events •Link social media profiles to actual customers 360 View of the Customer •Analyze massive volumes of data that can’t be handled by existing SIEM systems •Internet drug trafficking, prostitution, monitoring all the web, email traffic to identify potential threats Cyber Security
  5. 5. The Goal of Hadoop Manage large volumes of data Scalable to any volume Off-load from the warehouse Identify unique customers Reduce Costs Commodity hardware Common tools In-house skills Analyze new data types Improve business decisions Understand sentiment Analyze data-in-motion
  6. 6. What is Hadoop? 6 split 0 split 1 split 2 split 3 split 4 split 5 Map Map Map Reduce Reduce Reduce C Client output 0 output 1 output 2 M Master Input Files Map Phase Intermediate Files Reduce Phase Output Files •Framework to process big data in parallel on a cluster •What's new/different? •Free, open source •Uses commodity hardware •“Move programs to the data” •Scale both processing and storage by simply adding nodes •Makes big data processing accessible to everyone •Two key things to understand Hadoop: •How files are stored •How files are processed
  7. 7. How files are stored: HDFS •Key ideas: •Divide big files in blocks and store blocks randomly across cluster •Provide API to ask: where are the pieces of this file? •=> Programs can be shipped to nodes for parallel distributed processing 101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 3 4 4 4
  8. 8. How Files are Processed: MapReduce •Common pattern in data processing: apply a function, then aggregate grep "World Cup” *.txt | wc -l •User simply writes two pieces of code: “mapper” and “reducer” •Mapper code executes on every split of every file •Reducer consumes/aggregates mapper outputs •The Hadoop MR framework takes care of the rest (resource allocation, scheduling, coordination, temping of intermediate results, storage of final result on HDFS) 1011010010100100111001111110010100111010010100101100100101010011000101001011101011101011110110110101011010010101 1 2 3 Logical File Splits 1 Cluster 3 2 Map Map Map Reduce Result
  9. 9. SQL on Hadoop and Hive •Hadoop can process data of any kind (as long as it's splittable, etc) •A very common scenario: •Tabular data •Programs that “query” the data •Java Hadoop APIs are the wrong tool for this •Too low level, steep learning curve •Require strong programming expertise •Universally accepted solution: SQL •Enter Hive ... 1.Impose relational structure on plain files 2.Translate SELECT statements to MapReduce jobs 3.Hide all the low level details
  10. 10. Why SQL on Hadoop? Hadoop stores large volumes and varieties of data SQL gets information and insight out of Hadoop SQL leverages existing IT skills resulting in quicker time to value and lower cost
  11. 11. Hive •One of the most popular Hadoop-related technologies •Ships with all major Hadoop distributions •Hive opens up Hadoop to anyone with SQL skills •Simplified and shortened development cycle •Little Java/MapReduce knowledge required •Three key concepts •Hive SerDe •Hive Table •Hive Metastore
  12. 12. Hive SerDes •SerDe = Serializer + Deserializer •Deserializer = Java code that implements mapping from Hadoop “record” to Hive “row” •A Hadoop record is just a byte array •A Hive row has columns with names and data types •Serializer maps Hive row to Hadoop record (for writing) •Many built-in SerDes •Delimited text files •JSON •XML •REGEX •AVRO •Can add your own custom serdes
  13. 13. Hive Tables •A Hive table imposes a relational “schema” (list of column names and types) on a file •Schema is purely logical •Data in the file is not altered in any way •“Schema on read” (as opposed to SOW of traditional RDBMSs) •Hive table = Metadata + Data •CREATE TABLE statement (metadata) •A directory containing one or more files (data) CREATE TABLE logEvents (ipaddress STRING, eventtime TIMESTAMP, message STRING) ROW FORMAT SERDE 'org.apache.hive…LazySimpleSerde' WITH SERDEPROPERTIES ( 'field.delim' = '|' ) INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat' LOCATION '/user/hive/warehouse/sample.db/logevents';
  14. 14. Hive MetaStore •The Hive metastore stores metadata about all the tables •Usually backed by a conventional relational db (not on HDFS) •Default: Derby •MySQL, DB2, Oracle •Table metadata •Schema (column names and types) •Location (directory on HDFS) •SerDe •Hadoop InputFormat/OutputFormat •Partition information •Properties (column and row delimiters, etc) •Security (access control)
  15. 15. Hadoop Latency and Hive SQL Features •Hive was not designed to be an RDBMS, but to hide the low-level details of MapReduce •But the inevitable questions came up … •Hadoop Latency •Why is my query so slow compared to XYZ? •Why does it take so long to retrieve a few rows? •Hive SQL Features •How do I define a view, stored procedure, …? •What’s wrong with this subquery ? •No DATE, DECIMAL, VARCHAR data types?
  16. 16. SQL-on-Hadoop landscape •The SQL-on-Hadoop landscape changes constantly! •Being relatively new to the SQL game, they have all generally meant compromising one or more of…. •Speed •Robust SQL •Enterprise features •Interoperability with the Hadoop ecosystem •IBM InfoSphere BigInsights for Hadoop with Big SQL is based upon tried and true IBM relational technology, addressing all of these areas
  17. 17. Introducing Big SQL 3.0 •Goal: bring SQL on Hadoop to the next level •Low-latency HDFS-based parallelism •Move programs to the data •No MapReduce => MPP engine •Avoid unnecessary temping => Message passing •Avoid process startup/teardown => Daemon processes •Full SQL support SQL-based Application Big SQL Engine HDFS IBM data server client SQL MPP Run-time CSV Seq Parquet RC ORC Avro Custom JSON
  18. 18. Big SQL 3.0 – Not just a faster, richer Hive
  19. 19. Big SQL highlights •Full support for subqueries •In SELECT, FROM, WHERE and HAVING clauses •Correlated and uncorrelated •Equality, non-equality subqueries •EXISTS, NOT EXISTS, IN, ANY, SOME, etc. •All standard join operations •Standard and ANSI join syntax •Inner, outer, and full outer joins •Equality, non-equality, cross join support •Multi-value join •UNION, INTERSECT, EXCEPT SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  20. 20. Big SQL in the Hadoop Ecosystem • Fully integrated with ecosystem – Hive Metastore – Hive Tables – Hive SerDes – Hive partitioning – Hive Statistics – Columnar formats • ORC • Parquet • RCFile • Completely open, without compromises • No proprietary storage format Hive Hive Metastore Hadoop Cluster Pig Hive APIs Sqoop Hive APIs Big SQL Hive APIs
  21. 21. Architected for performance •Architected from the ground up for low latency and high throughput •MapReduce replaced with a modern MPP architecture •Compiler and runtime are native code (not java) •Big SQL worker daemons live directly on cluster •Continuously running (no startup latency) •Processing happens locally at the data •Message passing allows data to flow directly between nodes •Operations occur in memory with the ability to spill to disk •Supports aggregations and sorts larger than available RAM Head Node Big SQL Head Node Hive Metastore Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL HDFS/GPFS
  22. 22. Extreme parallelism •Massively parallel SQL engine that replaces MR •Shared-nothing architecture that eliminates scalability and networking issues •Engine pushes processing out to data nodes to maximize data locality. Hadoop data accessed natively via C++ and Java readers and writers. •Inter- and intra-node parallelism where work is distributed to multiple worker nodes and on each node multiple worker threads collaborate on the I/O and data processing (scale out horizontally and scale up vertically) •Intelligent data partition elimination based on SQL predicates •Fault tolerance through active health monitoring and management of parallel data and worker nodes
  23. 23. A process model view of Big SQL 3.0
  24. 24. Big SQL 3.0 – Architecture (cont.) 24 •Big SQL's runtime execution engine is all native code •For common table formats a native I/O engine is utilized •e.g. delimited, RC, SEQ, Parquet, … •For all others, a java I/O engine is used •Maximizes compatibility with existing tables •Allows for custom file formats and SerDe's •All Big SQL built-in functions are native code •Customer built UDx's can be developed in C++ or Java •Maximize performance without sacrificing extensibility Mgmt Node Big SQL Compute Node Task Tracker Data Node Big SQL Big SQL Worker Native I/O Engine Java I/O Engine SerDe I/O Fmt Runtime Java UDFs Native UDFs
  25. 25. Resource management •Big SQL doesn't run in isolation •Nodes tend to be shared with a variety of Hadoop services •Task tracker •Data node •HBase region servers •MapReduce jobs •etc. •Big SQL can be constrained to limit its footprint on the cluster •% of CPU utilization •% of memory utilization •Resources are automatically adjusted based upon workload •Always fitting within constraints •Self-tuning memory manager that re-distributes resources across components dynamically •default WLM concurrency control for heavy queries Compute Node Task Tracker Data Node Big SQL HBase MR Task MR Task MR Task
  26. 26. Performance •Query rewrites •Exhaustive query rewrite capabilities •Leverages additional metadata such as constraints and nullability •Optimization •Statistics and heuristic driven query optimization •Query optimizer based upon decades of IBM RDBMS experience •Tools and metrics •Highly detailed explain plans and query diagnostic tools •Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generation Query transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily Sales NLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product Store ZZJOIN Daily Sales HSJOIN Period
  27. 27. •Table statistics: •Cardinality (count) •Number of Files •Total File Size •Column statistics (this applies to column group stats also): •Minimum value •Maximum value •Cardinality (non-nulls) •Distribution (Number of Distinct Values) •Number of null values •Average Length of the column value (for string columns) •Histogram •Frequent Values (MFV) Statistics are key to performance
  28. 28. Application portability and integration •Big SQL 3.0 adopts IBM's standard Data Server Client Drivers •Robust, standards compliant ODBC, JDBC, and .NET drivers •Same driver used for DB2 LUW, DB2/z and Informix •Expands support to numerous languages (Python, Ruby, Perl, etc.) •Putting the story together…. •Big SQL shares a common SQL dialect with DB2 •Big SQL shares the same client drivers with DB2 •Data warehouse augmentation just got significantly easier Compatible SQL Compatible Drivers Portable Application
  29. 29. Application portability and integration (cont.) •This compatibility extends beyond your own applications •Open integration across Business Analytic Tools •IBM Optim Data Studio performance tool portfolio •Superior enablement for IBM Software – e.g. Cognos •Enhanced support by 3rd party software – e.g. Microstrategy
  30. 30. Query federation •Data never lives in isolation •Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active data warehouses •Big SQL provides the ability to query heterogeneous systems •Join Hadoop to other relational databases •Query optimizer understands capabilities of external system •Including available statistics •As much work as possible is pushed to each system to process Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  31. 31. Enterprise security •Users may be authenticated via •Operating system •Lightweight directory access protocol (LDAP) •Kerberos •User authorization mechanisms include •Full GRANT/REVOKE based security •Group and role based hierarchical security •Object level, column level, or row level (fine-grained) access controls •Auditing •You may define audit policies and track user activity •Transport layer security (TLS) •Protect integrity and confidentiality of data between the client and Big SQL
  32. 32. Monitoring •Comprehensive runtime monitoring infrastructure that helps answer the question: what is going on in my system? •SQL interfaces to the monitoring data via table functions •Ability to drill down into more granular metrics for problem determination and/ or detailed performance analysis •Runtime statistics collected during the execution of the section for a (SQL) access plan •Support for event monitors to track specific types of operations and activities •Protect against and discover unknown or unacceptable behaviors by monitoring data access via Audit facility. Reporting Level (Example: Service Class) Big SQL 3.0 Worker Threads Connection Control Blocks Worker Threads Collect Locally Push Up Data Incrementally Extract Data Directly From Reporting level Monitor Query
  33. 33. •Performance matters to customers •Benchmarking appeals to Engineers to drive product innovation •Benchmarketing used to convey performance in a memorable and appealing way •SQL over Hadoop is in the “Wild West” of Benchmarketing •100x claims! Compared to what? Conforming to what rules? •The TPC (Transaction Processing Performance Council) is the grand-daddy of all multi-vendor SQL-oriented organizations •Formed in August, 1988 •TPC-H and TPC-DS are the most relevant to SQL over Hadoop –R/W nature of workload not suitable for HDFS •Big Data Benchmarking Community (BDBC) formed Performance, Benchmarking, Benchmarketing
  34. 34. Power of Standard SQL •Everyone loves performance numbers, but that's not the whole story •How much work do you have to do to achieve those numbers? •A portion of our internal performance numbers are based upon industry standard benchmarks •Big SQL is capable of executing •All 22 TPC-H queries without modification •All 99 TPC-DS queries without modification SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written for Hive
  35. 35. 35 Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries *Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
  36. 36. Big SQL is 10x faster than Hive 0.12 (total workload elapsed time) 36 Comparing Big SQL and Hive 0.12 for Decision Support Queries * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
  37. 37. How many times faster is Big SQL than Hive 0.12? * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 Max Speedup of 74x 37 Queries sorted by speed up ratio (worst to best) Avg Speedup of 20x
  38. 38. Conclusion •Today, it seems, performance numbers are the name of the game •But in reality there is so much more… •How rich is the SQL? •How difficult is it to (re-)use your existing SQL? •How secure is your data? •Is your data still open for other uses on Hadoop? •Can your queries span your enterprise? •Can other Hadoop workloads co-exist in harmony? •… •With Big SQL 3.0 performance doesn't mean compromise
  39. 39. Try it now! InfoSphere for BigInsights Quick Start Free, no limit, non-production version of BigInsights Features Big SQL, BigSheets, Text Analytics, Big R, management console, development tools Tutorials and education available ibm.co/QuickStart
  40. 40. Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  41. 41. Темы для следующих митапов •R on Hadoop •Файловые системы •Движки MapReduce/Spark/etc •Hadoop Security •Spreadsheet analysis •Text analysis •?

×