SlideShare a Scribd company logo
1 of 23
“Project Panthera”: Better Analytics
   with SQL, MapReduce and HBase
                                               Jason Dai
                                       Principal Engineer
                 Intel SSG (Software and Services Group)




                                        Software and Services Group
My Background and Bias

                                                                      Intel IXP2800
Years of development on parallel compiler
•   Lead architect of Intel network processor
    compiler
    – Auto-partitioning & parallelizing for many-core
      many-thread (128 HW threads @ year 2002) CPU


Currently Principal Engineer in Intel SSG
•   Leading the open source Hadoop engineering team
    – HiBench, HiTune, “Project Panthera”, etc.




                                                   Software and Services Group
                                                                                      ‹#›
                                                                                       2
Agenda

Overview of “Project Panthera”

Analytical SQL engine for MapReduce

Document store for better query processing on HBase

Summary




                                        Software and Services Group
                                                                      ‹#›
                                                                       3
Project Panthera

Our open source efforts to enable better analytics capabilities on
Hadoop/HBase
•   Better integration with existing infrastructure using SQL
•   Better query processing on HBase
•   Efficiently utilizing new HW platform technologies
•   Etc.




           https://github.com/intel-hadoop/project-panthera




                                                  Software and Services Group
                                                                                ‹#›
                                                                                 4
Current Work under Project Panthera

An analytical SQL engine for MapReduce
•   Built on top of Hive
•   Provide full SQL support for OLAP

A document store for better query processing on HBase
•   A co-processor application for HBase
•   Provide document semantics & significantly speedup query processing




                                               Software and Services Group
                                                                             ‹#›
                                                                              5
Agenda

Overview of “Project Panthera”

Analytical SQL engine for MapReduce

Document store for better query processing on HBase

Summary




                                        Software and Services Group
                                                                      ‹#›
                                                                       6
Full SQL Support for Hadoop Needed

Full SQL support for OLAP
•   Required in modern business application environment
    –   Business users
    –   Enterprise analytics applications
    –   Third-party tools (such as query builders and BI applications)

Hive – THE Data Warehouse for Hadoop
•   HiveQL: a SQL-like query language (subset of SQL with extensions)
    –   Significantly lowers the barrier to MapReduce
•   Still large gaps w.r.t. full analytic SQL support
    –   Multiple-table SELECT statement, subquery in WHERE clauses, etc.




                                                               Software and Services Group
                                                                                             ‹#›
                                                                                              7
An analytical SQL engine for MapReduce

   The anatomy of a query processing engine

                              AST (Abstract
                                                                             Execution Plan
                              Syntax Tree)        Semantic Analyzer
    Query            Parser                                                                           Execution
                                                     (Optimizer)




   Our SQL engine for MapReduce

                                         SQL-AST Analyzer                 Hive Semantic
                        (Open     SQL-                    Hive-
                 SQL                       & Translator   AST                Analyzer
                       Source)    AST                                                                     Hadoop
Query   Driver                           Subquery        Multi-Table     INTERSECT       MINUS
                         SQL                                              Support        Support
                                                                                                            MR
                                         Unnesting        SELECT
                       Parser*                      …                                …
            HiveQL


                          Hive                Hive-AST
                         Parser
                                                                       *https://github.com/porcelli/plsql-parser



                                                                        Software and Services Group
                                                                                                                  ‹#›
                                                                                                                   8
Current Status

Enable complex SQL queries (not supported by Hive today), such as,
• Subquery in WHERE clauses (using ALL, ANY, IN, EXIST, SOME keywords)
      select * from t1 where t1.d > ALL (select z from t2 where t2.z!=9);


• Correlated subquery (i.e., a subquery referring to a column of a table not in its FROM clause)
      select * from t1 where exists ( select * from t2 where t1.b = t2.y );


• Scalar subquery (i.e., a subquery that returns exactly one column value from one row)
      select a,b,c,d,e,(select z from t2 where t2.y = t1.b and z != 99 ) from t1;


• Top-level subquery
      (select * from t1) union all (select * from t2) union all (select * from t3 order by 1);


• Multiple-table SELECT statement
      select * from t1,t2 where t1.c > t2.z;




        https://github.com/intel-hadoop/hive-0.9-panthera


                                                             Software and Services Group
                                                                                             ‹#›
                                                                                              9
Current Status

NIST SQL Test Suite Version 6.0
•   http://www.itl.nist.gov/div897/ctg/sql_form.htm
•   A widely used SQL-92 conformance test suite
•   Ported to run under both Hive and the SQL engine
    –   SELECT statements only
    –   Run against Hive/SQL engine and a RDBMS to verify the results

                                               Hive 0.9                     SQL Engine
                       Ported Query#
                                        Passed                        Passed
                         From NIST                   Pass Rate                         Pass Rate
                                        Query#                        Query#
    All queries            1015          777          76.6%              900             88.7%
    Subquery related
                            87             0              0%              72             82.8%
    queries
    Multiple-table
                            31             0              0%              27             87.1%
    select queries




                                                               Software and Services Group
                                                                                                   ‹#›
                                                                                                   10
The Path to Full SQL support for OLAP

A SQL compatible parser
•     E.g., Hive-3561

Multiple-table SELECT statement
•     E.g., Hive-3578

Full subquery support & optimizations
•     E.g., subquery unnesting (Hive-3577)

Complete SQL data type system
•     E.g., DateTime types and functions (Hive-1269)

...


                   See the umbrella JIRA Hive-3472



                                                       Software and Services Group
                                                                                     ‹#›
                                                                                     11
Agenda

Overview of “Project Panthera”

Analytical SQL engine for MapReduce

Document store for better query processing on HBase

Summary




                                        Software and Services Group
                                                                      ‹#›
                                                                      12
Query Processing on HBase

Hive (or SQL engine) over HBase
•   Store data (Hive table) in HBase
•   Query data using HiveQL or SQL
    –   Series of MapReduce jobs scanning HBase

Motivations
•   Stream new data into HBase in near realtime
•   Support high update rate workloads (to keep the warehouse always up to date)
•   Allow very low latency, online data serving
•   Etc.




                                                    Software and Services Group
                                                                                  ‹#›
                                                                                  13
Overheads of Query Processing on HBase

Space overhead
•   Fully qualified, multi-dimentional map in HBase vs.                              2~3x space overhead
    relational table                                                                  (a 18-column table)
           HBase Table
                                     Relational (Hive) Table
    (r1,   cf1:C1, ts)   v1
    (r1,   cf1:C2, ts)   v2     Row
                                        C1     C2     …        Cn
                                Key
    …                    …
                                 r1      v1     v2     …       vn
    (r1,   cf1:Cn, ts)   vn
                                 r2     vn+1   vn+2    …       v2n
    (r2,   cf1:C1, ts)   vn+1
    …                    …       …       …      …      …       …


                                                                                  ~6x performance overhead
Performance overhead                                                              (full 18-column table scan )

•   Among many reasons
    –   Highly concurrent read/write accesses in HBase vs. read-
        most analytical queries




                                                                     Software and Services Group
                                                                                                             ‹#›
                                                                                                             14
A Document Store on HBase

DOT (Document Oriented Table) on HBase
•   Each row contains a collection of
    documents (as well as row key)
•   Each document contains a collection
    of fields
•   A document is mapped to a HBase
    column and serialized using Avro, PB, etc.
                                                                …

Mapping relational table to DOT
                                                 Row Key     C1        C2         …   Cn
•   Each column mapped to a field                   r1       v1        v2         …   vn
•   Schema stored just once                        r2       vn+1      vn+2        …   v2n
                                                   …          …         …         …   …
•   Read overheads amortized across different
    fields in a document

           Implemented as a HBase Coprocessor Application
        https://github.com/intel-hadoop/hbase-0.94-panthera

                                                    Software and Services Group
                                                                                            ‹#›
                                                                                            15
Working with DOT

Hive/SQL queries on DOT
•   Similar to running Hive with HBase today
    –   Create a DOT in HBase
    –   Create external Hive table with the DOT
        • Use “doc.field” in place of “column qualifier” when specifying “hbase.column.mapping”
    –   Transparent to DML queries
        • No changes to the query or the HBase storage handler



    CREATE EXTERNAL TABLE table_dot (key INT, C1 STRING, C2 STRING, C3 DOUBLE)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:d.c1,f:d.c2, f:d.c3")
    TBLPROPERTIES ("hbase.table.name"=" table_dot");




                                                                      Software and Services Group
                                                                                                    ‹#›
                                                                                                    16
Working with DOT

Create a DOT in HBase
•   Required to specify the schema and serializer (e.g., Avro) for each document
    –   Stored in table metadata by the preCreateTable co-processor
•   I.e., the table schema is fixed and predetermined at table creation time
    –   OK for Hive/SQL queries

HTableDescriptor desc = new HTableDescriptor(“t1”);
//Specify a dot table
desc.setValue(“hbase.dot.enable”,”true”);
desc.setValue(“hbase.dot.type”, ”ANALYTICAL”);
…
HColumnDescriptor cf2 = new HColumnDescriptor(Bytes.toBytes("cf2"));
cf2.setValue("hbase.dot.columnfamily.doc.element",“d3”);    //Specify contained document
String doc3 = " {     n" + " "name": "d3", n"
  + " "type": "record",n" + " "fields": [n"
  + "   {"name": "f1", "type": "bytes"},n"
  + "   {"name": "f2", "type": "bytes"},n"
  + "   {"name": "f3", "type": "bytes"} ]n“ + "}";
cf2.setValue(“hbase.dot.columnfamily.doc.schema.d3”, doc3Schema); //specify the schema for d3
desc.addFamily(cf2Desc);
admin.createTable(desc);




                                                           Software and Services Group
                                                                                           ‹#›
                                                                                           17
Working with DOT

Data access in HBase                           Scan scan = new Scan();
                                               scan.addColumn(Bytes.toBytes(“cf1"), Bytes.toBytes(“d1.f1")).
•   Transparent to the user                         addColumn(Bytes.toBytes(“cf2"), Bytes.toBytes(“d3.f1”));
                                               SingleColumnValueFilter filter = new SingleColumnValueFilter(
    –   Just specify “doc.field” in place of           Bytes.toBytes("cf1"), Bytes.toBytes("d1.f1"),
        “column qualifier”                             CompareFilter.CompareOp.EQUAL,
                                                       new SubstringComparator("row1_fd1"));
    –   Mapping between “document”,            scan.setFilter(filter);
                                               HTable table = new HTable(conf, “t1”);
        “field” & “column qualifier” handled   ResultScanner scanner = table.getScanner(scan);
        by coprocessors automatically          for (Result result : scanner) {
                                                   System.out.println(result);
                                               }



•   Additional check for Put/Delete today
    –   All fields in a document expected to be updated together; otherwise:
        • Warning for Put (missing field set to NULL value)
        • Error for DELETE
    –   OK for Hive queries




                                                                  Software and Services Group
                                                                                                         ‹#›
                                                                                                         18
Some Results

Benchmarks
•   Create an 18-column table in Hive (on HBase) and load ~567 million rows



           Table storage
           • 1.7~3x space
             reduction w/ DOT




           Data loading
           • ~1.9x speedup for
             bulk load w/ DOT
           • 3~4x speedup for
             insert w/ DOT




                                                    Software and Services Group
                                                                                  ‹#›
                                                                                  19
Some Results

Benchmarks
•   Select various numbers of columns form the table
          select count (col1, col2, …, coln) from table



          SELECT performance: up to 2x speedup w/ DOT




                                                        Software and Services Group
                                                                                      ‹#›
                                                                                      20
Summary

“Project Panthera”
•   Our open source efforts to eanle better analytics capabilities on Hadoop/HBase
    –   https://github.com/intel-hadoop/project-panthera/
•   An analytical SQL engine for MapReduce
    –   Provide full SQL support for OLAP
        • Complex subquery, multiple-table SELECT, etc.
    –   Umbrella JIRA HIVE-3472
•   A document store for better query processing on HBase
    –   Provide document semantics & significantly speedup query processing
        • Up to 3x storage reduction, up to 2x performance speedup
    –   Umbrella JIRA HBASE-6800




                                                                     Software and Services Group
                                                                                                   ‹#›
                                                                                                   21
Thank You!


This slide deck and other related information will be available at
         http://software.intel.com/user/335224/track




                         Any questions?




                                           Software and Services Group
                                                                         ‹#›
                                                                         22
Software and Services Group
                              ‹#›
                              23

More Related Content

What's hot

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystemAmit Bhardwaj
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Hadoop summit-diverse-workload
Hadoop summit-diverse-workloadHadoop summit-diverse-workload
Hadoop summit-diverse-workloadWangda Tan
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Wangda Tan
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 

What's hot (20)

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Hadoop summit-diverse-workload
Hadoop summit-diverse-workloadHadoop summit-diverse-workload
Hadoop summit-diverse-workload
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 

Viewers also liked

Videreutdanning 160610
Videreutdanning 160610Videreutdanning 160610
Videreutdanning 160610hakva
 
“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...
“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...
“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...CIAT
 
Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)
Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)
Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)Esco Granada
 
Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...
Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...
Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...María Dolores Sánchez-Fernández, PhD.
 
EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-
EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-
EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-Iten Mario Mendoza Camacho
 
Las comunidades de aprendizaje
Las comunidades de aprendizajeLas comunidades de aprendizaje
Las comunidades de aprendizajeEzequiel-Tarazona
 
Copernica - Cross Mail Presentatie
Copernica - Cross Mail PresentatieCopernica - Cross Mail Presentatie
Copernica - Cross Mail PresentatieCopernica BV
 
Introduction to terrastore
Introduction to terrastoreIntroduction to terrastore
Introduction to terrastoresvjson
 
Applus IAT BUMP
Applus IAT BUMPApplus IAT BUMP
Applus IAT BUMPwichyfly
 
Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...
Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...
Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...patrick werkt slimmer
 
Análisis de vibraciones de un tren de maquinaria
Análisis de vibraciones de un tren de maquinariaAnálisis de vibraciones de un tren de maquinaria
Análisis de vibraciones de un tren de maquinariaHipolito Condori
 
Presentacion Malaga CF Pedro Jimenez
Presentacion Malaga CF Pedro JimenezPresentacion Malaga CF Pedro Jimenez
Presentacion Malaga CF Pedro JimenezPedroJmnz
 
OpenStack at EBSCO
OpenStack at EBSCOOpenStack at EBSCO
OpenStack at EBSCOTesora
 
Historias y cuentos online
Historias y cuentos onlineHistorias y cuentos online
Historias y cuentos onlineJuan Quintana
 
El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...
El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...
El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...aljubarrota
 
Pizza Point Of Sale System - SpeedLine Overview
Pizza Point Of Sale System - SpeedLine OverviewPizza Point Of Sale System - SpeedLine Overview
Pizza Point Of Sale System - SpeedLine Overviewcarmensadie
 
Web Hooks and the Programmable World of Tomorrow
Web Hooks and the Programmable World of TomorrowWeb Hooks and the Programmable World of Tomorrow
Web Hooks and the Programmable World of TomorrowJeff Lindsay
 

Viewers also liked (20)

XING for Universities
XING for UniversitiesXING for Universities
XING for Universities
 
Videreutdanning 160610
Videreutdanning 160610Videreutdanning 160610
Videreutdanning 160610
 
“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...
“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...
“Feria del Conocimiento América Latina y el Caribe: Casos destacados en agric...
 
Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)
Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)
Presentación Proyecto Crónicas de Jóvenes Emprendedores (AJE Granada)
 
Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...
Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...
Conferencia: Herramientas Estratégicas de Gestión: Responsabilidad Social Cor...
 
EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-
EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-
EL MALPARIDO -NOVELA EN VERSIÓN DIGITAL GRATUITA-
 
Las comunidades de aprendizaje
Las comunidades de aprendizajeLas comunidades de aprendizaje
Las comunidades de aprendizaje
 
Doug Hardenburgh Portfolio
Doug Hardenburgh PortfolioDoug Hardenburgh Portfolio
Doug Hardenburgh Portfolio
 
Copernica - Cross Mail Presentatie
Copernica - Cross Mail PresentatieCopernica - Cross Mail Presentatie
Copernica - Cross Mail Presentatie
 
Lider
LiderLider
Lider
 
Introduction to terrastore
Introduction to terrastoreIntroduction to terrastore
Introduction to terrastore
 
Applus IAT BUMP
Applus IAT BUMPApplus IAT BUMP
Applus IAT BUMP
 
Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...
Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...
Tutorial / Manual / how-to-set-up: ’Remember the Milk’ as your task managemen...
 
Análisis de vibraciones de un tren de maquinaria
Análisis de vibraciones de un tren de maquinariaAnálisis de vibraciones de un tren de maquinaria
Análisis de vibraciones de un tren de maquinaria
 
Presentacion Malaga CF Pedro Jimenez
Presentacion Malaga CF Pedro JimenezPresentacion Malaga CF Pedro Jimenez
Presentacion Malaga CF Pedro Jimenez
 
OpenStack at EBSCO
OpenStack at EBSCOOpenStack at EBSCO
OpenStack at EBSCO
 
Historias y cuentos online
Historias y cuentos onlineHistorias y cuentos online
Historias y cuentos online
 
El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...
El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...
El tren de Arganda: trayecto ferroviario entre la estación del Niño Jesús y l...
 
Pizza Point Of Sale System - SpeedLine Overview
Pizza Point Of Sale System - SpeedLine OverviewPizza Point Of Sale System - SpeedLine Overview
Pizza Point Of Sale System - SpeedLine Overview
 
Web Hooks and the Programmable World of Tomorrow
Web Hooks and the Programmable World of TomorrowWeb Hooks and the Programmable World of Tomorrow
Web Hooks and the Programmable World of Tomorrow
 

Similar to Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 

Similar to Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase

  • 1. “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services Group) Software and Services Group
  • 2. My Background and Bias Intel IXP2800 Years of development on parallel compiler • Lead architect of Intel network processor compiler – Auto-partitioning & parallelizing for many-core many-thread (128 HW threads @ year 2002) CPU Currently Principal Engineer in Intel SSG • Leading the open source Hadoop engineering team – HiBench, HiTune, “Project Panthera”, etc. Software and Services Group ‹#› 2
  • 3. Agenda Overview of “Project Panthera” Analytical SQL engine for MapReduce Document store for better query processing on HBase Summary Software and Services Group ‹#› 3
  • 4. Project Panthera Our open source efforts to enable better analytics capabilities on Hadoop/HBase • Better integration with existing infrastructure using SQL • Better query processing on HBase • Efficiently utilizing new HW platform technologies • Etc. https://github.com/intel-hadoop/project-panthera Software and Services Group ‹#› 4
  • 5. Current Work under Project Panthera An analytical SQL engine for MapReduce • Built on top of Hive • Provide full SQL support for OLAP A document store for better query processing on HBase • A co-processor application for HBase • Provide document semantics & significantly speedup query processing Software and Services Group ‹#› 5
  • 6. Agenda Overview of “Project Panthera” Analytical SQL engine for MapReduce Document store for better query processing on HBase Summary Software and Services Group ‹#› 6
  • 7. Full SQL Support for Hadoop Needed Full SQL support for OLAP • Required in modern business application environment – Business users – Enterprise analytics applications – Third-party tools (such as query builders and BI applications) Hive – THE Data Warehouse for Hadoop • HiveQL: a SQL-like query language (subset of SQL with extensions) – Significantly lowers the barrier to MapReduce • Still large gaps w.r.t. full analytic SQL support – Multiple-table SELECT statement, subquery in WHERE clauses, etc. Software and Services Group ‹#› 7
  • 8. An analytical SQL engine for MapReduce The anatomy of a query processing engine AST (Abstract Execution Plan Syntax Tree) Semantic Analyzer Query Parser Execution (Optimizer) Our SQL engine for MapReduce SQL-AST Analyzer Hive Semantic (Open SQL- Hive- SQL & Translator AST Analyzer Source) AST Hadoop Query Driver Subquery Multi-Table INTERSECT MINUS SQL Support Support MR Unnesting SELECT Parser* … … HiveQL Hive Hive-AST Parser *https://github.com/porcelli/plsql-parser Software and Services Group ‹#› 8
  • 9. Current Status Enable complex SQL queries (not supported by Hive today), such as, • Subquery in WHERE clauses (using ALL, ANY, IN, EXIST, SOME keywords) select * from t1 where t1.d > ALL (select z from t2 where t2.z!=9); • Correlated subquery (i.e., a subquery referring to a column of a table not in its FROM clause) select * from t1 where exists ( select * from t2 where t1.b = t2.y ); • Scalar subquery (i.e., a subquery that returns exactly one column value from one row) select a,b,c,d,e,(select z from t2 where t2.y = t1.b and z != 99 ) from t1; • Top-level subquery (select * from t1) union all (select * from t2) union all (select * from t3 order by 1); • Multiple-table SELECT statement select * from t1,t2 where t1.c > t2.z; https://github.com/intel-hadoop/hive-0.9-panthera Software and Services Group ‹#› 9
  • 10. Current Status NIST SQL Test Suite Version 6.0 • http://www.itl.nist.gov/div897/ctg/sql_form.htm • A widely used SQL-92 conformance test suite • Ported to run under both Hive and the SQL engine – SELECT statements only – Run against Hive/SQL engine and a RDBMS to verify the results Hive 0.9 SQL Engine Ported Query# Passed Passed From NIST Pass Rate Pass Rate Query# Query# All queries 1015 777 76.6% 900 88.7% Subquery related 87 0 0% 72 82.8% queries Multiple-table 31 0 0% 27 87.1% select queries Software and Services Group ‹#› 10
  • 11. The Path to Full SQL support for OLAP A SQL compatible parser • E.g., Hive-3561 Multiple-table SELECT statement • E.g., Hive-3578 Full subquery support & optimizations • E.g., subquery unnesting (Hive-3577) Complete SQL data type system • E.g., DateTime types and functions (Hive-1269) ... See the umbrella JIRA Hive-3472 Software and Services Group ‹#› 11
  • 12. Agenda Overview of “Project Panthera” Analytical SQL engine for MapReduce Document store for better query processing on HBase Summary Software and Services Group ‹#› 12
  • 13. Query Processing on HBase Hive (or SQL engine) over HBase • Store data (Hive table) in HBase • Query data using HiveQL or SQL – Series of MapReduce jobs scanning HBase Motivations • Stream new data into HBase in near realtime • Support high update rate workloads (to keep the warehouse always up to date) • Allow very low latency, online data serving • Etc. Software and Services Group ‹#› 13
  • 14. Overheads of Query Processing on HBase Space overhead • Fully qualified, multi-dimentional map in HBase vs. 2~3x space overhead relational table (a 18-column table) HBase Table Relational (Hive) Table (r1, cf1:C1, ts) v1 (r1, cf1:C2, ts) v2 Row C1 C2 … Cn Key … … r1 v1 v2 … vn (r1, cf1:Cn, ts) vn r2 vn+1 vn+2 … v2n (r2, cf1:C1, ts) vn+1 … … … … … … … ~6x performance overhead Performance overhead (full 18-column table scan ) • Among many reasons – Highly concurrent read/write accesses in HBase vs. read- most analytical queries Software and Services Group ‹#› 14
  • 15. A Document Store on HBase DOT (Document Oriented Table) on HBase • Each row contains a collection of documents (as well as row key) • Each document contains a collection of fields • A document is mapped to a HBase column and serialized using Avro, PB, etc. … Mapping relational table to DOT Row Key C1 C2 … Cn • Each column mapped to a field r1 v1 v2 … vn • Schema stored just once r2 vn+1 vn+2 … v2n … … … … … • Read overheads amortized across different fields in a document Implemented as a HBase Coprocessor Application https://github.com/intel-hadoop/hbase-0.94-panthera Software and Services Group ‹#› 15
  • 16. Working with DOT Hive/SQL queries on DOT • Similar to running Hive with HBase today – Create a DOT in HBase – Create external Hive table with the DOT • Use “doc.field” in place of “column qualifier” when specifying “hbase.column.mapping” – Transparent to DML queries • No changes to the query or the HBase storage handler CREATE EXTERNAL TABLE table_dot (key INT, C1 STRING, C2 STRING, C3 DOUBLE) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:d.c1,f:d.c2, f:d.c3") TBLPROPERTIES ("hbase.table.name"=" table_dot"); Software and Services Group ‹#› 16
  • 17. Working with DOT Create a DOT in HBase • Required to specify the schema and serializer (e.g., Avro) for each document – Stored in table metadata by the preCreateTable co-processor • I.e., the table schema is fixed and predetermined at table creation time – OK for Hive/SQL queries HTableDescriptor desc = new HTableDescriptor(“t1”); //Specify a dot table desc.setValue(“hbase.dot.enable”,”true”); desc.setValue(“hbase.dot.type”, ”ANALYTICAL”); … HColumnDescriptor cf2 = new HColumnDescriptor(Bytes.toBytes("cf2")); cf2.setValue("hbase.dot.columnfamily.doc.element",“d3”); //Specify contained document String doc3 = " { n" + " "name": "d3", n" + " "type": "record",n" + " "fields": [n" + " {"name": "f1", "type": "bytes"},n" + " {"name": "f2", "type": "bytes"},n" + " {"name": "f3", "type": "bytes"} ]n“ + "}"; cf2.setValue(“hbase.dot.columnfamily.doc.schema.d3”, doc3Schema); //specify the schema for d3 desc.addFamily(cf2Desc); admin.createTable(desc); Software and Services Group ‹#› 17
  • 18. Working with DOT Data access in HBase Scan scan = new Scan(); scan.addColumn(Bytes.toBytes(“cf1"), Bytes.toBytes(“d1.f1")). • Transparent to the user addColumn(Bytes.toBytes(“cf2"), Bytes.toBytes(“d3.f1”)); SingleColumnValueFilter filter = new SingleColumnValueFilter( – Just specify “doc.field” in place of Bytes.toBytes("cf1"), Bytes.toBytes("d1.f1"), “column qualifier” CompareFilter.CompareOp.EQUAL, new SubstringComparator("row1_fd1")); – Mapping between “document”, scan.setFilter(filter); HTable table = new HTable(conf, “t1”); “field” & “column qualifier” handled ResultScanner scanner = table.getScanner(scan); by coprocessors automatically for (Result result : scanner) { System.out.println(result); } • Additional check for Put/Delete today – All fields in a document expected to be updated together; otherwise: • Warning for Put (missing field set to NULL value) • Error for DELETE – OK for Hive queries Software and Services Group ‹#› 18
  • 19. Some Results Benchmarks • Create an 18-column table in Hive (on HBase) and load ~567 million rows Table storage • 1.7~3x space reduction w/ DOT Data loading • ~1.9x speedup for bulk load w/ DOT • 3~4x speedup for insert w/ DOT Software and Services Group ‹#› 19
  • 20. Some Results Benchmarks • Select various numbers of columns form the table select count (col1, col2, …, coln) from table SELECT performance: up to 2x speedup w/ DOT Software and Services Group ‹#› 20
  • 21. Summary “Project Panthera” • Our open source efforts to eanle better analytics capabilities on Hadoop/HBase – https://github.com/intel-hadoop/project-panthera/ • An analytical SQL engine for MapReduce – Provide full SQL support for OLAP • Complex subquery, multiple-table SELECT, etc. – Umbrella JIRA HIVE-3472 • A document store for better query processing on HBase – Provide document semantics & significantly speedup query processing • Up to 3x storage reduction, up to 2x performance speedup – Umbrella JIRA HBASE-6800 Software and Services Group ‹#› 21
  • 22. Thank You! This slide deck and other related information will be available at http://software.intel.com/user/335224/track Any questions? Software and Services Group ‹#› 22
  • 23. Software and Services Group ‹#› 23