SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Integration of Apache Hive
and HBase
Enis Soztutar
enis [at] apache [dot] org
@enissoz




Architecting the Future of Big Data
 © Hortonworks Inc. 2011              Page 1
About Me

•  User and committer of Hadoop since 2007
•  Contributor to Apache Hadoop, HBase, Hive and Gora
•  Joined Hortonworks as Member of Technical Staff
•  Twitter: @enissoz




        Architecting the Future of Big Data
                                                        Page 2
        © Hortonworks Inc. 2011
Agenda

•  Overview of Hive and HBase
•  Hive + HBase Features and Improvements
•  Future of Hive and HBase
•  Q&A




         Architecting the Future of Big Data
                                               Page 3
         © Hortonworks Inc. 2011
Apache Hive Overview
• Apache Hive is a data warehouse system for Hadoop
• SQL-like query language called HiveQL
• Built for PB scale data
• Main purpose is analysis and ad hoc querying
• Database / table / partition / bucket – DDL Operations
• SQL Types + Complex Types (ARRAY, MAP, etc)
• Very extensible
• Not for : small data sets, low latency queries, OLTP



         Architecting the Future of Big Data
                                                           Page 4
         © Hortonworks Inc. 2011
Apache Hive Architecture
                                  JDBC/ODBC




                                     Hive Thrift        Hive Web
       CLI
                                      Server            Interface



    Driver                                                          M
                                                                    S
                                                                    C
                 Parser                            Planner          l   Metastore
                                                                    i
                                                                    e
              Execution                            Optimizer        n
                                                                    t

        MapReduce

                                         HDFS                            RDBMS

       Architecting the Future of Big Data
                                                                                    Page 5
       © Hortonworks Inc. 2011
Overview of Apache HBase
• Apache HBase is the Hadoop database
• Modeled after Google’s BigTable
• A sparse, distributed, persistent multi- dimensional sorted
  map
• The map is indexed by a row key, column key, and a
  timestamp
• Each value in the map is an un-interpreted array of bytes
• Low latency random data access




         Architecting the Future of Big Data
                                                                Page 6
         © Hortonworks Inc. 2011
Overview of Apache HBase
• Logical view:




                                               From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al.




         Architecting the Future of Big Data
                                                                                                                                 Page 7
         © Hortonworks Inc. 2011
Apache HBase Architecture

            Client

                                                HMaster



                                                                     Zookeeper
    Region                                 Region         Region
    server                                 server         server
       Region                               Region          Region


       Region                               Region          Region



                                                     HDFS

     Architecting the Future of Big Data
                                                                                 Page 8
     © Hortonworks Inc. 2011
Hive + HBase Features and
Improvements




 Architecting the Future of Big Data
                                       Page 9
 © Hortonworks Inc. 2011
Hive + HBase Motivation
• Hive and HBase has different characteristics:
  High latency                                Low latency
  Structured                   vs.            Unstructured
  Analysts                                    Programmers

• Hive datawarehouses on Hadoop are high latency
  – Long ETL times
  – Access to real time data
• Analyzing HBase data with MapReduce requires custom
  coding
• Hive and SQL are already known by many analysts

        Architecting the Future of Big Data
                                                             Page 10
        © Hortonworks Inc. 2011
Use Case 1: HBase as ETL Data Sink




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010


                  Architecting the Future of Big Data
                                                                                 Page 11
                  © Hortonworks Inc. 2011
Use Case 2: HBase as Data Source




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010


                  Architecting the Future of Big Data
                                                                                 Page 12
                  © Hortonworks Inc. 2011
Use Case 3: Low Latency Warehouse




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010


                  Architecting the Future of Big Data
                                                                                 Page 13
                  © Hortonworks Inc. 2011
Example: Hive + Hbase (HBase table)
hbase(main):001:0> create 'short_urls', {NAME =>
'u'}, {NAME=>'s'}



hbase(main):014:0> scan 'short_urls'

ROW                   COLUMN+CELL
 bit.ly/aaaa          column=s:hits, value=100
 bit.ly/aaaa          column=u:url,
value=hbase.apache.org/
 bit.ly/abcd          column=s:hits, value=123
 bit.ly/abcd          column=u:url,
value=example.com/foo
      Architecting the Future of Big Data
                                                 Page 14
      © Hortonworks Inc. 2011
Example: Hive + HBase (Hive table)
CREATE TABLE short_urls(
   short_url string,
   url string,
   hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")

TBLPROPERTIES
("hbase.table.name" = ”short_urls");
       Architecting the Future of Big Data
                                                Page 15
       © Hortonworks Inc. 2011
Storage Handler
• Hive defines HiveStorageHandler class for different storage
  backends: HBase/ Cassandra / MongoDB/ etc
• Storage Handler has hooks for
  –  Getting input / output formats
  –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc
• Storage Handler is a table level concept
  –  Does not support Hive partitions, and buckets




         Architecting the Future of Big Data
                                                           Page 16
         © Hortonworks Inc. 2011
Apache Hive + HBase Architecture
                                           Hive Thrift         Hive Web
                   CLI
                                            Server             Interface


           Driver                                                          M
                                                                           S
                            Parser                         Planner         C
                                                                           l   Metastore
                                                                           i
                         Execution                         Optimizer       e
                                                                           n
                                                                           t


                                                         StorageHandler


                  MapReduce                                  HBase

                                             HDFS                              RDBMS

     Architecting the Future of Big Data
                                                                                           Page 17
     © Hortonworks Inc. 2011
Hive + HBase Integration
• For Input/OutputFormat, getSplits(), etc underlying HBase
  classes are used
• Column selection and certain filters can be pushed down
• HBase tables can be used with other(Hadoop native) tables
  and SQL constructs
• Hive DDL operations are converted to HBase DDL
  operations via the client hook.
  – All operations are performed by the client
  – No two phase commit




        Architecting the Future of Big Data
                                                          Page 18
        © Hortonworks Inc. 2011
Schema / Type Mapping




Architecting the Future of Big Data
                                      Page 19
© Hortonworks Inc. 2011
Schema Mapping
•  Hive table + columns + column types <=> HBase table + column
   families (+ column qualifiers)
•  Every field in Hive table is mapped in order to either
   – The table key (using :key as selector)
   – A column family (cf:) -> MAP fields in Hive
   – A column (cf:cq)
•  Hive table does not need to include all columns in HBase
•  CREATE TABLE short_urls(
       short_url string,
       url string,
       hit_count int,
       props, map<string,string>
   )
   WITH SERDEPROPERTIES
   ("hbase.columns.mapping" = ":key, u:url, s:hits, p:")

          Architecting the Future of Big Data
                                                              Page 20
          © Hortonworks Inc. 2011
Type Mapping
• Recently added to Hive (0.9.0)
• Previously all types were being converted to strings in HBase
• Hive has:
  – Primitive types: INT, STRING, BINARY, DATE, etc
  – ARRAY<Type>
  – MAP<PrimitiveType, Type>
  – STRUCT<a:INT, b:STRING, c:STRING>
• HBase does not have types
  – Bytes.toBytes()




        Architecting the Future of Big Data
                                                            Page 21
        © Hortonworks Inc. 2011
Type Mapping
• Table level property
  "hbase.table.default.storage.type” = “binary”
• Type mapping can be given per column after #
  – Any prefix of “binary” , eg u:url#b
  – Any prefix of “string” , eg u:url#s
  – The dash char “-” , eg u:url#-

CREATE TABLE short_urls(
   short_url string,
   url string,
   hit_count int,
   props, map<string,string>
)
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s")

        Architecting the Future of Big Data                  Page 22
        © Hortonworks Inc. 2011
Type Mapping
• If the type is not a primitive or Map, it is converted to a JSON
  string and serialized
• Still a few rough edges for schema and type mapping:
   – No Hive BINARY support in HBase mapping
   – No mapping of HBase timestamp (can only provide put
     timestamp)
   – No arbitrary mapping of Structs / Arrays into HBase schema




         Architecting the Future of Big Data
                                                                  Page 23
         © Hortonworks Inc. 2011
Bulk Load
• Steps to bulk load:
   – Sample source data for range partitioning
   – Save sampling results to a file
   – Run CLUSTER BY query using HiveHFileOutputFormat and
     TotalOrderPartitioner
   – Import Hfiles into HBase table
• Ideal setup should be
   SET hive.hbase.bulk=true
   INSERT OVERWRITE TABLE web_table SELECT ….




        Architecting the Future of Big Data
                                                        Page 24
        © Hortonworks Inc. 2011
Filter Pushdown




Architecting the Future of Big Data
                                      Page 25
© Hortonworks Inc. 2011
Filter Pushdown
• Idea is to pass down filter expressions to the storage layer to
  minimize scanned data
• To access indexes at HDFS or HBase
• Example:
   CREATE EXTERNAL TABLE users (userid LONG, email STRING, … )
   STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
   WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…")


   SELECT ... FROM users WHERE userid > 1000000 and email LIKE
‘%@gmail.com’;



-> scan.setStartRow(Bytes.toBytes(1000000))

         Architecting the Future of Big Data
                                                                  Page 26
         © Hortonworks Inc. 2011
Filter Decomposition
• Optimizer pushes down the predicates to the query plan
• Storage handlers can negotiate with the Hive optimizer to
  decompose the filter
  x > 3 AND upper(y) = 'XYZ’
• Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive
• Works with:
key = 3, key > 3, etc
key > 3 AND key < 100
• Only works against constant expressions


        Architecting the Future of Big Data
                                                              Page 27
        © Hortonworks Inc. 2011
Security Aspects
Towards fully secure deployments




Architecting the Future of Big Data
                                      Page 28
© Hortonworks Inc. 2011
Security – Big Picture
• Security becomes more important to support enterprise level
  and multi tenant applications
• 5 Different Components to ensure / impose security
  – HDFS
  – MapReduce
  – HBase
  – Zookeeper
  – Hive
• Each component has:
  – Authentication
  – Authorization


         Architecting the Future of Big Data
                                                           Page 29
         © Hortonworks Inc. 2011
HBase Security – Closer look
• Released with HBase 0.92
• Fully optional module, disabled by default
• Needs an underlying secure Hadoop release
• SecureRPCEngine: optional engine enforcing SASL
  authentication
  – Kerberos
  – DIGEST-MD5 based tokens
  – TokenProvider coprocessor
• Access control is implemented as a Coprocessor:
  AccessController
• Stores and distributes ACL data via Zookeeper
  – Sensitive data is only accessible by HBase daemons
  – Client does not need to authenticate to zk
         Architecting the Future of Big Data
                                                         Page 30
         © Hortonworks Inc. 2011
Hive Security – Closer look
• Hive has different deployment options, security considerations
  should take into account different deployments
• Authentication is only supported at Metastore, not on
  HiveServer, web interface, JDBC
• Authorization is enforced at the query layer (Driver)
• Pluggable authorization providers. Default one stores global/
  table/partition/column permissions in Metastore

GRANT ALTER ON TABLE web_table TO USER bob;
CREATE ROLE db_reader
GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO
ROLE db_reader

        Architecting the Future of Big Data
                                                           Page 31
        © Hortonworks Inc. 2011
Hive Deployment Option 1
  Client


             CLI


     Driver                                                                 M
                                                          Authorization
                                                                            S
                                                                            C
                        Parser                              Planner         l   Authentication
                                                                            i
                                                                                 Metastore
                                                                            e
                    Execution                              Optimizer        n
                                                                            t

     A/A                                            A/A
             MapReduce                                       HBase

                                                                A12n/A11N       A12n/A11N
                                                 HDFS
                                                                                RDBMS
           Architecting the Future of Big Data
                                                                                          Page 32
           © Hortonworks Inc. 2011
Hive Deployment Option 2
 Client


            CLI



    Driver                                                                 M
                                                         Authorization
                                                                           S
                                                                           C
                       Parser                              Planner         l   Authentication
                                                                           i
                                                                           e    Metastore
                                                                           n
                   Execution                              Optimizer
                                                                           t

    A/A                                            A/A
            MapReduce                                       HBase

                                                               A12n/A11N       A12n/A11N
                                                HDFS
                                                                               RDBMS
          Architecting the Future of Big Data
                                                                                        Page 33
          © Hortonworks Inc. 2011
Hive Deployment Option 3
  Client
                                                 JDBC/ODBC




                                                 Hive Thrift            Hive Web
               CLI
                                                  Server                Interface


                                                                                    M
     Driver                                                     Authorization
                                                                                    S
                                                                                    C
                          Parser                                  Planner           l   Authentication
                                                                                    i     Metastore
                                                                                    e
                       Execution                                 Optimizer          n
                                                                                    t
     A/A                                                  A/A
                MapReduce                                           HBase
                                                                                        A12n/A11N
                                                   HDFS               A12n/A11N
                                                                                         RDBMS
           Architecting the Future of Big Data
                                                                                                Page 34
           © Hortonworks Inc. 2011
Hive + HBase + Hadoop Security
• Regardless of Hive’s own security, for Hive to work on
  secure Hadoop and HBase, we should:
  – Obtain delegation tokens for Hadoop and HBase jobs
  – Ensure to obey the storage level (HDFS, HBase) permission checks
  – In HiveServer deployments, authenticate and impersonate the user

• Delegation tokens for Hadoop are already working
• Obtaining HBase delegation tokens are released in Hive
  0.9.0




         Architecting the Future of Big Data
                                                                   Page 35
         © Hortonworks Inc. 2011
Future of Hive + HBase
• Improve on schema / type mapping
• Fully secure Hive deployment options
• HBase bulk import improvements
• Sortable signed numeric types in HBase
• Filter pushdown: non key column filters
• Hive random access support for HBase
  – https://cwiki.apache.org/HCATALOG/random-access-
    framework.html




        Architecting the Future of Big Data
                                                       Page 36
        © Hortonworks Inc. 2011
References
• Security
  – https://issues.apache.org/jira/browse/HIVE-2764
  – https://issues.apache.org/jira/browse/HBASE-5371
  – https://issues.apache.org/jira/browse/HCATALOG-245
  – https://issues.apache.org/jira/browse/HCATALOG-260
  – https://issues.apache.org/jira/browse/HCATALOG-244
  – https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security
    +Design
• Type mapping / Filter Pushdown
  – https://issues.apache.org/jira/browse/HIVE-1634
  – https://issues.apache.org/jira/browse/HIVE-1226
  – https://issues.apache.org/jira/browse/HIVE-1643
  – https://issues.apache.org/jira/browse/HIVE-2815
  – https://issues.apache.org/jira/browse/HIVE-1643
        Architecting the Future of Big Data
                                                                 Page 37
        © Hortonworks Inc. 2011
Other Resources

• Hadoop Summit
  – June 13-14
  – San Jose, California
  – www.Hadoopsummit.org

• Hadoop Training and Certification
  – Developing Solutions Using Apache Hadoop
  – Administering Apache Hadoop
  – Online classes available US, India, EMEA
  – http://hortonworks.com/training/




        © Hortonworks Inc. 2012                Page 38
Thanks
Questions?




    Architecting the Future of Big Data
                                          Page 39
    © Hortonworks Inc. 2011

Mais conteúdo relacionado

Mais procurados

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixRajeshbabu Chintaguntla
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 

Mais procurados (20)

Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Hadoop
Hadoop Hadoop
Hadoop
 
Apache spark
Apache sparkApache spark
Apache spark
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Sqoop
SqoopSqoop
Sqoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Semelhante a Integration of HIve and HBase

HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData DayJohn Park
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善HortonworksJapan
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrowSteve Loughran
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0DataWorks Summit
 

Semelhante a Integration of HIve and HBase (20)

Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 

Mais de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mais de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Integration of HIve and HBase

  • 1. Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 1
  • 2. About Me •  User and committer of Hadoop since 2007 •  Contributor to Apache Hadoop, HBase, Hive and Gora •  Joined Hortonworks as Member of Technical Staff •  Twitter: @enissoz Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Agenda •  Overview of Hive and HBase •  Hive + HBase Features and Improvements •  Future of Hive and HBase •  Q&A Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Apache Hive Overview • Apache Hive is a data warehouse system for Hadoop • SQL-like query language called HiveQL • Built for PB scale data • Main purpose is analysis and ad hoc querying • Database / table / partition / bucket – DDL Operations • SQL Types + Complex Types (ARRAY, MAP, etc) • Very extensible • Not for : small data sets, low latency queries, OLTP Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Apache Hive Architecture JDBC/ODBC Hive Thrift Hive Web CLI Server Interface Driver M S C Parser Planner l Metastore i e Execution Optimizer n t MapReduce HDFS RDBMS Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. Overview of Apache HBase • Apache HBase is the Hadoop database • Modeled after Google’s BigTable • A sparse, distributed, persistent multi- dimensional sorted map • The map is indexed by a row key, column key, and a timestamp • Each value in the map is an un-interpreted array of bytes • Low latency random data access Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. Overview of Apache HBase • Logical view: From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al. Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Apache HBase Architecture Client HMaster Zookeeper Region Region Region server server server Region Region Region Region Region Region HDFS Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Hive + HBase Features and Improvements Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Hive + HBase Motivation • Hive and HBase has different characteristics: High latency Low latency Structured vs. Unstructured Analysts Programmers • Hive datawarehouses on Hadoop are high latency – Long ETL times – Access to real time data • Analyzing HBase data with MapReduce requires custom coding • Hive and SQL are already known by many analysts Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Use Case 1: HBase as ETL Data Sink From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Use Case 2: HBase as Data Source From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Use Case 3: Low Latency Warehouse From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Example: Hive + Hbase (HBase table) hbase(main):001:0> create 'short_urls', {NAME => 'u'}, {NAME=>'s'} hbase(main):014:0> scan 'short_urls' ROW COLUMN+CELL bit.ly/aaaa column=s:hits, value=100 bit.ly/aaaa column=u:url, value=hbase.apache.org/ bit.ly/abcd column=s:hits, value=123 bit.ly/abcd column=u:url, value=example.com/foo Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Example: Hive + HBase (Hive table) CREATE TABLE short_urls( short_url string, url string, hit_count int ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits") TBLPROPERTIES ("hbase.table.name" = ”short_urls"); Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Storage Handler • Hive defines HiveStorageHandler class for different storage backends: HBase/ Cassandra / MongoDB/ etc • Storage Handler has hooks for –  Getting input / output formats –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc • Storage Handler is a table level concept –  Does not support Hive partitions, and buckets Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Apache Hive + HBase Architecture Hive Thrift Hive Web CLI Server Interface Driver M S Parser Planner C l Metastore i Execution Optimizer e n t StorageHandler MapReduce HBase HDFS RDBMS Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Hive + HBase Integration • For Input/OutputFormat, getSplits(), etc underlying HBase classes are used • Column selection and certain filters can be pushed down • HBase tables can be used with other(Hadoop native) tables and SQL constructs • Hive DDL operations are converted to HBase DDL operations via the client hook. – All operations are performed by the client – No two phase commit Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Schema / Type Mapping Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. Schema Mapping •  Hive table + columns + column types <=> HBase table + column families (+ column qualifiers) •  Every field in Hive table is mapped in order to either – The table key (using :key as selector) – A column family (cf:) -> MAP fields in Hive – A column (cf:cq) •  Hive table does not need to include all columns in HBase •  CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:") Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Type Mapping • Recently added to Hive (0.9.0) • Previously all types were being converted to strings in HBase • Hive has: – Primitive types: INT, STRING, BINARY, DATE, etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING> • HBase does not have types – Bytes.toBytes() Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Type Mapping • Table level property "hbase.table.default.storage.type” = “binary” • Type mapping can be given per column after # – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#- CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s") Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Type Mapping • If the type is not a primitive or Map, it is converted to a JSON string and serialized • Still a few rough edges for schema and type mapping: – No Hive BINARY support in HBase mapping – No mapping of HBase timestamp (can only provide put timestamp) – No arbitrary mapping of Structs / Arrays into HBase schema Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Bulk Load • Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner – Import Hfiles into HBase table • Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT …. Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Filter Pushdown Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Filter Pushdown • Idea is to pass down filter expressions to the storage layer to minimize scanned data • To access indexes at HDFS or HBase • Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…") SELECT ... FROM users WHERE userid > 1000000 and email LIKE ‘%@gmail.com’; -> scan.setStartRow(Bytes.toBytes(1000000)) Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Filter Decomposition • Optimizer pushes down the predicates to the query plan • Storage handlers can negotiate with the Hive optimizer to decompose the filter x > 3 AND upper(y) = 'XYZ’ • Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive • Works with: key = 3, key > 3, etc key > 3 AND key < 100 • Only works against constant expressions Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Security Aspects Towards fully secure deployments Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Security – Big Picture • Security becomes more important to support enterprise level and multi tenant applications • 5 Different Components to ensure / impose security – HDFS – MapReduce – HBase – Zookeeper – Hive • Each component has: – Authentication – Authorization Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. HBase Security – Closer look • Released with HBase 0.92 • Fully optional module, disabled by default • Needs an underlying secure Hadoop release • SecureRPCEngine: optional engine enforcing SASL authentication – Kerberos – DIGEST-MD5 based tokens – TokenProvider coprocessor • Access control is implemented as a Coprocessor: AccessController • Stores and distributes ACL data via Zookeeper – Sensitive data is only accessible by HBase daemons – Client does not need to authenticate to zk Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Hive Security – Closer look • Hive has different deployment options, security considerations should take into account different deployments • Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC • Authorization is enforced at the query layer (Driver) • Pluggable authorization providers. Default one stores global/ table/partition/column permissions in Metastore GRANT ALTER ON TABLE web_table TO USER bob; CREATE ROLE db_reader GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO ROLE db_reader Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. Hive Deployment Option 1 Client CLI Driver M Authorization S C Parser Planner l Authentication i Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • 33. Hive Deployment Option 2 Client CLI Driver M Authorization S C Parser Planner l Authentication i e Metastore n Execution Optimizer t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • 34. Hive Deployment Option 3 Client JDBC/ODBC Hive Thrift Hive Web CLI Server Interface M Driver Authorization S C Parser Planner l Authentication i Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N HDFS A12n/A11N RDBMS Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • 35. Hive + HBase + Hadoop Security • Regardless of Hive’s own security, for Hive to work on secure Hadoop and HBase, we should: – Obtain delegation tokens for Hadoop and HBase jobs – Ensure to obey the storage level (HDFS, HBase) permission checks – In HiveServer deployments, authenticate and impersonate the user • Delegation tokens for Hadoop are already working • Obtaining HBase delegation tokens are released in Hive 0.9.0 Architecting the Future of Big Data Page 35 © Hortonworks Inc. 2011
  • 36. Future of Hive + HBase • Improve on schema / type mapping • Fully secure Hive deployment options • HBase bulk import improvements • Sortable signed numeric types in HBase • Filter pushdown: non key column filters • Hive random access support for HBase – https://cwiki.apache.org/HCATALOG/random-access- framework.html Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011
  • 37. References • Security – https://issues.apache.org/jira/browse/HIVE-2764 – https://issues.apache.org/jira/browse/HBASE-5371 – https://issues.apache.org/jira/browse/HCATALOG-245 – https://issues.apache.org/jira/browse/HCATALOG-260 – https://issues.apache.org/jira/browse/HCATALOG-244 – https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security +Design • Type mapping / Filter Pushdown – https://issues.apache.org/jira/browse/HIVE-1634 – https://issues.apache.org/jira/browse/HIVE-1226 – https://issues.apache.org/jira/browse/HIVE-1643 – https://issues.apache.org/jira/browse/HIVE-2815 – https://issues.apache.org/jira/browse/HIVE-1643 Architecting the Future of Big Data Page 37 © Hortonworks Inc. 2011
  • 38. Other Resources • Hadoop Summit – June 13-14 – San Jose, California – www.Hadoopsummit.org • Hadoop Training and Certification – Developing Solutions Using Apache Hadoop – Administering Apache Hadoop – Online classes available US, India, EMEA – http://hortonworks.com/training/ © Hortonworks Inc. 2012 Page 38
  • 39. Thanks Questions? Architecting the Future of Big Data Page 39 © Hortonworks Inc. 2011