SlideShare uma empresa Scribd logo
1 de 4
Baixar para ler offline
Hive - A Warehousing Solution Over a Map-Reduce
                              Framework

                    Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,
            Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy
                                                    Facebook Data Infrastructure Team

1. INTRODUCTION                                                                        databases. Each table has a corresponding HDFS di-
   The size of data sets being collected and analyzed in the                           rectory. The data in a table is serialized and stored in
industry for business intelligence is growing rapidly, mak-                            files within that directory. Users can associate tables
ing traditional warehousing solutions prohibitively expen-                             with the serialization format of the underlying data.
sive. Hadoop [3] is a popular open-source map-reduce im-                               Hive provides builtin serialization formats which ex-
plementation which is being used as an alternative to store                            ploit compression and lazy de-serialization. Users can
and process extremely large data sets on commodity hard-                               also add support for new data formats by defining cus-
ware. However, the map-reduce programming model is very                                tom serialize and de-serialize methods (called SerDe’s)
low level and requires developers to write custom programs                             written in Java. The serialization format of each table
which are hard to maintain and reuse.                                                  is stored in the system catalog and is automatically
   In this paper, we present Hive, an open-source data ware-                           used by Hive during query compilation and execution.
housing solution built on top of Hadoop. Hive supports                                 Hive also supports external tables on data stored in
queries expressed in a SQL-like declarative language - HiveQL,                         HDFS, NFS or local directories.
which are compiled into map-reduce jobs executed on Hadoop.                          • Partitions - Each table can have one or more parti-
In addition, HiveQL supports custom map-reduce scripts to                              tions which determine the distribution of data within
be plugged into queries. The language includes a type sys-                             sub-directories of the table directory. Suppose data
tem with support for tables containing primitive types, col-                           for table T is in the directory /wh/T. If T is partitioned
lections like arrays and maps, and nested compositions of                              on columns ds and ctry, then data with a particular
the same. The underlying IO libraries can be extended to                               ds value 20090101 and ctry value US, will be stored in
query data in custom formats. Hive also includes a system                              files within the directory /wh/T/ds=20090101/ctry=US.
catalog, Hive-Metastore, containing schemas and statistics,                          • Buckets - Data in each partition may in turn be divided
which is useful in data exploration and query optimization.                            into buckets based on the hash of a column in the table.
In Facebook, the Hive warehouse contains several thousand                              Each bucket is stored as a file in the partition directory.
tables with over 700 terabytes of data and is being used ex-                        Hive supports primitive column types (integers, floating
tensively for both reporting and ad-hoc analyses by more                          point numbers, generic strings, dates and booleans) and
than 100 users.                                                                   nestable collection types — array and map. Users can also
   The rest of the paper is organized as follows. Section 2                       define their own types programmatically.
describes the Hive data model and the HiveQL language
with an example. Section 3 describes the Hive system ar-                          2.2 Query Language
chitecture and an overview of the query life cycle. Section 4
provides a walk-through of the demonstration. We conclude                            Hive provides a SQL-like query language called HiveQL
with future work in Section 5.                                                    which supports select, project, join, aggregate, union all and
                                                                                  sub-queries in the from clause. HiveQL supports data defi-
                                                                                  nition (DDL) statements to create tables with specific seri-
2. HIVE DATABASE                                                                  alization formats, and partitioning and bucketing columns.
                                                                                  Users can load data from external sources and insert query
2.1     Data Model                                                                results into Hive tables via the load and insert data manip-
  Data in Hive is organized into:                                                 ulation (DML) statements respectively. HiveQL currently
   • Tables - These are analogous to tables in relational                         does not support updating and deleting rows in existing ta-
                                                                                  bles.
                                                                                     HiveQL supports multi-table insert, where users can per-
Permission to copy without fee all or part of this material is granted provided   form multiple queries on the same input data using a single
that the copies are not made or distributed for direct commercial advantage,      HiveQL statement. Hive optimizes these queries by sharing
the VLDB copyright notice and the title of the publication and its date appear,   the scan of the input data.
and notice is given that copying is by permission of the Very Large Data             HiveQL is also very extensible. It supports user defined
Base Endowment. To copy otherwise, or to republish, to post on servers            column transformation (UDF) and aggregation (UDAF) func-
or to redistribute to lists, requires a fee and/or special permission from the    tions implemented in Java. In addition, users can embed
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France                                        custom map-reduce scripts written in any language using
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.                       a simple row-based streaming interface, i.e., read rows from
standard input and write out rows to standard output. This
flexibility does come at a cost of converting rows from and
to strings.
  We omit more details due to lack of space. For a complete
description of HiveQL see the language manual [5].

2.3   Running Example: StatusMeme
  We now present a highly simplified application, Status-
Meme, inspired by Facebook Lexicon [6]. When Facebook
users update their status, the updates are logged into flat
files in an NFS directory /logs/status updates which are ro-
tated every day. We load this data into hive on a daily basis
into a table
status updates(userid int,status string,ds string)
using a load statement like below.

LOAD DATA LOCAL INPATH ‘/logs/status_updates’
INTO TABLE status_updates PARTITION (ds=’2009-03-20’)

   Each status update record contains the user identifier (userid),
the actual status string (status), and the date (ds) when the
status update occurred. This table is partitioned on the ds
column. Detailed user profile information, like the gender of
the user and the school the user is attending, is available in
the profiles(userid int,school string,gender int) table.
   We first want to compute daily statistics on the frequency
of status updates based on gender and school which the user
attends. The following multi-table insert statement gener-
ates the daily counts of status updates by school (into
school summary(school string,cnt int,ds string)) and gen-                          Figure 1: Hive Architecture
der (into gender summary(gender int,cnt int,ds string)) us-
ing a single scan of the join of the status updates and
profiles tables. Note that the output tables are also parti-                     FROM status_updates a JOIN profiles b
tioned on the ds column, and HiveQL allows users to insert                            ON (a.userid = b.userid)
query results into a specific partition of the output table.                      ) subq1
                                                                           GROUP BY subq1.school, subq1.meme
FROM (SELECT a.status, b.school, b.gender                                  DISTRIBUTE BY school, meme
      FROM status_updates a JOIN profiles b                                SORT BY school, meme, cnt desc
           ON (a.userid = b.userid and                               ) subq2;
               a.ds=’2009-03-20’ )
      ) subq1
INSERT OVERWRITE TABLE gender_summary                                3. HIVE ARCHITECTURE
                       PARTITION(ds=’2009-03-20’)
                                                                       Figure 1 shows the major components of Hive and its in-
SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary                                teractions with Hadoop. The main components of Hive are:
                       PARTITION(ds=’2009-03-20’)                       • External Interfaces - Hive provides both user inter-
SELECT subq1.school, COUNT(1) GROUP BY subq1.school                        faces like command line (CLI) and web UI, and appli-
                                                                           cation programming interfaces (API) like JDBC and
  Next, we want to display the ten most popular memes                      ODBC.
per school as determined by status updates by users who
                                                                        • The Hive Thrift Server exposes a very simple client
attend that school. We now show how this computation
                                                                           API to execute HiveQL statements. Thrift [8] is a
can be done using HiveQLs map-reduce constructs. We
                                                                           framework for cross-language services, where a server
parse the result of the join between status updates and
                                                                           written in one language (like Java) can also support
profiles tables by plugging in a custom Python mapper
                                                                           clients in other languages. The Thrift Hive clients gen-
script meme-extractor.py which uses sophisticated natural
                                                                           erated in different languages are used to build common
language processing techniques to extract memes from sta-
                                                                           drivers like JDBC (java), ODBC (C++), and scripting
tus strings. Since Hive does not yet support the rank ag-
                                                                           drivers written in php, perl, python etc.
gregation function the top 10 memes per school can then be
computed by a simple custom Python reduce script top10.py               • The Metastore is the system catalog. All other com-
                                                                           ponents of Hive interact with the metastore. For more
                                                                           details see Section 3.1.
                                                                        • The Driver manages the life cycle of a HiveQL state-
REDUCE subq2.school, subq2.meme, subq2.cnt
       USING ‘top10.py’ AS (school,meme,cnt)
                                                                           ment during compilation, optimization and execution.
FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt                     On receiving the HiveQL statement, from the thrift
      FROM (MAP b.school, a.status                                         server or other interfaces, it creates a session handle
                USING ‘meme-extractor.py’ AS (school,meme)                 which is later used to keep track of statistics like exe-
• Database - is a namespace for tables. The database
                                                                      ‘default’ is used for tables with no user supplied database
                                                                      name.
                                                                    • Table - Metadata for table contains list of columns and
                                                                      their types, owner, storage and SerDe information. It
                                                                      can also contain any user supplied key and value data;
                                                                      this facility can be used to store table statistics in the
                                                                      future. Storage information includes location of the ta-
                                                                      ble’s data in the underlying file system, data formats
                                                                      and bucketing information. SerDe metadata includes
                                                                      the implementation class of serializer and deserializer
                                                                      methods and any supporting information required by
                                                                      that implementation. All this information can be pro-
                                                                      vided during the creation of table.
                                                                    • Partition - Each partition can have its own columns
                                                                      and SerDe and storage information. This can be used
                                                                      in the future to support schema evolution in a Hive
                                                                      warehouse.
                                                                  The storage system for the metastore should be optimized
                                                                for online transactions with random accesses and updates.
                                                                A file system like HDFS is not suited since it is optimized
                                                                for sequential scans and not for random access. So, the
                                                                metastore uses either a traditional relational database (like
                                                                MySQL, Oracle) or file system (like local, NFS, AFS) and
                                                                not HDFS. As a result, HiveQL statements which only access
                                                                metadata objects are executed with very low latency. How-
                                                                ever, Hive has to explicitly maintain consistency between
                                                                metadata and data.

                                                                3.2 Compiler
                                                                  The driver invokes the compiler with the HiveQL string
                                                                which can be one of DDL, DML or query statements. The
                                                                compiler converts the string to a plan. The plan consists
                                                                only of metadata operations in case of DDL statements,
                                                                and HDFS operations in case of LOAD statements. For in-
                                                                sert statements and queries, the plan consists of a directed-
                                                                acyclic graph (DAG) of map-reduce jobs.
Figure 2: Query plan with 3 map-reduce jobs for                     • The Parser transforms a query string to a parse tree
multi-table insert query in Section 2.3                               representation.
                                                                    • The Semantic Analyzer transforms the parse tree to a
                                                                      block-based internal query representation. It retrieves
    cution time, number of output rows, etc.                          schema information of the input tables from the metas-
  • The Compiler is invoked by the driver upon receiv-                tore. Using this information it verifies column names,
    ing a HiveQL statement. The compiler translates this              expands select * and does type-checking including
    statement into a plan which consists of a DAG of map-             addition of implicit type conversions.
    reduce jobs. For more details see Section 3.2                   • The Logical Plan Generator converts the internal query
  • The driver submits the individual map-reduce jobs from            representation to a logical plan, which consists of a tree
    the DAG to the Execution Engine in a topological                  of logical operators.
    order. Hive currently uses Hadoop as its execution              • The Optimizer performs multiple passes over the logi-
    engine.                                                           cal plan and rewrites it in several ways:
We next describe the metastore and compiler in detail.                  • Combines multiple joins which share the join key
                                                                          into a single multi-way join, and hence a single
3.1   Metastore                                                           map-reduce job.
   The metastore is the system catalog which contains meta-             • Adds repartition operators (also known as Re-
data about the tables stored in Hive. This metadata is spec-              duceSinkOperator) for join, group-by and custom
ified during table creation and reused every time the table is             map-reduce operators. These repartition opera-
referenced in HiveQL. The metastore distinguishes Hive as a               tors mark the boundary between the map phase
traditional warehousing solution (ala Oracle or DB2) when                 and a reduce phase during physical plan genera-
compared with similar data processing systems built on top                tion.
of map-reduce like architectures like Pig [7] and Scope [2].            • Prunes columns early and pushes predicates closer
   The metastore contains the following objects:                          to the table scan operators in order to minimize
the amount of data transfered between operators.         ecution engines has forced us to revisit techniques for query
        • In case of partitioned tables, prunes partitions         processing. We have discovered that we have to either mod-
          that are not needed by the query                         ify or rewrite several query processing algorithms to perform
        • In case of sampling queries, prunes buckets that         efficiently in our setting.
          are not needed                                              Hive is an Apache sub-project, with an active user and de-
                                                                   veloper community both within and outside Facebook. The
     Users can also provide hints to the optimizer to              Hive warehouse instance in Facebook contains over 700 ter-
        • add partial aggregation operators to handle large        abytes of usable data and supports over 5000 queries on a
          cardinality grouped aggregations                         daily basis. This demonstration show cases the current ca-
        • add repartition operators to handle skew in grouped      pabilities of Hive. There are many important avenues of
          aggregations                                             future work:
        • perform joins in the map phase instead of the re-            • HiveQL currently accepts only a subset of SQL as valid
          duce phase                                                      queries. We are working towards making HiveQL sub-
                                                                          sume SQL syntax.
    • The Physical Plan Generator converts the logical plan
                                                                       • Hive currently has a na¨ rule-based optimizer with a
                                                                                                  ıve
      into a physical plan, consisting of a DAG of map-
                                                                          small number of simple rules. We plan to build a cost-
      reduce jobs. It creates a new map-reduce job for each
                                                                          based optimizer and adaptive optimization techniques
      of the marker operators – repartition and union all –
                                                                          to come up with more efficient plans.
      in the logical plan. It then assigns portions of the logi-
      cal plan enclosed between the markers to mappers and             • We are exploring columnar storage and more intelli-
      reducers of the map-reduce jobs.                                    gent data placement to improve scan performance.
   In Figure 2, we illustrate the plan of the multi-table in-          • We are running performance benchmarks based on [1]
sert query in Section 2.3. The nodes in the plan are phys-                to measure our progress as well as compare against
ical operators and the edges represent the flow of data be-                other systems [4]. In our preliminary experiments, we
tween operators. The last line in each node represents the                have been able to improve the performance of Hadoop
output schema of that operator. For lack of space, we do                  itself by 20% compared to [1]. The improvements in-
not describe the parameters specified within each operator                 volved using faster Hadoop data structures to process
node. The plan has three map-reduce jobs. Within the same                 the data, for example, using Text instead of String.
map-reduce job, the portion of the operator tree below the                The same queries expressed easily in HiveQL had 20%
repartition operator (ReduceSinkOperator) is executed by                  overhead compared to our optimized hadoop imple-
the mapper and the portion above by the reducer. The                      mentation, i.e., Hive’s performance is on par with the
repartitioning itself is performed by the execution engine.               hadoop code from [1]. Based on these experiments, we
   Notice that the first map-reduce job writes to two tem-                 have identified several areas for performance improve-
porary files to HDFS, tmp1 and tmp2, which are consumed                    ment and have begun working on them. More details
by the second and third map-reduce jobs respectively. Thus,               are available in [4].
the second and third map-reduce jobs wait for the first map-            • We are enhancing the JDBC and ODBC drivers for
reduce job to finish.                                                      Hive for integration with commercial BI tools which
                                                                          only work with traditional relational warehouses.
                                                                       • We are exploring methods for multi-query optimiza-
4. DEMONSTRATION DESCRIPTION                                              tion techniques and performing generic n-way joins in
  The demonstration consists of the following:                            a single map-reduce job.
   • Functionality — We demonstrate HiveQL constructs              We would like to thank our user and developer community
     via the StatusMeme application described in Section 2.3.      for their contributions, with special thanks to Yuntao Jia,
     We expand the application to include queries which use        Yongqiang He, Dhruba Borthakur and Jeff Hammerbacher.
     more HiveQL constructs and showcase the rule-based
     optimizer.
                                                                   6. REFERENCES
   • Tuning — We also demonstrate our query plan viewer
     which shows how HiveQL queries are translated into            [1] A. Pavlo et. al. A Comparison of Approaches to
     physical plans of map-reduce jobs. We show how hints              Large-Scale Data Analysis. Proc. ACM SIGMOD, 2009.
     can be used to modify the plans generated by the op-          [2] C.Ronnie et al. SCOPE: Easy and Efficient Parallel
     timizer.                                                          Processing of Massive Data Sets. Proc. VLDB Endow.,
                                                                       1(2):1265–1276, 2008.
   • User Interface — We show our graphical user interface
     which allows users to explore a Hive database, author         [3] Apache Hadoop. Available at
     HiveQL queries, and monitor query execution.                      http://wiki.apache.org/hadoop.
                                                                   [4] Hive Performance Benchmark. Available at
   • Scalability — We illustrate the scalability of the sys-
                                                                       https://issues.apache.org/jira/browse/HIVE-396.
     tem by increasing the sizes of the input data and the
     complexity of the queries.                                    [5] Hive Language Manual. Available at
                                                                       http://wiki.apache.org/hadoop/Hive/LanguageManual.
                                                                   [6] Facebook Lexicon. Available at
5. FUTURE WORK                                                         http://www.facebook.com/lexicon.
  Hive is a first step in building an open-source warehouse         [7] Apache Pig. http://wiki.apache.org/pig.
over a web-scale map-reduce data processing system (Hadoop).       [8] Apache Thrift. http://incubator.apache.org/thrift.
The distinct characteristics of the underlying storage and ex-

Mais conteúdo relacionado

Mais procurados

Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveZheng Shao
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 

Mais procurados (20)

Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 

Destaque

Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 

Destaque (20)

20081009nychive
20081009nychive20081009nychive
20081009nychive
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 

Semelhante a Hive Demo Paper at VLDB 2009

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDataWorks Summit
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 

Semelhante a Hive Demo Paper at VLDB 2009 (20)

Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hive
HiveHive
Hive
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
hive hadoop sql
hive hadoop sqlhive hadoop sql
hive hadoop sql
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 

Hive Demo Paper at VLDB 2009

  • 1. Hive - A Warehousing Solution Over a Map-Reduce Framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy Facebook Data Infrastructure Team 1. INTRODUCTION databases. Each table has a corresponding HDFS di- The size of data sets being collected and analyzed in the rectory. The data in a table is serialized and stored in industry for business intelligence is growing rapidly, mak- files within that directory. Users can associate tables ing traditional warehousing solutions prohibitively expen- with the serialization format of the underlying data. sive. Hadoop [3] is a popular open-source map-reduce im- Hive provides builtin serialization formats which ex- plementation which is being used as an alternative to store ploit compression and lazy de-serialization. Users can and process extremely large data sets on commodity hard- also add support for new data formats by defining cus- ware. However, the map-reduce programming model is very tom serialize and de-serialize methods (called SerDe’s) low level and requires developers to write custom programs written in Java. The serialization format of each table which are hard to maintain and reuse. is stored in the system catalog and is automatically In this paper, we present Hive, an open-source data ware- used by Hive during query compilation and execution. housing solution built on top of Hadoop. Hive supports Hive also supports external tables on data stored in queries expressed in a SQL-like declarative language - HiveQL, HDFS, NFS or local directories. which are compiled into map-reduce jobs executed on Hadoop. • Partitions - Each table can have one or more parti- In addition, HiveQL supports custom map-reduce scripts to tions which determine the distribution of data within be plugged into queries. The language includes a type sys- sub-directories of the table directory. Suppose data tem with support for tables containing primitive types, col- for table T is in the directory /wh/T. If T is partitioned lections like arrays and maps, and nested compositions of on columns ds and ctry, then data with a particular the same. The underlying IO libraries can be extended to ds value 20090101 and ctry value US, will be stored in query data in custom formats. Hive also includes a system files within the directory /wh/T/ds=20090101/ctry=US. catalog, Hive-Metastore, containing schemas and statistics, • Buckets - Data in each partition may in turn be divided which is useful in data exploration and query optimization. into buckets based on the hash of a column in the table. In Facebook, the Hive warehouse contains several thousand Each bucket is stored as a file in the partition directory. tables with over 700 terabytes of data and is being used ex- Hive supports primitive column types (integers, floating tensively for both reporting and ad-hoc analyses by more point numbers, generic strings, dates and booleans) and than 100 users. nestable collection types — array and map. Users can also The rest of the paper is organized as follows. Section 2 define their own types programmatically. describes the Hive data model and the HiveQL language with an example. Section 3 describes the Hive system ar- 2.2 Query Language chitecture and an overview of the query life cycle. Section 4 provides a walk-through of the demonstration. We conclude Hive provides a SQL-like query language called HiveQL with future work in Section 5. which supports select, project, join, aggregate, union all and sub-queries in the from clause. HiveQL supports data defi- nition (DDL) statements to create tables with specific seri- 2. HIVE DATABASE alization formats, and partitioning and bucketing columns. Users can load data from external sources and insert query 2.1 Data Model results into Hive tables via the load and insert data manip- Data in Hive is organized into: ulation (DML) statements respectively. HiveQL currently • Tables - These are analogous to tables in relational does not support updating and deleting rows in existing ta- bles. HiveQL supports multi-table insert, where users can per- Permission to copy without fee all or part of this material is granted provided form multiple queries on the same input data using a single that the copies are not made or distributed for direct commercial advantage, HiveQL statement. Hive optimizes these queries by sharing the VLDB copyright notice and the title of the publication and its date appear, the scan of the input data. and notice is given that copying is by permission of the Very Large Data HiveQL is also very extensible. It supports user defined Base Endowment. To copy otherwise, or to republish, to post on servers column transformation (UDF) and aggregation (UDAF) func- or to redistribute to lists, requires a fee and/or special permission from the tions implemented in Java. In addition, users can embed publisher, ACM. VLDB ‘09, August 24-28, 2009, Lyon, France custom map-reduce scripts written in any language using Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. a simple row-based streaming interface, i.e., read rows from
  • 2. standard input and write out rows to standard output. This flexibility does come at a cost of converting rows from and to strings. We omit more details due to lack of space. For a complete description of HiveQL see the language manual [5]. 2.3 Running Example: StatusMeme We now present a highly simplified application, Status- Meme, inspired by Facebook Lexicon [6]. When Facebook users update their status, the updates are logged into flat files in an NFS directory /logs/status updates which are ro- tated every day. We load this data into hive on a daily basis into a table status updates(userid int,status string,ds string) using a load statement like below. LOAD DATA LOCAL INPATH ‘/logs/status_updates’ INTO TABLE status_updates PARTITION (ds=’2009-03-20’) Each status update record contains the user identifier (userid), the actual status string (status), and the date (ds) when the status update occurred. This table is partitioned on the ds column. Detailed user profile information, like the gender of the user and the school the user is attending, is available in the profiles(userid int,school string,gender int) table. We first want to compute daily statistics on the frequency of status updates based on gender and school which the user attends. The following multi-table insert statement gener- ates the daily counts of status updates by school (into school summary(school string,cnt int,ds string)) and gen- Figure 1: Hive Architecture der (into gender summary(gender int,cnt int,ds string)) us- ing a single scan of the join of the status updates and profiles tables. Note that the output tables are also parti- FROM status_updates a JOIN profiles b tioned on the ds column, and HiveQL allows users to insert ON (a.userid = b.userid) query results into a specific partition of the output table. ) subq1 GROUP BY subq1.school, subq1.meme FROM (SELECT a.status, b.school, b.gender DISTRIBUTE BY school, meme FROM status_updates a JOIN profiles b SORT BY school, meme, cnt desc ON (a.userid = b.userid and ) subq2; a.ds=’2009-03-20’ ) ) subq1 INSERT OVERWRITE TABLE gender_summary 3. HIVE ARCHITECTURE PARTITION(ds=’2009-03-20’) Figure 1 shows the major components of Hive and its in- SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary teractions with Hadoop. The main components of Hive are: PARTITION(ds=’2009-03-20’) • External Interfaces - Hive provides both user inter- SELECT subq1.school, COUNT(1) GROUP BY subq1.school faces like command line (CLI) and web UI, and appli- cation programming interfaces (API) like JDBC and Next, we want to display the ten most popular memes ODBC. per school as determined by status updates by users who • The Hive Thrift Server exposes a very simple client attend that school. We now show how this computation API to execute HiveQL statements. Thrift [8] is a can be done using HiveQLs map-reduce constructs. We framework for cross-language services, where a server parse the result of the join between status updates and written in one language (like Java) can also support profiles tables by plugging in a custom Python mapper clients in other languages. The Thrift Hive clients gen- script meme-extractor.py which uses sophisticated natural erated in different languages are used to build common language processing techniques to extract memes from sta- drivers like JDBC (java), ODBC (C++), and scripting tus strings. Since Hive does not yet support the rank ag- drivers written in php, perl, python etc. gregation function the top 10 memes per school can then be computed by a simple custom Python reduce script top10.py • The Metastore is the system catalog. All other com- ponents of Hive interact with the metastore. For more details see Section 3.1. • The Driver manages the life cycle of a HiveQL state- REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school,meme,cnt) ment during compilation, optimization and execution. FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt On receiving the HiveQL statement, from the thrift FROM (MAP b.school, a.status server or other interfaces, it creates a session handle USING ‘meme-extractor.py’ AS (school,meme) which is later used to keep track of statistics like exe-
  • 3. • Database - is a namespace for tables. The database ‘default’ is used for tables with no user supplied database name. • Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information. It can also contain any user supplied key and value data; this facility can be used to store table statistics in the future. Storage information includes location of the ta- ble’s data in the underlying file system, data formats and bucketing information. SerDe metadata includes the implementation class of serializer and deserializer methods and any supporting information required by that implementation. All this information can be pro- vided during the creation of table. • Partition - Each partition can have its own columns and SerDe and storage information. This can be used in the future to support schema evolution in a Hive warehouse. The storage system for the metastore should be optimized for online transactions with random accesses and updates. A file system like HDFS is not suited since it is optimized for sequential scans and not for random access. So, the metastore uses either a traditional relational database (like MySQL, Oracle) or file system (like local, NFS, AFS) and not HDFS. As a result, HiveQL statements which only access metadata objects are executed with very low latency. How- ever, Hive has to explicitly maintain consistency between metadata and data. 3.2 Compiler The driver invokes the compiler with the HiveQL string which can be one of DDL, DML or query statements. The compiler converts the string to a plan. The plan consists only of metadata operations in case of DDL statements, and HDFS operations in case of LOAD statements. For in- sert statements and queries, the plan consists of a directed- acyclic graph (DAG) of map-reduce jobs. Figure 2: Query plan with 3 map-reduce jobs for • The Parser transforms a query string to a parse tree multi-table insert query in Section 2.3 representation. • The Semantic Analyzer transforms the parse tree to a block-based internal query representation. It retrieves cution time, number of output rows, etc. schema information of the input tables from the metas- • The Compiler is invoked by the driver upon receiv- tore. Using this information it verifies column names, ing a HiveQL statement. The compiler translates this expands select * and does type-checking including statement into a plan which consists of a DAG of map- addition of implicit type conversions. reduce jobs. For more details see Section 3.2 • The Logical Plan Generator converts the internal query • The driver submits the individual map-reduce jobs from representation to a logical plan, which consists of a tree the DAG to the Execution Engine in a topological of logical operators. order. Hive currently uses Hadoop as its execution • The Optimizer performs multiple passes over the logi- engine. cal plan and rewrites it in several ways: We next describe the metastore and compiler in detail. • Combines multiple joins which share the join key into a single multi-way join, and hence a single 3.1 Metastore map-reduce job. The metastore is the system catalog which contains meta- • Adds repartition operators (also known as Re- data about the tables stored in Hive. This metadata is spec- duceSinkOperator) for join, group-by and custom ified during table creation and reused every time the table is map-reduce operators. These repartition opera- referenced in HiveQL. The metastore distinguishes Hive as a tors mark the boundary between the map phase traditional warehousing solution (ala Oracle or DB2) when and a reduce phase during physical plan genera- compared with similar data processing systems built on top tion. of map-reduce like architectures like Pig [7] and Scope [2]. • Prunes columns early and pushes predicates closer The metastore contains the following objects: to the table scan operators in order to minimize
  • 4. the amount of data transfered between operators. ecution engines has forced us to revisit techniques for query • In case of partitioned tables, prunes partitions processing. We have discovered that we have to either mod- that are not needed by the query ify or rewrite several query processing algorithms to perform • In case of sampling queries, prunes buckets that efficiently in our setting. are not needed Hive is an Apache sub-project, with an active user and de- veloper community both within and outside Facebook. The Users can also provide hints to the optimizer to Hive warehouse instance in Facebook contains over 700 ter- • add partial aggregation operators to handle large abytes of usable data and supports over 5000 queries on a cardinality grouped aggregations daily basis. This demonstration show cases the current ca- • add repartition operators to handle skew in grouped pabilities of Hive. There are many important avenues of aggregations future work: • perform joins in the map phase instead of the re- • HiveQL currently accepts only a subset of SQL as valid duce phase queries. We are working towards making HiveQL sub- sume SQL syntax. • The Physical Plan Generator converts the logical plan • Hive currently has a na¨ rule-based optimizer with a ıve into a physical plan, consisting of a DAG of map- small number of simple rules. We plan to build a cost- reduce jobs. It creates a new map-reduce job for each based optimizer and adaptive optimization techniques of the marker operators – repartition and union all – to come up with more efficient plans. in the logical plan. It then assigns portions of the logi- cal plan enclosed between the markers to mappers and • We are exploring columnar storage and more intelli- reducers of the map-reduce jobs. gent data placement to improve scan performance. In Figure 2, we illustrate the plan of the multi-table in- • We are running performance benchmarks based on [1] sert query in Section 2.3. The nodes in the plan are phys- to measure our progress as well as compare against ical operators and the edges represent the flow of data be- other systems [4]. In our preliminary experiments, we tween operators. The last line in each node represents the have been able to improve the performance of Hadoop output schema of that operator. For lack of space, we do itself by 20% compared to [1]. The improvements in- not describe the parameters specified within each operator volved using faster Hadoop data structures to process node. The plan has three map-reduce jobs. Within the same the data, for example, using Text instead of String. map-reduce job, the portion of the operator tree below the The same queries expressed easily in HiveQL had 20% repartition operator (ReduceSinkOperator) is executed by overhead compared to our optimized hadoop imple- the mapper and the portion above by the reducer. The mentation, i.e., Hive’s performance is on par with the repartitioning itself is performed by the execution engine. hadoop code from [1]. Based on these experiments, we Notice that the first map-reduce job writes to two tem- have identified several areas for performance improve- porary files to HDFS, tmp1 and tmp2, which are consumed ment and have begun working on them. More details by the second and third map-reduce jobs respectively. Thus, are available in [4]. the second and third map-reduce jobs wait for the first map- • We are enhancing the JDBC and ODBC drivers for reduce job to finish. Hive for integration with commercial BI tools which only work with traditional relational warehouses. • We are exploring methods for multi-query optimiza- 4. DEMONSTRATION DESCRIPTION tion techniques and performing generic n-way joins in The demonstration consists of the following: a single map-reduce job. • Functionality — We demonstrate HiveQL constructs We would like to thank our user and developer community via the StatusMeme application described in Section 2.3. for their contributions, with special thanks to Yuntao Jia, We expand the application to include queries which use Yongqiang He, Dhruba Borthakur and Jeff Hammerbacher. more HiveQL constructs and showcase the rule-based optimizer. 6. REFERENCES • Tuning — We also demonstrate our query plan viewer which shows how HiveQL queries are translated into [1] A. Pavlo et. al. A Comparison of Approaches to physical plans of map-reduce jobs. We show how hints Large-Scale Data Analysis. Proc. ACM SIGMOD, 2009. can be used to modify the plans generated by the op- [2] C.Ronnie et al. SCOPE: Easy and Efficient Parallel timizer. Processing of Massive Data Sets. Proc. VLDB Endow., 1(2):1265–1276, 2008. • User Interface — We show our graphical user interface which allows users to explore a Hive database, author [3] Apache Hadoop. Available at HiveQL queries, and monitor query execution. http://wiki.apache.org/hadoop. [4] Hive Performance Benchmark. Available at • Scalability — We illustrate the scalability of the sys- https://issues.apache.org/jira/browse/HIVE-396. tem by increasing the sizes of the input data and the complexity of the queries. [5] Hive Language Manual. Available at http://wiki.apache.org/hadoop/Hive/LanguageManual. [6] Facebook Lexicon. Available at 5. FUTURE WORK http://www.facebook.com/lexicon. Hive is a first step in building an open-source warehouse [7] Apache Pig. http://wiki.apache.org/pig. over a web-scale map-reduce data processing system (Hadoop). [8] Apache Thrift. http://incubator.apache.org/thrift. The distinct characteristics of the underlying storage and ex-