2. Tajo: A Data Warehouse System
• Apache Top Level Project
• Distributed and scalable data warehouse system on Hadoop
• Low latency and long running batch queries in a single system
• Features:
ANSI/ISO SQL compliance
Mature SQL features: Join, Group by, Order By, Aggregation
and windows functions
Supports Partition table
Supports Java/Python, UDF
JDBC & Java based asynchronous API
Supports Read/Write of CSV, JSON, RCFile, Sequential file,
Parquet and ORC
4. Where to use Tajo
• Extraction, Transformation, Loading (ETL)
• Interactive BI/ Analytics on web-scale Big Data
• Data Discovery/ Exploratory analysis with R and existing SQL tools
• Query federation
• Customer wants a unified system for batch and interactive queries
on Hadoop, Amazon S3 or Hbase
• Customer wants to use mixed use of Hadoop-based DW and
RDBMS-based DW or want to replace RDBMS DW.
• Customer wants to use existing SQL tools on Hadoop DW
5. Hbase Storage Support
• You can use SQL to access Hbase tables.
• TAJO supports Hbase storage
• CREATE(EXTERNAL)/DROP/INSERT/OVERWRITE
• Create TABLE hbase_t1 (Key TEXT, Col1 TEXT, Col2 Int) USING
HBASE (
‘table’ = ‘t1’,
‘columns’ = ‘:key,cf1:col1,cf2:col2’,
‘hbase.zookeper.quorum’ = ‘host1:2181,host2:2181’
6. Tajo Shell (TSQL)
Tajo provides a shell utility named Tsql. It is a command-line interface
(CLI) where users can create or drop tables, inspect schema and
query tables, etc.
• Meta Commands
• Executing HDFS commands
• Session Variables
• Administration Commands
• Introducing to TSQL
• Executing a single command
• Executing Queries from Files
• Executing as background process
Refer: http://tajo.apache.org/docs/current/index.html
7. Tajo SQL Language (DDL)
CREATE DATABASE
CREATE DATABASE [IF NOT EXISTS] <database_name>
DROP DATABASE
DROP DATABASE [IF EXISTS] <database_name>
CREATE TABLE
CREATE TABLE [IF NOT EXISTS] <table_name> [(<column_name> <data_type>, ... )]
[using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>]
CREATE EXTERNAL TABLE [IF NOT EXISTS] <table_name> (<column_name> <data_type>, ... )
using <storage_type> [with (<key> = <value>, ...)] LOCATION '<path>'
Compression
L_ORDERKEY bigint,
L_PARTKEY bigint,
...
L_COMMENT text)
USING TEXT WITH ('text.delimiter'='|','compression.codec'='org.apache.hadoop.io.compress.DeflateCodec')
LOCATION 'hdfs://localhost:9010/tajo/warehouse/lineitem_100_snappy';
DROP TABLE
DROP TABLE [IF EXISTS] <table_name> [PURGE]
CREATE INDEX
CREATE INDEX [ name ] ON table_name [ USING method ]
( { column_name | ( expression ) } [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WHERE predicate ]
DROP INDEX
DROP INDEX name
8. INSERT (OVERWRITE) INTO
INSERT OVERWRITE statement overwrites a table data of an existing table or a data in a given
directory. Tajo’s INSERT OVERWRITE statement follows INSERT INTO SELECT statement of
SQL. The examples are as follows:
create table t1 (col1 int8, col2 int4, col3 float8);
-- when a target table schema and output schema are equivalent to each other
INSERT OVERWRITE INTO t1 SELECT l_orderkey, l_partkey, l_quantity FROM
lineitem;
-- or
INSERT OVERWRITE INTO t1 SELECT * FROM lineitem;
-- when the output schema are smaller than the target table schema
INSERT OVERWRITE INTO t1 SELECT l_orderkey FROM lineitem;
-- when you want to specify certain target columns
INSERT OVERWRITE INTO t1 (col1, col3) SELECT l_orderkey, l_quantity FROM
lineitem;
In addition, INSERT OVERWRITE statement overwrites table data as well as a specific
directory.
INSERT OVERWRITE INTO LOCATION '/dir/subdir' SELECT l_orderkey, l_quantity FROM lineitem;
9. Tajo Queries
Sample Query
SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...]
[FROM <table reference> [[AS] <table alias name>] [, ...]]
[WHERE <condition>]
[GROUP BY <expression> [, ...]]
[HAVING <condition>]
[ORDER BY <expression> [ASC|DESC] [NULL FIRST|NULL LAST] [, ...]]
Table and Table Aliases
A temporary name can be given to tables and complex table references to be used for references to
the derived table in the rest of the query. This is called a table alias.
FROM table_reference AS alias or FROM table_reference alias
Window Functions
A window function performs a calculation across multiple table rows that belong to some window
frame.
SELECT ...., func(param) OVER ([PARTITION BY partition-expr [, ...]] [ORDER BY sort-expr [, ...]]),
...., FROM
10. Better SQL support via thin JDBC
ETL Tools BI Tools Reporting Tools
TAJO CLUSTER
Tajo JDBC
HDFS HBase S3 Swift