Hadoop Fundamentals I

© 2013 IBM Corporation1
AVNET – Hadoop Fundamentals I
Romeo Kienzler
IBM Innovation Center Zurich

1) Welcome
2) What is big data?
3) Introduction to Hadoop
4) BigInsights
5) Hadoop architecture
6) Lab 1 – Core Hadoop
7) MapReduce
8) Lab 2 – MapReduce
9) Pig, Jaql, Hive, BigSQL, SystemT/AQL
10) Lab 3 – Pig, Hive, and Jaql
11) Certification on BigDataUniversity
Agenda

What is BIG data?

Traditional Business Intelligence / Data
Warehousing
...60 percent, were unsatisfied with their data warehousing system.¹
¹http://www.information-management.com/issues/20010601/3494-1.html

What is BIG data?

What is BIG data?
Big Data
Hadoop

What is BIG data?
Business Intelligence
Data Warehouse

Map-Reduce → Hadoop → BigInsights

Why is Big Data important?
Data AVAILABLE to an
organization
data an organization can
PROCESS
Missed
opportunity
Enterprises are “more blind”
to new opportunities.
Organizations are able to
process less and less of the
available data.
100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed
through the net. 80 % spam and viruses. => Prefiltering is more and more important.

Volume
Terabytes, petabytes, even
exabytes
Variety
All kinds of data
All kinds of analytics
Velocity
Agility
Analyze data in. . .
Hours instead of days
Days instead of weeks
Dynamically responsive
Rapid data exploration
Traditional / Non-traditional
data sources
Store
Analyze
Explore
What is BIG data?
Volume*Variaty*Velocity=Value

BigData Analytics

BigData Analytics – Predictive Analytics

BigData Analytics – Correlation / Text / NLP

BigData Analytics – Feature Extraction
Feature extraction involves simplifying the amount of resources
required to describe a large set of data accurately¹
¹: Wikipedia

Storage / DataCPU’s / Algorithm
Business Value / Insight

"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => Long Tail Distributions

Realtime / In-Memory Computing:
InfoSphere Streams / Watson

The Paris Hilton Problem
Watson Workshop: What is Watson?

Introduction to Hadoop

BigInsights

BigInsights Demonstration

Hadoop Architecture

HDFS – Hadoop File System

Lab 1 – Hadoop Architecture
1)Start from chapter 1.2
2)Replace /home/biadmin with /home/biadminX where X is your user ID
3)In chapter 1.3 skip task 1.3.1._1 and go to http://10.199.20.51:8080 instead
4)Skip 1.3.5
5)In chapter 1.3.6._30 use any file you like on your desktop computer

Map-Reduce

Data Parallelism

Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec

Lab 2 - MapReduce
1)Skip task 1.1._1, use putty to connect to biadmin@10.199.20.51 instead
2)Replace /home/biadmin with /home/biadminX where X is your user ID
3)In 1.1._4 - 1.1._6 replace output with with /home/biadminX/output where X is your user ID
4)Skip chapter 1.2
5)Chapter 1.3 is optional (using your local virtual machine), maybe during lunch break :)

Pig, Jaql, Hive, BigSQL, SystemT/AQL

SQL for BigInsights
 Data warehouse augmentation is a very common use case for Hadoop
 While highly scalable, MapReduce is notoriously difficult to use
– Java API is tedious and requires programming expertise
– Unfamiliar languages (e.g. Pig) also requiring expertise
– Many different file formats, storage mechanisms, configuration options, etc.
– Joins, grouping, sorting tedious to orchestrate
 SQL support opens the data to a much wider audience
– Familiar, widely known syntax
– Common catalog for identifying data and structure
– Clear separation of defining the what (you want) vs. the how (to get it)

Query Processing
 Big SQL consists of two query processing engines
– The SQL optimization engine
– Jaql as the query execution engine
Client
SQL Engine
Jaql
Jaql SQL
Optimizer
Runtime

Big SQL vs. Alternatives
 There are a number of SQL solutions, where does Big SQL fit in?
 Hive
– Open source
• Established Hadoop component
• Active development community
– Restrictive SQL syntax
• No subqueries (Hive 0.11 adds non-correlated subquery support)
• No windowed aggregates (Hive 0.11 adds windowed aggregate support)
• Ansi join syntax only
– Limited type support
• No varchar(n), decimal(p,s), etc.
– Poor client support
• Limited JDBC and ODBC drivers
– Poor low-latency query support (via local mapreduce)

Big SQL vs. Alternatives (cont.)
 Impala
– Recently open sourced
– Achieves low latency by bypassing MapReduce infrastructure
• Installs a completely separate execution infrastructure
• Can lead to resource scheduling conflicts
– Execution engine is C++
• Great for performance, makes extending difficult (e.g. UDF's & UDA's)
• Support for limited set of file formats
– Currently limited to broadcast joins
• All tables must fit in memory (aggregate cluster memory)
• Scalability limitation for larger clusters
– Uses Hive 0.9 query syntax (more limitations than the current Hive)
– Uses Hive 0.9 type system (more limitations than the current Hive)

Lab 3 – Querying Data with Pig, Hive, Jaql
1)putty to biadmin@10.199.20.51
2)Skip task 1.1._2, start jaql shell using command /opt/ibm/biginsights/jaql/bin/jaqlshell
3)In 1.1._5 replace biadmin with with biadminX where X is your user ID
4)Skip chapter 1.2 (optional using virtual machine)
5)In 1.3._2 replace biadmin with with biadminX where X is your user ID
6)Instead of task 1.3._2 type /opt/ibm/biginsights/pig/bin/pig
7)In 1.3._4 replace sampleData/NewsGroups.csv with /user/biadminX/sampleData/NewsGroups.csv
8)Skip chapter 1.4 (optional using virtual machine)
9)Skip 1.5._12 and _13 and type /opt/ibm/biginsights/hive/bin/hive instead
10)Type "use biadminX" where X is your user ID
11)continue with task _14

NoSQL Databases
 Column Store
– Hadoop / HBASE
– Cassandra
– Amazon Simple DB
 JSON / Document Store
– MongoDB
– CouchDB
 Key / Value Store
– Amazon DynamoDB
– Voldemort
 Graph DBs
– DB2 SPARQL Extension
– Neo4J
 MP RDBMS
– DB2 DPF, DB2 pureScale, PureData for Operational Analytics
– Oracle RAC
– Greenplum
http://nosql-database.org/ > 150

CAP Theorem / Brewers Theorem¹
 impossible for a distributed computer system simultaneously guarantee all 3 properties
– Consistency (all nodes see the same data at the same time)
– Availability (guarantee that every request knows whether it was successful or failed)
– Partition tolerance (continues to operate despite failure of part of the system)
 What about ACID?
– Atomicity
– Consistency
– Isolation
– Durability
 BASE, the new ACID
– Basically Available
– Soft state
– Eventual consistency
• Monotonic Read Consistency
• Monotonic Write Consistency
• Read Your Own Writes

Certification
 Go to www.bigdatauniversity.com
 Search for “hadoop fundamentals”
 Choose “Hadoop Fundamentals I – Version 2”
 Sign up
 Login with existing account or one of the following:
 Take the test:

Questions?

Hadoop Fundamentals I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Fundamentals I

Similar to Hadoop Fundamentals I (20)

More from Romeo Kienzler

More from Romeo Kienzler (20)

Recently uploaded

Recently uploaded (20)

Hadoop Fundamentals I