SlideShare uma empresa Scribd logo
1 de 198
Baixar para ler offline
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1
Big Data using Hadoop
Hands On Workshop
Dr.Thanachart Numnonda
Certified Java Programmer
thanachart@imcinstitute.com
Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com, +66-81-559-1446
YARN
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Big Data Introduction
Volume
Variety Velocity
DB Table
Delimited Text
XML, HTML
Free Form Text
Image, Music, VDO, Binary
Batch
Near real time
Real time
GB
TB
PB
XB
ZB
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Big Data Introduction
Volume
Variety Velocity
DB Table
Delimited Text
XML, HTML
Free Form Text
Image, Music, VDO, Binary
Batch
Near real time
Real time
GB
TB
PB
XB
ZB
1000 Kilobytes = 1 Megabyte
1000 Megabytes = 1 Gigabyte
1000 Gigabytes = 1 Terabyte
1000 Terabytes = 1 Petabyte
1000 Petabytes = 1 Exabyte
1000 Exabytes = 1 Zettabyte
1000 Zettabytes = 1 Yottabyte
1000 Yottabytes = 1 Brontobyte
1000 Brontobytes = 1 Geopbyte
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 4
Big Data
Players
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Big Data Architecture
Big Data InfrastructureBig Data Infrastructure
BI/Report
Next Best
Action
Distributed Data Processing
Integration and Metadata Framework
Distributed Data Store and DWH
Monitoring
and
Management
Framework
Security
Framework
Predictive
Analytics
Descriptive
Analytics
Prescriptive
Analytics
Big Data Platform
Big Data Applications
Hardware, Storage, Network
Fraud
Analysis
Cyber
Security
Talent
Search
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Lecture: Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop Timeline
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Apache Hadoop Core Technology
j2eedev.org/ecosystem-hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Apache Hadoop Ecosystem
j2eedev.org/ecosystem-hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Big Data Platform & Big Data Analytics
Hadoop Technology
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Block Size = 64MB
Replication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few
¢/month vs $/month
apache.org/hadoop/
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop 1.0 vs Hadoop 2.0
Hortonwork.com
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop 1.0 vs Hadoop 2.0
Hortonwork.com
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop 2
Hortonworks.com
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop Symbols and Reasons Behind
1
9
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Opportunity and Market Outlook
Source: Machina Research
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Launch a virtual server
on EC2 Amazon Web Services
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop Installation
Hadoop provides three installation choices:
1. Local mode: This is an unzip and run mode to
get you started right away where allparts of
Hadoop run within the same JVM
2. Pseudo distributed mode: This mode will be
run on different parts of Hadoop as different
Java processors, but within a single machine
3. Distributed mode: This is the real setup that
spans multiple machines
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Virtual Server
This lab will use a EC2 virtual server to install a
Hadoop server using the following features:
● Ubuntu Server 14.04 LTS
● m3.mediun 1vCPU, 3.75 GB memory
● Security group: default
● Keypair: imchadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Select a EC2 service and click on Lunch Instance
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Select an Amazon Machine Image (AMI) and
Ubuntu Server 14.04 LTS (PV)
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Choose m3.medium Type virtual server
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Leave configuration details as default
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Add Storage: 20 GB
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Name the instance
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Select an existing security group > Select Security
Group Name: default
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Click Launch and choose imchadoop as a key pair
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review an instance / click Connect for
an instruction to connect to the instance
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Connect to an instance from Mac/Linux
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Connect to an instance from Windows using Putty
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Connect to an instance from Windows using Putty
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Connect to an instance from Windows using Putty
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Connect to an instance from Windows using Putty
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Login as: ubuntu
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Installing Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Installing Hadoop and Ecosystem
1. Update System Software Repository
2. Configuring SSH
3. Installing Java
4. Download/Extract Hadoop
5. Installing Hadoop
6. Configure Hadoop
7. Formatting Namenode
8. Starting Hadoop
9. Accessing Hadoop Web Console
10. Stopping Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
1. Update System Software Repository
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
sudo apt-get update
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
2. Configuring SSH
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Install SSH, Create ssh-key
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Exit from ssh connection
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
3. Installing Java
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Installing Java
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
4. Download/Extract Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Download/Extract Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
5. Installing Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Installing Hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Execute environment variables: source ~/.bashrc
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit hadoop shell script
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit hadoop shell script – log location to another directory
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit Yarn shell script – log location to another directory
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
6. Configuring Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit hadoop - core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ip-172-31-4-165:9000</value>
</property>
</configuration>
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Creating directories for namenode and datanode
sudo mkdir -p /var/hadoop_data/namenode
sudo mkdir -p /var/hadoop_data/datanode
sudo chown ubuntu:ubuntu -R /var/hadoop_data
Edit hdfs-site.xml
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ip-172-31-4-165</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>ip-172-31-4-165:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>ip-172-31-4-165:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>ip-172-31-4-165:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>ip-172-31-4-165:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>ip-172-31-4-165:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit mapred-site.xml
cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
7. Formatting Namenode
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Formatting Namenode
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Formatting Namenode
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
8. Starting Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Starting Namenode and Datanode
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Starting Yarn
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
9. Accessing Hadoop Web Console
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Checking Public IT Address
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Accessing Hadoop Web Console
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hadoop Port Numbers
Daemon Default Port Configuration Parameter in conf/*-
site.xml
HDFS Namenode Web 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
Yarn ResourceManager Web 8088 yarn.resourcemanager.webapp.address
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
10. Stopping Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Stop Hadoop Yarn and HDFS
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Importing Data to HDFS
using Hadoop Command Line
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Creating Hadoop HDFS Directories and
Importing file to Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Traversing, Retrieving
Data from HDFS
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review data in Hadoop HDFS
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Deleting file from HDFS
rm
Usage: hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]
Delete files specified as args.
Options:
The -f option will not display a diagnostic message or modify the
exit status to reflect an error if the file does not exist.
The -R option deletes the directory and any content under it
recursively.
The -r option is equivalent to -R.
The -skipTrash option will bypass trash, if enabled, and delete
the specified file(s) immediately. This can be useful when it is
necessary to delete files from an over-quota directory.
Example:
hdfs dfs -rm hdfs://nn.example.com/file /user/hadoop/mydir
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Lecture: YARN
YARN
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
YARN: Yet Another Resource Negotiator
Hadoop.apache.org
MRV2 maintains API compatibility with previous stable release
(hadoop-1.x). This means that all Map-Reduce jobs should still run
unchanged on top of MRv2 with just a recompile.
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Lecture: Map Reduce
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
MapReduce
Hadoop Environment
Hadoop.apache.org
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReade
r
RecordWriter
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
How does the MapReduce work?
Sorting
Partitioning
Combining
Car, 2
Car, 2
Bear, {1,1}
Car, {2,1}
River, {1,1}
Deer, {1,1}
Output in a list of (Key, List of Values)
in the intermediate file
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReade
r
RecordWriter
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
MapReduce Processing – The Data
flow
1. InputFormat, InputSplits, RecordReader
2. Mapper - your focus is here
3. Partition, Shuffle & Sort
4. Reducer - your focus is here
5. OutputFormat, RecordWriter
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
InputFormat
InputFormat: Description: Key: Value:
TextInputFormat
Default format; reads
lines of text files
The byte offset of the
line
The line contents
KeyValueInputFormat
Parses lines into key,
val pairs
Everything up to the
first tab character
The remainder of the
line
SequenceFileInputFor
mat
A Hadoop-specific
high-performance
binary format
user-defined user-defined
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
InputSplit
An InputSplit describes a unit of work that comprises a single map
task.
InputSplit presents a byte-oriented view of the input.
You can control this value by setting the mapred.min.split.size
parameter in core-site.xml, or by overriding the parameter in the
JobConf object used to submit a particular MapReduce job.
RecordReader
RecordReader reads <key, value> pairs from an InputSplit.
Typically the RecordReader converts the byte-oriented view of
the input, provided by the InputSplit, and presents a record-
oriented to the Mapper
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Mapper
Mapper: The Mapper performs the user-defined logic to the input a
key, value and emits (key, value) pair(s) which are forwarded to the
Reducers.
Partition, Shuffle & Sort
After the first map tasks have completed, the nodes may still be
performing several more map tasks each. But they also begin
exchanging the intermediate outputs from the map tasks to where they
are required by the reducers.
Partitioner controls the partitioning of map-outputs to assign to reduce
task . he total number of partitions is the same as the number of reduce
tasks for the job
The set of intermediate keys on a single node is automatically sorted
by internal Hadoop before they are presented to the Reducer
This process of moving map outputs to the reducers is known as
shuffling.
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Reducer
This is an instance of user-provided code that performs read each
key, iterator of values in the partition assigned. The OutputCollector
object in Reducer phase has a method named collect() which will
collect a (key, value) output.
OutputFormat, Record Writer
OutputFormat governs the writing format in OutputCollector and
RecordWriter writes output into HDFS.
OutputFormat: Description
TextOutputFormat
Default; writes lines in "key t value"
form
SequenceFileOutputFormat
Writes binary files suitable for
reading into subsequent MapReduce
jobs
NullOutputFormat generates no output files
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Writing Your Own Map Reduce
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Writing Your Own Map Reduce
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Writing Your Own Map Reduce
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Continue (มีต่อ)..
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Writing Your Own Map Reduce
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Continue (มีต่อ)..
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Writing Your Own Map Reduce
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Packaging Map Reduce
and Deploying to Hadoop Runtime
Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
mkdir wordcount_classes
javac -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-
2.6.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-
core-2.6.0.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar -d
wordcount_classes WordCount.java
jar -cvf ./wordcount.jar -C wordcount_classes/ .
yarn jar ./wordcount.jar WordCount /inputs/* /outputs/wordcount_output_dir01
hdfs dfs -cat /outputs/wordcount_output_dir01/part-r-00000
Create java classes directory (สร้าง directory สําหรับ java class ที่ถูก compile แล้ว
Typing in one single line (พิมพต่อเนื่องเลย อย่ากดแป้ นขึ้นบรรทัดใหม่)
Create jar file (สร้าง jar file อย่าลืมใส่เครื่องหมายจุดด้านหลังด้วย)
Execute yarn (สั่งให้ yarn ทํางาน)
Review the result (สั่งให้พิมพ์ผลลัพท์ออกหน้าจอ)
Please find next page for screen step by step (โปรดดูหน้าถัดไปสําหรับหน้าจอในแต่ละขั้นตอน)...
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Packaging Map Reduce and Deploying to
Hadoop Runtime Environment
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
(Optional) Review OS File System
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
(Optional) Review OS File System
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
(Optional) Review OS File System
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Lecture
Understanding Hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Introduction
A Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data
summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called
Hive QL, which is based on SQL
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
What Hive is NOT
Hive is not designed for online transaction processing and
does not offer real-time queries and row level updates. It is
best used for batch jobs over large sets of immutable data
(like web logs, etc.).
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Sample HiveQL
The Query compiler uses the information stored in the metastore to
convert SQL queries into a sequence of map/reduce jobs, e.g. the
following query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.or
g
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Sample HiveQL
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
System Architecture and Components
• Metastore: To store the meta data.
• Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.
• SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types.
A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or
binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will
take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another
supported system.
• UDF and UDAF: Programmable interfaces and implementations for
user defined functions (scalar and aggregate functions).
• Clients: Command line client similar to Mysql command line.
hive.apache.or
g
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Architecture Overview
HDFS
Hive CLI
Querie
s
Browsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrif
t
Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.
WebUI
HDFS
DDL
Hive
Hive.apache.org
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive Metastore
Hive Metastore is a repository to keep all Hive
metadata; Tables and Partitions definition.
By default, Hive will store its metadata in
Derby DB
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive Built in Functions
Return Type Function Name (Signature) Description
BIGINT round(double a) returns the rounded BIGINT value of the double
BIGINT floor(double a) returns the maximum BIGINT value that is equal or less than the double
BIGINT ceil(double a) returns the minimum BIGINT value that is equal or greater than the double
double rand(), rand(int seed)
returns a random number (that changes from row to row). Specifiying the seed will make sure the generated
random number sequence is deterministic.
string concat(string A, string B,...)
returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This
function accepts arbitrary number of arguments and return the concatenation of all of them.
string substr(string A, int start)
returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in
'bar'
string substr(string A, int start, int length) returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba'
string upper(string A)
returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in
'FOOBAR'
string ucase(string A) Same as upper
string lower(string A) returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar'
string lcase(string A) Same as lower
string trim(string A) returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar'
string ltrim(string A)
returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar
') results in 'foobar '
string rtrim(string A)
returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ')
results in ' foobar'
string
regexp_replace(string A, string B,
string C)
returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See
Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'
string from_unixtime(int unixtime)
convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp
of that moment in the current system time zone in the format of "1970-01-01 00:00:00"
string to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"
int year(string date)
Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") =
1970
int month(string date)
Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") =
11
int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string
get_json_object(string json_string,
string path)
Extract json object from a json string based on json path specified, and return json string of the extracted json
object. It will return null if the input json string is invalid
hive.apache.org
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive Aggregate Functions
Return Type
Aggregation Function
Name (Signature)
Description
BIGINT
count(*), count(expr),
count(DISTINCT expr[,
expr_.])
count(*) - Returns the total number of retrieved rows,
including rows containing NULL values; count(expr) - Returns
the number of rows for which the supplied expression is non-
NULL; count(DISTINCT expr[, expr]) - Returns the number of
rows for which the supplied expression(s) are unique and
non-NULL.
DOUBLE
sum(col), sum(DISTINCT
col)
returns the sum of the elements in the group or the sum of
the distinct values of the column in the group
DOUBLE avg(col), avg(DISTINCT col)
returns the average of the elements in the group or the
average of the distinct values of the column in the group
DOUBLE min(col) returns the minimum value of the column in the group
DOUBLE max(col) returns the maximum value of the column in the group
hive.apache.org
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Running Hive
Hive Shell
● Interactive
hive
● Script
hive -f myscript
● Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.or
g
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive Commands
ortonworks.com
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive Tables
● Managed- CREATE TABLE
− LOAD- File moved into Hive's data warehouse directory
− DROP- Both data and metadata are deleted.
● External- CREATE EXTERNAL TABLE
− LOAD- No file moved
− DROP- Only metadata deleted
− Use when sharing data between Hive and Hadoop applications
or you want to use multiple schema on the same data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive External Table
Dropping External Table using Hive:-
- Hive will delete metadata from metastore
- Hive will NOT delete the HDFS file
- You need to manually delete the HDFS file
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Java JDBC for Hive
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "t" + res.getString(2));
}
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Java JDBC for Hive
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/a.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "t" + res.getString(2));
}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
HiveQL and MySQL Comparison
ortonworks.com
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
HiveQL and MySQL Query Comparison
ortonworks.com
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Creating Table and
Retrieving Data using Hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hive Hands-On Labs
1. Installing Hive
2. Configuring / Starting Hive
3. Creating Hive Table
4. Reviewing Hive Table in HDFS
5. Alter and Drop Hive Table
6. Loading Data to Hive Table
7. Querying Data from Hive Table
8. Reviewing Hive Table Content from HDFS Command
and WebUI
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
1. Installing Hive
$ wget http://www.us.apache.org/dist/hive/hive-
1.0.0/apache-hive-1.0.0-bin.tar.gz# tar -xvzf apache-hive-
1.1.0-bin.tar.gz
$ tar -xvf apache-hive-1.0.0-bin.tar.gz
$ mv apache-hive-1.0.0-bin apache-hive
$ sudo mv ./apache-hive /usr/local/
$ chown -R ubuntu:ubuntu /usr/local/apache-hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
1. Installing Hive
Edit $HOME ./bashrc
$ vi /home/ubuntu/.bashrc
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
1. Installing Hive
Execute environment variable
Create hive directories
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
2. Configuring Hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
2. Starting Hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Show Hive Table
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
3. Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
Time taken: 0.138 seconds
hive (default)> describe test_tbl;
OK
id int
country string
Time taken: 0.147 seconds
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
3. Creating Hive Table
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
4. Reviewing Hive Table in HDFS
[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51
/user/hive/warehouse/test_tbl
[hdadmin@localhost hdadmin]$
Review Hive Table from
HDFS WebUI
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
5. Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
Time taken: 0.077 seconds
hive (default)> drop table test_tbl;
OK
Time taken: 0.9 seconds
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
6. Loading Data to Hive Table
Please get the country_data.csv from your instructor.
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
6. Loading Data to Hive Table
Please get the country_data.csv from your instructor.
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
7. Querying Data from Hive Table
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
8. Reviewing Hive Table Content from HDFS
Command and WebUI
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Lecture: Pig
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Introduction
A high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables
them to handle very large data sets.
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig Components
● Two Compnents
● Language (Pig Latin)
● Compiler
● Two Execution Environments
● Local
pig -x local
● Distributed
pig -x mapreduce
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Running Pig
● Script
pig myscript
● Command line (Grunt)
pig
● Embedded
Writing a java program
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig Dataflow Organization
1. A LOAD statement reads data from the file system.
2. A series of "transformation" statements process the data.
3. A STORE statement writes output to the file system; or, a DUMP
statement displays output to the screen.
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig Latin
A = load 'hdi-data.csv' using PigStorage(',') AS
(id:int, country:chararray, hdi:float, lifeex:int,
mysch:int, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig Command Language
Cloudera, 2009
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig Execution Stages
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Why Pig?
● Makes writing Hadoop jobs easier
● You don't need to be a programmer to write Pig scripts
● Provide major functionality required for
DatawareHouse and Analytics
● Load, Filter, Join, Group By, Order, Transform
● User can write custom UDFs (User Defined Function)
Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig v.s. Hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Pig v.s. Hive
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Running a Pig script
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Installing Pig
$ wget http://apache.cs.utah.edu/pig/pig-0.14.0/pig-0.14.0.tar.gz
$ tar -xvf pig-0.14.0.tar.gz
$ mv pig-0.14.0 pig
$ sudo mv pig /usr/local/
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit System Environment Variables
vi ./.bashrc
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Execute System Environment
source ~/.bashrc
Download sample data
Import sample data to HDFS
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Write your own Pig script
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Running Pig shell
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Execute Pig Script
run countryFilter.pig
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Quit from Pig Shell
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Lecture: Sqoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Introduction
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line
tool with the following capabilities:
• Imports individual tables or entire databases to files in
HDFS
• Generates Java classes to allow you to interact with your
imported data
• Provides the ability to import from SQL databases straight
into your Hive data warehouse
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Architecture Overview
Hive.apache.or
g
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Hands-On: Loading Data from DBMS
to Hadoop HDFS
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Sqoop Hands-On Labs
1. Loading Data into MySQL DB
2. Installing Sqoop
3. Configuring Sqoop
4. Installing DB driver for Sqoop
5. Importing data from MySQL to Hadoop
6. Reviewing HDFS data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
1. Loading Data into MySQL DB
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Create sample sql scripts
Please get the script from your instructor
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Install MySQL
sudo apt-get install mysql-server
Leave black for password in our class room testing
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Install MySQL
Done installation
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Start MySQL service
Press enter for password
Create database name: countrydb
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Select database name: countrydb
Create table name: country_tbl
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Import your sample data to MySQL
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review your sample data from MySQL
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review your sample data from MySQL
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
2. Installing Sqoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Install Sqoop
wget http://mirror.cc.columbia.edu/pub/software/apache/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Edit System Environment Variables
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Execute System Environment Variables
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
3. Configuring Sqoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Configuring Sqoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Configuring Sqoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
4. Installing DB driver for Sqoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Install MySQL DB Driver
wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Install MySQL DB Driver
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
5. Importing data from MySQL to
Hadoop
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Sqoop Import Command
sqoop import -connect jdbc:mysql://localhost/countrydb -username root -table country_tbl -m 1
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
6. Reviewing HDFS data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review HDFS result data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review HDFS result data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review HDFS result data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Review HDFS result data
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Stop Hadoop Yarn and HDFS
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Recommendation to Further Study
Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
Thank you very much

Mais conteúdo relacionado

Mais procurados

Performance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMsPerformance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMs
Maarten Smeets
 

Mais procurados (20)

Spark access control on Amazon EMR with AWS Lake Formation
Spark access control on Amazon EMR with AWS Lake FormationSpark access control on Amazon EMR with AWS Lake Formation
Spark access control on Amazon EMR with AWS Lake Formation
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Performance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMsPerformance of Microservice frameworks on different JVMs
Performance of Microservice frameworks on different JVMs
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 
Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...
Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...
Deep Dive on Amazon Aurora MySQL Performance Tuning (DAT429-R1) - AWS re:Inve...
 
AWS Webinar 201: Designing scalable, available & resilient cloud applications
AWS Webinar 201: Designing scalable, available & resilient cloud applicationsAWS Webinar 201: Designing scalable, available & resilient cloud applications
AWS Webinar 201: Designing scalable, available & resilient cloud applications
 
Rapid Application Development on AWS
Rapid Application Development on AWSRapid Application Development on AWS
Rapid Application Development on AWS
 
Deep dive into flink interval join
Deep dive into flink interval joinDeep dive into flink interval join
Deep dive into flink interval join
 
Cloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper ContrailCloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper Contrail
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Introduction to Open Telemetry as Observability Library
Introduction to Open  Telemetry as Observability LibraryIntroduction to Open  Telemetry as Observability Library
Introduction to Open Telemetry as Observability Library
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 

Destaque

Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2
IMC Institute
 

Destaque (20)

Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Big Data Maturity Model and Governance
Big Data Maturity Model and GovernanceBig Data Maturity Model and Governance
Big Data Maturity Model and Governance
 
บทความ Big Data School ใน IMC e-Magazine
บทความ Big Data School ใน IMC e-Magazineบทความ Big Data School ใน IMC e-Magazine
บทความ Big Data School ใน IMC e-Magazine
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
 
Big data project management
Big data project managementBig data project management
Big data project management
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
เทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษาเทคโนโลยี  Cloud Computing  สำหรับงานสถาบันการศึกษา
เทคโนโลยี Cloud Computing สำหรับงานสถาบันการศึกษา
 
Apache Spark in Action
Apache Spark in ActionApache Spark in Action
Apache Spark in Action
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
บทความสรุปงานสัมมนา IT Trends 2017 ของ IMC Institute
บทความสรุปงานสัมมนา IT Trends 2017 ของ  IMC Instituteบทความสรุปงานสัมมนา IT Trends 2017 ของ  IMC Institute
บทความสรุปงานสัมมนา IT Trends 2017 ของ IMC Institute
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop Workshop
 

Semelhante a Hadoop Hand-on Lab: Installing Hadoop 2

Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
Swapnil (Neil) Jadhav
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
Dataiku
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting final
Skills Matter
 

Semelhante a Hadoop Hand-on Lab: Installing Hadoop 2 (20)

Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
 
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRBig Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizing
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Pinax Long Tutorial Slides
Pinax Long Tutorial SlidesPinax Long Tutorial Slides
Pinax Long Tutorial Slides
 
Using apache mx net in production deep learning streaming pipelines
Using apache mx net in production deep learning streaming pipelinesUsing apache mx net in production deep learning streaming pipelines
Using apache mx net in production deep learning streaming pipelines
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Hadoop uk user group meeting final
Hadoop uk user group meeting finalHadoop uk user group meeting final
Hadoop uk user group meeting final
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 

Mais de IMC Institute

Mais de IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Hadoop Hand-on Lab: Installing Hadoop 2

  • 1. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1 Big Data using Hadoop Hands On Workshop Dr.Thanachart Numnonda Certified Java Programmer thanachart@imcinstitute.com Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com, +66-81-559-1446 YARN
  • 2. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Big Data Introduction Volume Variety Velocity DB Table Delimited Text XML, HTML Free Form Text Image, Music, VDO, Binary Batch Near real time Real time GB TB PB XB ZB
  • 3. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Big Data Introduction Volume Variety Velocity DB Table Delimited Text XML, HTML Free Form Text Image, Music, VDO, Binary Batch Near real time Real time GB TB PB XB ZB 1000 Kilobytes = 1 Megabyte 1000 Megabytes = 1 Gigabyte 1000 Gigabytes = 1 Terabyte 1000 Terabytes = 1 Petabyte 1000 Petabytes = 1 Exabyte 1000 Exabytes = 1 Zettabyte 1000 Zettabytes = 1 Yottabyte 1000 Yottabytes = 1 Brontobyte 1000 Brontobytes = 1 Geopbyte
  • 4. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 4 Big Data Players
  • 5. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Big Data Architecture Big Data InfrastructureBig Data Infrastructure BI/Report Next Best Action Distributed Data Processing Integration and Metadata Framework Distributed Data Store and DWH Monitoring and Management Framework Security Framework Predictive Analytics Descriptive Analytics Prescriptive Analytics Big Data Platform Big Data Applications Hardware, Storage, Network Fraud Analysis Cyber Security Talent Search
  • 6. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Lecture: Hadoop
  • 7. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop Timeline
  • 8. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Apache Hadoop Core Technology j2eedev.org/ecosystem-hadoop
  • 9. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Apache Hadoop Ecosystem j2eedev.org/ecosystem-hadoop
  • 10. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Big Data Platform & Big Data Analytics Hadoop Technology
  • 11. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Block Size = 64MB Replication Factor = 3 HDFS: Hadoop Distributed File System Cost/GB is a few ¢/month vs $/month apache.org/hadoop/
  • 12. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop 1.0 vs Hadoop 2.0 Hortonwork.com
  • 13. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop 1.0 vs Hadoop 2.0 Hortonwork.com
  • 14. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop 2 Hortonworks.com
  • 15. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop Symbols and Reasons Behind 1 9
  • 16. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Opportunity and Market Outlook Source: Machina Research
  • 17. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Launch a virtual server on EC2 Amazon Web Services
  • 18. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop
  • 19. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop Installation Hadoop provides three installation choices: 1. Local mode: This is an unzip and run mode to get you started right away where allparts of Hadoop run within the same JVM 2. Pseudo distributed mode: This mode will be run on different parts of Hadoop as different Java processors, but within a single machine 3. Distributed mode: This is the real setup that spans multiple machines
  • 20. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Virtual Server This lab will use a EC2 virtual server to install a Hadoop server using the following features: ● Ubuntu Server 14.04 LTS ● m3.mediun 1vCPU, 3.75 GB memory ● Security group: default ● Keypair: imchadoop
  • 21. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Select a EC2 service and click on Lunch Instance
  • 22. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Select an Amazon Machine Image (AMI) and Ubuntu Server 14.04 LTS (PV)
  • 23. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Choose m3.medium Type virtual server
  • 24. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Leave configuration details as default
  • 25. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Add Storage: 20 GB
  • 26. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Name the instance
  • 27. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Select an existing security group > Select Security Group Name: default
  • 28. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Click Launch and choose imchadoop as a key pair
  • 29. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review an instance / click Connect for an instruction to connect to the instance
  • 30. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Connect to an instance from Mac/Linux
  • 31. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Connect to an instance from Windows using Putty
  • 32. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Connect to an instance from Windows using Putty
  • 33. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Connect to an instance from Windows using Putty
  • 34. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Connect to an instance from Windows using Putty
  • 35. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Login as: ubuntu
  • 36. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Installing Hadoop
  • 37. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Installing Hadoop and Ecosystem 1. Update System Software Repository 2. Configuring SSH 3. Installing Java 4. Download/Extract Hadoop 5. Installing Hadoop 6. Configure Hadoop 7. Formatting Namenode 8. Starting Hadoop 9. Accessing Hadoop Web Console 10. Stopping Hadoop
  • 38. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 1. Update System Software Repository
  • 39. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop sudo apt-get update
  • 40. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 2. Configuring SSH
  • 41. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Install SSH, Create ssh-key
  • 42. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  • 43. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Exit from ssh connection
  • 44. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 3. Installing Java
  • 45. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Installing Java
  • 46. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 4. Download/Extract Hadoop
  • 47. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Download/Extract Hadoop
  • 48. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 5. Installing Hadoop
  • 49. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Installing Hadoop export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
  • 50. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Execute environment variables: source ~/.bashrc
  • 51. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit hadoop shell script
  • 52. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit hadoop shell script – log location to another directory
  • 53. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit Yarn shell script – log location to another directory
  • 54. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 6. Configuring Hadoop
  • 55. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit hadoop - core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://ip-172-31-4-165:9000</value> </property> </configuration>
  • 56. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Creating directories for namenode and datanode sudo mkdir -p /var/hadoop_data/namenode sudo mkdir -p /var/hadoop_data/datanode sudo chown ubuntu:ubuntu -R /var/hadoop_data Edit hdfs-site.xml
  • 57. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit yarn-site.xml <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>ip-172-31-4-165</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>ip-172-31-4-165:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>ip-172-31-4-165:8031</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>ip-172-31-4-165:8032</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>ip-172-31-4-165:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>ip-172-31-4-165:8088</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
  • 58. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit mapred-site.xml cp mapred-site.xml.template mapred-site.xml vi mapred-site.xml
  • 59. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 7. Formatting Namenode
  • 60. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Formatting Namenode
  • 61. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Formatting Namenode
  • 62. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 8. Starting Hadoop
  • 63. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Starting Namenode and Datanode
  • 64. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Starting Yarn
  • 65. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 9. Accessing Hadoop Web Console
  • 66. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Checking Public IT Address
  • 67. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Accessing Hadoop Web Console
  • 68. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hadoop Port Numbers Daemon Default Port Configuration Parameter in conf/*- site.xml HDFS Namenode Web 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address Yarn ResourceManager Web 8088 yarn.resourcemanager.webapp.address
  • 69. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 10. Stopping Hadoop
  • 70. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Stop Hadoop Yarn and HDFS
  • 71. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Importing Data to HDFS using Hadoop Command Line
  • 72. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Creating Hadoop HDFS Directories and Importing file to Hadoop
  • 73. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Traversing, Retrieving Data from HDFS
  • 74. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review data in Hadoop HDFS
  • 75. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Deleting file from HDFS rm Usage: hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...] Delete files specified as args. Options: The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist. The -R option deletes the directory and any content under it recursively. The -r option is equivalent to -R. The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory. Example: hdfs dfs -rm hdfs://nn.example.com/file /user/hadoop/mydir
  • 76. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Lecture: YARN YARN
  • 77. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop YARN: Yet Another Resource Negotiator Hadoop.apache.org MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.
  • 78. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Lecture: Map Reduce
  • 79. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop MapReduce Hadoop Environment Hadoop.apache.org
  • 80. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop MapReduce Framework map: (K1, V1) -> list(K2, V2)) reduce: (K2, list(V2)) -> list(K3, V3)
  • 81. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop How does the MapReduce work? Output in a list of (Key, List of Values) in the intermediate file Sorting Partitioning Output in a list of (Key, Value) in the intermediate file InputSplit RecordReade r RecordWriter
  • 82. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop How does the MapReduce work? Sorting Partitioning Combining Car, 2 Car, 2 Bear, {1,1} Car, {2,1} River, {1,1} Deer, {1,1} Output in a list of (Key, List of Values) in the intermediate file Output in a list of (Key, Value) in the intermediate file InputSplit RecordReade r RecordWriter
  • 83. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop MapReduce Processing – The Data flow 1. InputFormat, InputSplits, RecordReader 2. Mapper - your focus is here 3. Partition, Shuffle & Sort 4. Reducer - your focus is here 5. OutputFormat, RecordWriter
  • 84. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop InputFormat InputFormat: Description: Key: Value: TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into key, val pairs Everything up to the first tab character The remainder of the line SequenceFileInputFor mat A Hadoop-specific high-performance binary format user-defined user-defined
  • 85. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop InputSplit An InputSplit describes a unit of work that comprises a single map task. InputSplit presents a byte-oriented view of the input. You can control this value by setting the mapred.min.split.size parameter in core-site.xml, or by overriding the parameter in the JobConf object used to submit a particular MapReduce job. RecordReader RecordReader reads <key, value> pairs from an InputSplit. Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record- oriented to the Mapper
  • 86. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Mapper Mapper: The Mapper performs the user-defined logic to the input a key, value and emits (key, value) pair(s) which are forwarded to the Reducers. Partition, Shuffle & Sort After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. Partitioner controls the partitioning of map-outputs to assign to reduce task . he total number of partitions is the same as the number of reduce tasks for the job The set of intermediate keys on a single node is automatically sorted by internal Hadoop before they are presented to the Reducer This process of moving map outputs to the reducers is known as shuffling.
  • 87. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Reducer This is an instance of user-provided code that performs read each key, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which will collect a (key, value) output. OutputFormat, Record Writer OutputFormat governs the writing format in OutputCollector and RecordWriter writes output into HDFS. OutputFormat: Description TextOutputFormat Default; writes lines in "key t value" form SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs NullOutputFormat generates no output files
  • 88. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Writing Your Own Map Reduce
  • 89. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Writing Your Own Map Reduce
  • 90. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Writing Your Own Map Reduce import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Continue (มีต่อ)..
  • 91. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Writing Your Own Map Reduce public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Continue (มีต่อ)..
  • 92. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Writing Your Own Map Reduce public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 93. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 94. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment mkdir wordcount_classes javac -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common- 2.6.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client- core-2.6.0.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes WordCount.java jar -cvf ./wordcount.jar -C wordcount_classes/ . yarn jar ./wordcount.jar WordCount /inputs/* /outputs/wordcount_output_dir01 hdfs dfs -cat /outputs/wordcount_output_dir01/part-r-00000 Create java classes directory (สร้าง directory สําหรับ java class ที่ถูก compile แล้ว Typing in one single line (พิมพต่อเนื่องเลย อย่ากดแป้ นขึ้นบรรทัดใหม่) Create jar file (สร้าง jar file อย่าลืมใส่เครื่องหมายจุดด้านหลังด้วย) Execute yarn (สั่งให้ yarn ทํางาน) Review the result (สั่งให้พิมพ์ผลลัพท์ออกหน้าจอ) Please find next page for screen step by step (โปรดดูหน้าถัดไปสําหรับหน้าจอในแต่ละขั้นตอน)...
  • 95. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 96. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 97. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 98. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 99. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 100. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 101. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 102. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 103. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 104. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 105. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 106. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 107. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 108. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Packaging Map Reduce and Deploying to Hadoop Runtime Environment
  • 109. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop (Optional) Review OS File System
  • 110. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop (Optional) Review OS File System
  • 111. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop (Optional) Review OS File System
  • 112. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Lecture Understanding Hive
  • 113. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL
  • 114. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop What Hive is NOT Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).
  • 115. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Sample HiveQL The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query SELECT * FROM t where t.c = 'xyz' SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1) SELECT t1.c1, count(1) from t1 group by t1.c1 Hive.apache.or g
  • 116. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Sample HiveQL
  • 117. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop System Architecture and Components • Metastore: To store the meta data. • Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop. • SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. • UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions). • Clients: Command line client similar to Mysql command line. hive.apache.or g
  • 118. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Architecture Overview HDFS Hive CLI Querie s Browsing Map Reduce MetaStore Thrift API SerDe Thrif t Jute JSON.. Execution Hive QL Parser Planner Mgmt. WebUI HDFS DDL Hive Hive.apache.org
  • 119. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Metastore Hive Metastore is a repository to keep all Hive metadata; Tables and Partitions definition. By default, Hive will store its metadata in Derby DB
  • 120. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Built in Functions Return Type Function Name (Signature) Description BIGINT round(double a) returns the rounded BIGINT value of the double BIGINT floor(double a) returns the maximum BIGINT value that is equal or less than the double BIGINT ceil(double a) returns the minimum BIGINT value that is equal or greater than the double double rand(), rand(int seed) returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. string concat(string A, string B,...) returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. string substr(string A, int start) returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' string substr(string A, int start, int length) returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba' string upper(string A) returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR' string ucase(string A) Same as upper string lower(string A) returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar' string lcase(string A) Same as lower string trim(string A) returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar' string ltrim(string A) returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' string rtrim(string A) returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' string regexp_replace(string A, string B, string C) returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' string from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" string to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" int year(string date) Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 int month(string date) Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 string get_json_object(string json_string, string path) Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid hive.apache.org
  • 121. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Aggregate Functions Return Type Aggregation Function Name (Signature) Description BIGINT count(*), count(expr), count(DISTINCT expr[, expr_.]) count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non- NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. DOUBLE sum(col), sum(DISTINCT col) returns the sum of the elements in the group or the sum of the distinct values of the column in the group DOUBLE avg(col), avg(DISTINCT col) returns the average of the elements in the group or the average of the distinct values of the column in the group DOUBLE min(col) returns the minimum value of the column in the group DOUBLE max(col) returns the maximum value of the column in the group hive.apache.org
  • 122. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Running Hive Hive Shell ● Interactive hive ● Script hive -f myscript ● Inline hive -e 'SELECT * FROM mytable' Hive.apache.or g
  • 123. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Commands ortonworks.com
  • 124. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Tables ● Managed- CREATE TABLE − LOAD- File moved into Hive's data warehouse directory − DROP- Both data and metadata are deleted. ● External- CREATE EXTERNAL TABLE − LOAD- No file moved − DROP- Only metadata deleted − Use when sharing data between Hive and Hadoop applications or you want to use multiple schema on the same data
  • 125. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive External Table Dropping External Table using Hive:- - Hive will delete metadata from metastore - Hive will NOT delete the HDFS file - You need to manually delete the HDFS file
  • 126. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Java JDBC for Hive import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "t" + res.getString(2)); }
  • 127. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Java JDBC for Hive // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "t" + res.getString(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } } }
  • 128. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop HiveQL and MySQL Comparison ortonworks.com
  • 129. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop HiveQL and MySQL Query Comparison ortonworks.com
  • 130. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Creating Table and Retrieving Data using Hive
  • 131. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hive Hands-On Labs 1. Installing Hive 2. Configuring / Starting Hive 3. Creating Hive Table 4. Reviewing Hive Table in HDFS 5. Alter and Drop Hive Table 6. Loading Data to Hive Table 7. Querying Data from Hive Table 8. Reviewing Hive Table Content from HDFS Command and WebUI
  • 132. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 1. Installing Hive $ wget http://www.us.apache.org/dist/hive/hive- 1.0.0/apache-hive-1.0.0-bin.tar.gz# tar -xvzf apache-hive- 1.1.0-bin.tar.gz $ tar -xvf apache-hive-1.0.0-bin.tar.gz $ mv apache-hive-1.0.0-bin apache-hive $ sudo mv ./apache-hive /usr/local/ $ chown -R ubuntu:ubuntu /usr/local/apache-hive
  • 133. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 1. Installing Hive Edit $HOME ./bashrc $ vi /home/ubuntu/.bashrc
  • 134. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 1. Installing Hive Execute environment variable Create hive directories
  • 135. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 2. Configuring Hive
  • 136. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 2. Starting Hive
  • 137. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Show Hive Table
  • 138. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 3. Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  • 139. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 3. Creating Hive Table
  • 140. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 4. Reviewing Hive Table in HDFS [hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse Found 1 items drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl [hdadmin@localhost hdadmin]$ Review Hive Table from HDFS WebUI
  • 141. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 5. Alter and Drop Hive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
  • 142. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 6. Loading Data to Hive Table Please get the country_data.csv from your instructor.
  • 143. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 6. Loading Data to Hive Table Please get the country_data.csv from your instructor.
  • 144. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 7. Querying Data from Hive Table
  • 145. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 8. Reviewing Hive Table Content from HDFS Command and WebUI
  • 146. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Lecture: Pig
  • 147. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  • 148. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig Components ● Two Compnents ● Language (Pig Latin) ● Compiler ● Two Execution Environments ● Local pig -x local ● Distributed pig -x mapreduce
  • 149. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Running Pig ● Script pig myscript ● Command line (Grunt) pig ● Embedded Writing a java program
  • 150. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig Dataflow Organization 1. A LOAD statement reads data from the file system. 2. A series of "transformation" statements process the data. 3. A STORE statement writes output to the file system; or, a DUMP statement displays output to the screen.
  • 151. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig Latin A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C;
  • 152. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig Command Language Cloudera, 2009
  • 153. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig Execution Stages Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  • 154. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Why Pig? ● Makes writing Hadoop jobs easier ● You don't need to be a programmer to write Pig scripts ● Provide major functionality required for DatawareHouse and Analytics ● Load, Filter, Join, Group By, Order, Transform ● User can write custom UDFs (User Defined Function) Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  • 155. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig v.s. Hive
  • 156. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Pig v.s. Hive
  • 157. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Running a Pig script
  • 158. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Installing Pig $ wget http://apache.cs.utah.edu/pig/pig-0.14.0/pig-0.14.0.tar.gz $ tar -xvf pig-0.14.0.tar.gz $ mv pig-0.14.0 pig $ sudo mv pig /usr/local/
  • 159. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit System Environment Variables vi ./.bashrc
  • 160. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Execute System Environment source ~/.bashrc Download sample data Import sample data to HDFS
  • 161. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Write your own Pig script
  • 162. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Running Pig shell
  • 163. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Execute Pig Script run countryFilter.pig
  • 164. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Quit from Pig Shell
  • 165. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Lecture: Sqoop
  • 166. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Introduction Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: • Imports individual tables or entire databases to files in HDFS • Generates Java classes to allow you to interact with your imported data • Provides the ability to import from SQL databases straight into your Hive data warehouse See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
  • 167. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Architecture Overview Hive.apache.or g
  • 168. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Hands-On: Loading Data from DBMS to Hadoop HDFS
  • 169. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Sqoop Hands-On Labs 1. Loading Data into MySQL DB 2. Installing Sqoop 3. Configuring Sqoop 4. Installing DB driver for Sqoop 5. Importing data from MySQL to Hadoop 6. Reviewing HDFS data
  • 170. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 1. Loading Data into MySQL DB
  • 171. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Create sample sql scripts Please get the script from your instructor
  • 172. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Install MySQL sudo apt-get install mysql-server Leave black for password in our class room testing
  • 173. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Install MySQL Done installation
  • 174. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Start MySQL service Press enter for password Create database name: countrydb
  • 175. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Select database name: countrydb Create table name: country_tbl
  • 176. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Import your sample data to MySQL
  • 177. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review your sample data from MySQL
  • 178. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review your sample data from MySQL
  • 179. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 2. Installing Sqoop
  • 180. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Install Sqoop wget http://mirror.cc.columbia.edu/pub/software/apache/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz
  • 181. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Edit System Environment Variables
  • 182. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Execute System Environment Variables
  • 183. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 3. Configuring Sqoop
  • 184. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Configuring Sqoop
  • 185. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Configuring Sqoop
  • 186. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 4. Installing DB driver for Sqoop
  • 187. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Install MySQL DB Driver wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
  • 188. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Install MySQL DB Driver
  • 189. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 5. Importing data from MySQL to Hadoop
  • 190. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Sqoop Import Command sqoop import -connect jdbc:mysql://localhost/countrydb -username root -table country_tbl -m 1
  • 191. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop 6. Reviewing HDFS data
  • 192. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review HDFS result data
  • 193. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review HDFS result data
  • 194. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review HDFS result data
  • 195. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Review HDFS result data
  • 196. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Stop Hadoop Yarn and HDFS
  • 197. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Recommendation to Further Study
  • 198. Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015Big Data Hadoop Workshop Thank you very much