SlideShare uma empresa Scribd logo
1 de 74
Baixar para ler offline
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1
Big Data Analytics
Using Hadoop Cluster
On Amazon EMR
February 2015
Dr.Thanachart Numnonda
IMC Institute
thanachart@imcinstitute.com
Modifiy from Original Version by Danairat T.
Certified Java Programmer, TOGAF – Silver
danairat@gmail.com
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Running this lab using Amazon EMR
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Create an EMR cluster
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview of Amazon EMR
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Amazon EMR Cluster
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating an AWS account
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Signing up for the necessary services
●
Simple Storage Service (S3)
●
Elastic Compute Cloud (EC2)
●
Elastic MapReduce (EMR)
Caution! This costs real money!
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating Amazon S3 bucket
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Create access key using Security Credentials
in the AWS Management Console
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Select EMR
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating a cluster in EMR
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating a cluster in EMR (cont.)
Name the cluster and also specify Log folder
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating a cluster in EMR (cont.)
Leave the Software Configuration as default
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating a cluster in EMR (cont.)
Leave the Hardware Configuration as default
Choose an exisitng EC2 key pair
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating a cluster in EMR (cont.)
Leave the others as default
Select Create Cluster
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
EMR Cluster Details
Note on the Master public DNS:
To see the details on how to connect to the Master Node using SSH click at SSH
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
SSH Instruction
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Connect to the Master Node
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Web Interface Host on EMR Cluster
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 21
Launch the Hue Web Interface
●
Set Up an SSH Tunnel to the Master Node
– See instruction at
– http://docs.aws.amazon.com/ElasticMapReduce/latest/Developer
Guide/emr-ssh-tunnel.html
●
Configure Proxy Settings to View Websites
– See instruction at
– http://docs.aws.amazon.com/ElasticMapReduce/latest/Developer
Guide/emr-connect-master-node-proxy.html
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 22
Launch the Hue Web Interface (Cont.)
●
http://master-public-dns-name:8888/
●
Create your own Hue account
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 23
Launch the Hue Web Interface (Cont.)
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Importing/Exporting
Data to HDFS
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing Data to Hadoop
Download War and Peace Full Text
www.gutenberg.org/ebooks/2600
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS using
File Browse
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Create new directory
Create two new directory name: input and output
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Upload Files
Upload file: pg2600.txt into input directory
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Map Reduce
Processing
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Map Reduce
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
High Level Architecture of MapReduce
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing you own Map
Reduce Program
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
29.
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46.
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[1]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[2]));
54.
55. JobClient.runJob(conf);
57. }
58. }
59.
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing Map/Reduce
Program on Eclipse
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Eclipse in a local machine
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a Java Project
Let's name it HadoopWordCount
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 38
Add dependencies to the project
●
Note you may need to download Hadoop-core-jar.zip
●
Add the following two JARs to your build path
●
hadoop-common.jar and hadoop-mapreduce-client-core.jar.
●
By perform the following steps
– Add a folder named lib to the project
– Copy the mentioned JARs in this folder
– Right-click on the project name >> select Build Path >> then
Configure Build Path
– Click on Add Jars, select these two JARs from the lib folder
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 39
Add dependencies to the project
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 40
Writing a source code
●
Right click the project, the select New >> Package
●
Name the package as org.myorg
●
Right click at org.myorg, the select New >> Class
●
Name the package as WordCount
●
Writing a source code as shown in previoud slides
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 41
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 42
Building a Jar file
●
Right click the project, the select Export
●
Select Java and then JAR file
●
Provide the JAR name, as wordcount.jar
●
Leave the JAR package options as default
●
In the JAR Manifest Specification section, in the botton, specify the Main
class
●
In this case, select WordCount
●
Click on Finish
●
The JAR file will be build and will be located at cloudera/workspace
Note: you may need to re-size the dialog font size by select
Windows >> Preferences >> Appearance >> Colors and Fonts
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 43
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running Map Reduce and
Deploying to Hadoop Runtime
Environment
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 45
Running a Jar file
●
Create a folder applications on Amazon S3
●
Upload wordcount.jar to s3://imcbucket/apps
Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 46
Running a Jar file (cont)
●
Open the Master node using SSH command
– ssh hadoop@ec2-54-213-220-37.us-west-2.compute.amazonaws.com
-i imckey.pem
●
Run the following commands
– $ mkdir apps
– $ hadoop fs -get s3://imcbucket/applications/wordcount.jar apps
– $ hadoop jar apps/wordcount.jar org.myorg.WordCount
s3://imcbucket/input/* s3://imcbucket/output/wordcount_result
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
Browse to the s3://imcbucket/output/wordcount_result
Open part-xxxx files
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture
Understanding Hive
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
A Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data
summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called
Hive QL, which is based on SQL
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Creating Table and
Retrieving Data using Hive
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Hive from the Master node
Starting Hive
hive> quit;
Quit from Hive
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Hive Editor from Hue
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Hive Editor from Hue
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
Time taken: 0.138 seconds
hive (default)> describe test_tbl;
OK
id int
country string
Time taken: 0.147 seconds
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing Hive Table in HDFS
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
Time taken: 0.077 seconds
hive (default)> drop table test_tbl;
OK
Time taken: 0.9 seconds
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Loading Data to Hive Table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Creating Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
Time taken: 0.241 seconds
hive (default)>
Loading data to Hive table
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Querying Data from Hive Table
hive (default)> select * from test_tbl;
OK
1 USA
62 Indonesia
63 Philippines
65 Singapore
66 Thailand
Time taken: 0.287 seconds
hive (default)>
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Insert Overwriting the Hive Table
hive (default)> LOAD DATA LOCAL INPATH
'/home/cloudera/Downloads/test_tbl_data_updated.csv' overwrite INTO
TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data_updated.csv
Copying file: file:/tmp/test_tbl_data_updated.csv
Loading data to table default.test_tbl
Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl
OK
Time taken: 0.204 seconds
hive (default)>
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
MovieLens
http://grouplens.org/datasets/movielens/
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Create the Hive Table for movielen
hive (default)> CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
hive (default)> LOAD DATA LOCAL INPATH
'/home/cloudera/Downloads/u.data' overwrite INTO TABLE u_data;
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Create the Hive Table for Apache LOf
hive (default)> CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*])
([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*")
([^ "]*|".*"))?"
)
STORED AS TEXTFILE;
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture
Understanding Pig
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
A high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables
them to handle very large data sets.
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running a Pig script
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Pig Command Line
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Pig from Hue
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,
lifeex:int, mysch:i
nt, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
dump C;
Writing a Pig Script
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
[hdadmin@localhost ~]$ cd Downloads
[hdadmin@localhost ~]$ pig -x local
grunt > run countryFilter.pig
....
(150,Cameroon,0.482,51,5,10,2031)
(126,Kyrgyzstan,0.615,67,9,12,2036)
(156,Nigeria,0.459,51,5,8,2069)
(154,Yemen,0.462,65,2,8,2213)
(138,Lao People's Democratic Republic,0.524,67,4,9,2242)
(153,Papua New Guinea,0.466,62,4,5,2271)
(165,Djibouti,0.43,57,3,5,2335)
(129,Nicaragua,0.589,74,5,10,2430)
(145,Pakistan,0.504,65,4,6,2550)
Running a Pig Script
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
Thank you
www.imcinstitute.com
www.facebook.com/imcinstitute

Mais conteúdo relacionado

Mais procurados

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...
An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...
An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...
Abhay Bhargav
 

Mais procurados (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...
An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...
An Attacker's View of Serverless and GraphQL Apps - Abhay Bhargav - AppSec Ca...
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
PyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web ApplicationsPyCon AU 2012 - Debugging Live Python Web Applications
PyCon AU 2012 - Debugging Live Python Web Applications
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 

Destaque

Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2
IMC Institute
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
IMC Institute
 

Destaque (20)

Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2Set up Hadoop Cluster on Amazon EC2
Set up Hadoop Cluster on Amazon EC2
 
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRBig Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Big data
Big dataBig data
Big data
 
Service generated big data and big data-as-a-service
Service generated big data and big data-as-a-serviceService generated big data and big data-as-a-service
Service generated big data and big data-as-a-service
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using Hadoop
 
Big Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop WorkshopBig Data Programming Using Hadoop Workshop
Big Data Programming Using Hadoop Workshop
 
Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
 

Semelhante a Big Data Analytics Using Hadoop Cluster On Amazon EMR

Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
Vipul Divyanshu
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 

Semelhante a Big Data Analytics Using Hadoop Cluster On Amazon EMR (20)

Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Automating hadoop jobs using rundeck
Automating hadoop jobs using rundeck  Automating hadoop jobs using rundeck
Automating hadoop jobs using rundeck
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Serverless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud PlatformServerless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud Platform
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
 
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)
 
Data Science
Data ScienceData Science
Data Science
 

Mais de IMC Institute

Mais de IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Big Data Analytics Using Hadoop Cluster On Amazon EMR

  • 1. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 1 Big Data Analytics Using Hadoop Cluster On Amazon EMR February 2015 Dr.Thanachart Numnonda IMC Institute thanachart@imcinstitute.com Modifiy from Original Version by Danairat T. Certified Java Programmer, TOGAF – Silver danairat@gmail.com
  • 2. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Running this lab using Amazon EMR
  • 3. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Create an EMR cluster
  • 4. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Architecture Overview of Amazon EMR
  • 5. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Amazon EMR Cluster
  • 6. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating an AWS account
  • 7. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Signing up for the necessary services ● Simple Storage Service (S3) ● Elastic Compute Cloud (EC2) ● Elastic MapReduce (EMR) Caution! This costs real money!
  • 8. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating Amazon S3 bucket
  • 9. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Create access key using Security Credentials in the AWS Management Console
  • 10. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
  • 11. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Select EMR
  • 12. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a cluster in EMR
  • 13. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a cluster in EMR (cont.) Name the cluster and also specify Log folder
  • 14. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a cluster in EMR (cont.) Leave the Software Configuration as default
  • 15. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a cluster in EMR (cont.) Leave the Hardware Configuration as default Choose an exisitng EC2 key pair
  • 16. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating a cluster in EMR (cont.) Leave the others as default Select Create Cluster
  • 17. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop EMR Cluster Details Note on the Master public DNS: To see the details on how to connect to the Master Node using SSH click at SSH
  • 18. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop SSH Instruction
  • 19. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Connect to the Master Node
  • 20. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Web Interface Host on EMR Cluster
  • 21. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 21 Launch the Hue Web Interface ● Set Up an SSH Tunnel to the Master Node – See instruction at – http://docs.aws.amazon.com/ElasticMapReduce/latest/Developer Guide/emr-ssh-tunnel.html ● Configure Proxy Settings to View Websites – See instruction at – http://docs.aws.amazon.com/ElasticMapReduce/latest/Developer Guide/emr-connect-master-node-proxy.html
  • 22. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 22 Launch the Hue Web Interface (Cont.) ● http://master-public-dns-name:8888/ ● Create your own Hue account
  • 23. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 23 Launch the Hue Web Interface (Cont.)
  • 24. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Importing/Exporting Data to HDFS
  • 25. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Importing Data to Hadoop Download War and Peace Full Text www.gutenberg.org/ebooks/2600
  • 26. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Review file in Hadoop HDFS using File Browse
  • 27. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Create new directory Create two new directory name: input and output
  • 28. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Upload Files Upload file: pg2600.txt into input directory
  • 29. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture: Understanding Map Reduce Processing Client Name Node Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Reduce
  • 30. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop High Level Architecture of MapReduce
  • 31. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing you own Map Reduce Program
  • 32. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 1. package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }
  • 33. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37.
  • 34. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Wordcount (HelloWord in Hadoop) 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[1])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[2])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59.
  • 35. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Writing Map/Reduce Program on Eclipse
  • 36. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Eclipse in a local machine
  • 37. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Create a Java Project Let's name it HadoopWordCount
  • 38. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 38 Add dependencies to the project ● Note you may need to download Hadoop-core-jar.zip ● Add the following two JARs to your build path ● hadoop-common.jar and hadoop-mapreduce-client-core.jar. ● By perform the following steps – Add a folder named lib to the project – Copy the mentioned JARs in this folder – Right-click on the project name >> select Build Path >> then Configure Build Path – Click on Add Jars, select these two JARs from the lib folder
  • 39. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 39 Add dependencies to the project
  • 40. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 40 Writing a source code ● Right click the project, the select New >> Package ● Name the package as org.myorg ● Right click at org.myorg, the select New >> Class ● Name the package as WordCount ● Writing a source code as shown in previoud slides
  • 41. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 41
  • 42. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 42 Building a Jar file ● Right click the project, the select Export ● Select Java and then JAR file ● Provide the JAR name, as wordcount.jar ● Leave the JAR package options as default ● In the JAR Manifest Specification section, in the botton, specify the Main class ● In this case, select WordCount ● Click on Finish ● The JAR file will be build and will be located at cloudera/workspace Note: you may need to re-size the dialog font size by select Windows >> Preferences >> Appearance >> Colors and Fonts
  • 43. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 43
  • 44. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running Map Reduce and Deploying to Hadoop Runtime Environment
  • 45. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 45 Running a Jar file ● Create a folder applications on Amazon S3 ● Upload wordcount.jar to s3://imcbucket/apps
  • 46. Danairat T., 2013, danairat@gmail.comBig Data Hadoop – Hands On Workshop 46 Running a Jar file (cont) ● Open the Master node using SSH command – ssh hadoop@ec2-54-213-220-37.us-west-2.compute.amazonaws.com -i imckey.pem ● Run the following commands – $ mkdir apps – $ hadoop fs -get s3://imcbucket/applications/wordcount.jar apps – $ hadoop jar apps/wordcount.jar org.myorg.WordCount s3://imcbucket/input/* s3://imcbucket/output/wordcount_result
  • 47. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result Browse to the s3://imcbucket/output/wordcount_result Open part-xxxx files
  • 48. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing MapReduce Output Result
  • 49. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Hive
  • 50. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A Petabyte Scale Data Warehouse Using Hadoop Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL
  • 51. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Creating Table and Retrieving Data using Hive
  • 52. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Running Hive from the Master node Starting Hive hive> quit; Quit from Hive
  • 53. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hive Editor from Hue
  • 54. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Hive Editor from Hue
  • 55. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Creating Hive Table hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 4.069 seconds hive (default)> show tables; OK test_tbl Time taken: 0.138 seconds hive (default)> describe test_tbl; OK id int country string Time taken: 0.147 seconds hive (default)> See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  • 56. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Using Hue Query Editor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  • 57. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Using Hue Query Editor See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
  • 58. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Reviewing Hive Table in HDFS
  • 59. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Alter and Drop Hive Table hive (default)> alter table test_tbl add columns (remarks STRING); hive (default)> describe test_tbl; OK id int country string remarks string Time taken: 0.077 seconds hive (default)> drop table test_tbl; OK Time taken: 0.9 seconds See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
  • 60. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Loading Data to Hive Table $ hive hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Creating Hive table hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data.csv Copying file: file:/tmp/test_tbl_data.csv Loading data to table default.test_tbl OK Time taken: 0.241 seconds hive (default)> Loading data to Hive table
  • 61. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Querying Data from Hive Table hive (default)> select * from test_tbl; OK 1 USA 62 Indonesia 63 Philippines 65 Singapore 66 Thailand Time taken: 0.287 seconds hive (default)>
  • 62. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Insert Overwriting the Hive Table hive (default)> LOAD DATA LOCAL INPATH '/home/cloudera/Downloads/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl; Copying data from file:/tmp/test_tbl_data_updated.csv Copying file: file:/tmp/test_tbl_data_updated.csv Loading data to table default.test_tbl Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl OK Time taken: 0.204 seconds hive (default)>
  • 63. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop MovieLens http://grouplens.org/datasets/movielens/
  • 64. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Create the Hive Table for movielen hive (default)> CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; hive (default)> LOAD DATA LOCAL INPATH '/home/cloudera/Downloads/u.data' overwrite INTO TABLE u_data;
  • 65. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Create the Hive Table for Apache LOf hive (default)> CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?" ) STORED AS TEXTFILE;
  • 66. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Lecture Understanding Pig
  • 67. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Introduction A high-level platform for creating MapReduce programs Using Hadoop Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  • 68. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Hands-On: Running a Pig script
  • 69. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Pig Command Line
  • 70. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Starting Pig from Hue
  • 71. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop countryFilter.pig A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:i nt, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; dump C; Writing a Pig Script
  • 72. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop [hdadmin@localhost ~]$ cd Downloads [hdadmin@localhost ~]$ pig -x local grunt > run countryFilter.pig .... (150,Cameroon,0.482,51,5,10,2031) (126,Kyrgyzstan,0.615,67,9,12,2036) (156,Nigeria,0.459,51,5,8,2069) (154,Yemen,0.462,65,2,8,2213) (138,Lao People's Democratic Republic,0.524,67,4,9,2242) (153,Papua New Guinea,0.466,62,4,5,2271) (165,Djibouti,0.43,57,3,5,2335) (129,Nicaragua,0.589,74,5,10,2430) (145,Pakistan,0.504,65,4,6,2550) Running a Pig Script
  • 73. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
  • 74. hanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop Thank you www.imcinstitute.com www.facebook.com/imcinstitute