SlideShare uma empresa Scribd logo
1 de 77
Baixar para ler offline
BIG DATA INFRASTRUCTURE –
INTRODUCTION TO HADOOP WITH
MAP REDUCE, PIG, AND HIVE
Gil Benghiat
Eric Estabrooks
Chris Bergh
O P E N
D A T A
S C I E N C E
C O N F E R E N C E
BOSTON 2015
@opendatasci
Agenda
Introductions
Hadoop Overview &
Comparisons
What do I use when?
AWS EMR
Hive
Pig
Impala Hive
6/1/2015 2
Doing
Presentation
Introductions
Meet DataKitchen
Chris Bergh
(Head Chef)
4
Gil Benghiat
(VP Product)
Eric Estabrooks
(VP Cloud and
Data Services)
Software development and executive experience delivering
enterprise software focused on Marketing and Health Care
sectors.
Deep Analytic Experience: Spent past decade solving
analytic challenges
New Approach To Data Preparation and Production:
focused on the Data Analysts and Data Scientists
5
Analysts And Their Teams Are Spending
60-80% Of Their Time
On Data Preparation And Production
This creates an expectation gap
6
Analyze
Prepare Data
C
Analyze
Prepare Data
Business Customer
Expectation
Analyst
Reality
Communicate
The business does not
think that Analysts are
preparing data
Analysts don’t want to
prepare data
7
DataKitchen is on a mission to
integrate and organize data to
make analysts and
data scientists
super-powered.
Meet the Audience: A few questions
• Who considers themselves
• Data scientist
• Data analyst
• Programmer / Scripter
• On the Business side
• Who knows SQL – can write a select statement?
• Who used AWS before today?
6/1/2015 8
Hadoop Overview
What Is Apache Hadoop?
• Software framework
• Distributed processing of large scale datasets
• Cluster of commodity hardware
• Promise of lower cost
• Has many frameworks, modules and projects
6/1/2015 10
http://hadoop.apache.org/
6/1/2015 11
Mark Grover http://radar.oreilly.com/2015/02/processing-frameworks-for-hadoop.html
Hadoop ecosystem frameworks
*** *
*Covered in talk
Hands on*
*
(HDFS, Cassandra, HBase, S3)
Hadoop has been evolving
6/1/2015 12
Map Reduce
Impala
Hadoop Pig
2005 2007 2009 2011 2013 2015
Google Trends
“Big Data”
What is Hadoop good for?
• Problems that are huge, and can be run in
parallel over immutable data
• NOT OLTP
(e.g. backend to e-commerce site)
• Providing frameworks to build software
• Map Reduce
• Spark
• Tez
• A backend for visualization tools
6/1/2015 13
Map Reduce
6/1/2015 14
http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf
6/1/2015 15
Test your system in the small
1. Make a small data set
2. Test like this:
$ cat data.txt | map | sort | reduce
6/1/2015 16
You can write map reduce jobs in your favorite language
Streaming Interface
• Lets you specify mappers and
reducer
• Supports
• Java
• Python
• Ruby
• Unix Shell
• R
• Any executable
Map Reduce “generators”
• Results in map reduce jobs
• PIG
• Hive
6/1/2015 17
Applications that lend themselves to map reduce
• Word Count
• PDF Generation (NY Times 11,000,000 articles)
• Analysis of stock market historical data (ROI and standard deviation)
• Geographical Data (Finding intersections, rendering map files)
• Log file querying and analysis
• Statistical machine translation
• Analyzing Tweets
6/1/2015 18
Pig
• Pig Latin - the scripting language
• Grunt – Shell for executing Pig Commands
6/1/2015 19
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
This is what it would be in Java
6/1/2015 20
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
Hive
You write SQL! Well, almost, it is HiveQL
6/1/2015 21
SELECT *
FROM user
WHERE active = 1;
JDBC
SQL
Workbench
HUE
AWS
S3
Impala
• Uses SQL very similar to HiveQL
• Runs 10-100x faster than Hive Map Reduce
• Runs in memory so it may not scale up as well
• Some batch jobs may run faster on Impala than Hive
• Great for developing your code on a small data set
• Can use interactively with Tableau and other BI tools
6/1/2015 22
• Had a version of SQL called Shark
• Shark has been replaced by Spark SQL
• Hive on Spark is under development
• Spark SQL is faster than Shark
• Runs 100x faster than Hive Map Reduce
• Can use interactively with Tableau and other BI tools
6/1/2015 23
Performance Comparisons
Performance comparison (3. Join Query Feb 2014)
6/1/2015 25
Source: https://amplab.cs.berkeley.edu/benchmark/
What’s this?
(inSeconds)
Performance comparison (TPC-DS April 2015)
6/1/2015 26
Source:
Performance comparison (Single User Sep 2014)
6/1/2015 27 Source:
Amazon EMR
Today, we will use EMR to run Hadoop
• EMR = Elastic Map Reduce
• Amazon does almost all of the work to create a cluster
• Offers a subset of modules and projects
6/1/2015 29
OR
6/1/2015 30
m3.xlarge
What to use when
6/1/2015 32
WhatTypeofDatabaseto
Use?
Capturing
Transactions?
Use RDMS
Capturing Logs? Use File System
Back End To
Website?
NoSQL Database
(Mongodb)
Cache (Redis)
Doing Analytics?
Small Data?
Desktop Tools
(Excel, Tableau)
Building Models?
R, Python, SAS
Miner
Big-ish Data?
Columnar Database
(Redshift)
‘Big Data’ Database
(like Hadoop)
6/1/2015 33
WhichToolShouldIUse?
Project Goal
Want Experience In
Coolest Tech?
Spark is Hot Tech
now
Just Want To Get
Job Done?
Choose Hadoop
Distributions
Mainly Structured
Data?
Want Fast
Response?
SQL / Impala
SQL / Redshift
Mainly
Unstructured Data?
Developer?
Write Map-Reduce
Job
Not Developer? SQL/HIVE
6/1/2015 34
HowShouldIUseIt?
Use Case
Development
Use Cloud
Use Virtual
Machine
Production
Fixed Workload
Do ROI on buying
up front
Use Cloud
Variable Workload Use Cloud
Hands on
Form groups of 3
6/1/2015 36
Let’s Do This!
6/1/2015 37
What do we need?
• AWS Account
• Key (.pem file)
• The data file in the S3 bucket
What will we do?
• Start Cluster
• MR Hive
• MR Pig
• Impala
• Sum county level
census data by state.
Prerequisites and scripts are
located at
http://www.datakitchen.io/blog
AWS Console
6/1/2015 38
• Just google “aws console”
• Log in
6/1/2015 39
Click Here
Where’s EMR?
Create Cluster
6/1/2015 40
OR
Cluster Options
6/1/2015 41
Cluster Configuration mod
Tags defaults
Software Configuration mod
File System Configuration defaults
Hardware Configuration mod
Security and Access mod
IAM Roles defaults
Bootstrap Actions defaults
Steps defaults
Cluster Configuration
6/1/2015 42
mod
Tags
6/1/2015 43
defaults
Software Configuration
6/1/2015 44
Pick Impala here!
Hopefully we’ll have time to get to this.
mod
Don’t for get to click add!
File System Configuration
6/1/2015 45
defaults
Hardware Configuration
6/1/2015 46
$ 0.35 / hour
Set Core and Task to 0
mod
Security and Access
6/1/2015 47
Finally we get to use our keys!
mod
IAM Roles
6/1/2015 48
Just defaults, please
More JSON in here
defaults
Bootstrap Actions
6/1/2015 49
defaults
• Tweak configuration
• Install custom application
(Apache Drill, Mahout, etc.)
• Shell scriptsCan use this to set up
Spark
Steps
6/1/2015 50
defaults
Steps
6/1/2015 51
Steps: Hive Program
6/1/2015 52
Provisioning
6/1/2015 53
Bootstrapping
6/1/2015 54
Monitor
Startup
Progress
6/1/2015 55
Instructions to Connect
6/1/2015 56
Here’s your hostname
SSH Info
We’ll follow these
instructions
Post ODSC Update: An easier way to access Hue
(foxyproxy slowed us down)
For Windows, Unix, and Mac, use ssh to establish a tunnel
$ ssh -i datakitchen-training.pem -L 8888:localhost:8888 hadoop@ec2-54-
152-244-88.compute-1.amazonaws.com
From the browser, go to
http://localhost:8888
You may need to fix the permissions on the .pem file:
$ chmod 400 datakitchen-training.pem
With the cygwin version of ssh, you may have to fix the group of the .pem file before the chmod
command.
$ chgrp Users datakitchen-training.pem
6/1/2015 57
Post ODSC Update: On Windows, you can use
putty to establish a tunnel
1. Download PuTTY.exe to your computer from:
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
2. Start PuTTY.
3. In the Category list, click Session
4. In the Host Name field, type hadoop@ec2-54-152-244-88.compute-1.amazonaws.com
5. In the Category list, expand Connection > SSH > Auth
6. For Private key file for authentication, click Browse and select the private key file (datakitchen-training.ppk) used
to launch the cluster.
7. In the Category list, expand Connection > SSH, and then click Tunnels.
8. In the Source port field, type 8888.
9. In the Destination type localhost:8888
10. Verify the Local and Auto options are selected.
11. Click Add.
12. Click Open.
13. Click Yes to dismiss the security alert.
6/1/2015 58
Now this will work
http://localhost:8888
Setup Web Connection – Linux/Mac
6/1/2015 59
Port Forwarding (Mac/Linux)
6/1/2015 60
ssh -i ~/.ec2/emr-training.pem -L 8888:localhost:8888 hadoop@ec2-54-173-219-
156.compute-1.amazonaws.com
Setup Web Connection – Windows
6/1/2015 61
Setup
Web
Connection
- Chrome
(Windows
and Mac are
Identical)
6/1/2015 62
Setup
Web
Connection
- Firefox
(Windows
and Mac are
Identical)
6/1/2015 63
Start Hue, in browser type
http://master public DNS:8888
http://ec2-52-5-91-114.compute-1.amazonaws.com:8888
6/1/2015 64
Note: no
hadoop@
Sign in
6/1/2015 65
First time Other times
6/1/2015 66
HIVE: Load Data from S3
6/1/2015 67
Familiar SQL
Describe file format
Pull from S3 bucket
UPDATE with your
bucket name
HIVE: Run the summary interactively
6/1/2015 68
HIVE: Export Our Data
6/1/2015 69
Define CSV
output
Write out data
You can look at the data in s3
UPDATE with
your bucket
name
PIG: Load Data from S3
6/1/2015 70
Readable
syntax
Describe file format
Pull from S3 bucket
UPDATE with your
bucket name
PIG: Transform the data
6/1/2015 71
PIG Export Our Data
6/1/2015 72
UPDATE with
your bucket
name
IMPALA: From the shell window
Type: impala-shell
>invalidate metadata
>show tables;
>
> quit
You can type “pig” or “hive” at the command line and run the scripts
here, without Hue.
6/1/2015 73
Terminate!
6/1/2015 74
Remember to shut down your clusters
Recap
Presentation
• Hadoop is an evolving ecosystem of projects
• It is well suited for big data
• Use something else for medium or small data
Doing
• Started a Hadoop cluster via the AWS Console (Web UI)
• Loaded Data
• Wrote some queries
6/1/2015 76
77
Thank you!
To continue the discussion,
contact us at
info@datakitchen.io
gil@datakitchen.io
eestabrooks@datakitchen.io
cbergh@datakitchen.io

Mais conteúdo relacionado

Mais procurados

Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing ChallengeTEST Huddle
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityDevOps.com
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Cloudera, Inc.
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integrationgluent.
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data PlatformsChris Kernaghan
 
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data TestingQA InfoTech
 
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark Summit
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend
 
Bdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchenBdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchenChristopher Bergh
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingMark Kerzner
 
Unleash the Power of Big Data and Machine Learning
Unleash the Power of Big Data and Machine LearningUnleash the Power of Big Data and Machine Learning
Unleash the Power of Big Data and Machine LearningTalend
 
Big Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBig Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBigDataExpo
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner
 
Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopMichelle Ufford
 

Mais procurados (20)

Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application Quality
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop Harnessing the Power of Apache Hadoop
Harnessing the Power of Apache Hadoop
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integration
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
 
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech Overview
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Bdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchenBdf16 big-data-warehouse-case-study-data kitchen
Bdf16 big-data-warehouse-case-study-data kitchen
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Unleash the Power of Big Data and Machine Learning
Unleash the Power of Big Data and Machine LearningUnleash the Power of Big Data and Machine Learning
Unleash the Power of Big Data and Machine Learning
 
Big Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBig Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real Time
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - Altiscale
 
Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for Hadoop
 

Destaque

Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Data analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - WebinarData analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - WebinarAli Zeeshan
 
MongoDB- Crud Operation
MongoDB- Crud OperationMongoDB- Crud Operation
MongoDB- Crud OperationEdureka!
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sectorAnil Rana
 
Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Datameer
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptxNaveen Kumar
 

Destaque (20)

Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Data analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - WebinarData analytics and analysis trends in 2015 - Webinar
Data analytics and analysis trends in 2015 - Webinar
 
MongoDB- Crud Operation
MongoDB- Crud OperationMongoDB- Crud Operation
MongoDB- Crud Operation
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sector
 
Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data Understand Your Customer Buying Journey with Big Data
Understand Your Customer Buying Journey with Big Data
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptx
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Semelhante a Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Hortonworks
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning ProductsAndrew Musselman
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoRomit Mehta
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowDaniel Zivkovic
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...SAP Cloud Platform
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Unlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQLUnlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQLMatt Lord
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupWojciech Biela
 

Semelhante a Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive (20)

Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Unlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQLUnlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQL
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
 

Mais de DataKitchen

Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You! DataKitchen
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoDataKitchen
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsDataKitchen
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...DataKitchen
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 

Mais de DataKitchen (7)

Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps Manifesto
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 

Último

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

  • 1. BIG DATA INFRASTRUCTURE – INTRODUCTION TO HADOOP WITH MAP REDUCE, PIG, AND HIVE Gil Benghiat Eric Estabrooks Chris Bergh O P E N D A T A S C I E N C E C O N F E R E N C E BOSTON 2015 @opendatasci
  • 2. Agenda Introductions Hadoop Overview & Comparisons What do I use when? AWS EMR Hive Pig Impala Hive 6/1/2015 2 Doing Presentation
  • 4. Meet DataKitchen Chris Bergh (Head Chef) 4 Gil Benghiat (VP Product) Eric Estabrooks (VP Cloud and Data Services) Software development and executive experience delivering enterprise software focused on Marketing and Health Care sectors. Deep Analytic Experience: Spent past decade solving analytic challenges New Approach To Data Preparation and Production: focused on the Data Analysts and Data Scientists
  • 5. 5 Analysts And Their Teams Are Spending 60-80% Of Their Time On Data Preparation And Production
  • 6. This creates an expectation gap 6 Analyze Prepare Data C Analyze Prepare Data Business Customer Expectation Analyst Reality Communicate The business does not think that Analysts are preparing data Analysts don’t want to prepare data
  • 7. 7 DataKitchen is on a mission to integrate and organize data to make analysts and data scientists super-powered.
  • 8. Meet the Audience: A few questions • Who considers themselves • Data scientist • Data analyst • Programmer / Scripter • On the Business side • Who knows SQL – can write a select statement? • Who used AWS before today? 6/1/2015 8
  • 10. What Is Apache Hadoop? • Software framework • Distributed processing of large scale datasets • Cluster of commodity hardware • Promise of lower cost • Has many frameworks, modules and projects 6/1/2015 10 http://hadoop.apache.org/
  • 11. 6/1/2015 11 Mark Grover http://radar.oreilly.com/2015/02/processing-frameworks-for-hadoop.html Hadoop ecosystem frameworks *** * *Covered in talk Hands on* * (HDFS, Cassandra, HBase, S3)
  • 12. Hadoop has been evolving 6/1/2015 12 Map Reduce Impala Hadoop Pig 2005 2007 2009 2011 2013 2015 Google Trends “Big Data”
  • 13. What is Hadoop good for? • Problems that are huge, and can be run in parallel over immutable data • NOT OLTP (e.g. backend to e-commerce site) • Providing frameworks to build software • Map Reduce • Spark • Tez • A backend for visualization tools 6/1/2015 13
  • 16. Test your system in the small 1. Make a small data set 2. Test like this: $ cat data.txt | map | sort | reduce 6/1/2015 16
  • 17. You can write map reduce jobs in your favorite language Streaming Interface • Lets you specify mappers and reducer • Supports • Java • Python • Ruby • Unix Shell • R • Any executable Map Reduce “generators” • Results in map reduce jobs • PIG • Hive 6/1/2015 17
  • 18. Applications that lend themselves to map reduce • Word Count • PDF Generation (NY Times 11,000,000 articles) • Analysis of stock market historical data (ROI and standard deviation) • Geographical Data (Finding intersections, rendering map files) • Log file querying and analysis • Statistical machine translation • Analyzing Tweets 6/1/2015 18
  • 19. Pig • Pig Latin - the scripting language • Grunt – Shell for executing Pig Commands 6/1/2015 19 http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
  • 20. This is what it would be in Java 6/1/2015 20 http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
  • 21. Hive You write SQL! Well, almost, it is HiveQL 6/1/2015 21 SELECT * FROM user WHERE active = 1; JDBC SQL Workbench HUE AWS S3
  • 22. Impala • Uses SQL very similar to HiveQL • Runs 10-100x faster than Hive Map Reduce • Runs in memory so it may not scale up as well • Some batch jobs may run faster on Impala than Hive • Great for developing your code on a small data set • Can use interactively with Tableau and other BI tools 6/1/2015 22
  • 23. • Had a version of SQL called Shark • Shark has been replaced by Spark SQL • Hive on Spark is under development • Spark SQL is faster than Shark • Runs 100x faster than Hive Map Reduce • Can use interactively with Tableau and other BI tools 6/1/2015 23
  • 25. Performance comparison (3. Join Query Feb 2014) 6/1/2015 25 Source: https://amplab.cs.berkeley.edu/benchmark/ What’s this? (inSeconds)
  • 26. Performance comparison (TPC-DS April 2015) 6/1/2015 26 Source:
  • 27. Performance comparison (Single User Sep 2014) 6/1/2015 27 Source:
  • 29. Today, we will use EMR to run Hadoop • EMR = Elastic Map Reduce • Amazon does almost all of the work to create a cluster • Offers a subset of modules and projects 6/1/2015 29 OR
  • 31. What to use when
  • 32. 6/1/2015 32 WhatTypeofDatabaseto Use? Capturing Transactions? Use RDMS Capturing Logs? Use File System Back End To Website? NoSQL Database (Mongodb) Cache (Redis) Doing Analytics? Small Data? Desktop Tools (Excel, Tableau) Building Models? R, Python, SAS Miner Big-ish Data? Columnar Database (Redshift) ‘Big Data’ Database (like Hadoop)
  • 33. 6/1/2015 33 WhichToolShouldIUse? Project Goal Want Experience In Coolest Tech? Spark is Hot Tech now Just Want To Get Job Done? Choose Hadoop Distributions Mainly Structured Data? Want Fast Response? SQL / Impala SQL / Redshift Mainly Unstructured Data? Developer? Write Map-Reduce Job Not Developer? SQL/HIVE
  • 34. 6/1/2015 34 HowShouldIUseIt? Use Case Development Use Cloud Use Virtual Machine Production Fixed Workload Do ROI on buying up front Use Cloud Variable Workload Use Cloud
  • 36. Form groups of 3 6/1/2015 36
  • 37. Let’s Do This! 6/1/2015 37 What do we need? • AWS Account • Key (.pem file) • The data file in the S3 bucket What will we do? • Start Cluster • MR Hive • MR Pig • Impala • Sum county level census data by state. Prerequisites and scripts are located at http://www.datakitchen.io/blog
  • 38. AWS Console 6/1/2015 38 • Just google “aws console” • Log in
  • 41. Cluster Options 6/1/2015 41 Cluster Configuration mod Tags defaults Software Configuration mod File System Configuration defaults Hardware Configuration mod Security and Access mod IAM Roles defaults Bootstrap Actions defaults Steps defaults
  • 44. Software Configuration 6/1/2015 44 Pick Impala here! Hopefully we’ll have time to get to this. mod Don’t for get to click add!
  • 46. Hardware Configuration 6/1/2015 46 $ 0.35 / hour Set Core and Task to 0 mod
  • 47. Security and Access 6/1/2015 47 Finally we get to use our keys! mod
  • 48. IAM Roles 6/1/2015 48 Just defaults, please More JSON in here defaults
  • 49. Bootstrap Actions 6/1/2015 49 defaults • Tweak configuration • Install custom application (Apache Drill, Mahout, etc.) • Shell scriptsCan use this to set up Spark
  • 56. Instructions to Connect 6/1/2015 56 Here’s your hostname SSH Info We’ll follow these instructions
  • 57. Post ODSC Update: An easier way to access Hue (foxyproxy slowed us down) For Windows, Unix, and Mac, use ssh to establish a tunnel $ ssh -i datakitchen-training.pem -L 8888:localhost:8888 hadoop@ec2-54- 152-244-88.compute-1.amazonaws.com From the browser, go to http://localhost:8888 You may need to fix the permissions on the .pem file: $ chmod 400 datakitchen-training.pem With the cygwin version of ssh, you may have to fix the group of the .pem file before the chmod command. $ chgrp Users datakitchen-training.pem 6/1/2015 57
  • 58. Post ODSC Update: On Windows, you can use putty to establish a tunnel 1. Download PuTTY.exe to your computer from: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html 2. Start PuTTY. 3. In the Category list, click Session 4. In the Host Name field, type hadoop@ec2-54-152-244-88.compute-1.amazonaws.com 5. In the Category list, expand Connection > SSH > Auth 6. For Private key file for authentication, click Browse and select the private key file (datakitchen-training.ppk) used to launch the cluster. 7. In the Category list, expand Connection > SSH, and then click Tunnels. 8. In the Source port field, type 8888. 9. In the Destination type localhost:8888 10. Verify the Local and Auto options are selected. 11. Click Add. 12. Click Open. 13. Click Yes to dismiss the security alert. 6/1/2015 58 Now this will work http://localhost:8888
  • 59. Setup Web Connection – Linux/Mac 6/1/2015 59
  • 60. Port Forwarding (Mac/Linux) 6/1/2015 60 ssh -i ~/.ec2/emr-training.pem -L 8888:localhost:8888 hadoop@ec2-54-173-219- 156.compute-1.amazonaws.com
  • 61. Setup Web Connection – Windows 6/1/2015 61
  • 64. Start Hue, in browser type http://master public DNS:8888 http://ec2-52-5-91-114.compute-1.amazonaws.com:8888 6/1/2015 64 Note: no hadoop@
  • 65. Sign in 6/1/2015 65 First time Other times
  • 67. HIVE: Load Data from S3 6/1/2015 67 Familiar SQL Describe file format Pull from S3 bucket UPDATE with your bucket name
  • 68. HIVE: Run the summary interactively 6/1/2015 68
  • 69. HIVE: Export Our Data 6/1/2015 69 Define CSV output Write out data You can look at the data in s3 UPDATE with your bucket name
  • 70. PIG: Load Data from S3 6/1/2015 70 Readable syntax Describe file format Pull from S3 bucket UPDATE with your bucket name
  • 71. PIG: Transform the data 6/1/2015 71
  • 72. PIG Export Our Data 6/1/2015 72 UPDATE with your bucket name
  • 73. IMPALA: From the shell window Type: impala-shell >invalidate metadata >show tables; > > quit You can type “pig” or “hive” at the command line and run the scripts here, without Hue. 6/1/2015 73
  • 75. Remember to shut down your clusters
  • 76. Recap Presentation • Hadoop is an evolving ecosystem of projects • It is well suited for big data • Use something else for medium or small data Doing • Started a Hadoop cluster via the AWS Console (Web UI) • Loaded Data • Wrote some queries 6/1/2015 76
  • 77. 77 Thank you! To continue the discussion, contact us at info@datakitchen.io gil@datakitchen.io eestabrooks@datakitchen.io cbergh@datakitchen.io