SlideShare uma empresa Scribd logo
1 de 41
Introduction to Hadoop and
Big Data
Who am I
• Joe Alex
– Software Architect / Data Scientist
Loves to code in Java, Scala
– Areas of Interest: Big Data, Data Analytics,
Machine Learning, Hadoop, Cassandra
– Currently working as Team Lead for Managed
Security Services Portal at Verizon
New kind of data
• Social - messages, posts, blogs, photos, videos,
maps, graphs, friends
• Machine – sensors, firewalls, routers, logs,
metrics, health monitoring, cell phones, credit
card transactions
New kind of data
• Volume – massive, TB  PB
– Convert 350 billion annual meter readings to better predict power
consumption
– Turn 12 terabytes of Tweets created each day into improved product
sentiment analysis
• Types – structured, semi/un-structured
– Text, audio, video, click streams, logs, machine
– Monitor 100’s of live video feeds from surveillance cameras to target points of
interest
• Velocity (time sensitive)
– ideally processed as it is streaming (realtime, near-realtime, batch)
– Scrutinize 5 million trade events created each day to identify potential fraud
– Analyze 500 million daily call detail records in real-time to predict customer
churn faster
What is Big Data about
• We are drowning is a sea of data, sometimes
we throw away a lot.
• Still we cant make much sense of it
• We consider data as a cost
• But Data is an opportunity
• This is what Big Data is about
– New Insights
– New Business
Big Data Domains
• Digital marketing
• Data discovery – patterns, trends
• Fraud detection
• Machine generated Data Analytics
– Remote device insight, sensing, location based
intel
• Social
• Data retention
Big Data Architecture
• Traditional
– High Availability, RDBMS, Structured data
• Big Data
– High scalabilty/availability/flexibility,
Compute/Storage on same nodes,
Structured/Semi/Un-Structured data
How to tackle Big Data
• Layered Architecture
– Speed Layer
– Batch layer
Apache Hadoop
• Open source project under Apache Software Foundation
• Based on papers published by Google
– MapReduce:
http://research.google.com/archive/mapreduce.html
– GFS:
http://research.google.com/archive/gfs.html
Reliability
• "Failure is the defining difference between
distributed and local programming“
• - Ken Arnold, CORBA Designer
Why Hadoop
• Data processed by Google every month: 400Pb… in
2007
• Average job size: 180Gb
• Time 180Gb of data would take to read sequentially off
a single disk drive: 45 minutes
• Solution: parallel reads
• – 1 HDD = 75Mb/sec
• – 1,000 HDDs = 75Gb/sec (Far more acceptable)
• Data Access Speed is the Bottleneck
• We can process data very quickly, but we can only
read/write it very slowly
Core Components
• Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
• There are many other projects based around core
Hadoop
– Often referred to as the “Hadoop Ecosystem”
– Pig, Hive, HBase, Flume, Oozie, Sqoop etc
• A set of machines running HDFS and MapReduce is
known as a Hadoop Cluster
• Individual machines are known as nodes. A cluster can
have as few as one node, as many as several thousands
• More nodes = better performance
System Requirements
• System should support partial failure
– Failure of one part of the system should result in a
graceful decline in performance. Not a full halt
• System should support data recoverability
– If components fail, their workload should be picked up
by still functioning units
• System should support individual recoverability
– Nodes that fail and restart should be able to rejoin the
group activity without a full group restart
System Requirements (cont’d)
• System should be consistent
– – Concurrent operations or partial internal failures
should not cause the results of the job to change
• System should be scalable
– – Adding increased load to a system should not
cause outright failure. Instead, should result in a
graceful decline
• Increasing resources should support a
proportional increase in load capacity
Hadoop’s radical approach
• Hadoop provides a radical approach to these issues:
– Nodes talk to each other as little as possible, Probably
never.
– This is known as a “shared nothing” architecture
– Programmer should not explicitly be allowed to write code
which communicates between nodes
• Data is spread throughout machines in the cluster
– Data distribution happens when data is loaded on to the
cluster
• Instead of bringing data to the processors, Hadoop
brings the processing to the data
Hadoop’s radical approach
• Batch Oriented
• Data Locality (code is shipped around)
• Heavy Parallelization
• Process Management
• Append-only files
• Express your computation in Map Reduce, get
parallelism and and scalability for free
Core Hadoop Daemons
• Each node in a Hadoop installation runs one or
more daemons executing MapReduce code or
HDFS commands. Each daemon’s responsibilities
in the cluster are:
– NameNode: manages HDFS and communicates with
every DataNode daemon in the cluster
– JobTracker: dispatches jobs and assigns splits to
mappers or reducers as each stage completes
– TaskTracker: executes tasks sent by the JobTracker and
reports status
– DataNode: Manages HDFS content in the node and
updates status to the NameNode
Config files
• hadoop-env.sh — environmental configuration,
JVM configuration etc
• core-site.xml — site wide configuration
• hdfs-site.xml — HDFS block size, Name and Data
node directories
• mapred-site.xml — total MapReduce tasks,
JobTracker address
• masters, slaves files — NameNode, JobTracker,
DataNodes, and TaskTrackers addresses, as
appropriate
HDFS: Hadoop Distributed File System
• Based on Google’s GFS (Google File System)
• Provides redundant storage of massive
amounts of data
– Using cheap, unreliable computers
• At load time, data is distributed across all
nodes
– Provides for efficient MapReduce processing
HDFS Assumptions
• High component failure rates
– Inexpensive components fail all the time
• “Modest” number of HUGE files
– Just a few million
– Each file likely to be 100Mb or larger
– Multi-Gigabyte files typical
• Large streaming reads
– Not random access
• High sustained throughput should be favored
over low latency
HDFS Features
• Operates ‘on top of’ an existing filesystem
• Files are stored as ‘blocks’
– Much larger than for most filesystems
– Default is 64Mb
• Provides reliability through replication
– Each block is replicated across three or more DataNodes
• Single NameNode stores metadata and co-ordinates access
– Provides simple, centralized management
• No data caching
– Would provide little benefit due to large datasets, streaming reads
• Familiar interface, but customize the API
– Simplify the problem and focus on distributed applications
HDFS Block diagram
MapReduce
• MapReduce is a method for distributing a task
across multiple nodes in the Hadoop cluster
• Consists of two phases: Map, and then Reduce
– Between the two is a stage known as the shuffle and
sort
• Each Map task operates on a discrete portion of
the overall dataset. Typically one HDFS block of
data
• After all Maps are complete, the MapReduce
system distributes the intermediate data to nodes
which perform the Reduce phase
Features of MapReduce
• Automatic parallelization and distribution
• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
– – MapReduce programs are usually written in Java
– – Can be written in any scripting language using Hadoop
Streaming
– – All of Hadoop is written in Java
• MapReduce abstracts all the “housekeeping” away
from the developer
– – Developer can concentrate simply on writing the Map
and Reduce functions
MapReduce example
• Map
• // assume input is a set of text files k is a line
offset v is the line for that offset
• let map(k, v) =
• for each word in v:
• emit(word, 1)
• Reduce
• // k is a word vals is a list of 1s
• let reduce(k, vals) =
• emit(k, vals.length())
MapReduce Highlevel
Map Process
• map (in_key, in_value) -> (out_key, out_value) list
Reduce Process
• reduce (out_key, out_value list) -> (final_key, final_value) list
MapReduce
Streaming API
• Many organizations have developers skilled in
languages other than Java
– Perl, Ruby, Python, Etc
• The Streaming API allows developers to use any
language they wish to write Mappers and Reducers
– As long as the language can read from standard input and
write to standard output
• Advantages of the Streaming API:
– No need for non-Java coders to learn Java
– Fast development time
– Ability to use existing code libraries
Job Driver
• Job Driver
JobConf conf = new JobConf(WordCount.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
• Driver is submitted to the Hadoop cluster for processing, along with
the rest of the code in a .jar file.
Mapper
• The basic Java code implementation for the mapper has the
form:
public class WordMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> collector, Reporter
reporter) throws IOException {
/* implementation here
*/ }
}
• The implementation itself uses standard Java text
manipulation tools; you can use regular expressions,
scanners, whatever is necessary.
Reducer
• Reducer
public class SumReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable>
values,
OutputCollector<Text, IntWritable> collector, Reporter
reporter)
throws IOException {
/* implementation
*/}
}
• The reducer iterates over keys and values generated in the
previous step and sums up the occurrence of the word
Input/Output Formats
• Input Formats
– KeyValueTextInputFormat — Each line represents
a key and value delimited by a separator
– TextInputFormat — The key is the byte offset, the
value is the text itself for each line
– Sequence Input Format — Raw format serialized
key/value pairs
• Output Formats
– Specify final output
Hadoop Eco System : Hive
• Hive
– SQL-based data warehousing app
– Data analysts are very familiar with SQL than Java etc
– Hive allows users to query data using HiveQL, a language
very similar to standard SQL
– Hive turns HiveQL queries into standard MapReduce jobs
– Automatically runs the jobs, and displays the results to the
user
– Note that Hive is not an RDBMS
• Results take many seconds, minutes, or even hours to be produced
• Not possible to modify the data using HiveQL
– Features for analyzing very large data sets
Hadoop Eco System: Pig
• Pig
– Data-flow oriented language
– Pig can be used as an alternative to writing
MapReduce jobs in Java (or some other language)
– Provides a scripting language known as Pig Latin
– Abstracts MapReduce details away from the user
– Made up of a set of operations that are applied to the
input data to produce output
– Fairly easy to write complex asks such as joins of
multiple datasets
– Under the covers, Pig Latin scripts are converted to
MapReduce jobs
Hadoop Eco System: HBase
• HBase
– Distributed, sparse, column-oriented datastore
• Distributed: designed to use multiple machines to store and
serve data
• Sparse: each row may or may not have values for all columns
• Column-oriented: Data is stored grouped by column, rather
than by row. Columns are grouped into ‘column families’,
which define what columns are physically stored together
– Leverages HDFS
– Modeled after Google’s BigTable datastore
Hadoop Eco System: Others
• Flume
– Flume is a distributed, reliable, available service for efficiently
moving large amounts of data as it is produced.
– Ideally suited to gathering logs from multiple systems and
inserting them into HDFS as they are generated
• Sqoop
– Sqoop is “the SQL-to-Hadoop database import tool”
– Designed to import data from RDBMS into Hadoop
– Can also send data the other way, from Hadoop to an RDBMS
– Uses JDBC to connect to the RDBMS
• Oozie
– Dataflow
Hadoop Eco System: Others
• Zookeeper
– Distributed consensus engine
– Provides well-defined concurrent access semantics:
• Leader election
• Service discovery
• Distributed locking / mutual exclusion
• Avro
– Serialization and RPC framework
• Mahout
– Machine learning library
Next Gen
• Storm – distributed realtime computation.
Makes it easy to reliably process unbounded
streams of data, doing for realtime processing
what Hadoop did for batch processing
• Spark – Spark is an open source cluster
computing system that aims to make data
analytics fast.
• Impala – real-time processing
Questions
Twitter @joealex
Email joe.m.alex@gmail.com

Mais conteúdo relacionado

Mais procurados

a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 

Mais procurados (20)

Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Presentation
PresentationPresentation
Presentation
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 

Semelhante a Introduction to Hadoop and Big Data

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 

Semelhante a Introduction to Hadoop and Big Data (20)

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Hadoop
HadoopHadoop
Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop
HadoopHadoop
Hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Último

( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
nilamkumrai
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
nirzagarg
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
( Pune ) VIP Pimpri Chinchwad Call Girls 🎗️ 9352988975 Sizzling | Escorts | G...
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 

Introduction to Hadoop and Big Data

  • 1. Introduction to Hadoop and Big Data
  • 2. Who am I • Joe Alex – Software Architect / Data Scientist Loves to code in Java, Scala – Areas of Interest: Big Data, Data Analytics, Machine Learning, Hadoop, Cassandra – Currently working as Team Lead for Managed Security Services Portal at Verizon
  • 3. New kind of data • Social - messages, posts, blogs, photos, videos, maps, graphs, friends • Machine – sensors, firewalls, routers, logs, metrics, health monitoring, cell phones, credit card transactions
  • 4. New kind of data • Volume – massive, TB  PB – Convert 350 billion annual meter readings to better predict power consumption – Turn 12 terabytes of Tweets created each day into improved product sentiment analysis • Types – structured, semi/un-structured – Text, audio, video, click streams, logs, machine – Monitor 100’s of live video feeds from surveillance cameras to target points of interest • Velocity (time sensitive) – ideally processed as it is streaming (realtime, near-realtime, batch) – Scrutinize 5 million trade events created each day to identify potential fraud – Analyze 500 million daily call detail records in real-time to predict customer churn faster
  • 5. What is Big Data about • We are drowning is a sea of data, sometimes we throw away a lot. • Still we cant make much sense of it • We consider data as a cost • But Data is an opportunity • This is what Big Data is about – New Insights – New Business
  • 6. Big Data Domains • Digital marketing • Data discovery – patterns, trends • Fraud detection • Machine generated Data Analytics – Remote device insight, sensing, location based intel • Social • Data retention
  • 7. Big Data Architecture • Traditional – High Availability, RDBMS, Structured data • Big Data – High scalabilty/availability/flexibility, Compute/Storage on same nodes, Structured/Semi/Un-Structured data
  • 8. How to tackle Big Data • Layered Architecture – Speed Layer – Batch layer
  • 9. Apache Hadoop • Open source project under Apache Software Foundation • Based on papers published by Google – MapReduce: http://research.google.com/archive/mapreduce.html – GFS: http://research.google.com/archive/gfs.html
  • 10. Reliability • "Failure is the defining difference between distributed and local programming“ • - Ken Arnold, CORBA Designer
  • 11. Why Hadoop • Data processed by Google every month: 400Pb… in 2007 • Average job size: 180Gb • Time 180Gb of data would take to read sequentially off a single disk drive: 45 minutes • Solution: parallel reads • – 1 HDD = 75Mb/sec • – 1,000 HDDs = 75Gb/sec (Far more acceptable) • Data Access Speed is the Bottleneck • We can process data very quickly, but we can only read/write it very slowly
  • 12. Core Components • Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce • There are many other projects based around core Hadoop – Often referred to as the “Hadoop Ecosystem” – Pig, Hive, HBase, Flume, Oozie, Sqoop etc • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster • Individual machines are known as nodes. A cluster can have as few as one node, as many as several thousands • More nodes = better performance
  • 13. System Requirements • System should support partial failure – Failure of one part of the system should result in a graceful decline in performance. Not a full halt • System should support data recoverability – If components fail, their workload should be picked up by still functioning units • System should support individual recoverability – Nodes that fail and restart should be able to rejoin the group activity without a full group restart
  • 14. System Requirements (cont’d) • System should be consistent – – Concurrent operations or partial internal failures should not cause the results of the job to change • System should be scalable – – Adding increased load to a system should not cause outright failure. Instead, should result in a graceful decline • Increasing resources should support a proportional increase in load capacity
  • 15. Hadoop’s radical approach • Hadoop provides a radical approach to these issues: – Nodes talk to each other as little as possible, Probably never. – This is known as a “shared nothing” architecture – Programmer should not explicitly be allowed to write code which communicates between nodes • Data is spread throughout machines in the cluster – Data distribution happens when data is loaded on to the cluster • Instead of bringing data to the processors, Hadoop brings the processing to the data
  • 16. Hadoop’s radical approach • Batch Oriented • Data Locality (code is shipped around) • Heavy Parallelization • Process Management • Append-only files • Express your computation in Map Reduce, get parallelism and and scalability for free
  • 17. Core Hadoop Daemons • Each node in a Hadoop installation runs one or more daemons executing MapReduce code or HDFS commands. Each daemon’s responsibilities in the cluster are: – NameNode: manages HDFS and communicates with every DataNode daemon in the cluster – JobTracker: dispatches jobs and assigns splits to mappers or reducers as each stage completes – TaskTracker: executes tasks sent by the JobTracker and reports status – DataNode: Manages HDFS content in the node and updates status to the NameNode
  • 18. Config files • hadoop-env.sh — environmental configuration, JVM configuration etc • core-site.xml — site wide configuration • hdfs-site.xml — HDFS block size, Name and Data node directories • mapred-site.xml — total MapReduce tasks, JobTracker address • masters, slaves files — NameNode, JobTracker, DataNodes, and TaskTrackers addresses, as appropriate
  • 19. HDFS: Hadoop Distributed File System • Based on Google’s GFS (Google File System) • Provides redundant storage of massive amounts of data – Using cheap, unreliable computers • At load time, data is distributed across all nodes – Provides for efficient MapReduce processing
  • 20. HDFS Assumptions • High component failure rates – Inexpensive components fail all the time • “Modest” number of HUGE files – Just a few million – Each file likely to be 100Mb or larger – Multi-Gigabyte files typical • Large streaming reads – Not random access • High sustained throughput should be favored over low latency
  • 21. HDFS Features • Operates ‘on top of’ an existing filesystem • Files are stored as ‘blocks’ – Much larger than for most filesystems – Default is 64Mb • Provides reliability through replication – Each block is replicated across three or more DataNodes • Single NameNode stores metadata and co-ordinates access – Provides simple, centralized management • No data caching – Would provide little benefit due to large datasets, streaming reads • Familiar interface, but customize the API – Simplify the problem and focus on distributed applications
  • 23. MapReduce • MapReduce is a method for distributing a task across multiple nodes in the Hadoop cluster • Consists of two phases: Map, and then Reduce – Between the two is a stage known as the shuffle and sort • Each Map task operates on a discrete portion of the overall dataset. Typically one HDFS block of data • After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase
  • 24. Features of MapReduce • Automatic parallelization and distribution • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers – – MapReduce programs are usually written in Java – – Can be written in any scripting language using Hadoop Streaming – – All of Hadoop is written in Java • MapReduce abstracts all the “housekeeping” away from the developer – – Developer can concentrate simply on writing the Map and Reduce functions
  • 25. MapReduce example • Map • // assume input is a set of text files k is a line offset v is the line for that offset • let map(k, v) = • for each word in v: • emit(word, 1) • Reduce • // k is a word vals is a list of 1s • let reduce(k, vals) = • emit(k, vals.length())
  • 27. Map Process • map (in_key, in_value) -> (out_key, out_value) list
  • 28. Reduce Process • reduce (out_key, out_value list) -> (final_key, final_value) list
  • 30. Streaming API • Many organizations have developers skilled in languages other than Java – Perl, Ruby, Python, Etc • The Streaming API allows developers to use any language they wish to write Mappers and Reducers – As long as the language can read from standard input and write to standard output • Advantages of the Streaming API: – No need for non-Java coders to learn Java – Fast development time – Ability to use existing code libraries
  • 31. Job Driver • Job Driver JobConf conf = new JobConf(WordCount.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordMapper.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setReducerClass(SumReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); • Driver is submitted to the Hadoop cluster for processing, along with the rest of the code in a .jar file.
  • 32. Mapper • The basic Java code implementation for the mapper has the form: public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { /* implementation here */ } } • The implementation itself uses standard Java text manipulation tools; you can use regular expressions, scanners, whatever is necessary.
  • 33. Reducer • Reducer public class SumReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { /* implementation */} } • The reducer iterates over keys and values generated in the previous step and sums up the occurrence of the word
  • 34. Input/Output Formats • Input Formats – KeyValueTextInputFormat — Each line represents a key and value delimited by a separator – TextInputFormat — The key is the byte offset, the value is the text itself for each line – Sequence Input Format — Raw format serialized key/value pairs • Output Formats – Specify final output
  • 35. Hadoop Eco System : Hive • Hive – SQL-based data warehousing app – Data analysts are very familiar with SQL than Java etc – Hive allows users to query data using HiveQL, a language very similar to standard SQL – Hive turns HiveQL queries into standard MapReduce jobs – Automatically runs the jobs, and displays the results to the user – Note that Hive is not an RDBMS • Results take many seconds, minutes, or even hours to be produced • Not possible to modify the data using HiveQL – Features for analyzing very large data sets
  • 36. Hadoop Eco System: Pig • Pig – Data-flow oriented language – Pig can be used as an alternative to writing MapReduce jobs in Java (or some other language) – Provides a scripting language known as Pig Latin – Abstracts MapReduce details away from the user – Made up of a set of operations that are applied to the input data to produce output – Fairly easy to write complex asks such as joins of multiple datasets – Under the covers, Pig Latin scripts are converted to MapReduce jobs
  • 37. Hadoop Eco System: HBase • HBase – Distributed, sparse, column-oriented datastore • Distributed: designed to use multiple machines to store and serve data • Sparse: each row may or may not have values for all columns • Column-oriented: Data is stored grouped by column, rather than by row. Columns are grouped into ‘column families’, which define what columns are physically stored together – Leverages HDFS – Modeled after Google’s BigTable datastore
  • 38. Hadoop Eco System: Others • Flume – Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced. – Ideally suited to gathering logs from multiple systems and inserting them into HDFS as they are generated • Sqoop – Sqoop is “the SQL-to-Hadoop database import tool” – Designed to import data from RDBMS into Hadoop – Can also send data the other way, from Hadoop to an RDBMS – Uses JDBC to connect to the RDBMS • Oozie – Dataflow
  • 39. Hadoop Eco System: Others • Zookeeper – Distributed consensus engine – Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Avro – Serialization and RPC framework • Mahout – Machine learning library
  • 40. Next Gen • Storm – distributed realtime computation. Makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing • Spark – Spark is an open source cluster computing system that aims to make data analytics fast. • Impala – real-time processing