SlideShare uma empresa Scribd logo
1 de 36
Basics of Big
Data
Analytics
&
Hadoo
p
Ambuj Kumar
Ambuj_kumar@aol.com
http://ambuj4bigdata.blogspot.in
http://ambujworld.wordpress.com
Agend
a Big Data –
 Concepts overview
 Analytics –
 Concepts overview
 Hadoop –
 Concepts overview
 HDFS
 Concepts overview
 Data Flow - Read & Write
Operation
 MapReduce
 Concepts overview
 WordCount Program
 Use Cases
 Landscape
 Hadoop Features & Summary
What is Big
Data?Big data is data which is too large, complex and dynamic for any conventional data tools to capture,
store, manage and analyze.
Challenges of Big
Data
• Storage (~
Petabytes)
1
• Processing (Timely
manner)
• Variety of Data (Structured,
Semi
Structured,Un-structured)
• Cos
t
2
3
4
Big Data
AnalyticsBig data analytics is the process of examining large
amounts of data of a variety of types (big data) to
uncover hidden patterns, unknown correlations and other
useful information.
Big Data AnalyticsSolutions
There are many different Big Data Analytics Solutions out
in the market.
 Tableau – visualization tools
 SAS – Statistical computing
 IBM and Oracle –They have a range of tools for Big Data
Analysis
 Revolution – Statistical computing
 R – Open source tool for Statisticalcomputing
What is Hadoop?
 Open-source data storage and processingAPI
 Massively scalable, automaticallyparallelizable
 Based on work from Google
 GFS + MapReduce + BigTable
 Current Distributions based on Open Source and VendorWork
 Apache Hadoop
 Cloudera – CDH4
 Hortonworks
 MapR
 AWS
 Windows Azure HDInsight
Why Use Hadoop?
Cheaper
Scales to Petabytes
or more
Faster
Parallel data
processing
Better
Suited for particular
types of BigData
problems
Hadoop
HistoryIn 2008, Hadoop becameApache Top Level Project
Comparing:RDBMS vs.
HadoopTraditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query
ResponseTime
Can be near immediate Has latency (due tobatch
processing)
Where is Hadoop
used?
Technology
Industry Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance
Companies Using Hadoop
 Search
Yahoo,Amazon,Zvents
 Log Processing
Facebook,Yahoo,
ContextWeb.Joost,Last.fm
 Recommendation
Systems
Facebook,Linkedin
 DataWarehouse
Facebook,AOL
 Video & ImageAnalysis
NewYorkTimes,Eyealike
------- Almost in every
domain!
Hadoop is a set of Apache
Frameworks and more…
 Data storage (HDFS)
 Runs on commodity hardware (usually
Linux)
 Horizontally scalable
 Processing (MapReduce)
 Parallelized (scalable) processing Fault Tolerant
 Other Tools / Frameworks
 Data Access
 HBase, Hive, Pig,
Mahout
 Tools
 Hue, Sqoop
 Monitoring
 Greenplum, Cloudera
Hadoop Core - HDFS
MapReduceAPI
Monitoring &Alerting
Tools & Libraries
DataAccess
Core parts of Hadoop
distribution
HDFS Storage
Redundant (3copies)
For large files – large
blocks
64 or 128 MB / block
Can scale to 1000s of
nodes
MapReduce API
Batch (Job) processing
Distributed and Localized
to clusters (Map)
Auto-Parallelizable for
huge amounts of data
Fault-tolerant (auto
retries)
Adds high availability and
more
Other Libraries
Pig
Hive
HBase
Others
Hadoop Cluster HDFS
(Physical) Storage
Name Node
Data Node 1 Data Node 2 Data Node 3
Secondary
Name Node
• Contains web site to view
cluster information
• V2 Hadoop uses multiple
Name Nodes for HA
One Name Node
Many Data Nodes
• 3 copies of each node by
default
Work with data in HDFS
• Using common Linux shell
commands
• Block size is 64 or 128 MB
MapReduce Job – Logical
View
Hadoop
Ecosystem
Common Hadoop
Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft
HDInsight
HDFS
:Architecture Master
NameNode
Slave
Bunch of DataNodes
HDFS Layers
NameNode
Storage
…………
NS
Block Management
NameNode
DataNode
DataNode DataNode DataNode DataNode DataNode
DataNode
Name
Space
Block
Storage
HDFS : Basic
Features
Highly fault-
tolerant High
throughput
Suitable for applications with large data
sets Streaming access to file system
data
Can be built out of commodity hardware
HDFS Write
(1/2)
Client Name Node
1
2
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A2 A3 A4A1
3
Client contacts NameNode to write data
NameNode says write it to thesenodes
Client sequentiallywrites
blocks to DataNode
HDFS Write
(2/2)
Client Name Node
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A1
DataNodes replicatedata
blocks, orchestrated
by the NameNode
A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3
HDFS
Read
Client Name Node
1
2
Data Node
A
Data Node
B
Data Node
C
Data Node
D
3
Client contacts NameNode to read data
NameNode says you can findit here
Client sequentially
reads blocks from
DataNode
A1 A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3
HA (High Availability) for
NameNode
NameNode (StandBy)
DataNode
NameNode (Active)
Active NameNode
Do normal namenode’s operation
Standby NameNode
Maintain NameNode’s data
Ready to be active NameNode
DataNode DataNode DataNode DataNode
MapRedu
ce
MapReduce job consist of two tasks
 Map Task
 Reduce Task
Blocks of data distributed across several
machinesare processed by map tasks parallel
 Results are aggregated in the reducer
 Works only on KEY/VALUE pair
MapReduce:Word
Count
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Bear 2
Car 3
Deer 2
River 2
Can we do word count in parallel?
Deer Bear River
Car Car River
Deer Car Bear
MapReduce:Word Count
Program
Data Flow in a MapReduce
Program in Hadoop
Mapper
ClassPackage ambuj.com.wc;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final static LongWritable one = new LongWritable(1);
private Text word = newText();
@Override
public void map(LongWritable inputKey, Text inputVal, Context context)
throws IOException, InterruptedException {
String line = inputVal.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Reducer
Classpackage ambuj.com.wc;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> listOfValues,
Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : listOfValues) {
sum = sum + val.get();
}
context.write(key, new LongWritable(sum));
}
}
Driver
Class
package ambuj.com.wc;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configuration conf = newConfiguration();
Job job = new Job(conf,"WordCount");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCountDriver(), args);
}
}
A view of
Hadoop Client Job
Data Node
Task
Tracker
Task
Task
Task
Job Tracker Name Node
Data Node
Task
Tracker
Task
Task
Task
Data Node
Task
Tracker
Task
Task
Task
MasterSlave
Blocks HDFS
MapReduce
Use
Cases
 Utilities want to predict power consumption
 Banks and insurance companies want to
understand risk
 Fraud detection
 Marketing departments want to understand
customers
 Recommendations
 Location-Based Ad Targeting
 Threat Analysis
Big Data
Landscape
Hadoop Features &
SummaryDistributed frame work for processing and storing
data generally on commodity hardware.
Completely open source and written in Java.
 Store anything
 Unstructured or semi structured data,
 Storage capacity
 Scale linearly, cost in not exponential.
 Data locality and process in yourway.
 Code moves todata
 In MR you specify the actual steps in processing the data and drive the out put.
 Stream access: Process data in any language.
 Failure and fault tolerance:
 Detect Failure and Heals itself.
 Reliable, data replicated, failed task are rerun , no need maintain backup of data
 Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of
commodity
PC machines.
The Hadoop framework transparently for customization to provides applications both reliability,
adaption
and data motion.
Primarily used for batch processing, not real-time/ transactional user applications.
References -
Hadoop
 Hadoop:The Definitive Guide,Third Edition by
Tom White.
 http://hadoop.apache.org
 http://www.cloudera.com
 http://ambuj4bigdata.blogspot.com
 http://ambujworld.wordpress.com
Thank
You

Mais conteúdo relacionado

Mais procurados

Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentationHarshitaKamboj
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011Hortonworks
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 

Mais procurados (20)

Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
 
Hadoop
HadoopHadoop
Hadoop
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 

Semelhante a Basic of Big Data

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 

Semelhante a Basic of Big Data (20)

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 

Último

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 

Último (20)

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 

Basic of Big Data

  • 1. Basics of Big Data Analytics & Hadoo p Ambuj Kumar Ambuj_kumar@aol.com http://ambuj4bigdata.blogspot.in http://ambujworld.wordpress.com
  • 2. Agend a Big Data –  Concepts overview  Analytics –  Concepts overview  Hadoop –  Concepts overview  HDFS  Concepts overview  Data Flow - Read & Write Operation  MapReduce  Concepts overview  WordCount Program  Use Cases  Landscape  Hadoop Features & Summary
  • 3. What is Big Data?Big data is data which is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze.
  • 4. Challenges of Big Data • Storage (~ Petabytes) 1 • Processing (Timely manner) • Variety of Data (Structured, Semi Structured,Un-structured) • Cos t 2 3 4
  • 5. Big Data AnalyticsBig data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Big Data AnalyticsSolutions There are many different Big Data Analytics Solutions out in the market.  Tableau – visualization tools  SAS – Statistical computing  IBM and Oracle –They have a range of tools for Big Data Analysis  Revolution – Statistical computing  R – Open source tool for Statisticalcomputing
  • 6. What is Hadoop?  Open-source data storage and processingAPI  Massively scalable, automaticallyparallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and VendorWork  Apache Hadoop  Cloudera – CDH4  Hortonworks  MapR  AWS  Windows Azure HDInsight
  • 7. Why Use Hadoop? Cheaper Scales to Petabytes or more Faster Parallel data processing Better Suited for particular types of BigData problems
  • 8. Hadoop HistoryIn 2008, Hadoop becameApache Top Level Project
  • 9. Comparing:RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query ResponseTime Can be near immediate Has latency (due tobatch processing)
  • 10. Where is Hadoop used? Technology Industry Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
  • 11. Companies Using Hadoop  Search Yahoo,Amazon,Zvents  Log Processing Facebook,Yahoo, ContextWeb.Joost,Last.fm  Recommendation Systems Facebook,Linkedin  DataWarehouse Facebook,AOL  Video & ImageAnalysis NewYorkTimes,Eyealike ------- Almost in every domain!
  • 12. Hadoop is a set of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduceAPI Monitoring &Alerting Tools & Libraries DataAccess
  • 13. Core parts of Hadoop distribution HDFS Storage Redundant (3copies) For large files – large blocks 64 or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and Localized to clusters (Map) Auto-Parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Other Libraries Pig Hive HBase Others
  • 14. Hadoop Cluster HDFS (Physical) Storage Name Node Data Node 1 Data Node 2 Data Node 3 Secondary Name Node • Contains web site to view cluster information • V2 Hadoop uses multiple Name Nodes for HA One Name Node Many Data Nodes • 3 copies of each node by default Work with data in HDFS • Using common Linux shell commands • Block size is 64 or 128 MB
  • 15. MapReduce Job – Logical View
  • 18. HDFS :Architecture Master NameNode Slave Bunch of DataNodes HDFS Layers NameNode Storage ………… NS Block Management NameNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Name Space Block Storage
  • 19. HDFS : Basic Features Highly fault- tolerant High throughput Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware
  • 20. HDFS Write (1/2) Client Name Node 1 2 Data Node A Data Node B Data Node C Data Node D A2 A3 A4A1 3 Client contacts NameNode to write data NameNode says write it to thesenodes Client sequentiallywrites blocks to DataNode
  • 21. HDFS Write (2/2) Client Name Node Data Node A Data Node B Data Node C Data Node D A1 DataNodes replicatedata blocks, orchestrated by the NameNode A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
  • 22. HDFS Read Client Name Node 1 2 Data Node A Data Node B Data Node C Data Node D 3 Client contacts NameNode to read data NameNode says you can findit here Client sequentially reads blocks from DataNode A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
  • 23. HA (High Availability) for NameNode NameNode (StandBy) DataNode NameNode (Active) Active NameNode Do normal namenode’s operation Standby NameNode Maintain NameNode’s data Ready to be active NameNode DataNode DataNode DataNode DataNode
  • 24. MapRedu ce MapReduce job consist of two tasks  Map Task  Reduce Task Blocks of data distributed across several machinesare processed by map tasks parallel  Results are aggregated in the reducer  Works only on KEY/VALUE pair
  • 25. MapReduce:Word Count Deer 1 Bear 1 River 1 Car 1 Car 1 River 1 Deer 1 Car 1 Bear 1 Bear 2 Car 3 Deer 2 River 2 Can we do word count in parallel? Deer Bear River Car Car River Deer Car Bear
  • 27. Data Flow in a MapReduce Program in Hadoop
  • 28. Mapper ClassPackage ambuj.com.wc; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> { private final static LongWritable one = new LongWritable(1); private Text word = newText(); @Override public void map(LongWritable inputKey, Text inputVal, Context context) throws IOException, InterruptedException { String line = inputVal.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 29. Reducer Classpackage ambuj.com.wc; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override public void reduce(Text key, Iterable<LongWritable> listOfValues, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable val : listOfValues) { sum = sum + val.get(); } context.write(key, new LongWritable(sum)); } }
  • 30. Driver Class package ambuj.com.wc; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordCountDriver extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Configuration conf = newConfiguration(); Job job = new Job(conf,"WordCount"); job.setJarByClass(WordCountDriver.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new WordCountDriver(), args); } }
  • 31. A view of Hadoop Client Job Data Node Task Tracker Task Task Task Job Tracker Name Node Data Node Task Tracker Task Task Task Data Node Task Tracker Task Task Task MasterSlave Blocks HDFS MapReduce
  • 32. Use Cases  Utilities want to predict power consumption  Banks and insurance companies want to understand risk  Fraud detection  Marketing departments want to understand customers  Recommendations  Location-Based Ad Targeting  Threat Analysis
  • 34. Hadoop Features & SummaryDistributed frame work for processing and storing data generally on commodity hardware. Completely open source and written in Java.  Store anything  Unstructured or semi structured data,  Storage capacity  Scale linearly, cost in not exponential.  Data locality and process in yourway.  Code moves todata  In MR you specify the actual steps in processing the data and drive the out put.  Stream access: Process data in any language.  Failure and fault tolerance:  Detect Failure and Heals itself.  Reliable, data replicated, failed task are rerun , no need maintain backup of data  Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion. Primarily used for batch processing, not real-time/ transactional user applications.
  • 35. References - Hadoop  Hadoop:The Definitive Guide,Third Edition by Tom White.  http://hadoop.apache.org  http://www.cloudera.com  http://ambuj4bigdata.blogspot.com  http://ambujworld.wordpress.com