“BIG DATA” is data that is big in
volume
velocity and
Variety
“TODAY’S BIG MAY BE TOMMOROW’S NORMAL”
Varieties deals with a wide range of data types
Structured data - RDMS
Semi – structured data – HTML,XML
Unstructured data – audios, videos, emails, photos, pdf, social media
hadoop
It was created by DOUG CUTTING and MICHEAL CAFARELLA in 2005
2003 – NUTCH open source search engine( lucene ,sphinx ,etc…)
(google published some papers mentioning about DFS and MAP REDUCE)
After yahoo took this initiative step
Then the creation of hadoop took place
Hadoop 0.1.0 was relesed april 2006
As of now hadoop 2.8 is available
2. CONTENTS
Big data
WHAT IS BIG DATA
VOLUME
VELOCITY
VARIETY
SOURCES OF BIG DATA
CHALLENGES WITH BIG DATA
TECHNOLOGIES TO MEET BIG DATA
Hadoop
HISTORY OF HADOOP
BEFORE HADOOP
ARCHITECTURE
COMPANIES WHICH USE HADOOP
BIG DATA JOB ROLES
3. NAME SYMBOL VALUE
KILOBYTE KB 10^3
MEGABYTE MB 10^6
GIGABYTE GB 10^9
TERABYTE TB 10^12
PETABYTE PB 10^15
EXABYTE EB 10^18
ZETTABYTE ZB 10^21
YOTTABYTES YB 10^24
4. WHAT IS “BIG DATA”?
“BIG DATA” is data that is big in
volume
velocity and
Variety
“TODAY’S BIG MAY BE TOMMOROW’S NORMAL”
3 V’S
6. VARIETY
Varieties deals with a wide range of data types
Structured data - RDMS
Semi – structured data – HTML,XML
Unstructured data – audios, videos, emails, photos,
pdf, social media
7. SOURCES OF “BIG DATA”
Social media
Machine log data
Public web
Docs
Business apps
Data storage
8. CHALLENGES WITH “BIG DATA”
CAPTURE
STORAGE
CURATION
SEARCH
ANALYSIS
TRANSFER
VISUALIZATION
PRIVACY VIOLATIONS
9. WHAT KIND OF TECHNOLOGIES NEEDED TO
MEET CHALLENGES POSED BY “BIG DATA”
Cheap and abundant storage
Faster processors to help with quicker processing of
big data
Affordable open-source, distributed big data
platforms, such as “hadoop”
Cloud computing and other flexible resource
allocation arrangement
10.
11.
12. HISTORY OF HADOOP
It was created by DOUG CUTTING and MICHEAL
CAFARELLA in 2005
2003 – NUTCH open source search engine( lucene
,sphinx ,etc…)
(google published some papers mentioning about
DFS and MAP REDUCE)
After yahoo took this initiative step
Then the creation of hadoop took place
Hadoop 0.1.0 was relesed april 2006
As of now hadoop 2.8 is available
13.
14.
15. BEFORE HADOOP
Suppose you are having 100tb of data in a data
center
And one time you want to retrieve some 2tb of data
and you wrote a code to do that let us say a 100kb of
code
To get that out you should get that data out to your
system and do that you supposed to do…
i.e where ever you should run that program, to that
system you should fetch that data
“COMPUTATION IS ALWAYS PROCESSOR BORN”
17. Contd…
HDFS – (hadoop distributed file system)
a distributed file system that stores data on
commodity machines ,providing very high aggregate
bandwidth across the cluster(storing)
MAP REDUCE – a system for parllel processing of
large data sets(processing)
18. Contd…
HDFS - name node
secondary name node
job tracker
data node
task tracker
Master node
Slave nodes
19.
20. PROCESS…
File -> name node -> division into blocks ->
replication of blocks by three times ->addressing
that each replicated blocks in the name node
Suppose if any error occurred with the hardware
then that information is let to know to name node
and set the number of the data replicated constant
By again replicating to set the number as three
And mentioning the address of the node to the name
node so that there is no error in processing
23. “BIG DATA” JOB ROLES
CHIEF DATA OFFICER
BIG DATA SCIENTIST
BIG DATA ANALYST
BIG DATA VISUALIZER
BIG DATA MANAGER
BIG DATA SOLUTIONS AECHITECT
BIG DATA ENGINEER
BIG DATA RESEARCHER
BIG DATA CONSULTANT