16. Three or Four “V”s Value
Competitive or Collective
advantage
Volume Variety
Terabytes Structured
Petabytes Unstructured
Exabytes Human Generated
Zetabytes Machine Generated
Velocity
User populations x
Transaction rates x
Machine data
16 Software Group
17. Data volumes have always been
increasing….
2006 Perspective
17 Software Group
18. Though the absolute volumes are
boggling…
Digital information
created 2011
2.13E+21
Total Digital capacity
1.18E+21
Digital information
2008
4.87E+18
Living Human
Genomes
5.48E+18
Google
1.10E+17
Human Brain
2.81E+15
1.E+09 1.E+11 1.E+13 1.E+15 1.E+17 1.E+19 1.E+21
Gigabyte Terabyte Petabyte Exabyte zettabyte
18 Software Group
30. Data: now and then
1993 2013
Generated Generated
internally externally
Key to
Key to
operational
competitiveness
efficiency
Source of
product
innovation
Changing our
world
30 Software Group
41. Siri
“Siri call me an “I want to jump off a
ambulance” bridge”
From now on, I‟ll call you
„An Ambulance‟. OK? I found 14 bridges nearby:
41 Software Group
51. The intrumented human
• Compass
• Camera
• Mike/earphones
• Heads up display
• Bluetooth Personal
Area Network
• 3G/WiFi Wide Area
Network
• Pulse, temp
• GPS monitor
• Storage • Silent alarms
• Pedometer, sleep
monitoring
51 Software Group
52. All this requires But what else are they
and generates good for?
huge data sets
52 Software Group
53. The data Companies want to
“exhaust” itself generate competitive
generates new advantage through
opportunites “Big Data analytics”
53 Software Group
54. Big Data Analytics
Machine Collective
Learning Intelligence
Programs that Programs that use
evolve with inputs from “crowds‟
“experience” to seem intelligent
Predictive
Analytics
Programs that
extrapolate from
existing data into
the future
54 Software Group
93. Schema on Read vs Schema on Write
Schema on Write
Code Analyse
Transform Load Utilize
Extract Data
Data Cleanse Aggregate Warehouse
Norma
lize
Schema on Read
Code Analyse
Data Load Utilize
Hadoop
Cleanse
93 Software Group
94. Hadoop: Open
Source Map-
Reduce Stack
94 Software Group
95. Hadoop at Yahoo
Yahoo! Hadoop cluster:
4000 nodes
16PB disk
64 TB of RAM
32,000 Cores
95 Software Group
97. MAP REDUCE HADOOP CLIENT
(DISTRIBUTED (JAVA, PIG, HIVE)
PROCESSING)
Hadoop 1.0
HDFS Architecture
(DISTRIBUTED
STORAGE)
JOB TRACKER NAME NODE SECONDARY
NAME NODE
DATA NODE DATA NODE DATA NODE
TASK TRACKER TASK TRACKER TASK TRACKER
DATA NODE DATA NODE DATA NODE
TASK TRACKER TASK TRACKER TASK TRACKER
DATA NODE DATA NODE DATA NODE
TASK TRACKER TASK TRACKER TASK TRACKER
DATA NODE DATA NODE DATA NODE
TASK TRACKER TASK TRACKER TASK TRACKER
97 Software Group
99. HBase
A Real time database built on Hadoop
Log MemStore
Buffer Cache
Buffer
Table Table Table Table
Datafiles Redo HFile HFile WA Log
ASM HDFS
Disks Disks
99 Software Group
100. Hbase Data Model
Name Site Counter NameId Name SiteId SiteName
Dick Ebay 507,018 1 Dick 1 Ebay
Dick Google 690,414 2 Jane 2 Google
Jane Google 716,426 3 Facebook
Dick Facebook 723,649 4 ILoveLarry.com
Jane Facebook 643,261 5 MadBillFans.com
Jane ILoveLarry.com 856,767
Dick MadBillFans.com 675,230
NameId SiteId Counter
1 1 507,018
1 3 690,414
2 3 716,426
1 3 723,649
2 3 643,261
2 4 856,767
1 5 675,230
Id Name Ebay Google Facebook (other columns) MadBillFans.com
1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230
Id Name Google Facebook (other columns) ILoveLarry.com
2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767
100 Software Group
134. ETL Free
Schema on Write
Code Analyse
Extract Transform Load Utilize
Data Clean Aggre Data
se gate
Warehouse
Norm
alize
Schema on Read
Code Analyse
Data Load Utilize
Hadoop
Cleanse
135 Software Group
135. The most
concrete
technology
enabling the Big
Data revolution
136 Software Group
144. SharePlex® for Hadoop
JMS Queue Hadoop
Poster
HBase Real
Time
replication
Change Data Batched
Capture HDFS
File Copy Audit / Change
Redo-logs Data
145 Software Group
145. Toad for Hadoop
Hive Query IDE
Oracle <-> Hadoop data
management
Basic Hadoop administration
Beta June
146 Software Group