DATA TRENS
- Facebook has around 60 PB
warehouse and it’s constantly growing
- Twitter messages are 140 bytes each
generating 8TB data per day.
-Data is more than doubling every
year.
-Almost 80% of data will be
unstructured data.
-Amazon: 35% of product sales come
from product recommendations
New Type of DATA?
• Sentiment : Understand how your customers feel about
your products / company
• Sensor/Machine:Discover patters in data streaming
automatically from sensors and machines.
• Unstructured: text,video,pictures.
• Server Logs:Search logs find pattern
• Geographic:Analyze location-based data
• Clickstream:Capture and analyze website visitors data
Capacity vs Cost
Year Capacity(GB) Cost per GB(USD)
1990 0.10 $4000
1997 2 $150
2002 80 $3.75
2007 750 $0.35
2012 3.000 $0.05
2015 10.000 $0.02
What is Big Data
• Big Data is When the Volume,Velocity,Variety of
data gets to the point where it is too difficult/
expensive for traditional systems to work with.
Traditional Large scale
Computing System Problems
• Computation has been
processor bound
• Relatively small amount
of data
• Complex processing
• Need bigger computers
• More memory,More/fast
processor
Better Solution
• Distributed Systems- Multiple
machine run for single job
Problem Of Distributed Systems
Data Stored central location
Data Copied processor runtime
Todays
• Total Data size PetaBytes
• Daily Terabytes
We Need New Solution
HADOOP
HDFS
• Hadoop Distributed File System:Storing data
• Data Split into blocks. 64 Mb…
• Each Block replicated e.g 3 times. replicas store different
nodes.
• Based on Google File system
• ext3,ext4,xfs
• No random writes allowed. Prefer large streaming reads
Hadoop Ecosystem
• HIVE
• LIKE SQL
• User query data in hadoop cluster without knowing Java and Map
reduce.
• PIG
• Uses a dataflow scripting language
• IMPALA
• Open source project created by cloudier
• Very similar to HiveQL.Produces much faster.
Hadoop Ecosystem
• FLUME
• Import data into HDFS as it is generated
• Log files from a Web Server
• Sqoop
• Import data from tables in a OLTP into HDFS
• Populate database tables from files in HDFS
• Oozi
• Developers create a workflow of MapReduce Jobs
Hadoop Ecosystem
• HBASE
• HADOOP DATABASE
• NOSQL DATASTORE
• HUGE DATA STORE,GB,TB,PB
• Query Language get/put/scan
• Read/write Throughput Millions of query ps ,rdbms
is 1000s queries/second
Big Data
• Finance ,Fraud detection,Customer risk analysis
• Retail, Product recommendation,buy and discount
• Advertising,More effective web ads
• Defense
• Telco
• Healthcare