2. AGENDA
• What is Big Data?
• Why do we have to talk about it?
• Paradigm shift in informationmanagement
• Technology and efficiency
3. WHAT IS BIG DATA?
• Data volume cannot be handled traditional
solutions (eg.: relational database)
• More than 100 million data rows, typically multi
billion
4.
5. GLOBAL RATE OF DATA
PRODUCTION (PER SECOND)
• 30 TB/sec (22000 films)
• Digital media
• 2 hours of YouTube video
• Communication
• 3000 business emails
• 300000 SMS
• Web
• Half million page views
• Logs
• Billions
8. WHY NOW?
● Long term trends
○ Size of stored data doubles every 40 months since
1980s
○ Moore’s law: number of transistors on integrated
circuits doubles every 18 months
13. GOOGLE’S HARDWARE IN 2013
• 12 data centers worldwide
• More than a million nodes
• A data center costs $600 million to build
• Oregon data center
• 15000 m2
• power of 30 000 homes
14. GOOGLE’S HARDWARE IN 2013
• Cheap commodity hardware
• each has its own battery!
• Modular data centers
• Standard container
• 1160 servers per container
• Efficiency: 11% overhead
(power transformation, cooling)
19. HADOOP
• Who uses Hadoop?
• Facebook: 100 PB
• Yahoo: 4000 nodes
• More than half of Fortune 50 companies!
• History
• Replica of Google architecture (GFS, BigTable) in
Java under Apache licence
• Hadoop 2.0
• Full High Availability
• Advanced resource managements (YARN)
20. GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few
milliseconds
select year, SUM(mother_age *
record_weight) /
SUM(record_weight) as age
from
publicdata:samples.natality
where ever_born = 1 group by
year order by year;
21. GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few
milliseconds
22. CLOUDERA IMPALA
• Same as BigQuery on top of Hadoop
• Standard SQL on Big Data.
• On a 10 million Ft cluster terabytes of data can
be analyzed interactively
• Scales to thousands of nodes
• Technology sugars
• Run-time code generation with LLVM
• Parquet format (column oriented)
23. APACHE SPARK
• Berkeley University
• Achieves 100 times speed up compared to
Hadoop on certain tasks
• In cluster memory computation
24. INEFFICIENCY CAN WASTE
HUGE AMOUNT OF RESOURCES
• 300 node cluster
• Hadoop
• Hive
= • 300 node cluster
• One node
• Vectorwise
• Vectorwise holds world
speed record in analytical
database queries on a single
node
25. CLEVER WAYS TO IMPROVE
EFFICIENCY
• Lossless data compression (even 50x!)
• Clever lossy compression of data (e.g.: olap
cubes)
• Cache aware implementations (asymmetric
trends, memory access bottleneck)
26. LOSSLESS DATA COMPRESSION
• compression can boost sequential data
access even 50 times! (100 MB/sec -> 5
GB/sec)
• Less data -> less I/O operation
• One CPU can decompress data even at 5
GB/sec
• gzip decompression is very slow
• snappy, lzo, lz4 can reach 1 GB/sec
decompression speed
• decompression used by column oriented
databases can reach 5 GB/sec (PFOR)
• two billion integers per second! (almost one
integer per clock cycle!!!)
27. EXAMPLE: LOGDRILL
2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562
2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321
2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522
2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425
2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432
2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134
2011-01-08 00:00 GET 200 2
2011-01-08 00:01 GET 200 2
2011-01-08 00:02 GET 404 1
2011-01-08 00:02 POST 200 1
28. CAHE AWARE PROGRAMMING
• CPU speed increasing about 60% a year
• Memory speed increasing only 10% a year
• The increasing gap is covered with multi level
cache memories
• Cache is under-exploited
100x speed up!!!
29. LESSONS LEARNED
• Big Data is not a hype at least from the
technological viewpoint
• Modern technologies (Impala, Spark) can
reach theoretical limits of the cluster hardware
configuration
• Deep understanding of both the problem and
the technologies are required to create
efficient Big Data solutions