BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS

BIG DATA
MODERN TECHNOLOGIES
György Balogh
LogDrill Ltd.
SECWorld – 7 May 2014

AGENDA
• What is Big Data?
• Why do we have to talk about it?
• Paradigm shift in informationmanagement
• Technology and efficiency

WHAT IS BIG DATA?
• Data volume cannot be handled traditional
solutions (eg.: relational database)
• More than 100 million data rows, typically multi
billion

GLOBAL RATE OF DATA
PRODUCTION (PER SECOND)
• 30 TB/sec (22000 films)
• Digital media
• 2 hours of YouTube video
• Communication
• 3000 business emails
• 300000 SMS
• Web
• Half million page views
• Logs
• Billions

WHY NOW?
● Long term trends
○ Size of stored data doubles every 40 months since
1980s
○ Moore’s law: number of transistors on integrated
circuits doubles every 18 months

HARD DRIVES IN 1991 AND 2012
● 1991
● 40 MB
● 3500 RPM
● 0.7 MB/sec
● full scan: 1 minutes
● 2012
● 4 TB ( x 100000)
● 7200 RPM
● 120 MB/sec ( x 170)
● full scan: 8 hours (x 480)

DATA ACCESS BECOMES THE
SCARCE RESOURCE!

GOOGLE’S HARDWARE IN 2013
• 12 data centers worldwide
• More than a million nodes
• A data center costs $600 million to build
• Oregon data center
• 15000 m2
• power of 30 000 homes

GOOGLE’S HARDWARE IN 2013
• Cheap commodity hardware
• each has its own battery!
• Modular data centers
• Standard container
• 1160 servers per container
• Efficiency: 11% overhead
(power transformation, cooling)

TECHNOLOGIES
• Hadoop 2.0
• Google BigQuery
• Cloudera Impala
• Apache Spark

HADOOP DISTRIBUTED FILE
SYSTEM (HDFS)

HADOOP
• Who uses Hadoop?
• Facebook: 100 PB
• Yahoo: 4000 nodes
• More than half of Fortune 50 companies!
• History
• Replica of Google architecture (GFS, BigTable) in
Java under Apache licence
• Hadoop 2.0
• Full High Availability
• Advanced resource managements (YARN)

GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few
milliseconds
select year, SUM(mother_age *
record_weight) /
SUM(record_weight) as age
from
publicdata:samples.natality
where ever_born = 1 group by
year order by year;

GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few
milliseconds

CLOUDERA IMPALA
• Same as BigQuery on top of Hadoop
• Standard SQL on Big Data.
• On a 10 million Ft cluster terabytes of data can
be analyzed interactively
• Scales to thousands of nodes
• Technology sugars
• Run-time code generation with LLVM
• Parquet format (column oriented)

APACHE SPARK
• Berkeley University
• Achieves 100 times speed up compared to
Hadoop on certain tasks
• In cluster memory computation

INEFFICIENCY CAN WASTE
HUGE AMOUNT OF RESOURCES
• 300 node cluster
• Hadoop
• Hive
= • 300 node cluster
• One node
• Vectorwise
• Vectorwise holds world
speed record in analytical
database queries on a single
node

CLEVER WAYS TO IMPROVE
EFFICIENCY
• Lossless data compression (even 50x!)
• Clever lossy compression of data (e.g.: olap
cubes)
• Cache aware implementations (asymmetric
trends, memory access bottleneck)

LOSSLESS DATA COMPRESSION
• compression can boost sequential data
access even 50 times! (100 MB/sec -> 5
GB/sec)
• Less data -> less I/O operation
• One CPU can decompress data even at 5
GB/sec
• gzip decompression is very slow
• snappy, lzo, lz4 can reach 1 GB/sec
decompression speed
• decompression used by column oriented
databases can reach 5 GB/sec (PFOR)
• two billion integers per second! (almost one
integer per clock cycle!!!)

EXAMPLE: LOGDRILL
2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562
2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134
2011-01-08 00:00 GET 200 2
2011-01-08 00:01 GET 200 2
2011-01-08 00:02 GET 404 1
2011-01-08 00:02 POST 200 1

CAHE AWARE PROGRAMMING
• CPU speed increasing about 60% a year
• Memory speed increasing only 10% a year
• The increasing gap is covered with multi level
cache memories
• Cache is under-exploited
100x speed up!!!

LESSONS LEARNED
• Big Data is not a hype at least from the
technological viewpoint
• Modern technologies (Impala, Spark) can
reach theoretical limits of the cluster hardware
configuration
• Deep understanding of both the problem and
the technologies are required to create
efficient Big Data solutions

BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS

Recommended

Recommended

More Related Content

Similar to BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS

Similar to BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS (20)

Recently uploaded

Recently uploaded (20)

BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS