O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Definition of Big Data
"Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.“
"Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the potential to
be mined for information.“
Data growing way faster than computation speeds
A single machine can no longer process or even store all this data!
The Big Data problem
Where does Big Data come from?
Online recorded content:
.. everything what happens online can potentially be recorded
User generated content (Facebook, Twitter, Instagram, etc)
Smartphone users reach to their phone 150 times a day (2013)
Health and scientific computing
Large Hadron Collider produces about double amount of data than Twitter every year
Internet of Things (IoT)
smart thermostat systems
automobiles with built-in sensors
all kind of “smart” devices of various sizes
Example scales of Big Data
EIR communication logs: 1.4 TB / day
Facebook logs: 60 TB / day
Google total web index: ~10+ PB (10000TB)
Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)
..as a reminder..
time to read 1TB from disk: 3 hours (100MB/s)
Google web index could be read from disk serialized in ~3.4 years
Let’s design a simple web tracker from scratch
Register and count each page view for a number of clients
“Keep simple things simple”
Huge number of page views => massive DB load on concurrent updates => DB
timeouts => FAIL
Why write each count?!
Let’s introduce a queue and buffer updates
# of page views and # of clients keep increasing => DB overload => FAIL
The bottleneck is the write-heavy DB
Let’s shard the database!
Have to keep adding new servers and re-sharding existing databases
Re-sharding online is tricky (maybe introduce pending queues?)
A single code failure corrupts a huge set of data collected over years
Is there a way out?
We need new tools which handle:
automatic sharding and re-sharding
automatic replication and rebalancing
effortless horizontal scaling
But we need to adapt ourselves as well. We need:
a new definition of “data” (data ≠ information)
new architectures (Lambda Architecture)
immutable data (for scaling and fault tolerance)
functional programming concepts
No, writing 25 years old structural code in this year’s favorite language
won’t cut it anymore
Big Data tooling
Apache Hadoop distributed filesystem (HDFS)
Distributed, scalable, portable filesytem written in Java
Open source, 10 years old (!) project
Handles files in the gigabytes-terabytes range
Manages automatic replication and rebalancing of data
Facebook had 21 PB of storage on HDFS in 2010
Yahoo had a cluster of 10 000 Hadoop nodes in 2008
Next generation data processing engine written in Scala
Open source, 5 years old project
Up to 100 times faster than Hadoop MapReduce
Uses functional programming techniques to process data
Can scale down to get run in an IDE!