Проектирование крупномасштабных приложений сбора данных (Josh Berkus)

Firehose
Engineering
designing
high-volume
data collection
systems
Josh Berkus
HiLoad, Moscow
October 2011

Firehose Database
Applications (FDA)

(1) very high volume of data
input from many producers
(2) continous processing of
incoming data

1. Volume

● 100's to 1000's facts/second
● GB/hour

1. Volume

● spikes in volume
● multiple uncoorindated sources

1. Volume

volume always grows over time

2. Constant flow
since data arrives 24/7 …

while the user interface can be
down, data collection can never
be down

2. Constant flow
● can't stop
receiving to
process
ETL ● data can
arrive out of
order

3. Database size
● terabytes to petabytes
● lots of hardware
● single-node DBMSes aren't enough
● difficult backups, redundancy,
migration
● analytics are resource-consumptive

3. Database size
● database growth
● size grows quickly
● need to expand storage
● estimate target data size
● create data ageing policies

3. Database size

“We will decide on a data
retention policy when we run out
of disk space.”
– every business user everywhere

many components
= many failures

4. Component failure
● all components fail
● or need scheduled downtime
● including the network
● collection must continue
● collection & processing must
recover

http://crash-stats.mozilla.com

Mozilla Socorro

webservers

collectors

reports
processors

socorro data volume
● 3000 crashes/minute
● avg. size 150K
● 40TB accumulated raw data
● 500GB accumulated metadata /
reports

dealing with volume

load balancers collectors

dealing with volume

monitor processors

dealing with size
data
40TB
expandible

metadata
views
500GB
fixed size

dealing with
component failure
Lots of hardware ...
● 30 Hbase ● 3 ES servers
nodes ● 6 collectors
● 2 PostgreSQL ● 12 processors
servers ● 8 middleware &
● 6 load web servers
balancers

… lots of failures

load balancing &
redundancy


elastic connections
● components queue their data
● retain it if other nodes are down
● components resume work
automatically
● when other nodes come back up

elastic connections
crash mover

collector

reciever local file queue

server management
● puppet
● controls configuration of all servers
● makes sure servers recover
● allows rapid deployment of
replacement nodes

Upwind
● speed
● wind speed
● heat
● vibration
● noise
● direction

Upwind
1. maximize
power
generation
2. make sure
turbine isn't
damaged

dealing with volume
each turbine:
90 to 700 facts/second
windmills per farm: up to 100
number of farms: 40+
est. total: 300,000 facts/second
(will grow)

dealing with volume

local historian analytic reports
storage database

storage database

dealing with volume

local historian analytic
storage database

master
database
storage database

multi-tenant partitioning
● partition the whole application
● each customer gets their own
toolchain
● allows scaling with the
number of customers
● lowers efficiency
● more efficient with virtualization

dealing with:
constant flow and size

minute hours days months years
historian
buffer table table table table

minute hours days months
historian
buffer table table table

minute hours days
historian
buffer table table

time-based rollups
● continuously accumulate
levels of rollup
● each is based on the level below it
● data is always appended, never
updated
● small windows == small resources

time-based rollups
● allows:
● very rapid summary reports for
different windows
● retaining different summaries for
different levels of time
● batch/out-of-order processing
● summarization in parallel

data collection must be:

● continuous
● parallel
● fault-tolerant

data processing must be:

● continuous
● parallel
● fault-tolerant

every component
must be able to fail
● without too much data loss
● other components must
continue

5 tools to use
1. queueing software
2. buffering techniques
3. materialized views
4. configuration management
5. comprehensive monitoring

4 don'ts
1. use cutting-edge technology
2. use untested hardware
3. run components to capacity
4. do hot patching

Contact
● Josh Berkus: josh@pgexperts.com
● blog: blogs.ittoolbox.com/database/soup
● PostgreSQL: www.postgresql.org
● pgexperts: www.pgexperts.com
● Upcoming Events
● PostgreSQL Europe: http://2011.pgconf.eu/
● PostgreSQL Italy: http://2011.pgday.it/

The text and diagrams in this talk is copyright 2011 Josh Berkus and is licensed under the creative
commons attribution license. Title slide image is licensed from iStockPhoto and may not be
reproduced or redistributed. Socorro images are copyright 2011 Mozilla Inc.

Firehose
Engineering
designing
high-volume
data collection
systems
Josh Berkus
HiLoad, Moscow
October 2011
1

I created this talk because, for some reason, I've been
working on a bunch of the same kind of data systems
lately.

Firehose Database
Applications (FDA)

(1) very high volume of data
input from many producers
(2) continous processing of
incoming data

2

I call these “Firehose Database Applications”. They're
characterized by two factors, and have a lot of things
in common regardless of which industry they're used
in.
1. These applications are focused entirely on the
collection of large amounts of data coming in from
many high-volume, producers, and
2. they process data into some more useful form, such
as summaries, continuously 24 hours a day.

Some examples ...

Mozilla Socorro 3

Mozilla's Socorro system, which collects firefox &
thunderbird crash data from around the world.

Upwind

4

Upwind, which collects a huge amount of metrics from
large wind turbine farms in order to monitor and
analyze them.

Fraud Detection System

5

And numerous other systems, such as a fraud
detection system which receives data from over 100
different transaction processing systems and
analyzes it to look for patterns which would indicate
fraud.

Firehose Challenges

6

regardless of how these firehose systems are used,
they share the same four challenges, all of which
need to be overcome in some way to get the system
running and keep it running.

1. Volume

● 100's to 1000's facts/second
● GB/hour 7

The first and most obvious is volume. A firehose
system is collecting many, many facts per second,
which may add up to gigabytes per hour.

1. Volume

● spikes in volume
● multiple uncoorindated sources 8

You also have to plan for unpredicted spikes in volume
which may be 10X normal.
Also your data sources may come in at different rates,
and with interruptions or out of order data.

But, the biggest challenge of volume always is ...

1. Volume

volume always grows over time
9

… there is always, always, always more of it than there
used to be.

2. Constant flow
since data arrives 24/7 …

while the user interface can be
down, data collection can never
be down

10

the second challenge is dealing with the all-day, all-
week, all-year nature of the incoming data.
for example, it is often permissiable for the user
interface or reporting system be out of service, but
data collection must go on at all times without
interruption. Otherwise, data would be lost.
this makes for different downtime plans than standard
web applications, who can have “read only
downtimes”. Firehose systems can't; they must
arrange for alternate storage if the primary
collections is down.

2. Constant flow
● can't stop
receiving to
process
ETL ● data can
arrive out of
order

11

it also means that conventional Extract Transform Load
(ETL) approaches to data warehousing don't work.
You can't wait for the down period to process the
data. Instead, data must be processed continuously
while still collecting.
Data is also never at rest; you must be prepared for
interruptions and out-of-order data.

3. Database size
● terabytes to petabytes
● lots of hardware
● single-node DBMSes aren't enough
● difficult backups, redundancy,
migration
● analytics are resource-consumptive

12

another obvious challenge is the sheer size of the data.

once you start accumulating data 24x7, you will rapidly
find yourself dealing with terabytes or even petabytes
of data. This requires you to deal with lots of
hardware and large clustered systems or server
farms, including clustered or distributed databases.
this makes backups and redundancy either very
difficult or very expensive.
performing analytics on the data and reporting on it
becomes very resource-intensive and slow at large
data sizes.

3. Database size
● database growth
● size grows quickly
● need to expand storage
● estimate target data size
● create data ageing policies

13

the database is also growing rapidly, which means that
solutions which work today might not work next
month. You need to estimate the target size of your
data and engineer for that, not the amount of data
you have now.
you will also have to create a data ageing or expiration
policy so that you can eventually purge data and limit
the database to some finite size. this is the hardest
thing to do because ...

3. Database size

“We will decide on a data
retention policy when we run out
of disk space.”
– every business user everywhere

14

… users never ever ever want to “throw away” data.
No matter how old it is or how infrequently it's
accessed.
This isn't special to firehose systems except that they
accumulate data faster than any other kind of
system.

4.

15

problem #4 is illustrated by this architecture diagram of
the Socorro system. Firehose systems have lots and
lots of components, often using dozens of servers
and hundreds of application instances. In addition to
cost and administrative overhead, what having that
much stuff means is ...

many components
= many failures 16

… every day of the week something is going to be
crashing on you. You simply can't run a system with
90+ servers and expect them to be up all the time.

4. Component failure
● all components fail
● or need scheduled downtime
● collection must continue
● collection & processing must
recover
17

Even when components don't fail unexpectedly, they
have to be taken down for upgrades, maintenance,
replacement, or migration to new locations and
software. But the data collection and processing has
to go on while components are unavailable or offline.
Even when processing has to stop, components
must recover quickly when the system is ready
again, and must do so without losing data.

solving firehose problems 18

So let's talk about how a couple of different teams
solved some of these firehose problems.

socorro project

19

First we're going to talk about Mozilla's Socorro
system. I'll bet a lot of you have seen this Firefox
crash screen. The data it produces gets sent to the
socorro system at Mozilla, where it's collected and
processed.

http://crash-stats.mozilla.com

20

At the end of the processing we get pretty reports like
this, which let the mozilla developers know when
releases are stable and ready, and where frequent
crashing and hanging issues are. You can visit it
right now: It's http://crash-stats.mozilla.com

21

There's quite a rich set of data there collecting
everything about every aspect of what's making
Firefox and other Mozilla products crash, individually
and in the aggregate.

Mozilla Socorro

webservers

collectors

reports
processors

22

this is a simplified architecture diagram showing the
very basic data flow of Socorro.
1. Crashes come in through the internet,
2. are harvested by a cluster of collectors
3. raw crashes are stored in Hbase
4. a cluster of processors pull raw crashes from Hbase
process crashes, and write processed crashes and
metadata to Hbase and Postgres
5. Postgres populates views which are used to support
the web interface and reports used internally by
mozilla developers.

23

a full architecture diagram would look like this one, but
even this doesn't show all components.

socorro data volume
● 3000 crashes/minute
● avg. size 150K
● 40TB accumulated raw data
● 500GB accumulated metadata /
reports

24

socorro receives up to 3000 crashes every minute from
around the world. each of these crashes comes with
50 to 1000K of binary crash data which needs to be
processed.
within the target data lifetime of 6-12 months, socorro
has accumulated nearly 40 terabytes of raw data and
half a terabyte of metadata, views and reports.

dealing with volume

25

how does Mozilla deal with this large data volume? In
one word:

parallelism

1. crashes come in from the internet
2. they hit a set of round-robin load balancers
3. the load balancers randomly assign them to one of
six crash collectors
4. the crash collectors receive the crash information,
and then write the raw crash to a 30-server Hbase
cluster.

dealing with volume

26
monitor processors

5. the raw crashes in HBase are pulled out in parallel
by twelve processor servers running 10 processor
instances each.
6. these processors do a lot of work to process the
binary crashes and then write the processed crash
results to Hbase an the processed crash metadata
and summary data to Postgres.
7. for scheduling and preventing duplication, the
processors are coordinated by a monitor server,
which runs a queue and monitoring system backed
by Postgres.

dealing with size
data
40TB
expandible

metadata
views
500GB
fixed size 27

Why have both Hbase and PostgreSQL? Isn't one
database enough?
Well, the 30-server Hbase cluster holds 40 terabytes of
data. It's good at doing this in a redundant, scalable
way and is very fast for pulling individual crashes out
of the billion or so which are stored there.
However Hbase is very poor at ad-hoc queries, or
simple browsing of data over the web. It also often
requires downtime for maintenance
So we use Postgres which holds metadata and holds
summary “materialized views” which are used by the
web application and the reports to present analytics
to users.
Postgres can't store the raw data though because it
can't easily expand to the sizes needed.

dealing with
component failure
Lots of hardware ...
● 30 Hbase ● 3 ES servers
nodes ● 6 collectors
● 2 PostgreSQL ● 12 processors
servers ● 8 middleware &
● 6 load web servers
balancers

… lots of failures 28

Of course, with the more than 60 servers we're dealing
with, something is always going down. That's simply
reality. Some of the worst failures have involved the
entire network going out and all components losing
communication with each other.
This means that we need to be prepared to have
failures and to recover from them. How do we do
that?

load balancing &
redundancy

29

One obvious way is to have lots of redundant
components for each role in the application. As we
already saw, for incoming data there is more than
one load balancer and are 6 collectors. This means
that we can lose several machines and still be
receiving user data, and that in catastrophic failure
we only lose an minority of the data.

elastic connections
● components queue their data
● retain it if other nodes are down
● components resume work
automatically
● when other nodes come back up

30

A more important strategy we've learned through hard
experience is the requirement to have “elastic
connections”. That is, each component is prepared
to lose communication with the components on either
side of it. They queue data trough several
mechanisms, and resume work when the next
component in line comes back up.

elastic connections
crash mover

collector

reciever local file queue
31

For example, let's look at the collectors. They receive
data from the web and write to Hbase. But it's more
complex than that.
1. The collectors can lose communication to Hbase for
a variety of reasons. It can be down for
maintenance, or there can be network issues.
2. The crashes are received by receiver processes
3. which write them to local file storage in a primitive,
filename-based queue.
4. A separate “crashmover” process looks for files in
the queue
5. When Hbase is available, the crashmover saves the
files to Hbase.

server management
● puppet
● controls configuration of all servers
● makes sure servers recover
● allows rapid deployment of
replacement nodes

32

Another thing which has been essential to managing
component failure is comprehensive server
management through Puppet. Having servers under
central management means that we can make sure
services are up and configured correctly when they
are supposed to be. And it also means that if we
permanently lose nodes we can replace them quickly
with identically configured nodes.

Upwind

33

let's move on to a different team and different methods
of facing firehose challenges. Upwind.

Upwind
● speed
● wind speed
● heat
● vibration
● noise
● direction
34

power-generating wind turbines have onboard
computers which produce a lot of data about the
status and operation of the wind turbine Upwind
takes that data, collects it, summarizes it and turns it
into graphical summary reports for customers who
own wind farms.

Upwind
1. maximize
power
generation
2. make sure
turbine isn't
damaged
35

the customers need this data for two reasons:
1. wind turbines are very expensive and they need to
make sure they're maximizing power generation from
them.
2. to intervene and keep the turbines from catching fire
or otherwise becoming permanently damaged.

dealing with volume
each turbine:
90 to 700 facts/second
windmills per farm: up to 100
number of farms: 40+
est. total: 300,000 facts/second
(will grow)
36

turbines come with a lot of different monitors, between
90 and 700 of them depending on the model. each
of these monitors produces a reading every second.
an individual “wind farm” can have up to 100 turbines,
and Upwind expects more than 40 farms to make
use of the service once it's fully deployed. Plus they
want to get more customers.
this means an estimated 300,000 metrics per second
total by next year.

dealing with volume

storage database

storage database

37

Upwind chose a very different route than Mozilla for
dealing with this volume. Let's explain the data flow
first:
1. Wind farms write metrics to local storage, which can
hold data for a couple hours.
2. This data is then read by an industry-specific
application called the historian which interprets the
raw data and stores each second of data.
3. that data is then read from the historian by Postgres
4. which produces nice summaries and analytics for
the reports.
As it turns out, data is not shared between wind farms
most of the time. So Upwind can scale simply by
adding a whole additional service tool chain for each
windfarm it brings online.

dealing with volume

storage database

master
database
storage database

38

In the limited cases where we need to do inter-farm
reports, we will have a master Postgres node which
will use pl/proxy and/or foreign data wrappers to
summarize data from multiple wind farms.

multi-tenant partitioning
● partition the whole application
● each customer gets their own
toolchain
● allows scaling with the
number of customers
● lowers efficiency
● more efficient with virtualization
39

this is what's known as multi-tenant partitioning, where
you give each customer, or “tenant”, a full stack of
their own. It's not very resource-efficient, but it does
easily scale as you add customers, and eliminates
the need to use complex clustered storage or
databases.
cloud hosting and virtualization is also making this
strategy easier to manage.

dealing with:
constant flow and size

minute hours days months years
historian
buffer table table table table

minute hours days months
historian
buffer table table table

minute hours days 40
historian
buffer table table

In order to deal with the constant flow of data and the
large database size, we continously produce a series
of time-based summaries of the data:
1. a buffer holding a several minutes of data. this also
accounts for lag and out-of-order data.
2. hour summaries 3. day summaries
4. month summaries 5. annual summaries

time-based rollups
● continuously accumulate
levels of rollup
● each is based on the level below it
● data is always appended, never
updated
● small windows == small resources

41

For efficiency, each level builds on the level beneath it.
This means that no analytic operation ever has to
work with a very large set, an so can be very fast.
Even better, the time-based summaries can be run in
parallel, so that we can make maximum use of the
processors on the server.

time-based rollups
● allows:
● very rapid summary reports for
different windows
● retaining different summaries for
different levels of time
● batch/out-of-order processing
● summarization in parallel

42

this allows us to have a set of “views” which
summarize the data at the time windows users look
at. This means almost “instant” reports for users.

firehose tips
43

based on my experiences with several of these
firehose projects, here's some general rules for
building this kind of an application

data collection must be:

● continuous
● parallel
● fault-tolerant

44

data collection has to be 24x7. It must be done in
parallel by multiple processes/servers, and above all
must be fault-tolerant.

data processing must be:

● continuous
● parallel
● fault-tolerant

45

data processing has to be built the same way.

every component
must be able to fail
● without too much data loss
● other components must
continue

46

the lesson most people don't seem to be prepared for
is that every component must be able to fail without
permanently downing the system, or causing
unacceptable data loss.

5 tools to use
1. queueing software
2. buffering techniques
3. materialized views
4. configuration management
5. comprehensive monitoring

47

here's some classes of tools you should be using in
any firehose application.
queuing and buffering for fault-tolerance
materialized views to deal with volume and size
configuration management and comprehensive
monitoring to keep the system up.

4 don'ts
1. use cutting-edge technology
2. use untested hardware
3. run components to capacity
4. do hot patching

48

some things you will be tempted to do and shouldn't
* don't use cutting edge hardware or software because
it's not reliable and will lead to repeated
unacceptable downtimes
* don't put hardware into service without testing it first
beause it will be unreliable or perform poorly,
dragging the system down
* never run components or layers of your stack at
capacity, because then you have no headroom for
when you have a surge in data
* and never do hot patching of production services
because you can't manage it and it will lead to
mistakes and taking the system down.

firehose mastered?

49

so now that we know how to do it, it's time for you to
build your own firehose applications. it's challenging
and fun.

Contact
● Josh Berkus: josh@pgexperts.com
● blog: blogs.ittoolbox.com/database/soup
● PostgreSQL: www.postgresql.org
● pgexperts: www.pgexperts.com
● Upcoming Events
● PostgreSQL Europe: http://2011.pgconf.eu/
● PostgreSQL Italy: http://2011.pgday.it/

The text and diagrams in this talk is copyright 2011 Josh Berkus and is licensed under the creative
commons attribution license. Title slide image is licensed from iStockPhoto and may not be
reproduced or redistributed. Socorro images are copyright 2011 Mozilla Inc. 50

contact and copyright information.

Questions?

Проектирование крупномасштабных приложений сбора данных (Josh Berkus)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (7)

Destaque

Destaque (9)

Semelhante a Проектирование крупномасштабных приложений сбора данных (Josh Berkus)

Semelhante a Проектирование крупномасштабных приложений сбора данных (Josh Berkus) (20)

Mais de Ontico

Mais de Ontico (20)

Último

Último (20)

Проектирование крупномасштабных приложений сбора данных (Josh Berkus)