Hybrid my sql_hadoop_datawarehouse

Percona Live NYC 2012
1
MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.

2
Who are Palomino?
› Professional Services:
› ETLs,
› Cluster tooling.

3
Who are Palomino?
› ETLs,
› Configuration management (DevOps)
› Chef,
› Puppet,
› Ansible.

4
Who are Palomino?
› ETLs,
› Configuration management (DevOps)
› Chef,
› Puppet,
› Ansible.
› Big Data Cluster Administration (OpsDev)
› MySQL, PostgreSQL,
› Cassandra, HBase,
› MongoDB, Couchbase.

5
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino

Achievements:
› Palomino Big Data Strategy.
› Datawarehouse Cluster at Riot Games.
› Designed/built back-end for Firefox Sync.

6
Who am I?
Tim Ellis

Achievements:
› Led DB team at Digg.com.
› Harassed the Reddit team at a party.

7
Who am I?
Tim Ellis

Achievements:

Ensured successful business for:
› Digg,
› Friendster,

8
Who am I?
Tim Ellis

Achievements:

Ensured successful business for:
› Digg,
› Friendster,
› Mozilla,
› StumbleUpon,
› Riot Games (League of Legends).

What Is This Talk?
9
Experiences of a High-Volume DBA

I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?

What Is This Talk?
10

Snowflake? Star?

I'll win a bar bet, but would be fired from Oracle.

What Is This Talk?
11

Snowflake? Star?


I've administered high-volume Datawarehouses and
managed a large ETL rollout, but haven't written
extensive ETLs or reports.

What Is This Talk?
12

Snowflake? Star?


I've administered high-volume Datawarehouses and
managed a large ETL rollout, but haven't written
extensive ETLs or reports.

A high-volume Datawarehouse is of a different design
than a low-volume Datawarehouse by necessity.
Typically simpler schemas, more complex queries.

Why OSS?
13
Freedom at Scale == Economical Sense

Selling OSS to Management used to be hard...

› My query tools are limited.
› The business users know DMBSx.
› The documentation is lacking.

Why OSS?
14



...but then terascale happened one day.

Why OSS?
15




› Adding 20TB costs HOW MUCH?!
› Adding 30 machines costs HOW MUCH?!

Why OSS?
16




› How many sales calls before I push the release?

Why OSS?
17




› How many sales calls before I push the release?
› I'll hire an entire team and still be more efficient.

How to begin?
18
Take stock of the current system

Establish a data flow:

› Who's sending me data?
› How much?

How to begin?
19


› How much?
› What are the bottlenecks?
› What's the current ETL process?

How to begin?
20


› How much?

We're looking for typical data flow characteristics:

› Log data, write-mostly, free-form.
› Looks tabular, “select * from table.”
› Size: MB, GB or TB per hour?

How to begin?
21


› How much?

We're looking for typical data flow characteristics:

› Log data, write-mostly, free-form.
› Looks tabular, “select * from table.”
› Size: MB, GB or TB per hour?
› Who queries this data? How often?

What is Hadoop?
22
The Hadoop Ecosystem

Hadoop Components:

› HDFS: A filesystem across the whole cluster.

What is Hadoop?
23

Hadoop Components:

› Hadoop: A map/reduce implementation.

What is Hadoop?
24

Hadoop Components:

› Hive: SQL→Map/Reduce converter.

What is Hadoop?
25

Hadoop Components:

› HBase: A column store (and more).

What is Hadoop?
26

Hadoop Components:


Most-interesting bits:

› Hive lets business users formulate SQL!

What is Hadoop?
27

Hadoop Components:



› HBase provides a distributed column store!

What is Hadoop?
28

Hadoop Components:



› HBase provides a distributed column store!
› HDFS provides massive I/O and redundancy.

Should You Use Hadoop?
29
Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

› Scan large chunks of your data every time.

30


› Apply a lot of cluster resource to a query.

31


› Very large datasets, multiple tera/petabytes.

32


› With HBase, column store engine.

33



Where Hadoop/HBase falls short:

› Query iteration is typically minutes.

34




› Administration is new and unusual.

35




› Hadoop still immature (some say “beta”).

36




› Hadoop still immature (some say “beta”).
› Documentation is bad or non-existent.

Should You Use MySQL?
37
MySQL Strengths and Weaknesses

MySQL is good for:

› Smaller datasets, typically gigabytes.
› Indexing data automatically and quickly.
› Short query iteration, even milliseconds.
› Quick dataloads and processing with MyISAM.

38

MySQL is good for:


Where MySQL falls short:

› Has no column store engine.
› Documentation for datawarehousing minimal.

39

MySQL is good for:



› You probably know better than I. Trust the DBA.

40

MySQL is good for:



› You probably know better than I. Trust the DBA.
› Be honest with management. If Vertica is better...

MySQL/Hadoop Hybrid
41
Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

› No built-in end-user-friendly query tools.

MySQL/Hadoop Hybrid
42
Common Weaknesses


› Immature technology – can crash sometimes.

MySQL/Hadoop Hybrid
43
Common Weaknesses


› Not too much documentation.

MySQL/Hadoop Hybrid
44
Common Weaknesses



You'll need buy-in, savvy, and resilience from:

› ETL/Datawarehouse developers,

MySQL/Hadoop Hybrid
45
Common Weaknesses




› Business Users,

MySQL/Hadoop Hybrid
46
Common Weaknesses




› Business Users,
› Systems Administrators,

MySQL/Hadoop Hybrid
47
Common Weaknesses




› Business Users,
› Systems Administrators,
› Management.

Building a Hadoop Cluster
48
The NameNode

Typical Reasons Clusters Fail:

› Cascading failure (distributed fail)
› Network outage (distributed fail)
› Bad query executed (distributed fail)

49
The NameNode


› NameNode dies? (single point of failure)

50
The NameNode



NameNode failing is not a common failure case. Still,
it's good to plan for it:

› All critical filesystems on RAID 1+0

51
The NameNode




› Redundant PSU

52
The NameNode




› Redundant PSU
› Redundant NICs to independent routers

53
Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

› RAID-0 or even JBOD.

54


› More spindles: linux-1u.net has 8HDD in 1U.

55


› 7200rpm SATA nice, 15Krpm overkill.

56


› Multiple TB of storage.

57


› 8-24GB RAM.

58


› 8-24GB RAM.
› Good/fast network cards!

59


› Multiple TB of storage. ←lots of this!!!
› 8-24GB RAM.
› Good/fast network cards!

A DBA thinks “Database” == RAM. Likewise,
“Hadoop Node” == disk spindles, disk storage, and
network. You lose 2-3x storage to data replication.

60
Network and Rack Layout

Network within a rack (top-of-rack switching):

› Bandwidth for 30 machines going full-tilt.

61


› Multiple TOR switches for redundancy.
› Consider bridging.

62



Network between racks (datacentre switching):

› Inter-rack switches: better than 2Gbit desireable.
› Hadoop rack awareness reduces inter-rack traffic.

63



Network between racks (datacentre switching):

› Inter-rack switches: better than 2Gbit desireable.
› Hadoop rack awareness reduces inter-rack traffic.

Need sharp Networking employees on-board to help
build cluster. Network instability can cause crashes.

64
Monitoring: Trending and Alerting

Pick your graphing solution, and put stats into it. In
doubt about which stats to graph?

65

doubt about which stats to graph? Try all of them.

› Every Hadoop stat exposed via JMX.
› Every HBase stat exposed via JMX.
› All disk, CPU, RAM, network stats.

66



A possible solution:

› Use collectd's JMX plugin to collect stats.

67



A possible solution:

› Use collectd's JMX plugin to collect stats.
› Put stats into Graphite.
› Or Ganglia if you know how.

68
Palomino Cluster Tool

Use Configuration Management to build your cluster:

› Ansible – easiest and quickest.
› Opscode Chef – most popular, must love Ruby.
› Puppet – most mature.

69



The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:

› Pre-written Configuration Management scripts.

70




› Sets up HDFS, Hadoop, HBase, Monitoring.

71




› In the future, will also set up alerting and backups.

72




› In the future, will also set up alerting and backups.
› Also sets up MySQL+MHA, may be relevant?

Running the Hadoop Cluster
73
Typical Problems

Hadoop Clusters are Distributed Systems.

› Network stressed? Reduce-heavy workload.

74
Typical Problems


› CPUs stressed? Map-heavy workload.

75
Typical Problems


› Disks stressed? Map-heavy workload.

76
Typical Problems


› RAM stressed? This is a DBMS after all!

77
Typical Problems



Watch your storage subsystems.

› 120TB is a lot of disk space.

78
Typical Problems




› Until you put in 120TB of data.

79
Typical Problems




› 400 spindles is a lot of IOPS.

80
Typical Problems




› 400 spindles is a lot of IOPS.
› Until you query everything. Ten times.

81
Administration by Scientific Method

What did we just learn...?

82

Hadoop Clusters are Distributed Systems!

› Instability on system X? Could be Y's fault.

83


› Temporal correlation of ERRORs across nodes.

84


› Correlation of WARNINGs and ERRORs.

85


› Do log events correlate to graph anomolies?

86



The Procedure:

1. Problems occurring on the cluster?
2. Formulate hypothesis from input (graphs/logs).

87



The Procedure:

3. Test hypothesis (tweak configurations).

88



The Procedure:

3. Test hypothesis (tweak configurations).
4. Go to 1. You're graphing EVERYTHING, right?

89
Graphing your Logs

You need to graph everything. How about graphing
your logs?

90
Graphing your Logs

your logs?

› grep ERROR | cut <date/hour part> | uniq -c
2012-07-29 06 15692
2012-07-29 07 30432
2012-07-29 08 76943
2012-07-29 09 54955
2012-07-29 10 15652

91
Graphing your Logs

your logs?

› grep ERROR | cut <date/hour part> | uniq -c
2012-07-29 06 15692
2012-07-29 07 30432
2012-07-29 08 76943
2012-07-29 09 54955
2012-07-29 10 15652

That's close, but what if that's hundreds of lines? You
can put the data into LibreOffice Calc, but slows down
iteration cycle.

92
Graphing your Logs

Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:

93
Graphing your Logs


› grep ERROR | cut <date/hour part> | distribution
2012-07-29 06|15692 ++++++++++
2012-07-29 07|30432 +++++++++++++++++++
2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++
2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++
2012-07-29 10|15652 ++++++++++

94
Graphing your Logs


› grep ERROR | cut <date/hour part> | distribution
2012-07-29 06|15692 ++++++++++
2012-07-29 07|30432 +++++++++++++++++++
2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++
2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++
2012-07-29 10|15652 ++++++++++

On a quick iteration cycle in the terminal, this is very
useful. For presentation to the suits later you can import
the data into another prettier tool.

95
Graphing your Logs

A real-life (MySQL) example:
root@db49:/var/log/mysql# grep -i error error.log
| cut -c 1-9
| distribution
| sort -n
This file was about 2.5GB in size

Just the date/hour portion
Distribution sorts by key frequency by
default, but we'll want date/hour ordering.

96
Graphing your Logs

| cut -c 1-9
| distribution
| sort -n
Val |Ct (Pct) Histogram
120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏
120601 17|10 (7.69%) █████████▋
120601 14|4 (3.08%) ███▉
120602 14|2 (1.54%) ██
120602 21|4 (3.08%) ███▉
120610 13|2 (1.54%) ██
120610 14|4 (3.08%) ███▉
120611 14|2 (1.54%) ██
120612 14|2 (1.54%) ██
120613 14|2 (1.54%) ██
120616 13|2 (1.54%) ██
120630 14|5 (3.85%) ████▉

Obvious: Noon on June 1st was ugly.

97
Graphing your Logs

| cut -c 1-9
| distribution
| sort -n
Val |Ct (Pct) Histogram
120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏
120601 17|10 (7.69%) █████████▋
120601 14|4 (3.08%) ███▉
120602 14|2 (1.54%) ██
120602 21|4 (3.08%) ███▉
120610 13|2 (1.54%) ██
120610 14|4 (3.08%) ███▉
120611 14|2 (1.54%) ██
120612 14|2 (1.54%) ██
120613 14|2 (1.54%) ██
120616 13|2 (1.54%) ██
120630 14|5 (3.85%) ████▉

Obvious: Noon on June 1st was ugly.

But also: What keeps happening at 2pm?

Building the MySQL Datawarehouse
98
Hardware Spec and Layout

This is a typical OLAP role.

› Fast non-transactional engine: MyISAM.
› Data typically time-related: partition by date.
› Data write-only or read-all? Archive engine.
› Index-everything schemas.

Building the MySQL Datawarehouse
99
Hardware Spec and Layout

This is a typical OLAP role.

› Fast non-transactional engine: MyISAM.
› Data typically time-related: partition by date.
› Data write-only or read-all? Archive engine.
› Index-everything schemas.

Typically beefier hardware is better.

› Many spindles, many CPUs, much RAM.
› Reasonably-fast network cards.

ETL Framework
100
Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.

ETL Framework
101


› Copy straight in: “cat file | hdfs put <filename>”

ETL Framework
102


› From the network: “scp file | hdfs put <filename>”

ETL Framework
103


› Streaming: (Logs?)→Flume→HDFS.

ETL Framework
104


› Table loads: Sqoop (“select * into <hdfsFile>”).

ETL Framework
105


› Table loads: Sqoop (“select * into <hdfsFile>”).

HBase is not as simple, but can be worth it.

› Flume→HBase.
› HBase column family == columnar scans.
› Beware: no secondary indexes.

ETL Framework
106
Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

ETL Framework
107


› Yesterday TableX delta == 150k rows. Today 5k.

ETL Framework
108


› Yesterday data loads were 120GB. Today 15GB.

ETL Framework
109


› Yesterday “grep -ci error” == 1k. Today 20k.

ETL Framework
110


› Yesterday “wc -l etllogs” == 700k. Today 10k.

ETL Framework
111


› Yesterday ETL process == 8hrs. Today 1hr.

ETL Framework
112



If you have time, get a bit more sophisticated:

› Yesterday TableX.ColY was int. Today varchar.

ETL Framework
113



If you have time, get a bit more sophisticated:

› Yesterday TableX.ColY was int. Today varchar.
› Yesterday TableX.ColY compressed at 8x, today
it compresses at 2x (or 32x?).

Getting Data Out
114
Hadoop Reporting Tools

The oldschool method of retrieving data:

› select f(col) from table where ... group by ...

Getting Data Out
115



The NoSQL method of retrieving data:

Getting Data Out
116




› select f(col) from table where ... group by …

Getting Data Out
117





Hadoop includes Hive (SQL→Map/Reduce Converter).
In my experience, dedicated business users can learn to
use Hive with little extra training.

Getting Data Out
118





Hadoop includes Hive (SQL→Map/Reduce Converter).
In my experience, dedicated business users can learn to
use Hive with little extra training.

But there is extra training!

Getting Data Out
119

It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:

Getting Data Out
120


› Tools that submit SQL and receive tabular data.
› Tableau has Hadoop connector.

Getting Data Out
121



Most of Hadoop's power is in Map/Reduce:

› Hive == SQL→Map/Reduce.
› RHadoop == R→Map/Reduce.

Getting Data Out
122



Most of Hadoop's power is in Map/Reduce:

› Hive == SQL→Map/Reduce.
› RHadoop == R→Map/Reduce.
› HadoopStreaming == Anything→Map/Reduce

The Hybrid Datawarehouse
123
Putting it All Together

The Way I've Always Done It:

1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.

124


2. New parallel ETL into Hadoop.

125


3. Build ETLs Hadoop→current DW.
› Typical == equivalent summaries from #1.

126


› Once that works, shut off old data flow.

127


4. Give everyone access to Hadoop.
› They will think of cool new uses for the data.

128


5. Work through The Pain of #4.
› It doesn't come free, but is worth the price.

129


5. Work through The Pain of #4.
› It doesn't come free, but is worth the price.
6. Go to #1.

130
Q&A

Questions?

131
Q&A

Questions? Some suggestions:

› What is the average airspeed
of a laden sparrow?
› How can I hire you?

132
Q&A


of a laden sparrow?
› No really, I have money, you have skills.
Let's make this happen.

133
Q&A


of a laden sparrow?
› Where's the coffee?
I never thought I could be so sleepy.

134
Q&A


of a laden sparrow?
› Where's the coffee?
I never thought I could be so sleepy.

Thank you! Email me if you desire.
domain: palominodb.com – username: time

Hybrid my sql_hadoop_datawarehouse

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (11)

Semelhante a Hybrid my sql_hadoop_datawarehouse

Semelhante a Hybrid my sql_hadoop_datawarehouse (20)

Mais de Laine Campbell

Mais de Laine Campbell (8)

Último

Último (20)

Hybrid my sql_hadoop_datawarehouse