1. Percona Live NYC 2012
1
MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
2. Percona Live NYC 2012
2
MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
› Professional Services:
› ETLs,
› Cluster tooling.
3. Percona Live NYC 2012
3
MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
› Professional Services:
› ETLs,
› Cluster tooling.
› Configuration management (DevOps)
› Chef,
› Puppet,
› Ansible.
4. Percona Live NYC 2012
4
MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
› Professional Services:
› ETLs,
› Cluster tooling.
› Configuration management (DevOps)
› Chef,
› Puppet,
› Ansible.
› Big Data Cluster Administration (OpsDev)
› MySQL, PostgreSQL,
› Cassandra, HBase,
› MongoDB, Couchbase.
5. Percona Live NYC 2012
5
MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino
Achievements:
› Palomino Big Data Strategy.
› Datawarehouse Cluster at Riot Games.
› Designed/built back-end for Firefox Sync.
6. Percona Live NYC 2012
6
MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino
Achievements:
› Palomino Big Data Strategy.
› Datawarehouse Cluster at Riot Games.
› Designed/built back-end for Firefox Sync.
› Led DB team at Digg.com.
› Harassed the Reddit team at a party.
7. Percona Live NYC 2012
7
MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino
Achievements:
› Palomino Big Data Strategy.
› Datawarehouse Cluster at Riot Games.
› Designed/built back-end for Firefox Sync.
› Led DB team at Digg.com.
› Harassed the Reddit team at a party.
Ensured successful business for:
› Digg,
› Friendster,
8. Percona Live NYC 2012
8
MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino
Achievements:
› Palomino Big Data Strategy.
› Datawarehouse Cluster at Riot Games.
› Designed/built back-end for Firefox Sync.
› Led DB team at Digg.com.
› Harassed the Reddit team at a party.
Ensured successful business for:
› Digg,
› Friendster,
› Mozilla,
› StumbleUpon,
› Riot Games (League of Legends).
9. What Is This Talk?
9
Experiences of a High-Volume DBA
I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?
10. What Is This Talk?
10
Experiences of a High-Volume DBA
I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?
I'll win a bar bet, but would be fired from Oracle.
11. What Is This Talk?
11
Experiences of a High-Volume DBA
I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?
I'll win a bar bet, but would be fired from Oracle.
I've administered high-volume Datawarehouses and
managed a large ETL rollout, but haven't written
extensive ETLs or reports.
12. What Is This Talk?
12
Experiences of a High-Volume DBA
I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?
I'll win a bar bet, but would be fired from Oracle.
I've administered high-volume Datawarehouses and
managed a large ETL rollout, but haven't written
extensive ETLs or reports.
A high-volume Datawarehouse is of a different design
than a low-volume Datawarehouse by necessity.
Typically simpler schemas, more complex queries.
13. Why OSS?
13
Freedom at Scale == Economical Sense
Selling OSS to Management used to be hard...
› My query tools are limited.
› The business users know DMBSx.
› The documentation is lacking.
14. Why OSS?
14
Freedom at Scale == Economical Sense
Selling OSS to Management used to be hard...
› My query tools are limited.
› The business users know DMBSx.
› The documentation is lacking.
...but then terascale happened one day.
15. Why OSS?
15
Freedom at Scale == Economical Sense
Selling OSS to Management used to be hard...
› My query tools are limited.
› The business users know DMBSx.
› The documentation is lacking.
...but then terascale happened one day.
› Adding 20TB costs HOW MUCH?!
› Adding 30 machines costs HOW MUCH?!
16. Why OSS?
16
Freedom at Scale == Economical Sense
Selling OSS to Management used to be hard...
› My query tools are limited.
› The business users know DMBSx.
› The documentation is lacking.
...but then terascale happened one day.
› Adding 20TB costs HOW MUCH?!
› Adding 30 machines costs HOW MUCH?!
› How many sales calls before I push the release?
17. Why OSS?
17
Freedom at Scale == Economical Sense
Selling OSS to Management used to be hard...
› My query tools are limited.
› The business users know DMBSx.
› The documentation is lacking.
...but then terascale happened one day.
› Adding 20TB costs HOW MUCH?!
› Adding 30 machines costs HOW MUCH?!
› How many sales calls before I push the release?
› I'll hire an entire team and still be more efficient.
18. How to begin?
18
Take stock of the current system
Establish a data flow:
› Who's sending me data?
› How much?
19. How to begin?
19
Take stock of the current system
Establish a data flow:
› Who's sending me data?
› How much?
› What are the bottlenecks?
› What's the current ETL process?
20. How to begin?
20
Take stock of the current system
Establish a data flow:
› Who's sending me data?
› How much?
› What are the bottlenecks?
› What's the current ETL process?
We're looking for typical data flow characteristics:
› Log data, write-mostly, free-form.
› Looks tabular, “select * from table.”
› Size: MB, GB or TB per hour?
21. How to begin?
21
Take stock of the current system
Establish a data flow:
› Who's sending me data?
› How much?
› What are the bottlenecks?
› What's the current ETL process?
We're looking for typical data flow characteristics:
› Log data, write-mostly, free-form.
› Looks tabular, “select * from table.”
› Size: MB, GB or TB per hour?
› Who queries this data? How often?
22. What is Hadoop?
22
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
23. What is Hadoop?
23
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
› Hadoop: A map/reduce implementation.
24. What is Hadoop?
24
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
› Hadoop: A map/reduce implementation.
› Hive: SQL→Map/Reduce converter.
25. What is Hadoop?
25
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
› Hadoop: A map/reduce implementation.
› Hive: SQL→Map/Reduce converter.
› HBase: A column store (and more).
26. What is Hadoop?
26
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
› Hadoop: A map/reduce implementation.
› Hive: SQL→Map/Reduce converter.
› HBase: A column store (and more).
Most-interesting bits:
› Hive lets business users formulate SQL!
27. What is Hadoop?
27
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
› Hadoop: A map/reduce implementation.
› Hive: SQL→Map/Reduce converter.
› HBase: A column store (and more).
Most-interesting bits:
› Hive lets business users formulate SQL!
› HBase provides a distributed column store!
28. What is Hadoop?
28
The Hadoop Ecosystem
Hadoop Components:
› HDFS: A filesystem across the whole cluster.
› Hadoop: A map/reduce implementation.
› Hive: SQL→Map/Reduce converter.
› HBase: A column store (and more).
Most-interesting bits:
› Hive lets business users formulate SQL!
› HBase provides a distributed column store!
› HDFS provides massive I/O and redundancy.
29. Should You Use Hadoop?
29
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
30. Should You Use Hadoop?
30
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
31. Should You Use Hadoop?
31
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
› Very large datasets, multiple tera/petabytes.
32. Should You Use Hadoop?
32
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
› Very large datasets, multiple tera/petabytes.
› With HBase, column store engine.
33. Should You Use Hadoop?
33
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
› Very large datasets, multiple tera/petabytes.
› With HBase, column store engine.
Where Hadoop/HBase falls short:
› Query iteration is typically minutes.
34. Should You Use Hadoop?
34
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
› Very large datasets, multiple tera/petabytes.
› With HBase, column store engine.
Where Hadoop/HBase falls short:
› Query iteration is typically minutes.
› Administration is new and unusual.
35. Should You Use Hadoop?
35
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
› Very large datasets, multiple tera/petabytes.
› With HBase, column store engine.
Where Hadoop/HBase falls short:
› Query iteration is typically minutes.
› Administration is new and unusual.
› Hadoop still immature (some say “beta”).
36. Should You Use Hadoop?
36
Hadoop Strengths and Weaknesses
Hadoop/HBase is good for:
› Scan large chunks of your data every time.
› Apply a lot of cluster resource to a query.
› Very large datasets, multiple tera/petabytes.
› With HBase, column store engine.
Where Hadoop/HBase falls short:
› Query iteration is typically minutes.
› Administration is new and unusual.
› Hadoop still immature (some say “beta”).
› Documentation is bad or non-existent.
37. Should You Use MySQL?
37
MySQL Strengths and Weaknesses
MySQL is good for:
› Smaller datasets, typically gigabytes.
› Indexing data automatically and quickly.
› Short query iteration, even milliseconds.
› Quick dataloads and processing with MyISAM.
38. Should You Use MySQL?
38
MySQL Strengths and Weaknesses
MySQL is good for:
› Smaller datasets, typically gigabytes.
› Indexing data automatically and quickly.
› Short query iteration, even milliseconds.
› Quick dataloads and processing with MyISAM.
Where MySQL falls short:
› Has no column store engine.
› Documentation for datawarehousing minimal.
39. Should You Use MySQL?
39
MySQL Strengths and Weaknesses
MySQL is good for:
› Smaller datasets, typically gigabytes.
› Indexing data automatically and quickly.
› Short query iteration, even milliseconds.
› Quick dataloads and processing with MyISAM.
Where MySQL falls short:
› Has no column store engine.
› Documentation for datawarehousing minimal.
› You probably know better than I. Trust the DBA.
40. Should You Use MySQL?
40
MySQL Strengths and Weaknesses
MySQL is good for:
› Smaller datasets, typically gigabytes.
› Indexing data automatically and quickly.
› Short query iteration, even milliseconds.
› Quick dataloads and processing with MyISAM.
Where MySQL falls short:
› Has no column store engine.
› Documentation for datawarehousing minimal.
› You probably know better than I. Trust the DBA.
› Be honest with management. If Vertica is better...
41. MySQL/Hadoop Hybrid
41
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
42. MySQL/Hadoop Hybrid
42
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
› Immature technology – can crash sometimes.
43. MySQL/Hadoop Hybrid
43
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
› Immature technology – can crash sometimes.
› Not too much documentation.
44. MySQL/Hadoop Hybrid
44
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
› Immature technology – can crash sometimes.
› Not too much documentation.
You'll need buy-in, savvy, and resilience from:
› ETL/Datawarehouse developers,
45. MySQL/Hadoop Hybrid
45
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
› Immature technology – can crash sometimes.
› Not too much documentation.
You'll need buy-in, savvy, and resilience from:
› ETL/Datawarehouse developers,
› Business Users,
46. MySQL/Hadoop Hybrid
46
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
› Immature technology – can crash sometimes.
› Not too much documentation.
You'll need buy-in, savvy, and resilience from:
› ETL/Datawarehouse developers,
› Business Users,
› Systems Administrators,
47. MySQL/Hadoop Hybrid
47
Common Weaknesses
So if you combine the weaknesses of these two
technologies... what have you got?
› No built-in end-user-friendly query tools.
› Immature technology – can crash sometimes.
› Not too much documentation.
You'll need buy-in, savvy, and resilience from:
› ETL/Datawarehouse developers,
› Business Users,
› Systems Administrators,
› Management.
48. Building a Hadoop Cluster
48
The NameNode
Typical Reasons Clusters Fail:
› Cascading failure (distributed fail)
› Network outage (distributed fail)
› Bad query executed (distributed fail)
49. Building a Hadoop Cluster
49
The NameNode
Typical Reasons Clusters Fail:
› Cascading failure (distributed fail)
› Network outage (distributed fail)
› Bad query executed (distributed fail)
› NameNode dies? (single point of failure)
50. Building a Hadoop Cluster
50
The NameNode
Typical Reasons Clusters Fail:
› Cascading failure (distributed fail)
› Network outage (distributed fail)
› Bad query executed (distributed fail)
NameNode failing is not a common failure case. Still,
it's good to plan for it:
› All critical filesystems on RAID 1+0
51. Building a Hadoop Cluster
51
The NameNode
Typical Reasons Clusters Fail:
› Cascading failure (distributed fail)
› Network outage (distributed fail)
› Bad query executed (distributed fail)
NameNode failing is not a common failure case. Still,
it's good to plan for it:
› All critical filesystems on RAID 1+0
› Redundant PSU
52. Building a Hadoop Cluster
52
The NameNode
Typical Reasons Clusters Fail:
› Cascading failure (distributed fail)
› Network outage (distributed fail)
› Bad query executed (distributed fail)
NameNode failing is not a common failure case. Still,
it's good to plan for it:
› All critical filesystems on RAID 1+0
› Redundant PSU
› Redundant NICs to independent routers
53. Building a Hadoop Cluster
53
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
54. Building a Hadoop Cluster
54
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
› More spindles: linux-1u.net has 8HDD in 1U.
55. Building a Hadoop Cluster
55
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
› More spindles: linux-1u.net has 8HDD in 1U.
› 7200rpm SATA nice, 15Krpm overkill.
56. Building a Hadoop Cluster
56
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
› More spindles: linux-1u.net has 8HDD in 1U.
› 7200rpm SATA nice, 15Krpm overkill.
› Multiple TB of storage.
57. Building a Hadoop Cluster
57
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
› More spindles: linux-1u.net has 8HDD in 1U.
› 7200rpm SATA nice, 15Krpm overkill.
› Multiple TB of storage.
› 8-24GB RAM.
58. Building a Hadoop Cluster
58
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
› More spindles: linux-1u.net has 8HDD in 1U.
› 7200rpm SATA nice, 15Krpm overkill.
› Multiple TB of storage.
› 8-24GB RAM.
› Good/fast network cards!
59. Building a Hadoop Cluster
59
Basic Cluster Node Configuration
So much for the specialised hardware. All non-
NameNode nodes in your cluster:
› RAID-0 or even JBOD.
› More spindles: linux-1u.net has 8HDD in 1U.
› 7200rpm SATA nice, 15Krpm overkill.
› Multiple TB of storage. ←lots of this!!!
› 8-24GB RAM.
› Good/fast network cards!
A DBA thinks “Database” == RAM. Likewise,
“Hadoop Node” == disk spindles, disk storage, and
network. You lose 2-3x storage to data replication.
60. Building a Hadoop Cluster
60
Network and Rack Layout
Network within a rack (top-of-rack switching):
› Bandwidth for 30 machines going full-tilt.
61. Building a Hadoop Cluster
61
Network and Rack Layout
Network within a rack (top-of-rack switching):
› Bandwidth for 30 machines going full-tilt.
› Multiple TOR switches for redundancy.
› Consider bridging.
62. Building a Hadoop Cluster
62
Network and Rack Layout
Network within a rack (top-of-rack switching):
› Bandwidth for 30 machines going full-tilt.
› Multiple TOR switches for redundancy.
› Consider bridging.
Network between racks (datacentre switching):
› Inter-rack switches: better than 2Gbit desireable.
› Hadoop rack awareness reduces inter-rack traffic.
63. Building a Hadoop Cluster
63
Network and Rack Layout
Network within a rack (top-of-rack switching):
› Bandwidth for 30 machines going full-tilt.
› Multiple TOR switches for redundancy.
› Consider bridging.
Network between racks (datacentre switching):
› Inter-rack switches: better than 2Gbit desireable.
› Hadoop rack awareness reduces inter-rack traffic.
Need sharp Networking employees on-board to help
build cluster. Network instability can cause crashes.
64. Building a Hadoop Cluster
64
Monitoring: Trending and Alerting
Pick your graphing solution, and put stats into it. In
doubt about which stats to graph?
65. Building a Hadoop Cluster
65
Monitoring: Trending and Alerting
Pick your graphing solution, and put stats into it. In
doubt about which stats to graph? Try all of them.
› Every Hadoop stat exposed via JMX.
› Every HBase stat exposed via JMX.
› All disk, CPU, RAM, network stats.
66. Building a Hadoop Cluster
66
Monitoring: Trending and Alerting
Pick your graphing solution, and put stats into it. In
doubt about which stats to graph? Try all of them.
› Every Hadoop stat exposed via JMX.
› Every HBase stat exposed via JMX.
› All disk, CPU, RAM, network stats.
A possible solution:
› Use collectd's JMX plugin to collect stats.
67. Building a Hadoop Cluster
67
Monitoring: Trending and Alerting
Pick your graphing solution, and put stats into it. In
doubt about which stats to graph? Try all of them.
› Every Hadoop stat exposed via JMX.
› Every HBase stat exposed via JMX.
› All disk, CPU, RAM, network stats.
A possible solution:
› Use collectd's JMX plugin to collect stats.
› Put stats into Graphite.
› Or Ganglia if you know how.
68. Building a Hadoop Cluster
68
Palomino Cluster Tool
Use Configuration Management to build your cluster:
› Ansible – easiest and quickest.
› Opscode Chef – most popular, must love Ruby.
› Puppet – most mature.
69. Building a Hadoop Cluster
69
Palomino Cluster Tool
Use Configuration Management to build your cluster:
› Ansible – easiest and quickest.
› Opscode Chef – most popular, must love Ruby.
› Puppet – most mature.
The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:
› Pre-written Configuration Management scripts.
70. Building a Hadoop Cluster
70
Palomino Cluster Tool
Use Configuration Management to build your cluster:
› Ansible – easiest and quickest.
› Opscode Chef – most popular, must love Ruby.
› Puppet – most mature.
The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:
› Pre-written Configuration Management scripts.
› Sets up HDFS, Hadoop, HBase, Monitoring.
71. Building a Hadoop Cluster
71
Palomino Cluster Tool
Use Configuration Management to build your cluster:
› Ansible – easiest and quickest.
› Opscode Chef – most popular, must love Ruby.
› Puppet – most mature.
The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:
› Pre-written Configuration Management scripts.
› Sets up HDFS, Hadoop, HBase, Monitoring.
› In the future, will also set up alerting and backups.
72. Building a Hadoop Cluster
72
Palomino Cluster Tool
Use Configuration Management to build your cluster:
› Ansible – easiest and quickest.
› Opscode Chef – most popular, must love Ruby.
› Puppet – most mature.
The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:
› Pre-written Configuration Management scripts.
› Sets up HDFS, Hadoop, HBase, Monitoring.
› In the future, will also set up alerting and backups.
› Also sets up MySQL+MHA, may be relevant?
73. Running the Hadoop Cluster
73
Typical Problems
Hadoop Clusters are Distributed Systems.
› Network stressed? Reduce-heavy workload.
76. Running the Hadoop Cluster
76
Typical Problems
Hadoop Clusters are Distributed Systems.
› Network stressed? Reduce-heavy workload.
› CPUs stressed? Map-heavy workload.
› Disks stressed? Map-heavy workload.
› RAM stressed? This is a DBMS after all!
77. Running the Hadoop Cluster
77
Typical Problems
Hadoop Clusters are Distributed Systems.
› Network stressed? Reduce-heavy workload.
› CPUs stressed? Map-heavy workload.
› Disks stressed? Map-heavy workload.
› RAM stressed? This is a DBMS after all!
Watch your storage subsystems.
› 120TB is a lot of disk space.
78. Running the Hadoop Cluster
78
Typical Problems
Hadoop Clusters are Distributed Systems.
› Network stressed? Reduce-heavy workload.
› CPUs stressed? Map-heavy workload.
› Disks stressed? Map-heavy workload.
› RAM stressed? This is a DBMS after all!
Watch your storage subsystems.
› 120TB is a lot of disk space.
› Until you put in 120TB of data.
79. Running the Hadoop Cluster
79
Typical Problems
Hadoop Clusters are Distributed Systems.
› Network stressed? Reduce-heavy workload.
› CPUs stressed? Map-heavy workload.
› Disks stressed? Map-heavy workload.
› RAM stressed? This is a DBMS after all!
Watch your storage subsystems.
› 120TB is a lot of disk space.
› Until you put in 120TB of data.
› 400 spindles is a lot of IOPS.
80. Running the Hadoop Cluster
80
Typical Problems
Hadoop Clusters are Distributed Systems.
› Network stressed? Reduce-heavy workload.
› CPUs stressed? Map-heavy workload.
› Disks stressed? Map-heavy workload.
› RAM stressed? This is a DBMS after all!
Watch your storage subsystems.
› 120TB is a lot of disk space.
› Until you put in 120TB of data.
› 400 spindles is a lot of IOPS.
› Until you query everything. Ten times.
81. Running the Hadoop Cluster
81
Administration by Scientific Method
What did we just learn...?
82. Running the Hadoop Cluster
82
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
83. Running the Hadoop Cluster
83
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
› Temporal correlation of ERRORs across nodes.
84. Running the Hadoop Cluster
84
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
› Temporal correlation of ERRORs across nodes.
› Correlation of WARNINGs and ERRORs.
85. Running the Hadoop Cluster
85
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
› Temporal correlation of ERRORs across nodes.
› Correlation of WARNINGs and ERRORs.
› Do log events correlate to graph anomolies?
86. Running the Hadoop Cluster
86
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
› Temporal correlation of ERRORs across nodes.
› Correlation of WARNINGs and ERRORs.
› Do log events correlate to graph anomolies?
The Procedure:
1. Problems occurring on the cluster?
2. Formulate hypothesis from input (graphs/logs).
87. Running the Hadoop Cluster
87
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
› Temporal correlation of ERRORs across nodes.
› Correlation of WARNINGs and ERRORs.
› Do log events correlate to graph anomolies?
The Procedure:
1. Problems occurring on the cluster?
2. Formulate hypothesis from input (graphs/logs).
3. Test hypothesis (tweak configurations).
88. Running the Hadoop Cluster
88
Administration by Scientific Method
Hadoop Clusters are Distributed Systems!
› Instability on system X? Could be Y's fault.
› Temporal correlation of ERRORs across nodes.
› Correlation of WARNINGs and ERRORs.
› Do log events correlate to graph anomolies?
The Procedure:
1. Problems occurring on the cluster?
2. Formulate hypothesis from input (graphs/logs).
3. Test hypothesis (tweak configurations).
4. Go to 1. You're graphing EVERYTHING, right?
89. Running the Hadoop Cluster
89
Graphing your Logs
You need to graph everything. How about graphing
your logs?
90. Running the Hadoop Cluster
90
Graphing your Logs
You need to graph everything. How about graphing
your logs?
› grep ERROR | cut <date/hour part> | uniq -c
2012-07-29 06 15692
2012-07-29 07 30432
2012-07-29 08 76943
2012-07-29 09 54955
2012-07-29 10 15652
91. Running the Hadoop Cluster
91
Graphing your Logs
You need to graph everything. How about graphing
your logs?
› grep ERROR | cut <date/hour part> | uniq -c
2012-07-29 06 15692
2012-07-29 07 30432
2012-07-29 08 76943
2012-07-29 09 54955
2012-07-29 10 15652
That's close, but what if that's hundreds of lines? You
can put the data into LibreOffice Calc, but slows down
iteration cycle.
92. Running the Hadoop Cluster
92
Graphing your Logs
Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:
93. Running the Hadoop Cluster
93
Graphing your Logs
Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:
› grep ERROR | cut <date/hour part> | distribution
2012-07-29 06|15692 ++++++++++
2012-07-29 07|30432 +++++++++++++++++++
2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++
2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++
2012-07-29 10|15652 ++++++++++
94. Running the Hadoop Cluster
94
Graphing your Logs
Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:
› grep ERROR | cut <date/hour part> | distribution
2012-07-29 06|15692 ++++++++++
2012-07-29 07|30432 +++++++++++++++++++
2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++
2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++
2012-07-29 10|15652 ++++++++++
On a quick iteration cycle in the terminal, this is very
useful. For presentation to the suits later you can import
the data into another prettier tool.
95. Running the Hadoop Cluster
95
Graphing your Logs
A real-life (MySQL) example:
root@db49:/var/log/mysql# grep -i error error.log
| cut -c 1-9
| distribution
| sort -n
This file was about 2.5GB in size
Just the date/hour portion
Distribution sorts by key frequency by
default, but we'll want date/hour ordering.
97. Running the Hadoop Cluster
97
Graphing your Logs
A real-life (MySQL) example:
root@db49:/var/log/mysql# grep -i error error.log
| cut -c 1-9
| distribution
| sort -n
Val |Ct (Pct) Histogram
120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏
120601 17|10 (7.69%) █████████▋
120601 14|4 (3.08%) ███▉
120602 14|2 (1.54%) ██
120602 21|4 (3.08%) ███▉
120610 13|2 (1.54%) ██
120610 14|4 (3.08%) ███▉
120611 14|2 (1.54%) ██
120612 14|2 (1.54%) ██
120613 14|2 (1.54%) ██
120616 13|2 (1.54%) ██
120630 14|5 (3.85%) ████▉
Obvious: Noon on June 1st was ugly.
But also: What keeps happening at 2pm?
98. Building the MySQL Datawarehouse
98
Hardware Spec and Layout
This is a typical OLAP role.
› Fast non-transactional engine: MyISAM.
› Data typically time-related: partition by date.
› Data write-only or read-all? Archive engine.
› Index-everything schemas.
99. Building the MySQL Datawarehouse
99
Hardware Spec and Layout
This is a typical OLAP role.
› Fast non-transactional engine: MyISAM.
› Data typically time-related: partition by date.
› Data write-only or read-all? Archive engine.
› Index-everything schemas.
Typically beefier hardware is better.
› Many spindles, many CPUs, much RAM.
› Reasonably-fast network cards.
100. ETL Framework
100
Getting Data into Hadoop
Hadoop HDFS at its core is simply a filesystem.
101. ETL Framework
101
Getting Data into Hadoop
Hadoop HDFS at its core is simply a filesystem.
› Copy straight in: “cat file | hdfs put <filename>”
102. ETL Framework
102
Getting Data into Hadoop
Hadoop HDFS at its core is simply a filesystem.
› Copy straight in: “cat file | hdfs put <filename>”
› From the network: “scp file | hdfs put <filename>”
103. ETL Framework
103
Getting Data into Hadoop
Hadoop HDFS at its core is simply a filesystem.
› Copy straight in: “cat file | hdfs put <filename>”
› From the network: “scp file | hdfs put <filename>”
› Streaming: (Logs?)→Flume→HDFS.
104. ETL Framework
104
Getting Data into Hadoop
Hadoop HDFS at its core is simply a filesystem.
› Copy straight in: “cat file | hdfs put <filename>”
› From the network: “scp file | hdfs put <filename>”
› Streaming: (Logs?)→Flume→HDFS.
› Table loads: Sqoop (“select * into <hdfsFile>”).
105. ETL Framework
105
Getting Data into Hadoop
Hadoop HDFS at its core is simply a filesystem.
› Copy straight in: “cat file | hdfs put <filename>”
› From the network: “scp file | hdfs put <filename>”
› Streaming: (Logs?)→Flume→HDFS.
› Table loads: Sqoop (“select * into <hdfsFile>”).
HBase is not as simple, but can be worth it.
› Flume→HBase.
› HBase column family == columnar scans.
› Beware: no secondary indexes.
106. ETL Framework
106
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
107. ETL Framework
107
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
108. ETL Framework
108
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
› Yesterday data loads were 120GB. Today 15GB.
109. ETL Framework
109
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
› Yesterday data loads were 120GB. Today 15GB.
› Yesterday “grep -ci error” == 1k. Today 20k.
110. ETL Framework
110
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
› Yesterday data loads were 120GB. Today 15GB.
› Yesterday “grep -ci error” == 1k. Today 20k.
› Yesterday “wc -l etllogs” == 700k. Today 10k.
111. ETL Framework
111
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
› Yesterday data loads were 120GB. Today 15GB.
› Yesterday “grep -ci error” == 1k. Today 20k.
› Yesterday “wc -l etllogs” == 700k. Today 10k.
› Yesterday ETL process == 8hrs. Today 1hr.
112. ETL Framework
112
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
› Yesterday data loads were 120GB. Today 15GB.
› Yesterday “grep -ci error” == 1k. Today 20k.
› Yesterday “wc -l etllogs” == 700k. Today 10k.
› Yesterday ETL process == 8hrs. Today 1hr.
If you have time, get a bit more sophisticated:
› Yesterday TableX.ColY was int. Today varchar.
113. ETL Framework
113
Notice when something is wrong
Don't skimp ETL alerting! Start with the obvious:
› Yesterday TableX delta == 150k rows. Today 5k.
› Yesterday data loads were 120GB. Today 15GB.
› Yesterday “grep -ci error” == 1k. Today 20k.
› Yesterday “wc -l etllogs” == 700k. Today 10k.
› Yesterday ETL process == 8hrs. Today 1hr.
If you have time, get a bit more sophisticated:
› Yesterday TableX.ColY was int. Today varchar.
› Yesterday TableX.ColY compressed at 8x, today
it compresses at 2x (or 32x?).
114. Getting Data Out
114
Hadoop Reporting Tools
The oldschool method of retrieving data:
› select f(col) from table where ... group by ...
115. Getting Data Out
115
Hadoop Reporting Tools
The oldschool method of retrieving data:
› select f(col) from table where ... group by ...
The NoSQL method of retrieving data:
116. Getting Data Out
116
Hadoop Reporting Tools
The oldschool method of retrieving data:
› select f(col) from table where ... group by ...
The NoSQL method of retrieving data:
› select f(col) from table where ... group by …
117. Getting Data Out
117
Hadoop Reporting Tools
The oldschool method of retrieving data:
› select f(col) from table where ... group by ...
The NoSQL method of retrieving data:
› select f(col) from table where ... group by …
Hadoop includes Hive (SQL→Map/Reduce Converter).
In my experience, dedicated business users can learn to
use Hive with little extra training.
118. Getting Data Out
118
Hadoop Reporting Tools
The oldschool method of retrieving data:
› select f(col) from table where ... group by ...
The NoSQL method of retrieving data:
› select f(col) from table where ... group by …
Hadoop includes Hive (SQL→Map/Reduce Converter).
In my experience, dedicated business users can learn to
use Hive with little extra training.
But there is extra training!
119. Getting Data Out
119
Hadoop Reporting Tools
It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:
120. Getting Data Out
120
Hadoop Reporting Tools
It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:
› Tools that submit SQL and receive tabular data.
› Tableau has Hadoop connector.
121. Getting Data Out
121
Hadoop Reporting Tools
It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:
› Tools that submit SQL and receive tabular data.
› Tableau has Hadoop connector.
Most of Hadoop's power is in Map/Reduce:
› Hive == SQL→Map/Reduce.
› RHadoop == R→Map/Reduce.
122. Getting Data Out
122
Hadoop Reporting Tools
It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:
› Tools that submit SQL and receive tabular data.
› Tableau has Hadoop connector.
Most of Hadoop's power is in Map/Reduce:
› Hive == SQL→Map/Reduce.
› RHadoop == R→Map/Reduce.
› HadoopStreaming == Anything→Map/Reduce
123. The Hybrid Datawarehouse
123
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
124. The Hybrid Datawarehouse
124
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
2. New parallel ETL into Hadoop.
125. The Hybrid Datawarehouse
125
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
2. New parallel ETL into Hadoop.
3. Build ETLs Hadoop→current DW.
› Typical == equivalent summaries from #1.
126. The Hybrid Datawarehouse
126
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
2. New parallel ETL into Hadoop.
3. Build ETLs Hadoop→current DW.
› Typical == equivalent summaries from #1.
› Once that works, shut off old data flow.
127. The Hybrid Datawarehouse
127
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
2. New parallel ETL into Hadoop.
3. Build ETLs Hadoop→current DW.
› Typical == equivalent summaries from #1.
› Once that works, shut off old data flow.
4. Give everyone access to Hadoop.
› They will think of cool new uses for the data.
128. The Hybrid Datawarehouse
128
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
2. New parallel ETL into Hadoop.
3. Build ETLs Hadoop→current DW.
› Typical == equivalent summaries from #1.
› Once that works, shut off old data flow.
4. Give everyone access to Hadoop.
› They will think of cool new uses for the data.
5. Work through The Pain of #4.
› It doesn't come free, but is worth the price.
129. The Hybrid Datawarehouse
129
Putting it All Together
The Way I've Always Done It:
1. Identify a data flow overloading current DW.
› Typical == raw data into DW then summarised.
2. New parallel ETL into Hadoop.
3. Build ETLs Hadoop→current DW.
› Typical == equivalent summaries from #1.
› Once that works, shut off old data flow.
4. Give everyone access to Hadoop.
› They will think of cool new uses for the data.
5. Work through The Pain of #4.
› It doesn't come free, but is worth the price.
6. Go to #1.
131. The Hybrid Datawarehouse
131
Q&A
Questions? Some suggestions:
› What is the average airspeed
of a laden sparrow?
› How can I hire you?
132. The Hybrid Datawarehouse
132
Q&A
Questions? Some suggestions:
› What is the average airspeed
of a laden sparrow?
› How can I hire you?
› No really, I have money, you have skills.
Let's make this happen.
133. The Hybrid Datawarehouse
133
Q&A
Questions? Some suggestions:
› What is the average airspeed
of a laden sparrow?
› How can I hire you?
› No really, I have money, you have skills.
Let's make this happen.
› Where's the coffee?
I never thought I could be so sleepy.
134. The Hybrid Datawarehouse
134
Q&A
Questions? Some suggestions:
› What is the average airspeed
of a laden sparrow?
› How can I hire you?
› No really, I have money, you have skills.
Let's make this happen.
› Where's the coffee?
I never thought I could be so sleepy.
Thank you! Email me if you desire.
domain: palominodb.com – username: time
Percona Live NYC 2012