SlideShare uma empresa Scribd logo
1 de 134
Baixar para ler offline
Percona Live NYC 2012
                                                            1
                 MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
Percona Live NYC 2012
                                                            2
                 MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
› Professional Services:
    › ETLs,
    › Cluster tooling.
Percona Live NYC 2012
                                                            3
                 MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
› Professional Services:
    › ETLs,
    › Cluster tooling.
› Configuration management (DevOps)
    › Chef,
    › Puppet,
    › Ansible.
Percona Live NYC 2012
                                                            4
                 MySQL/Hadoop Hybrid Datawarehouse
Who are Palomino?
› Bespoke Services: we work with and like you.
› Production Experienced: senior DBAs, admins, engineers.
› 24x7: globally-distributed on-call staff.
› One-Month Contracts: not more.
› Professional Services:
    › ETLs,
    › Cluster tooling.
› Configuration management (DevOps)
    › Chef,
    › Puppet,
    › Ansible.
› Big Data Cluster Administration (OpsDev)
    › MySQL, PostgreSQL,
    › Cassandra, HBase,
    › MongoDB, Couchbase.
Percona Live NYC 2012
                                                       5
                   MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino

Achievements:
 › Palomino Big Data Strategy.
 › Datawarehouse Cluster at Riot Games.
 › Designed/built back-end for Firefox Sync.
Percona Live NYC 2012
                                                       6
                   MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino

Achievements:
 › Palomino Big Data Strategy.
 › Datawarehouse Cluster at Riot Games.
 › Designed/built back-end for Firefox Sync.
 › Led DB team at Digg.com.
 › Harassed the Reddit team at a party.
Percona Live NYC 2012
                                                       7
                   MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino

Achievements:
 › Palomino Big Data Strategy.
 › Datawarehouse Cluster at Riot Games.
 › Designed/built back-end for Firefox Sync.
 › Led DB team at Digg.com.
 › Harassed the Reddit team at a party.

Ensured successful business for:
 › Digg,
 › Friendster,
Percona Live NYC 2012
                                                       8
                   MySQL/Hadoop Hybrid Datawarehouse
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino

Achievements:
 › Palomino Big Data Strategy.
 › Datawarehouse Cluster at Riot Games.
 › Designed/built back-end for Firefox Sync.
 › Led DB team at Digg.com.
 › Harassed the Reddit team at a party.

Ensured successful business for:
 › Digg,
 › Friendster,
 › Mozilla,
 › StumbleUpon,
 › Riot Games (League of Legends).
What Is This Talk?
                                                 9
              Experiences of a High-Volume DBA

I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?
What Is This Talk?
                                                      10
               Experiences of a High-Volume DBA

I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?

I'll win a bar bet, but would be fired from Oracle.
What Is This Talk?
                                                      11
               Experiences of a High-Volume DBA

I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?

I'll win a bar bet, but would be fired from Oracle.

I've administered high-volume Datawarehouses and
managed a large ETL rollout, but haven't written
extensive ETLs or reports.
What Is This Talk?
                                                      12
               Experiences of a High-Volume DBA

I've built high-volume Datawarehouses, but am not
well-versed in traditional Datawarehouse theory. Cube?
Snowflake? Star?

I'll win a bar bet, but would be fired from Oracle.

I've administered high-volume Datawarehouses and
managed a large ETL rollout, but haven't written
extensive ETLs or reports.

A high-volume Datawarehouse is of a different design
than a low-volume Datawarehouse by necessity.
Typically simpler schemas, more complex queries.
Why OSS?
                                                     13
              Freedom at Scale == Economical Sense

Selling OSS to Management used to be hard...

  › My query tools are limited.
  › The business users know DMBSx.
  › The documentation is lacking.
Why OSS?
                                                      14
               Freedom at Scale == Economical Sense

Selling OSS to Management used to be hard...

  › My query tools are limited.
  › The business users know DMBSx.
  › The documentation is lacking.

...but then terascale happened one day.
Why OSS?
                                                      15
               Freedom at Scale == Economical Sense

Selling OSS to Management used to be hard...

  › My query tools are limited.
  › The business users know DMBSx.
  › The documentation is lacking.

...but then terascale happened one day.

  › Adding 20TB costs HOW MUCH?!
  › Adding 30 machines costs HOW MUCH?!
Why OSS?
                                                      16
               Freedom at Scale == Economical Sense

Selling OSS to Management used to be hard...

  › My query tools are limited.
  › The business users know DMBSx.
  › The documentation is lacking.

...but then terascale happened one day.

  › Adding 20TB costs HOW MUCH?!
  › Adding 30 machines costs HOW MUCH?!
  › How many sales calls before I push the release?
Why OSS?
                                                      17
               Freedom at Scale == Economical Sense

Selling OSS to Management used to be hard...

  › My query tools are limited.
  › The business users know DMBSx.
  › The documentation is lacking.

...but then terascale happened one day.

  › Adding 20TB costs HOW MUCH?!
  › Adding 30 machines costs HOW MUCH?!
  › How many sales calls before I push the release?
  › I'll hire an entire team and still be more efficient.
How to begin?
                                                  18
               Take stock of the current system

Establish a data flow:

  › Who's sending me data?
  › How much?
How to begin?
                                                  19
               Take stock of the current system

Establish a data flow:

  › Who's sending me data?
  › How much?
  › What are the bottlenecks?
  › What's the current ETL process?
How to begin?
                                                       20
               Take stock of the current system

Establish a data flow:

  › Who's sending me data?
  › How much?
  › What are the bottlenecks?
  › What's the current ETL process?

We're looking for typical data flow characteristics:

  › Log data, write-mostly, free-form.
  › Looks tabular, “select * from table.”
  › Size: MB, GB or TB per hour?
How to begin?
                                                       21
               Take stock of the current system

Establish a data flow:

  › Who's sending me data?
  › How much?
  › What are the bottlenecks?
  › What's the current ETL process?

We're looking for typical data flow characteristics:

  › Log data, write-mostly, free-form.
  › Looks tabular, “select * from table.”
  › Size: MB, GB or TB per hour?
  › Who queries this data? How often?
What is Hadoop?
                                                   22
              The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
What is Hadoop?
                                                   23
              The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
  › Hadoop: A map/reduce implementation.
What is Hadoop?
                                                   24
              The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
  › Hadoop: A map/reduce implementation.
  › Hive: SQL→Map/Reduce converter.
What is Hadoop?
                                                   25
              The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
  › Hadoop: A map/reduce implementation.
  › Hive: SQL→Map/Reduce converter.
  › HBase: A column store (and more).
What is Hadoop?
                                                   26
               The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
  › Hadoop: A map/reduce implementation.
  › Hive: SQL→Map/Reduce converter.
  › HBase: A column store (and more).

Most-interesting bits:

  › Hive lets business users formulate SQL!
What is Hadoop?
                                                   27
               The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
  › Hadoop: A map/reduce implementation.
  › Hive: SQL→Map/Reduce converter.
  › HBase: A column store (and more).

Most-interesting bits:

  › Hive lets business users formulate SQL!
  › HBase provides a distributed column store!
What is Hadoop?
                                                   28
               The Hadoop Ecosystem

Hadoop Components:

  › HDFS: A filesystem across the whole cluster.
  › Hadoop: A map/reduce implementation.
  › Hive: SQL→Map/Reduce converter.
  › HBase: A column store (and more).

Most-interesting bits:

  › Hive lets business users formulate SQL!
  › HBase provides a distributed column store!
  › HDFS provides massive I/O and redundancy.
Should You Use Hadoop?
                                                 29
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
Should You Use Hadoop?
                                                  30
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
Should You Use Hadoop?
                                                    31
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
  › Very large datasets, multiple tera/petabytes.
Should You Use Hadoop?
                                                    32
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
  › Very large datasets, multiple tera/petabytes.
  › With HBase, column store engine.
Should You Use Hadoop?
                                                    33
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
  › Very large datasets, multiple tera/petabytes.
  › With HBase, column store engine.

Where Hadoop/HBase falls short:

  › Query iteration is typically minutes.
Should You Use Hadoop?
                                                    34
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
  › Very large datasets, multiple tera/petabytes.
  › With HBase, column store engine.

Where Hadoop/HBase falls short:

  › Query iteration is typically minutes.
  › Administration is new and unusual.
Should You Use Hadoop?
                                                    35
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
  › Very large datasets, multiple tera/petabytes.
  › With HBase, column store engine.

Where Hadoop/HBase falls short:

  › Query iteration is typically minutes.
  › Administration is new and unusual.
  › Hadoop still immature (some say “beta”).
Should You Use Hadoop?
                                                    36
              Hadoop Strengths and Weaknesses

Hadoop/HBase is good for:

  › Scan large chunks of your data every time.
  › Apply a lot of cluster resource to a query.
  › Very large datasets, multiple tera/petabytes.
  › With HBase, column store engine.

Where Hadoop/HBase falls short:

  › Query iteration is typically minutes.
  › Administration is new and unusual.
  › Hadoop still immature (some say “beta”).
  › Documentation is bad or non-existent.
Should You Use MySQL?
                                              37
             MySQL Strengths and Weaknesses

MySQL is good for:

  › Smaller datasets, typically gigabytes.
  › Indexing data automatically and quickly.
  › Short query iteration, even milliseconds.
  › Quick dataloads and processing with MyISAM.
Should You Use MySQL?
                                                 38
             MySQL Strengths and Weaknesses

MySQL is good for:

  › Smaller datasets, typically gigabytes.
  › Indexing data automatically and quickly.
  › Short query iteration, even milliseconds.
  › Quick dataloads and processing with MyISAM.

Where MySQL falls short:

  › Has no column store engine.
  › Documentation for datawarehousing minimal.
Should You Use MySQL?
                                                39
             MySQL Strengths and Weaknesses

MySQL is good for:

  › Smaller datasets, typically gigabytes.
  › Indexing data automatically and quickly.
  › Short query iteration, even milliseconds.
  › Quick dataloads and processing with MyISAM.

Where MySQL falls short:

  › Has no column store engine.
  › Documentation for datawarehousing minimal.
  › You probably know better than I. Trust the DBA.
Should You Use MySQL?
                                                  40
              MySQL Strengths and Weaknesses

MySQL is good for:

  › Smaller datasets, typically gigabytes.
  › Indexing data automatically and quickly.
  › Short query iteration, even milliseconds.
  › Quick dataloads and processing with MyISAM.

Where MySQL falls short:

  › Has no column store engine.
  › Documentation for datawarehousing minimal.
  › You probably know better than I. Trust the DBA.
  › Be honest with management. If Vertica is better...
MySQL/Hadoop Hybrid
                                                 41
              Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
MySQL/Hadoop Hybrid
                                                 42
              Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
  › Immature technology – can crash sometimes.
MySQL/Hadoop Hybrid
                                                 43
              Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
  › Immature technology – can crash sometimes.
  › Not too much documentation.
MySQL/Hadoop Hybrid
                                                  44
               Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
  › Immature technology – can crash sometimes.
  › Not too much documentation.

You'll need buy-in, savvy, and resilience from:

  › ETL/Datawarehouse developers,
MySQL/Hadoop Hybrid
                                                  45
               Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
  › Immature technology – can crash sometimes.
  › Not too much documentation.

You'll need buy-in, savvy, and resilience from:

  › ETL/Datawarehouse developers,
  › Business Users,
MySQL/Hadoop Hybrid
                                                  46
               Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
  › Immature technology – can crash sometimes.
  › Not too much documentation.

You'll need buy-in, savvy, and resilience from:

  › ETL/Datawarehouse developers,
  › Business Users,
  › Systems Administrators,
MySQL/Hadoop Hybrid
                                                  47
               Common Weaknesses

So if you combine the weaknesses of these two
technologies... what have you got?

  › No built-in end-user-friendly query tools.
  › Immature technology – can crash sometimes.
  › Not too much documentation.

You'll need buy-in, savvy, and resilience from:

  › ETL/Datawarehouse developers,
  › Business Users,
  › Systems Administrators,
  › Management.
Building a Hadoop Cluster
                                            48
              The NameNode

Typical Reasons Clusters Fail:

  › Cascading failure (distributed fail)
  › Network outage (distributed fail)
  › Bad query executed (distributed fail)
Building a Hadoop Cluster
                                               49
              The NameNode

Typical Reasons Clusters Fail:

  › Cascading failure (distributed fail)
  › Network outage (distributed fail)
  › Bad query executed (distributed fail)
  › NameNode dies? (single point of failure)
Building a Hadoop Cluster
                                                 50
              The NameNode

Typical Reasons Clusters Fail:

  › Cascading failure (distributed fail)
  › Network outage (distributed fail)
  › Bad query executed (distributed fail)

NameNode failing is not a common failure case. Still,
it's good to plan for it:

  › All critical filesystems on RAID 1+0
Building a Hadoop Cluster
                                                 51
              The NameNode

Typical Reasons Clusters Fail:

  › Cascading failure (distributed fail)
  › Network outage (distributed fail)
  › Bad query executed (distributed fail)

NameNode failing is not a common failure case. Still,
it's good to plan for it:

  › All critical filesystems on RAID 1+0
  › Redundant PSU
Building a Hadoop Cluster
                                                 52
              The NameNode

Typical Reasons Clusters Fail:

  › Cascading failure (distributed fail)
  › Network outage (distributed fail)
  › Bad query executed (distributed fail)

NameNode failing is not a common failure case. Still,
it's good to plan for it:

  › All critical filesystems on RAID 1+0
  › Redundant PSU
  › Redundant NICs to independent routers
Building a Hadoop Cluster
                                                 53
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
Building a Hadoop Cluster
                                                  54
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
  › More spindles: linux-1u.net has 8HDD in 1U.
Building a Hadoop Cluster
                                                  55
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
  › More spindles: linux-1u.net has 8HDD in 1U.
  › 7200rpm SATA nice, 15Krpm overkill.
Building a Hadoop Cluster
                                                  56
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
  › More spindles: linux-1u.net has 8HDD in 1U.
  › 7200rpm SATA nice, 15Krpm overkill.
  › Multiple TB of storage.
Building a Hadoop Cluster
                                                  57
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
  › More spindles: linux-1u.net has 8HDD in 1U.
  › 7200rpm SATA nice, 15Krpm overkill.
  › Multiple TB of storage.
  › 8-24GB RAM.
Building a Hadoop Cluster
                                                  58
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
  › More spindles: linux-1u.net has 8HDD in 1U.
  › 7200rpm SATA nice, 15Krpm overkill.
  › Multiple TB of storage.
  › 8-24GB RAM.
  › Good/fast network cards!
Building a Hadoop Cluster
                                                  59
              Basic Cluster Node Configuration

So much for the specialised hardware. All non-
NameNode nodes in your cluster:

  › RAID-0 or even JBOD.
  › More spindles: linux-1u.net has 8HDD in 1U.
  › 7200rpm SATA nice, 15Krpm overkill.
  › Multiple TB of storage. ←lots of this!!!
  › 8-24GB RAM.
  › Good/fast network cards!

A DBA thinks “Database” == RAM. Likewise,
“Hadoop Node” == disk spindles, disk storage, and
network. You lose 2-3x storage to data replication.
Building a Hadoop Cluster
                                                 60
              Network and Rack Layout

Network within a rack (top-of-rack switching):

  › Bandwidth for 30 machines going full-tilt.
Building a Hadoop Cluster
                                                 61
              Network and Rack Layout

Network within a rack (top-of-rack switching):

  › Bandwidth for 30 machines going full-tilt.
  › Multiple TOR switches for redundancy.
  › Consider bridging.
Building a Hadoop Cluster
                                                  62
              Network and Rack Layout

Network within a rack (top-of-rack switching):

  › Bandwidth for 30 machines going full-tilt.
  › Multiple TOR switches for redundancy.
  › Consider bridging.

Network between racks (datacentre switching):

  › Inter-rack switches: better than 2Gbit desireable.
  › Hadoop rack awareness reduces inter-rack traffic.
Building a Hadoop Cluster
                                                   63
               Network and Rack Layout

Network within a rack (top-of-rack switching):

  › Bandwidth for 30 machines going full-tilt.
  › Multiple TOR switches for redundancy.
  › Consider bridging.

Network between racks (datacentre switching):

  › Inter-rack switches: better than 2Gbit desireable.
  › Hadoop rack awareness reduces inter-rack traffic.

Need sharp Networking employees on-board to help
build cluster. Network instability can cause crashes.
Building a Hadoop Cluster
                                                     64
               Monitoring: Trending and Alerting

Pick your graphing solution, and put stats into it. In
doubt about which stats to graph?
Building a Hadoop Cluster
                                                     65
               Monitoring: Trending and Alerting

Pick your graphing solution, and put stats into it. In
doubt about which stats to graph? Try all of them.

  › Every Hadoop stat exposed via JMX.
  › Every HBase stat exposed via JMX.
  › All disk, CPU, RAM, network stats.
Building a Hadoop Cluster
                                                     66
               Monitoring: Trending and Alerting

Pick your graphing solution, and put stats into it. In
doubt about which stats to graph? Try all of them.

  › Every Hadoop stat exposed via JMX.
  › Every HBase stat exposed via JMX.
  › All disk, CPU, RAM, network stats.

A possible solution:

   › Use collectd's JMX plugin to collect stats.
Building a Hadoop Cluster
                                                     67
               Monitoring: Trending and Alerting

Pick your graphing solution, and put stats into it. In
doubt about which stats to graph? Try all of them.

  › Every Hadoop stat exposed via JMX.
  › Every HBase stat exposed via JMX.
  › All disk, CPU, RAM, network stats.

A possible solution:

   › Use collectd's JMX plugin to collect stats.
   › Put stats into Graphite.
   › Or Ganglia if you know how.
Building a Hadoop Cluster
                                                68
              Palomino Cluster Tool

Use Configuration Management to build your cluster:

  › Ansible – easiest and quickest.
  › Opscode Chef – most popular, must love Ruby.
  › Puppet – most mature.
Building a Hadoop Cluster
                                                69
              Palomino Cluster Tool

Use Configuration Management to build your cluster:

  › Ansible – easiest and quickest.
  › Opscode Chef – most popular, must love Ruby.
  › Puppet – most mature.

The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:

  › Pre-written Configuration Management scripts.
Building a Hadoop Cluster
                                                70
              Palomino Cluster Tool

Use Configuration Management to build your cluster:

  › Ansible – easiest and quickest.
  › Opscode Chef – most popular, must love Ruby.
  › Puppet – most mature.

The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:

  › Pre-written Configuration Management scripts.
  › Sets up HDFS, Hadoop, HBase, Monitoring.
Building a Hadoop Cluster
                                                    71
              Palomino Cluster Tool

Use Configuration Management to build your cluster:

  › Ansible – easiest and quickest.
  › Opscode Chef – most popular, must love Ruby.
  › Puppet – most mature.

The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:

  › Pre-written Configuration Management scripts.
  › Sets up HDFS, Hadoop, HBase, Monitoring.
  › In the future, will also set up alerting and backups.
Building a Hadoop Cluster
                                                    72
              Palomino Cluster Tool

Use Configuration Management to build your cluster:

  › Ansible – easiest and quickest.
  › Opscode Chef – most popular, must love Ruby.
  › Puppet – most mature.

The Palomino Cluster Tool (open source on Github)
uses the above tools to build a cluster for you:

  › Pre-written Configuration Management scripts.
  › Sets up HDFS, Hadoop, HBase, Monitoring.
  › In the future, will also set up alerting and backups.
  › Also sets up MySQL+MHA, may be relevant?
Running the Hadoop Cluster
                                               73
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
Running the Hadoop Cluster
                                               74
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
Running the Hadoop Cluster
                                               75
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
  › Disks stressed? Map-heavy workload.
Running the Hadoop Cluster
                                               76
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
  › Disks stressed? Map-heavy workload.
  › RAM stressed? This is a DBMS after all!
Running the Hadoop Cluster
                                               77
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
  › Disks stressed? Map-heavy workload.
  › RAM stressed? This is a DBMS after all!

Watch your storage subsystems.

  › 120TB is a lot of disk space.
Running the Hadoop Cluster
                                               78
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
  › Disks stressed? Map-heavy workload.
  › RAM stressed? This is a DBMS after all!

Watch your storage subsystems.

  › 120TB is a lot of disk space.
  › Until you put in 120TB of data.
Running the Hadoop Cluster
                                               79
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
  › Disks stressed? Map-heavy workload.
  › RAM stressed? This is a DBMS after all!

Watch your storage subsystems.

  › 120TB is a lot of disk space.
  › Until you put in 120TB of data.
  › 400 spindles is a lot of IOPS.
Running the Hadoop Cluster
                                               80
              Typical Problems

Hadoop Clusters are Distributed Systems.

  › Network stressed? Reduce-heavy workload.
  › CPUs stressed? Map-heavy workload.
  › Disks stressed? Map-heavy workload.
  › RAM stressed? This is a DBMS after all!

Watch your storage subsystems.

  › 120TB is a lot of disk space.
  › Until you put in 120TB of data.
  › 400 spindles is a lot of IOPS.
  › Until you query everything. Ten times.
Running the Hadoop Cluster
                                                     81
               Administration by Scientific Method

What did we just learn...?
Running the Hadoop Cluster
                                                    82
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
Running the Hadoop Cluster
                                                    83
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
  › Temporal correlation of ERRORs across nodes.
Running the Hadoop Cluster
                                                    84
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
  › Temporal correlation of ERRORs across nodes.
  › Correlation of WARNINGs and ERRORs.
Running the Hadoop Cluster
                                                    85
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
  › Temporal correlation of ERRORs across nodes.
  › Correlation of WARNINGs and ERRORs.
  › Do log events correlate to graph anomolies?
Running the Hadoop Cluster
                                                    86
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
  › Temporal correlation of ERRORs across nodes.
  › Correlation of WARNINGs and ERRORs.
  › Do log events correlate to graph anomolies?

The Procedure:

  1. Problems occurring on the cluster?
  2. Formulate hypothesis from input (graphs/logs).
Running the Hadoop Cluster
                                                    87
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
  › Temporal correlation of ERRORs across nodes.
  › Correlation of WARNINGs and ERRORs.
  › Do log events correlate to graph anomolies?

The Procedure:

  1. Problems occurring on the cluster?
  2. Formulate hypothesis from input (graphs/logs).
  3. Test hypothesis (tweak configurations).
Running the Hadoop Cluster
                                                    88
              Administration by Scientific Method

Hadoop Clusters are Distributed Systems!

  › Instability on system X? Could be Y's fault.
  › Temporal correlation of ERRORs across nodes.
  › Correlation of WARNINGs and ERRORs.
  › Do log events correlate to graph anomolies?

The Procedure:

  1. Problems occurring on the cluster?
  2. Formulate hypothesis from input (graphs/logs).
  3. Test hypothesis (tweak configurations).
  4. Go to 1. You're graphing EVERYTHING, right?
Running the Hadoop Cluster
                                               89
              Graphing your Logs

You need to graph everything. How about graphing
your logs?
Running the Hadoop Cluster
                                                   90
                      Graphing your Logs

You need to graph everything. How about graphing
your logs?

  › grep ERROR | cut <date/hour part> | uniq -c
  2012-07-29   06   15692
  2012-07-29   07   30432
  2012-07-29   08   76943
  2012-07-29   09   54955
  2012-07-29   10   15652
Running the Hadoop Cluster
                                                   91
                      Graphing your Logs

You need to graph everything. How about graphing
your logs?

  › grep ERROR | cut <date/hour part> | uniq -c
  2012-07-29   06   15692
  2012-07-29   07   30432
  2012-07-29   08   76943
  2012-07-29   09   54955
  2012-07-29   10   15652



That's close, but what if that's hundreds of lines? You
can put the data into LibreOffice Calc, but slows down
iteration cycle.
Running the Hadoop Cluster
                                                 92
              Graphing your Logs

Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:
Running the Hadoop Cluster
                                                                             93
                      Graphing your Logs

Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:

  › grep ERROR | cut <date/hour part> | distribution
  2012-07-29   06|15692   ++++++++++
  2012-07-29   07|30432   +++++++++++++++++++
  2012-07-29   08|76943   ++++++++++++++++++++++++++++++++++++++++++++++++
  2012-07-29   09|54955   ++++++++++++++++++++++++++++++++++
  2012-07-29   10|15652   ++++++++++
Running the Hadoop Cluster
                                                                             94
                      Graphing your Logs

Graphing logs (terminal output) easier with Palomino's
terminal tool “distribution,” OSS on Github:

  › grep ERROR | cut <date/hour part> | distribution
  2012-07-29   06|15692   ++++++++++
  2012-07-29   07|30432   +++++++++++++++++++
  2012-07-29   08|76943   ++++++++++++++++++++++++++++++++++++++++++++++++
  2012-07-29   09|54955   ++++++++++++++++++++++++++++++++++
  2012-07-29   10|15652   ++++++++++



On a quick iteration cycle in the terminal, this is very
useful. For presentation to the suits later you can import
the data into another prettier tool.
Running the Hadoop Cluster
                                                                                         95
                       Graphing your Logs

A real-life (MySQL) example:
  root@db49:/var/log/mysql# grep -i error error.log 
    | cut -c 1-9 
    | distribution 
    | sort -n
                                                This file was about 2.5GB in size


                                                Just the date/hour portion
                                           Distribution sorts by key frequency by
                                           default, but we'll want date/hour ordering.
Running the Hadoop Cluster
                                                                                     96
                       Graphing your Logs

A real-life (MySQL) example:
  root@db49:/var/log/mysql# grep -i error error.log 
    | cut -c 1-9 
    | distribution 
    | sort -n
  Val      |Ct (Pct)    Histogram
  120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏
  120601 17|10 (7.69%) █████████▋
  120601 14|4 (3.08%)   ███▉
  120602 14|2 (1.54%)   ██
  120602 21|4 (3.08%)   ███▉
  120610 13|2 (1.54%)   ██
  120610 14|4 (3.08%)   ███▉
  120611 14|2 (1.54%)   ██
  120612 14|2 (1.54%)   ██
  120613 14|2 (1.54%)   ██
  120616 13|2 (1.54%)   ██
  120630 14|5 (3.85%)   ████▉




Obvious: Noon on June 1st was ugly.
Running the Hadoop Cluster
                                                                                     97
                       Graphing your Logs

A real-life (MySQL) example:
  root@db49:/var/log/mysql# grep -i error error.log 
    | cut -c 1-9 
    | distribution 
    | sort -n
  Val      |Ct (Pct)    Histogram
  120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏
  120601 17|10 (7.69%) █████████▋
  120601 14|4 (3.08%)   ███▉
  120602 14|2 (1.54%)   ██
  120602 21|4 (3.08%)   ███▉
  120610 13|2 (1.54%)   ██
  120610 14|4 (3.08%)   ███▉
  120611 14|2 (1.54%)   ██
  120612 14|2 (1.54%)   ██
  120613 14|2 (1.54%)   ██
  120616 13|2 (1.54%)   ██
  120630 14|5 (3.85%)   ████▉




Obvious: Noon on June 1st was ugly.

But also: What keeps happening at 2pm?
Building the MySQL Datawarehouse
                                                      98
              Hardware Spec and Layout

This is a typical OLAP role.

  › Fast non-transactional engine: MyISAM.
  › Data typically time-related: partition by date.
  › Data write-only or read-all? Archive engine.
  › Index-everything schemas.
Building the MySQL Datawarehouse
                                                      99
               Hardware Spec and Layout

This is a typical OLAP role.

  › Fast non-transactional engine: MyISAM.
  › Data typically time-related: partition by date.
  › Data write-only or read-all? Archive engine.
  › Index-everything schemas.

Typically beefier hardware is better.

  › Many spindles, many CPUs, much RAM.
  › Reasonably-fast network cards.
ETL Framework
                                                  100
              Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.
ETL Framework
                                                   101
              Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.

  › Copy straight in: “cat file | hdfs put <filename>”
ETL Framework
                                                  102
              Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.

  › Copy straight in: “cat file | hdfs put <filename>”
  › From the network: “scp file | hdfs put <filename>”
ETL Framework
                                                  103
              Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.

  › Copy straight in: “cat file | hdfs put <filename>”
  › From the network: “scp file | hdfs put <filename>”
  › Streaming: (Logs?)→Flume→HDFS.
ETL Framework
                                                  104
              Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.

  › Copy straight in: “cat file | hdfs put <filename>”
  › From the network: “scp file | hdfs put <filename>”
  › Streaming: (Logs?)→Flume→HDFS.
  › Table loads: Sqoop (“select * into <hdfsFile>”).
ETL Framework
                                                  105
               Getting Data into Hadoop

Hadoop HDFS at its core is simply a filesystem.

  › Copy straight in: “cat file | hdfs put <filename>”
  › From the network: “scp file | hdfs put <filename>”
  › Streaming: (Logs?)→Flume→HDFS.
  › Table loads: Sqoop (“select * into <hdfsFile>”).

HBase is not as simple, but can be worth it.

  › Flume→HBase.
  › HBase column family == columnar scans.
  › Beware: no secondary indexes.
ETL Framework
                                                    106
              Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:
ETL Framework
                                                    107
              Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
ETL Framework
                                                    108
              Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
  › Yesterday data loads were 120GB. Today 15GB.
ETL Framework
                                                    109
              Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
  › Yesterday data loads were 120GB. Today 15GB.
  › Yesterday “grep -ci error” == 1k. Today 20k.
ETL Framework
                                                    110
              Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
  › Yesterday data loads were 120GB. Today 15GB.
  › Yesterday “grep -ci error” == 1k. Today 20k.
  › Yesterday “wc -l etllogs” == 700k. Today 10k.
ETL Framework
                                                    111
              Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
  › Yesterday data loads were 120GB. Today 15GB.
  › Yesterday “grep -ci error” == 1k. Today 20k.
  › Yesterday “wc -l etllogs” == 700k. Today 10k.
  › Yesterday ETL process == 8hrs. Today 1hr.
ETL Framework
                                                    112
               Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
  › Yesterday data loads were 120GB. Today 15GB.
  › Yesterday “grep -ci error” == 1k. Today 20k.
  › Yesterday “wc -l etllogs” == 700k. Today 10k.
  › Yesterday ETL process == 8hrs. Today 1hr.

If you have time, get a bit more sophisticated:

  › Yesterday TableX.ColY was int. Today varchar.
ETL Framework
                                                    113
               Notice when something is wrong

Don't skimp ETL alerting! Start with the obvious:

  › Yesterday TableX delta == 150k rows. Today 5k.
  › Yesterday data loads were 120GB. Today 15GB.
  › Yesterday “grep -ci error” == 1k. Today 20k.
  › Yesterday “wc -l etllogs” == 700k. Today 10k.
  › Yesterday ETL process == 8hrs. Today 1hr.

If you have time, get a bit more sophisticated:

  › Yesterday TableX.ColY was int. Today varchar.
  › Yesterday TableX.ColY compressed at 8x, today
    it compresses at 2x (or 32x?).
Getting Data Out
                                                      114
               Hadoop Reporting Tools

The oldschool method of retrieving data:

  › select f(col) from table where ... group by ...
Getting Data Out
                                                      115
               Hadoop Reporting Tools

The oldschool method of retrieving data:

  › select f(col) from table where ... group by ...

The NoSQL method of retrieving data:
Getting Data Out
                                                      116
               Hadoop Reporting Tools

The oldschool method of retrieving data:

  › select f(col) from table where ... group by ...

The NoSQL method of retrieving data:

  › select f(col) from table where ... group by …
Getting Data Out
                                                      117
               Hadoop Reporting Tools

The oldschool method of retrieving data:

  › select f(col) from table where ... group by ...

The NoSQL method of retrieving data:

  › select f(col) from table where ... group by …

Hadoop includes Hive (SQL→Map/Reduce Converter).
In my experience, dedicated business users can learn to
use Hive with little extra training.
Getting Data Out
                                                       118
                Hadoop Reporting Tools

The oldschool method of retrieving data:

   › select f(col) from table where ... group by ...

The NoSQL method of retrieving data:

   › select f(col) from table where ... group by …

Hadoop includes Hive (SQL→Map/Reduce Converter).
In my experience, dedicated business users can learn to
use Hive with little extra training.

But there is extra training!
Getting Data Out
                                                    119
               Hadoop Reporting Tools

It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:
Getting Data Out
                                                    120
               Hadoop Reporting Tools

It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:

  › Tools that submit SQL and receive tabular data.
  › Tableau has Hadoop connector.
Getting Data Out
                                                    121
               Hadoop Reporting Tools

It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:

  › Tools that submit SQL and receive tabular data.
  › Tableau has Hadoop connector.

Most of Hadoop's power is in Map/Reduce:

  › Hive == SQL→Map/Reduce.
  › RHadoop == R→Map/Reduce.
Getting Data Out
                                                    122
               Hadoop Reporting Tools

It's best if your business users have analytical mindsets,
technical backgrounds, and no fear of the command
line. Hadoop reporting:

  › Tools that submit SQL and receive tabular data.
  › Tableau has Hadoop connector.

Most of Hadoop's power is in Map/Reduce:

  › Hive == SQL→Map/Reduce.
  › RHadoop == R→Map/Reduce.
  › HadoopStreaming == Anything→Map/Reduce
The Hybrid Datawarehouse
                                               123
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
The Hybrid Datawarehouse
                                               124
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
  2. New parallel ETL into Hadoop.
The Hybrid Datawarehouse
                                               125
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
  2. New parallel ETL into Hadoop.
  3. Build ETLs Hadoop→current DW.
     › Typical == equivalent summaries from #1.
The Hybrid Datawarehouse
                                               126
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
  2. New parallel ETL into Hadoop.
  3. Build ETLs Hadoop→current DW.
     › Typical == equivalent summaries from #1.
     › Once that works, shut off old data flow.
The Hybrid Datawarehouse
                                                 127
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
  2. New parallel ETL into Hadoop.
  3. Build ETLs Hadoop→current DW.
     › Typical == equivalent summaries from #1.
     › Once that works, shut off old data flow.
  4. Give everyone access to Hadoop.
     › They will think of cool new uses for the data.
The Hybrid Datawarehouse
                                                 128
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
  2. New parallel ETL into Hadoop.
  3. Build ETLs Hadoop→current DW.
     › Typical == equivalent summaries from #1.
     › Once that works, shut off old data flow.
  4. Give everyone access to Hadoop.
     › They will think of cool new uses for the data.
  5. Work through The Pain of #4.
     › It doesn't come free, but is worth the price.
The Hybrid Datawarehouse
                                                 129
             Putting it All Together

The Way I've Always Done It:

  1. Identify a data flow overloading current DW.
     › Typical == raw data into DW then summarised.
  2. New parallel ETL into Hadoop.
  3. Build ETLs Hadoop→current DW.
     › Typical == equivalent summaries from #1.
     › Once that works, shut off old data flow.
  4. Give everyone access to Hadoop.
     › They will think of cool new uses for the data.
  5. Work through The Pain of #4.
     › It doesn't come free, but is worth the price.
  6. Go to #1.
The Hybrid Datawarehouse
                                        130
             Q&A

Questions?
The Hybrid Datawarehouse
                                         131
              Q&A

Questions? Some suggestions:

   › What is the average airspeed
     of a laden sparrow?
   › How can I hire you?
The Hybrid Datawarehouse
                                                 132
              Q&A

Questions? Some suggestions:

   › What is the average airspeed
     of a laden sparrow?
   › How can I hire you?
   › No really, I have money, you have skills.
     Let's make this happen.
The Hybrid Datawarehouse
                                                 133
              Q&A

Questions? Some suggestions:

   › What is the average airspeed
     of a laden sparrow?
   › How can I hire you?
   › No really, I have money, you have skills.
     Let's make this happen.
   › Where's the coffee?
     I never thought I could be so sleepy.
The Hybrid Datawarehouse
                                                 134
              Q&A

Questions? Some suggestions:

   › What is the average airspeed
     of a laden sparrow?
   › How can I hire you?
   › No really, I have money, you have skills.
     Let's make this happen.
   › Where's the coffee?
     I never thought I could be so sleepy.

Thank you! Email me if you desire.
domain: palominodb.com – username: time
Percona Live NYC 2012

Mais conteúdo relacionado

Mais procurados

Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
When Devs Do Ops
When Devs Do OpsWhen Devs Do Ops
When Devs Do OpsWooga
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Myths & Reality - Choose a DBMS tailored to your use cases
Myths & Reality - Choose a DBMS tailored to your use casesMyths & Reality - Choose a DBMS tailored to your use cases
Myths & Reality - Choose a DBMS tailored to your use casesOVHcloud
 
Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageDebasish Ghosh
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance TuningFromDual GmbH
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Brasil
 

Mais procurados (20)

Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
When Devs Do Ops
When Devs Do OpsWhen Devs Do Ops
When Devs Do Ops
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
AMIS OOW Review 2012- Deel 1 - Lucas Jellema & Paul Uijtewaal
AMIS OOW Review 2012- Deel 1 - Lucas Jellema & Paul UijtewaalAMIS OOW Review 2012- Deel 1 - Lucas Jellema & Paul Uijtewaal
AMIS OOW Review 2012- Deel 1 - Lucas Jellema & Paul Uijtewaal
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
 
Big Data - Big Pitfalls.
Big Data - Big Pitfalls.Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
AMIS OOW Review 2012 - Deel 3 - Alex Nuijten
AMIS OOW Review 2012 - Deel 3 - Alex NuijtenAMIS OOW Review 2012 - Deel 3 - Alex Nuijten
AMIS OOW Review 2012 - Deel 3 - Alex Nuijten
 
Myths & Reality - Choose a DBMS tailored to your use cases
Myths & Reality - Choose a DBMS tailored to your use casesMyths & Reality - Choose a DBMS tailored to your use cases
Myths & Reality - Choose a DBMS tailored to your use cases
 
Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new age
 
MySQL Performance Tuning
MySQL Performance TuningMySQL Performance Tuning
MySQL Performance Tuning
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPal
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 

Destaque

Velocity pythian operational visibility
Velocity pythian operational visibilityVelocity pythian operational visibility
Velocity pythian operational visibilityLaine Campbell
 
Data Warehouse and OLAP - Lear-Fabini
Data Warehouse and OLAP - Lear-FabiniData Warehouse and OLAP - Lear-Fabini
Data Warehouse and OLAP - Lear-FabiniScott Fabini
 
Recruiting for diversity in tech
Recruiting for diversity in techRecruiting for diversity in tech
Recruiting for diversity in techLaine Campbell
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingJason S
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Cloudera, Inc.
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 

Destaque (11)

Database engineering
Database engineeringDatabase engineering
Database engineering
 
Velocity pythian operational visibility
Velocity pythian operational visibilityVelocity pythian operational visibility
Velocity pythian operational visibility
 
Data Warehouse and OLAP - Lear-Fabini
Data Warehouse and OLAP - Lear-FabiniData Warehouse and OLAP - Lear-Fabini
Data Warehouse and OLAP - Lear-Fabini
 
Recruiting for diversity in tech
Recruiting for diversity in techRecruiting for diversity in tech
Recruiting for diversity in tech
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

Semelhante a Hybrid my sql_hadoop_datawarehouse

Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsorthuguk
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Roman Nikitchenko
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3xKinAnx
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)guest0f8e278
 
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)Bogdan Bocse
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 

Semelhante a Hybrid my sql_hadoop_datawarehouse (20)

Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsort
 
Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!Big Data: fall seven times, stand up eight!
Big Data: fall seven times, stand up eight!
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Final deck
Final deckFinal deck
Final deck
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
The Rise of Digital Audio (AdsWizz, DevTalks Bucharest, 2015)
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 

Mais de Laine Campbell

Pythian operational visibility
Pythian operational visibilityPythian operational visibility
Pythian operational visibilityLaine Campbell
 
Scaling MySQL in Amazon Web Services
Scaling MySQL in Amazon Web ServicesScaling MySQL in Amazon Web Services
Scaling MySQL in Amazon Web ServicesLaine Campbell
 
RDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and PatternsRDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and PatternsLaine Campbell
 
An Introduction To Palomino
An Introduction To PalominoAn Introduction To Palomino
An Introduction To PalominoLaine Campbell
 
Methods of Sharding MySQL
Methods of Sharding MySQLMethods of Sharding MySQL
Methods of Sharding MySQLLaine Campbell
 
CouchConf SF 2012 Lightning Talk - Operational Excellence
CouchConf SF 2012 Lightning Talk - Operational ExcellenceCouchConf SF 2012 Lightning Talk - Operational Excellence
CouchConf SF 2012 Lightning Talk - Operational ExcellenceLaine Campbell
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingLaine Campbell
 

Mais de Laine Campbell (8)

Pythian operational visibility
Pythian operational visibilityPythian operational visibility
Pythian operational visibility
 
Scaling MySQL in Amazon Web Services
Scaling MySQL in Amazon Web ServicesScaling MySQL in Amazon Web Services
Scaling MySQL in Amazon Web Services
 
RDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and PatternsRDS for MySQL, No BS Operations and Patterns
RDS for MySQL, No BS Operations and Patterns
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
An Introduction To Palomino
An Introduction To PalominoAn Introduction To Palomino
An Introduction To Palomino
 
Methods of Sharding MySQL
Methods of Sharding MySQLMethods of Sharding MySQL
Methods of Sharding MySQL
 
CouchConf SF 2012 Lightning Talk - Operational Excellence
CouchConf SF 2012 Lightning Talk - Operational ExcellenceCouchConf SF 2012 Lightning Talk - Operational Excellence
CouchConf SF 2012 Lightning Talk - Operational Excellence
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Hybrid my sql_hadoop_datawarehouse

  • 1. Percona Live NYC 2012 1 MySQL/Hadoop Hybrid Datawarehouse Who are Palomino? › Bespoke Services: we work with and like you. › Production Experienced: senior DBAs, admins, engineers. › 24x7: globally-distributed on-call staff. › One-Month Contracts: not more.
  • 2. Percona Live NYC 2012 2 MySQL/Hadoop Hybrid Datawarehouse Who are Palomino? › Bespoke Services: we work with and like you. › Production Experienced: senior DBAs, admins, engineers. › 24x7: globally-distributed on-call staff. › One-Month Contracts: not more. › Professional Services: › ETLs, › Cluster tooling.
  • 3. Percona Live NYC 2012 3 MySQL/Hadoop Hybrid Datawarehouse Who are Palomino? › Bespoke Services: we work with and like you. › Production Experienced: senior DBAs, admins, engineers. › 24x7: globally-distributed on-call staff. › One-Month Contracts: not more. › Professional Services: › ETLs, › Cluster tooling. › Configuration management (DevOps) › Chef, › Puppet, › Ansible.
  • 4. Percona Live NYC 2012 4 MySQL/Hadoop Hybrid Datawarehouse Who are Palomino? › Bespoke Services: we work with and like you. › Production Experienced: senior DBAs, admins, engineers. › 24x7: globally-distributed on-call staff. › One-Month Contracts: not more. › Professional Services: › ETLs, › Cluster tooling. › Configuration management (DevOps) › Chef, › Puppet, › Ansible. › Big Data Cluster Administration (OpsDev) › MySQL, PostgreSQL, › Cassandra, HBase, › MongoDB, Couchbase.
  • 5. Percona Live NYC 2012 5 MySQL/Hadoop Hybrid Datawarehouse Who am I? Tim Ellis CTO/Principal Architect, Palomino Achievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync.
  • 6. Percona Live NYC 2012 6 MySQL/Hadoop Hybrid Datawarehouse Who am I? Tim Ellis CTO/Principal Architect, Palomino Achievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync. › Led DB team at Digg.com. › Harassed the Reddit team at a party.
  • 7. Percona Live NYC 2012 7 MySQL/Hadoop Hybrid Datawarehouse Who am I? Tim Ellis CTO/Principal Architect, Palomino Achievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync. › Led DB team at Digg.com. › Harassed the Reddit team at a party. Ensured successful business for: › Digg, › Friendster,
  • 8. Percona Live NYC 2012 8 MySQL/Hadoop Hybrid Datawarehouse Who am I? Tim Ellis CTO/Principal Architect, Palomino Achievements: › Palomino Big Data Strategy. › Datawarehouse Cluster at Riot Games. › Designed/built back-end for Firefox Sync. › Led DB team at Digg.com. › Harassed the Reddit team at a party. Ensured successful business for: › Digg, › Friendster, › Mozilla, › StumbleUpon, › Riot Games (League of Legends).
  • 9. What Is This Talk? 9 Experiences of a High-Volume DBA I've built high-volume Datawarehouses, but am not well-versed in traditional Datawarehouse theory. Cube? Snowflake? Star?
  • 10. What Is This Talk? 10 Experiences of a High-Volume DBA I've built high-volume Datawarehouses, but am not well-versed in traditional Datawarehouse theory. Cube? Snowflake? Star? I'll win a bar bet, but would be fired from Oracle.
  • 11. What Is This Talk? 11 Experiences of a High-Volume DBA I've built high-volume Datawarehouses, but am not well-versed in traditional Datawarehouse theory. Cube? Snowflake? Star? I'll win a bar bet, but would be fired from Oracle. I've administered high-volume Datawarehouses and managed a large ETL rollout, but haven't written extensive ETLs or reports.
  • 12. What Is This Talk? 12 Experiences of a High-Volume DBA I've built high-volume Datawarehouses, but am not well-versed in traditional Datawarehouse theory. Cube? Snowflake? Star? I'll win a bar bet, but would be fired from Oracle. I've administered high-volume Datawarehouses and managed a large ETL rollout, but haven't written extensive ETLs or reports. A high-volume Datawarehouse is of a different design than a low-volume Datawarehouse by necessity. Typically simpler schemas, more complex queries.
  • 13. Why OSS? 13 Freedom at Scale == Economical Sense Selling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking.
  • 14. Why OSS? 14 Freedom at Scale == Economical Sense Selling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking. ...but then terascale happened one day.
  • 15. Why OSS? 15 Freedom at Scale == Economical Sense Selling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking. ...but then terascale happened one day. › Adding 20TB costs HOW MUCH?! › Adding 30 machines costs HOW MUCH?!
  • 16. Why OSS? 16 Freedom at Scale == Economical Sense Selling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking. ...but then terascale happened one day. › Adding 20TB costs HOW MUCH?! › Adding 30 machines costs HOW MUCH?! › How many sales calls before I push the release?
  • 17. Why OSS? 17 Freedom at Scale == Economical Sense Selling OSS to Management used to be hard... › My query tools are limited. › The business users know DMBSx. › The documentation is lacking. ...but then terascale happened one day. › Adding 20TB costs HOW MUCH?! › Adding 30 machines costs HOW MUCH?! › How many sales calls before I push the release? › I'll hire an entire team and still be more efficient.
  • 18. How to begin? 18 Take stock of the current system Establish a data flow: › Who's sending me data? › How much?
  • 19. How to begin? 19 Take stock of the current system Establish a data flow: › Who's sending me data? › How much? › What are the bottlenecks? › What's the current ETL process?
  • 20. How to begin? 20 Take stock of the current system Establish a data flow: › Who's sending me data? › How much? › What are the bottlenecks? › What's the current ETL process? We're looking for typical data flow characteristics: › Log data, write-mostly, free-form. › Looks tabular, “select * from table.” › Size: MB, GB or TB per hour?
  • 21. How to begin? 21 Take stock of the current system Establish a data flow: › Who's sending me data? › How much? › What are the bottlenecks? › What's the current ETL process? We're looking for typical data flow characteristics: › Log data, write-mostly, free-form. › Looks tabular, “select * from table.” › Size: MB, GB or TB per hour? › Who queries this data? How often?
  • 22. What is Hadoop? 22 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster.
  • 23. What is Hadoop? 23 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation.
  • 24. What is Hadoop? 24 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter.
  • 25. What is Hadoop? 25 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more).
  • 26. What is Hadoop? 26 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more). Most-interesting bits: › Hive lets business users formulate SQL!
  • 27. What is Hadoop? 27 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more). Most-interesting bits: › Hive lets business users formulate SQL! › HBase provides a distributed column store!
  • 28. What is Hadoop? 28 The Hadoop Ecosystem Hadoop Components: › HDFS: A filesystem across the whole cluster. › Hadoop: A map/reduce implementation. › Hive: SQL→Map/Reduce converter. › HBase: A column store (and more). Most-interesting bits: › Hive lets business users formulate SQL! › HBase provides a distributed column store! › HDFS provides massive I/O and redundancy.
  • 29. Should You Use Hadoop? 29 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time.
  • 30. Should You Use Hadoop? 30 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query.
  • 31. Should You Use Hadoop? 31 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes.
  • 32. Should You Use Hadoop? 32 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine.
  • 33. Should You Use Hadoop? 33 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine. Where Hadoop/HBase falls short: › Query iteration is typically minutes.
  • 34. Should You Use Hadoop? 34 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine. Where Hadoop/HBase falls short: › Query iteration is typically minutes. › Administration is new and unusual.
  • 35. Should You Use Hadoop? 35 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine. Where Hadoop/HBase falls short: › Query iteration is typically minutes. › Administration is new and unusual. › Hadoop still immature (some say “beta”).
  • 36. Should You Use Hadoop? 36 Hadoop Strengths and Weaknesses Hadoop/HBase is good for: › Scan large chunks of your data every time. › Apply a lot of cluster resource to a query. › Very large datasets, multiple tera/petabytes. › With HBase, column store engine. Where Hadoop/HBase falls short: › Query iteration is typically minutes. › Administration is new and unusual. › Hadoop still immature (some say “beta”). › Documentation is bad or non-existent.
  • 37. Should You Use MySQL? 37 MySQL Strengths and Weaknesses MySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM.
  • 38. Should You Use MySQL? 38 MySQL Strengths and Weaknesses MySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM. Where MySQL falls short: › Has no column store engine. › Documentation for datawarehousing minimal.
  • 39. Should You Use MySQL? 39 MySQL Strengths and Weaknesses MySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM. Where MySQL falls short: › Has no column store engine. › Documentation for datawarehousing minimal. › You probably know better than I. Trust the DBA.
  • 40. Should You Use MySQL? 40 MySQL Strengths and Weaknesses MySQL is good for: › Smaller datasets, typically gigabytes. › Indexing data automatically and quickly. › Short query iteration, even milliseconds. › Quick dataloads and processing with MyISAM. Where MySQL falls short: › Has no column store engine. › Documentation for datawarehousing minimal. › You probably know better than I. Trust the DBA. › Be honest with management. If Vertica is better...
  • 41. MySQL/Hadoop Hybrid 41 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools.
  • 42. MySQL/Hadoop Hybrid 42 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes.
  • 43. MySQL/Hadoop Hybrid 43 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation.
  • 44. MySQL/Hadoop Hybrid 44 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation. You'll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers,
  • 45. MySQL/Hadoop Hybrid 45 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation. You'll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers, › Business Users,
  • 46. MySQL/Hadoop Hybrid 46 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation. You'll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers, › Business Users, › Systems Administrators,
  • 47. MySQL/Hadoop Hybrid 47 Common Weaknesses So if you combine the weaknesses of these two technologies... what have you got? › No built-in end-user-friendly query tools. › Immature technology – can crash sometimes. › Not too much documentation. You'll need buy-in, savvy, and resilience from: › ETL/Datawarehouse developers, › Business Users, › Systems Administrators, › Management.
  • 48. Building a Hadoop Cluster 48 The NameNode Typical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail)
  • 49. Building a Hadoop Cluster 49 The NameNode Typical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail) › NameNode dies? (single point of failure)
  • 50. Building a Hadoop Cluster 50 The NameNode Typical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail) NameNode failing is not a common failure case. Still, it's good to plan for it: › All critical filesystems on RAID 1+0
  • 51. Building a Hadoop Cluster 51 The NameNode Typical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail) NameNode failing is not a common failure case. Still, it's good to plan for it: › All critical filesystems on RAID 1+0 › Redundant PSU
  • 52. Building a Hadoop Cluster 52 The NameNode Typical Reasons Clusters Fail: › Cascading failure (distributed fail) › Network outage (distributed fail) › Bad query executed (distributed fail) NameNode failing is not a common failure case. Still, it's good to plan for it: › All critical filesystems on RAID 1+0 › Redundant PSU › Redundant NICs to independent routers
  • 53. Building a Hadoop Cluster 53 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD.
  • 54. Building a Hadoop Cluster 54 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U.
  • 55. Building a Hadoop Cluster 55 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill.
  • 56. Building a Hadoop Cluster 56 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage.
  • 57. Building a Hadoop Cluster 57 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage. › 8-24GB RAM.
  • 58. Building a Hadoop Cluster 58 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage. › 8-24GB RAM. › Good/fast network cards!
  • 59. Building a Hadoop Cluster 59 Basic Cluster Node Configuration So much for the specialised hardware. All non- NameNode nodes in your cluster: › RAID-0 or even JBOD. › More spindles: linux-1u.net has 8HDD in 1U. › 7200rpm SATA nice, 15Krpm overkill. › Multiple TB of storage. ←lots of this!!! › 8-24GB RAM. › Good/fast network cards! A DBA thinks “Database” == RAM. Likewise, “Hadoop Node” == disk spindles, disk storage, and network. You lose 2-3x storage to data replication.
  • 60. Building a Hadoop Cluster 60 Network and Rack Layout Network within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt.
  • 61. Building a Hadoop Cluster 61 Network and Rack Layout Network within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt. › Multiple TOR switches for redundancy. › Consider bridging.
  • 62. Building a Hadoop Cluster 62 Network and Rack Layout Network within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt. › Multiple TOR switches for redundancy. › Consider bridging. Network between racks (datacentre switching): › Inter-rack switches: better than 2Gbit desireable. › Hadoop rack awareness reduces inter-rack traffic.
  • 63. Building a Hadoop Cluster 63 Network and Rack Layout Network within a rack (top-of-rack switching): › Bandwidth for 30 machines going full-tilt. › Multiple TOR switches for redundancy. › Consider bridging. Network between racks (datacentre switching): › Inter-rack switches: better than 2Gbit desireable. › Hadoop rack awareness reduces inter-rack traffic. Need sharp Networking employees on-board to help build cluster. Network instability can cause crashes.
  • 64. Building a Hadoop Cluster 64 Monitoring: Trending and Alerting Pick your graphing solution, and put stats into it. In doubt about which stats to graph?
  • 65. Building a Hadoop Cluster 65 Monitoring: Trending and Alerting Pick your graphing solution, and put stats into it. In doubt about which stats to graph? Try all of them. › Every Hadoop stat exposed via JMX. › Every HBase stat exposed via JMX. › All disk, CPU, RAM, network stats.
  • 66. Building a Hadoop Cluster 66 Monitoring: Trending and Alerting Pick your graphing solution, and put stats into it. In doubt about which stats to graph? Try all of them. › Every Hadoop stat exposed via JMX. › Every HBase stat exposed via JMX. › All disk, CPU, RAM, network stats. A possible solution: › Use collectd's JMX plugin to collect stats.
  • 67. Building a Hadoop Cluster 67 Monitoring: Trending and Alerting Pick your graphing solution, and put stats into it. In doubt about which stats to graph? Try all of them. › Every Hadoop stat exposed via JMX. › Every HBase stat exposed via JMX. › All disk, CPU, RAM, network stats. A possible solution: › Use collectd's JMX plugin to collect stats. › Put stats into Graphite. › Or Ganglia if you know how.
  • 68. Building a Hadoop Cluster 68 Palomino Cluster Tool Use Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature.
  • 69. Building a Hadoop Cluster 69 Palomino Cluster Tool Use Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature. The Palomino Cluster Tool (open source on Github) uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts.
  • 70. Building a Hadoop Cluster 70 Palomino Cluster Tool Use Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature. The Palomino Cluster Tool (open source on Github) uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts. › Sets up HDFS, Hadoop, HBase, Monitoring.
  • 71. Building a Hadoop Cluster 71 Palomino Cluster Tool Use Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature. The Palomino Cluster Tool (open source on Github) uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts. › Sets up HDFS, Hadoop, HBase, Monitoring. › In the future, will also set up alerting and backups.
  • 72. Building a Hadoop Cluster 72 Palomino Cluster Tool Use Configuration Management to build your cluster: › Ansible – easiest and quickest. › Opscode Chef – most popular, must love Ruby. › Puppet – most mature. The Palomino Cluster Tool (open source on Github) uses the above tools to build a cluster for you: › Pre-written Configuration Management scripts. › Sets up HDFS, Hadoop, HBase, Monitoring. › In the future, will also set up alerting and backups. › Also sets up MySQL+MHA, may be relevant?
  • 73. Running the Hadoop Cluster 73 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload.
  • 74. Running the Hadoop Cluster 74 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload.
  • 75. Running the Hadoop Cluster 75 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload.
  • 76. Running the Hadoop Cluster 76 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all!
  • 77. Running the Hadoop Cluster 77 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all! Watch your storage subsystems. › 120TB is a lot of disk space.
  • 78. Running the Hadoop Cluster 78 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all! Watch your storage subsystems. › 120TB is a lot of disk space. › Until you put in 120TB of data.
  • 79. Running the Hadoop Cluster 79 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all! Watch your storage subsystems. › 120TB is a lot of disk space. › Until you put in 120TB of data. › 400 spindles is a lot of IOPS.
  • 80. Running the Hadoop Cluster 80 Typical Problems Hadoop Clusters are Distributed Systems. › Network stressed? Reduce-heavy workload. › CPUs stressed? Map-heavy workload. › Disks stressed? Map-heavy workload. › RAM stressed? This is a DBMS after all! Watch your storage subsystems. › 120TB is a lot of disk space. › Until you put in 120TB of data. › 400 spindles is a lot of IOPS. › Until you query everything. Ten times.
  • 81. Running the Hadoop Cluster 81 Administration by Scientific Method What did we just learn...?
  • 82. Running the Hadoop Cluster 82 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault.
  • 83. Running the Hadoop Cluster 83 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault. › Temporal correlation of ERRORs across nodes.
  • 84. Running the Hadoop Cluster 84 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs.
  • 85. Running the Hadoop Cluster 85 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies?
  • 86. Running the Hadoop Cluster 86 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies? The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs).
  • 87. Running the Hadoop Cluster 87 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies? The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs). 3. Test hypothesis (tweak configurations).
  • 88. Running the Hadoop Cluster 88 Administration by Scientific Method Hadoop Clusters are Distributed Systems! › Instability on system X? Could be Y's fault. › Temporal correlation of ERRORs across nodes. › Correlation of WARNINGs and ERRORs. › Do log events correlate to graph anomolies? The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs). 3. Test hypothesis (tweak configurations). 4. Go to 1. You're graphing EVERYTHING, right?
  • 89. Running the Hadoop Cluster 89 Graphing your Logs You need to graph everything. How about graphing your logs?
  • 90. Running the Hadoop Cluster 90 Graphing your Logs You need to graph everything. How about graphing your logs? › grep ERROR | cut <date/hour part> | uniq -c 2012-07-29 06 15692 2012-07-29 07 30432 2012-07-29 08 76943 2012-07-29 09 54955 2012-07-29 10 15652
  • 91. Running the Hadoop Cluster 91 Graphing your Logs You need to graph everything. How about graphing your logs? › grep ERROR | cut <date/hour part> | uniq -c 2012-07-29 06 15692 2012-07-29 07 30432 2012-07-29 08 76943 2012-07-29 09 54955 2012-07-29 10 15652 That's close, but what if that's hundreds of lines? You can put the data into LibreOffice Calc, but slows down iteration cycle.
  • 92. Running the Hadoop Cluster 92 Graphing your Logs Graphing logs (terminal output) easier with Palomino's terminal tool “distribution,” OSS on Github:
  • 93. Running the Hadoop Cluster 93 Graphing your Logs Graphing logs (terminal output) easier with Palomino's terminal tool “distribution,” OSS on Github: › grep ERROR | cut <date/hour part> | distribution 2012-07-29 06|15692 ++++++++++ 2012-07-29 07|30432 +++++++++++++++++++ 2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++ 2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++ 2012-07-29 10|15652 ++++++++++
  • 94. Running the Hadoop Cluster 94 Graphing your Logs Graphing logs (terminal output) easier with Palomino's terminal tool “distribution,” OSS on Github: › grep ERROR | cut <date/hour part> | distribution 2012-07-29 06|15692 ++++++++++ 2012-07-29 07|30432 +++++++++++++++++++ 2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++ 2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++ 2012-07-29 10|15652 ++++++++++ On a quick iteration cycle in the terminal, this is very useful. For presentation to the suits later you can import the data into another prettier tool.
  • 95. Running the Hadoop Cluster 95 Graphing your Logs A real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n This file was about 2.5GB in size Just the date/hour portion Distribution sorts by key frequency by default, but we'll want date/hour ordering.
  • 96. Running the Hadoop Cluster 96 Graphing your Logs A real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n Val |Ct (Pct) Histogram 120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏ 120601 17|10 (7.69%) █████████▋ 120601 14|4 (3.08%) ███▉ 120602 14|2 (1.54%) ██ 120602 21|4 (3.08%) ███▉ 120610 13|2 (1.54%) ██ 120610 14|4 (3.08%) ███▉ 120611 14|2 (1.54%) ██ 120612 14|2 (1.54%) ██ 120613 14|2 (1.54%) ██ 120616 13|2 (1.54%) ██ 120630 14|5 (3.85%) ████▉ Obvious: Noon on June 1st was ugly.
  • 97. Running the Hadoop Cluster 97 Graphing your Logs A real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n Val |Ct (Pct) Histogram 120601 12|60 (46.15%) █████████████████████████████████████████████████████████▏ 120601 17|10 (7.69%) █████████▋ 120601 14|4 (3.08%) ███▉ 120602 14|2 (1.54%) ██ 120602 21|4 (3.08%) ███▉ 120610 13|2 (1.54%) ██ 120610 14|4 (3.08%) ███▉ 120611 14|2 (1.54%) ██ 120612 14|2 (1.54%) ██ 120613 14|2 (1.54%) ██ 120616 13|2 (1.54%) ██ 120630 14|5 (3.85%) ████▉ Obvious: Noon on June 1st was ugly. But also: What keeps happening at 2pm?
  • 98. Building the MySQL Datawarehouse 98 Hardware Spec and Layout This is a typical OLAP role. › Fast non-transactional engine: MyISAM. › Data typically time-related: partition by date. › Data write-only or read-all? Archive engine. › Index-everything schemas.
  • 99. Building the MySQL Datawarehouse 99 Hardware Spec and Layout This is a typical OLAP role. › Fast non-transactional engine: MyISAM. › Data typically time-related: partition by date. › Data write-only or read-all? Archive engine. › Index-everything schemas. Typically beefier hardware is better. › Many spindles, many CPUs, much RAM. › Reasonably-fast network cards.
  • 100. ETL Framework 100 Getting Data into Hadoop Hadoop HDFS at its core is simply a filesystem.
  • 101. ETL Framework 101 Getting Data into Hadoop Hadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>”
  • 102. ETL Framework 102 Getting Data into Hadoop Hadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>”
  • 103. ETL Framework 103 Getting Data into Hadoop Hadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>” › Streaming: (Logs?)→Flume→HDFS.
  • 104. ETL Framework 104 Getting Data into Hadoop Hadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>” › Streaming: (Logs?)→Flume→HDFS. › Table loads: Sqoop (“select * into <hdfsFile>”).
  • 105. ETL Framework 105 Getting Data into Hadoop Hadoop HDFS at its core is simply a filesystem. › Copy straight in: “cat file | hdfs put <filename>” › From the network: “scp file | hdfs put <filename>” › Streaming: (Logs?)→Flume→HDFS. › Table loads: Sqoop (“select * into <hdfsFile>”). HBase is not as simple, but can be worth it. › Flume→HBase. › HBase column family == columnar scans. › Beware: no secondary indexes.
  • 106. ETL Framework 106 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious:
  • 107. ETL Framework 107 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k.
  • 108. ETL Framework 108 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB.
  • 109. ETL Framework 109 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k.
  • 110. ETL Framework 110 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k.
  • 111. ETL Framework 111 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k. › Yesterday ETL process == 8hrs. Today 1hr.
  • 112. ETL Framework 112 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k. › Yesterday ETL process == 8hrs. Today 1hr. If you have time, get a bit more sophisticated: › Yesterday TableX.ColY was int. Today varchar.
  • 113. ETL Framework 113 Notice when something is wrong Don't skimp ETL alerting! Start with the obvious: › Yesterday TableX delta == 150k rows. Today 5k. › Yesterday data loads were 120GB. Today 15GB. › Yesterday “grep -ci error” == 1k. Today 20k. › Yesterday “wc -l etllogs” == 700k. Today 10k. › Yesterday ETL process == 8hrs. Today 1hr. If you have time, get a bit more sophisticated: › Yesterday TableX.ColY was int. Today varchar. › Yesterday TableX.ColY compressed at 8x, today it compresses at 2x (or 32x?).
  • 114. Getting Data Out 114 Hadoop Reporting Tools The oldschool method of retrieving data: › select f(col) from table where ... group by ...
  • 115. Getting Data Out 115 Hadoop Reporting Tools The oldschool method of retrieving data: › select f(col) from table where ... group by ... The NoSQL method of retrieving data:
  • 116. Getting Data Out 116 Hadoop Reporting Tools The oldschool method of retrieving data: › select f(col) from table where ... group by ... The NoSQL method of retrieving data: › select f(col) from table where ... group by …
  • 117. Getting Data Out 117 Hadoop Reporting Tools The oldschool method of retrieving data: › select f(col) from table where ... group by ... The NoSQL method of retrieving data: › select f(col) from table where ... group by … Hadoop includes Hive (SQL→Map/Reduce Converter). In my experience, dedicated business users can learn to use Hive with little extra training.
  • 118. Getting Data Out 118 Hadoop Reporting Tools The oldschool method of retrieving data: › select f(col) from table where ... group by ... The NoSQL method of retrieving data: › select f(col) from table where ... group by … Hadoop includes Hive (SQL→Map/Reduce Converter). In my experience, dedicated business users can learn to use Hive with little extra training. But there is extra training!
  • 119. Getting Data Out 119 Hadoop Reporting Tools It's best if your business users have analytical mindsets, technical backgrounds, and no fear of the command line. Hadoop reporting:
  • 120. Getting Data Out 120 Hadoop Reporting Tools It's best if your business users have analytical mindsets, technical backgrounds, and no fear of the command line. Hadoop reporting: › Tools that submit SQL and receive tabular data. › Tableau has Hadoop connector.
  • 121. Getting Data Out 121 Hadoop Reporting Tools It's best if your business users have analytical mindsets, technical backgrounds, and no fear of the command line. Hadoop reporting: › Tools that submit SQL and receive tabular data. › Tableau has Hadoop connector. Most of Hadoop's power is in Map/Reduce: › Hive == SQL→Map/Reduce. › RHadoop == R→Map/Reduce.
  • 122. Getting Data Out 122 Hadoop Reporting Tools It's best if your business users have analytical mindsets, technical backgrounds, and no fear of the command line. Hadoop reporting: › Tools that submit SQL and receive tabular data. › Tableau has Hadoop connector. Most of Hadoop's power is in Map/Reduce: › Hive == SQL→Map/Reduce. › RHadoop == R→Map/Reduce. › HadoopStreaming == Anything→Map/Reduce
  • 123. The Hybrid Datawarehouse 123 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised.
  • 124. The Hybrid Datawarehouse 124 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop.
  • 125. The Hybrid Datawarehouse 125 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1.
  • 126. The Hybrid Datawarehouse 126 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow.
  • 127. The Hybrid Datawarehouse 127 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow. 4. Give everyone access to Hadoop. › They will think of cool new uses for the data.
  • 128. The Hybrid Datawarehouse 128 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow. 4. Give everyone access to Hadoop. › They will think of cool new uses for the data. 5. Work through The Pain of #4. › It doesn't come free, but is worth the price.
  • 129. The Hybrid Datawarehouse 129 Putting it All Together The Way I've Always Done It: 1. Identify a data flow overloading current DW. › Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoop→current DW. › Typical == equivalent summaries from #1. › Once that works, shut off old data flow. 4. Give everyone access to Hadoop. › They will think of cool new uses for the data. 5. Work through The Pain of #4. › It doesn't come free, but is worth the price. 6. Go to #1.
  • 130. The Hybrid Datawarehouse 130 Q&A Questions?
  • 131. The Hybrid Datawarehouse 131 Q&A Questions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you?
  • 132. The Hybrid Datawarehouse 132 Q&A Questions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you? › No really, I have money, you have skills. Let's make this happen.
  • 133. The Hybrid Datawarehouse 133 Q&A Questions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you? › No really, I have money, you have skills. Let's make this happen. › Where's the coffee? I never thought I could be so sleepy.
  • 134. The Hybrid Datawarehouse 134 Q&A Questions? Some suggestions: › What is the average airspeed of a laden sparrow? › How can I hire you? › No really, I have money, you have skills. Let's make this happen. › Where's the coffee? I never thought I could be so sleepy. Thank you! Email me if you desire. domain: palominodb.com – username: time Percona Live NYC 2012