SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Big Bad PostgreSQL: A Case Study


                 Moving a
                 “large,”
          “complicated,” and
           mission-critical
           datawarehouse
              from Oracle
            to PostgreSQL
           for cost control.



    1
Monday, August 2, 2010
Who am I? @postwait on twitter


                         Author of “Scalable Internet Architectures”
                         Pearson, ISBN: 067232699X

                         CEO of OmniTI
                         We build scalable and secure web applications

                         I am an Engineer
                         A practitioner of academic computing.
                         IEEE member and Senior ACM member.
                         On the Editorial Board of ACM’s Queue magazine.

                         I work on/with a lot of Open Source software:
                         Apache, perl, Linux, Solaris, PostgreSQL,
                         Varnish, Spread, Reconnoiter, etc.

                         I have experience.
                         I’ve had the unique opportunity to watch a great many catastrophes.
                         I enjoy immersing myself in the pathology of architecture failures.




Monday, August 2, 2010
Overall Architecture


                                                    Oracle 8i
                                                                                          OLTP instance:
                                          0.5 TB              0.25 TB
                                                                                          drives the site
                                          Hitachi              JBOD


                                                                  OLTP




       Log import and                                                                           Oracle 8i




       processing
                                                              Oracle 8i

                                                                                                    0.75 TB
                                                                                                     JBOD
                           MySQL
                         log importer               0.5 TB                1.5 TB
                                                    Hitachi                MTI              OLTP warm backup


                                                                                                               Warm spare
                               1.2 TB
                               SATA                             Datawarehouse
                               RAID


                           Log Importer


                                                                           MySQL 4.1




                                                                                 1.2 TB
                                                                               IDE RAID


                                                                            Data Exporter




                                 bulk selects / data exports
Monday, August 2, 2010
Database Situation

         •      The problems:
               •  The database is growing.
               •  The OLTP and ODS/warehouse are too slow.
               •  A lot of application code against the OLTP system.
               •  Minimal application code against the ODS system.
         •      Oracle:
               •  Licensed per processor.
               •  Really, really, really expensive on a large scale.
         •      PostgreSQL:
               •  No licensing costs.
               •  Good support for complex queries.


Monday, August 2, 2010
Database Choices



                  •      Must keep Oracle on OLTP
                    •      Complex, Oracle-specific web application.
                    •      Need more processors.
                  •      ODS: Oracle not required.
                    •      Complex queries from limited sources.
                    •      Needs more space and power.
                  •      Result:
                    •      Move ODS Oracle licenses to OLTP
                    •      Run PostgreSQL on ODS




Monday, August 2, 2010
PostgreSQL gotchas



                    •    For an OLTP system that does thousands of
                         updates per second, vacuuming is a hassle.

                    •    No upgrades?!

                         •   pg_migrator is coming along.

                    •    Less community experience with large
                         databases.

                    •    Replication features less evolved

                         •   though PostgreSQL 9.0 ups the ante



Monday, August 2, 2010
PostgreSQL ♥ ODS




                  •      Mostly inserts.

                  •      Updates/Deletes controlled, not real-time.

                  •      pl/perl (leverage DBI/DBD for remote
                         database connectivity).

                  •      Monster queries.

                  •      Extensible.




Monday, August 2, 2010
Choosing Linux



                    •    Popular, liked, good community support.

                    •    Chronic problems:

                         •   kernel panics

                         •   filesystems remounting read-only

                         •   filesystems don’t support snapshots

                         •   LVM is clunky on enterprise storage

                         •   20 outages in 4 months



Monday, August 2, 2010
Choosing Solaris 10

                    •    Switched to Solaris 10

                         •   No crashes, better system-level tools.

                             •   prstat, iostat, vmstat, smf, fault-
                                 management.

                         •   ZFS

                             •   snapshots (persistent), BLI backups.

                         •   Excellent support for enterprise storage.

                         •   DTrace.

                         •   Free (too).


Monday, August 2, 2010
Oracle features we need




                    •    Partitioning

                    •    Statistics and Aggregations

                         •   rank over partition, lead, lag, etc.

                    •    Large selects (100GB)

                    •    Autonomous transactions

                    •    Replication from Oracle (to Oracle)




Monday, August 2, 2010
Partitioning



                 For large data sets:
                   pgods=# select count(1) from ods.ods_tblpick_super;
                      count
                   ------------
                    3247286017
                   (1 row)




                    • Next biggest tables: billions
                    • Allows us to cluster data over specific ranges
                      (by date in our case)
                    • Simple, cheap archiving and removal of data.
                    • Can put ranges used less often in different
                         tablespaces (slower, cheaper storage)


Monday, August 2, 2010
Partitioning PostgreSQL style



           •       PostgreSQL doesn’t support partition...

           •       It supports inheritance... (what’s this?)

                 •       some crazy object-relation paradigm.

           •       We can use it to implement partitioning:

                 •       One master table with no rows.

                 •       Child tables that have our partition constraints.

                 •       Rules on the master table for insert/update/delete.



Monday, August 2, 2010
Partitioning PostgreSQL realized

             •       Cheaply add new empty partitions

             •       Cheaply remove old partitions

             •       Migrate less-often-accessed partitions to slower
                     storage

             •       Different indexes strategies per partition

             •       PostgreSQL >8.1 supports constraint checking on
                     inherited tables.
                   •     smarter planning

                   •     smarter executing



Monday, August 2, 2010
RANK OVER PARTITION



                    • In Oracle:
              select userid, email from (
              !   !    select u.userid, u.email,
              !   !    row_number() over
                               (partition by u.email order by userid desc) as position
              !   !    from (...)) where position = 1



                    • In PostgreSQL:
              FOR v_row IN select u.userid, u.email from (...) order by email, userid desc
              LOOP
              !    IF v_row.email != v_last_email THEN
              !    !   RETURN NEXT v_row;
              !    !   v_last_email := v_row.email;
              !    !   v_rownum := v_rownum + 1;
              !    END IF;
              END LOOP;

                                 With 8.4, we have windowing functions

Monday, August 2, 2010
Large SELECTs


                 • Application code does:
              select u.*, b.browser, m.lastmess
                from ods.ods_users u,
                     ods.ods_browsers b,
                     ( select userid, min(senddate) as senddate
                          from ods.ods_maillog
                      group by userid ) m,
                     ods.ods_maillog l
               where u.userid = b.userid
                 and u.userid = m.userid
                 and u.userid = l.userid
                 and l.senddate = m.senddate;




                    •    The width of these rows is about 2k

                    •    50 million row return set

                    •    > 100 GB of data

Monday, August 2, 2010
The Large SELECT Problem


                 •       libpq will buffer the entire result in memory.

                         •   This affects language bindings (DBD::Pg).

                         •   This is an utterly deficient default behavior.

                 •       This can be avoided by using cursors

                         •   Requires the app to be PostgreSQL specific.

                         •   You open a cursor.

                         •   Then FETCH the row count you desire.




Monday, August 2, 2010
Big SELECTs the Postgres way



           The previous “big” query becomes:
              DECLARE CURSOR bigdump FOR
              select u.*, b.browser, m.lastmess
                from ods.ods_users u,
                     ods.ods_browsers b,
                     ( select userid, min(senddate) as senddate
                          from ods.ods_maillog
                      group by userid ) m,
                     ods.ods_maillog l
               where u.userid = b.userid
                 and u.userid = m.userid
                 and u.userid = l.userid
                 and l.senddate = m.senddate;


           Then, in a loop:
              FETCH FORWARD 10000 FROM bigdump;




Monday, August 2, 2010
Autonomous Transactions



         • In Oracle we have over 2000 custom stored procedures.
         • During these procedures, we like to:
           • COMMIT incrementally
                    Useful for long transactions (update/delete) that
                    need not be atomic -- incremental COMMITs.

               • start a new top-level txn that can COMMIT
                    Useful for logging progress in a stored procedure so
                    that you know how far you progessed and how long
                    each step took even if it rolls back.



Monday, August 2, 2010
PostgreSQL shortcoming




                    •    PostgreSQL simply does not support
                         Autonomous transactions and to quote core
                         developers “that would be hard.”

                    •    When in doubt, use brute force.

                    •    Use pl/perl to use DBD::Pg to connect to
                         ourselves (a new backend) and execute a new
                         top-level transaction.




Monday, August 2, 2010
Replication


             • Cross vendor database replication isn’t too difficult.
             • Helps a lot when you can do it inside the database.
             • Using dbi-link (based on pl/perl and DBI) we can.
               • We can connect to any remote database.
               • INSERT into local tables directly from remote
                         SELECT statements.
                         [snapshots]
                   • LOOP over remote SELECT statements and
                         process them row-by-row.
                         [replaying remote DML logs]


Monday, August 2, 2010
Replication (really)



          •      Through a combination of snapshotting and DML
                 replay we:

                •        replicate over into over 2000 tables in PostgreSQL
                         from Oracle

                     •     snapshot replication of 200

                     •     DML replay logs for 1800

          •      PostgreSQL to Oracle is a bit harder

                •        out-of-band export and imports



Monday, August 2, 2010
New Architecture


            •      Master: Sun v890 and Hitachi AMS + warm standby
                     running Oracle
                     (1TB)

            •      Logs: several customs
                     running MySQL instances
                     (2TB each)

            •      ODS BI: 2x Sun v40
                     running PostgreSQL 8.3
                     (6TB on Sun JBODs on ZFS each)

            •      ODS archive: 2x custom
                     running PostgreSQL 8.3
                     (14TB internal storage on ZFS each)




Monday, August 2, 2010
PostgreSQL is Lacking




             •      pg_dump is too intrusive.

             •      Poor system-level instrumentation.

             •      Poor methods to determine specific contention.

             •      It relies on the operating system’s filesystem cache.
                    (which make PostgreSQL inconsistent across it’s
                    supported OS base)




Monday, August 2, 2010
Enter Solaris

        •       Solaris is a UNIX from Sun Microsystems.

        •       Is it different than other UNIX/UNIX-like systems?

              •          Mostly it isn’t different (hence the term UNIX)

              •          It does have extremely strong ABI backward
                         compatibility.

              •          It’s stable and works well on large machines.

        •       Solaris 10 shakes things up a bit:
              •          DTrace

              •          ZFS

              •          Zones


Monday, August 2, 2010
Solaris / ZFS


                    •    ZFS: Zettaback Filesystem.

                         •   264 snapshots, 248 files/directory, 264 bytes/filesystem,
                             278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system

                    •    Extremely cheap differential backups.

                         •   I have a 5 TB database, I need a backup!

                    •    No rollback in your database? What is this? MySQL?

                    •    No rollback in your filesystem?

                         •   ZFS has snapshots, rollback, clone and promote.

                         •   OMG! Life altering features.

                    •    Caveat: ZFS is slower than alternatives, by about 10% with tuning.




Monday, August 2, 2010
Solaris / Zones



                    •    Zones: Virtual Environments.

                    •    Shared kernel.

                    •    Can share filesystems.

                    •    Segregated processes and privileges.

                    •    No big deal for databases, right?


                                          But Wait!


Monday, August 2, 2010
Solaris / ZFS + Zones = Magic Juju
             https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh

    •       ZFS snapshot, clone, delegate to zone, boot and run.

    •       When done, halt zone, destroy clone.

    •       We get a point-in-time copy of our PostgreSQL database:

          •      read-write,

          •      low disk-space requirements,

          •      NO LOCKS! Welcome back pg_dump,
                 you don’t suck (as much) anymore.

          •      Fast snapshot to usable copy time:

                •        On our 20 GB database: 1 minute.

                •        On our 1.2 TB database: 2 minutes.

Monday, August 2, 2010
ZFS: how I saved my soul.

        •      Database crash. Bad. 1.2 TB of data... busted.
               The reason Robert Treat looks a bit older than he
               should.

        •      xlogs corrupted. catalog indexes corrupted.

        •      Fault? PostgreSQL bug? Bad memory? Who knows?

        •      Trial & error on a 1.2 TB data set is a cruel experience.

             •       In real-life, most recovery actions are destructive
                     actions.

             •       PostgreSQL is no different.

        •      Rollback to last checkpoint (ZFS), hack postgres code,
               try, fail, repeat.

Monday, August 2, 2010
Let DTrace open your eyes

       •       DTrace: Dynamic Tracing

       •       Dynamically instrument “stuff” in the system:
             •      system calls (like strace/truss/ktrace).

             •      process/scheduler activity (on/off cpu, semaphores, conditions).

             •      see signals sent and received.

             •      trace kernel functions, networking.

             •      watch I/O down to the disk.

             •      user-space processes, each function... each machine instruction!

             •      Add probes into apps where it makes sense to you.



Monday, August 2, 2010
Can you see what I see?

                    •    There is EXPLAIN... when that isn’t enough...

                    •    There is EXPLAIN ANALYZE... when that isn’t enough.

                    •    There is DTrace.

                         ; dtrace -q -n ‘
                         postgresql*:::statement-start
                         {
                            self->query = copyinstr(arg0);
                            self->ok=1;
                         }
                         io:::start
                         /self->ok/
                         {
                            @[self->query,
                              args[0]->b_flags & B_READ ? "read" : "write",
                              args[1]->dev_statname] = sum(args[0]->b_bcount);
                         }’
                         dtrace: description 'postgres*:::statement-start' matched 14 probes
                         ^C

                         select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000;
                             read sd1 16384
                         select division, sum(amount), avg(amount) from ods.billings where txn_timestamp
                         between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division;
                             read sd2 71647232




Monday, August 2, 2010
OmniTI Labs / pgtreats

                    •    https://labs.omniti.com/labs/pgtreats

                         •   Where we stick out PostgreSQL goodies...

                         •   like pg_file_stress


                         FILENAME/DBOBJECT                             READS                    WRITES
                                                             #   min    avg    max      #   min    avg   max
            alldata1__idx_remove_domain_external             1    12     12     12    398     0      0     0
            slowdata1__pg_rewrite                            1    12     12     12      0     0      0     0
            slowdata1__pg_class_oid_index                    1     0      0      0      0     0      0     0
            slowdata1__pg_attribute                          2     0      0      0      0     0      0     0
            alldata1__mv_users                               0     0      0      0      4     0      0     0
            slowdata1__pg_statistic                          1     0      0      0      0     0      0     0
            slowdata1__pg_index                              1     0      0      0      0     0      0     0
            slowdata1__pg_index_indexrelid_index             1     0      0      0      0     0      0     0
            alldata1__remove_domain_external                 0     0      0      0    502     0      0     0
            alldata1__promo_15_tb_full_2                    19     0      0      0     11     0      0     0
            slowdata1__pg_class_relname_nsp_index            2     0      0      0      0     0      0     0
            alldata1__promo_177intaoltest_tb                 0     0      0      0   1053     0      0     0
            slowdata1__pg_attribute_relid_attnum_index       2     0      0      0      0     0      0     0
            alldata1__promo_15_tb_full_2_pk                  2     0      0      0      0     0      0     0
            alldata1__all_mailable_2                      1403     0      0    423      0     0      0     0
            alldata1__mv_users_pkey                          0     0      0      0      4     0      0     0




Monday, August 2, 2010
Results




                    •    Move ODS Oracle licenses to OLTP

                    •    Run PostgreSQL on ODS

                    •    Save $800k in license costs.

                    •    Spend $100k in labor costs.

                    •    Learn a lot.




Monday, August 2, 2010
Thanks!



                    •    Thank you.

                    •    http://omniti.com/does/postgresql

                    •    We’re hiring, but only if you love:

                         •   lots of data on lots of disks on lots of big boxes

                         •   smart people

                         •   hard problems

                         •   more than one database technology (including PostgreSQL)

                         •   responsibility




Monday, August 2, 2010

Mais conteúdo relacionado

Destaque

Innovation Through “Trusted” Open Source Solutions
Innovation Through “Trusted” Open Source SolutionsInnovation Through “Trusted” Open Source Solutions
Innovation Through “Trusted” Open Source Solutions
Joshua L. Davis
 
Mil-OSS @ 47th Annual AOC Convention
Mil-OSS @ 47th Annual AOC ConventionMil-OSS @ 47th Annual AOC Convention
Mil-OSS @ 47th Annual AOC Convention
Joshua L. Davis
 

Destaque (14)

Ignite: Improving Performance on Federal Contracts Using Scrum & Agile
Ignite: Improving Performance on Federal Contracts Using Scrum & AgileIgnite: Improving Performance on Federal Contracts Using Scrum & Agile
Ignite: Improving Performance on Federal Contracts Using Scrum & Agile
 
Ignite: YSANAOYOA
Ignite: YSANAOYOAIgnite: YSANAOYOA
Ignite: YSANAOYOA
 
The Enterprise Guide to Drupal for Gov 2.0
The Enterprise Guide to Drupal for Gov 2.0The Enterprise Guide to Drupal for Gov 2.0
The Enterprise Guide to Drupal for Gov 2.0
 
The Open Source Movement
The Open Source MovementThe Open Source Movement
The Open Source Movement
 
Homeland Open Security Technologies (HOST)
Homeland Open Security Technologies (HOST)Homeland Open Security Technologies (HOST)
Homeland Open Security Technologies (HOST)
 
Ignite: Hackin' Excel with Ruby
Ignite: Hackin' Excel with RubyIgnite: Hackin' Excel with Ruby
Ignite: Hackin' Excel with Ruby
 
OSSIM and OMAR in the DoD/IC
OSSIM and OMAR in the DoD/ICOSSIM and OMAR in the DoD/IC
OSSIM and OMAR in the DoD/IC
 
Senior Leaders Adapting to Social Technologies
Senior Leaders Adapting to Social TechnologiesSenior Leaders Adapting to Social Technologies
Senior Leaders Adapting to Social Technologies
 
The Next Generation Open IDS Engine Suricata and Emerging Threats
The Next Generation Open IDS Engine Suricata and Emerging ThreatsThe Next Generation Open IDS Engine Suricata and Emerging Threats
The Next Generation Open IDS Engine Suricata and Emerging Threats
 
Barcamp: Open Source and Security
Barcamp: Open Source and SecurityBarcamp: Open Source and Security
Barcamp: Open Source and Security
 
Innovation Through “Trusted” Open Source Solutions
Innovation Through “Trusted” Open Source SolutionsInnovation Through “Trusted” Open Source Solutions
Innovation Through “Trusted” Open Source Solutions
 
Advancing open source geospatial software for the do d ic edward pickle openg...
Advancing open source geospatial software for the do d ic edward pickle openg...Advancing open source geospatial software for the do d ic edward pickle openg...
Advancing open source geospatial software for the do d ic edward pickle openg...
 
Mil-OSS @ 47th Annual AOC Convention
Mil-OSS @ 47th Annual AOC ConventionMil-OSS @ 47th Annual AOC Convention
Mil-OSS @ 47th Annual AOC Convention
 
DISA's Open Source Corporate Management Information System (OSCMIS)
DISA's Open Source Corporate Management Information System (OSCMIS)DISA's Open Source Corporate Management Information System (OSCMIS)
DISA's Open Source Corporate Management Information System (OSCMIS)
 

Semelhante a Big Bad PostgreSQL: BI on a Budget

Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume LaforgeGaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Guillaume Laforge
 
mogpres
mogpresmogpres
mogpres
xlight
 
MySQL overview
MySQL overviewMySQL overview
MySQL overview
Marco Tusa
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
Analysis Software Benchmark
Analysis Software BenchmarkAnalysis Software Benchmark
Analysis Software Benchmark
Akira Shibata
 
Vizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceVizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users Conference
Isaac Christoffersen
 

Semelhante a Big Bad PostgreSQL: BI on a Budget (20)

Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume LaforgeGaelyk - SpringOne2GX - 2010 - Guillaume Laforge
Gaelyk - SpringOne2GX - 2010 - Guillaume Laforge
 
Oracleonoracle dec112012
Oracleonoracle dec112012Oracleonoracle dec112012
Oracleonoracle dec112012
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 
mogpres
mogpresmogpres
mogpres
 
Long and winding road - 2014
Long and winding road  - 2014Long and winding road  - 2014
Long and winding road - 2014
 
MySQL overview
MySQL overviewMySQL overview
MySQL overview
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Oracle en Entel Summit 2010
Oracle en Entel Summit 2010Oracle en Entel Summit 2010
Oracle en Entel Summit 2010
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
 
Exadata 12c New Features RMOUG
Exadata 12c New Features RMOUGExadata 12c New Features RMOUG
Exadata 12c New Features RMOUG
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Analysis Software Benchmark
Analysis Software BenchmarkAnalysis Software Benchmark
Analysis Software Benchmark
 
PDoolan Oracle Overview
PDoolan Oracle OverviewPDoolan Oracle Overview
PDoolan Oracle Overview
 
mogpres
mogpresmogpres
mogpres
 
Introduction to TokuDB v7.5 and Read Free Replication
Introduction to TokuDB v7.5 and Read Free ReplicationIntroduction to TokuDB v7.5 and Read Free Replication
Introduction to TokuDB v7.5 and Read Free Replication
 
Vizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceVizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users Conference
 
Sanger HPC infrastructure Report (2007)
Sanger HPC infrastructure  Report (2007)Sanger HPC infrastructure  Report (2007)
Sanger HPC infrastructure Report (2007)
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 

Mais de Joshua L. Davis

Mais de Joshua L. Davis (16)

Ignite: Devops - Why Should You Care
Ignite: Devops - Why Should You CareIgnite: Devops - Why Should You Care
Ignite: Devops - Why Should You Care
 
Using the Joomla CMI in the Army Hosting Environment
Using the Joomla CMI in the Army Hosting EnvironmentUsing the Joomla CMI in the Army Hosting Environment
Using the Joomla CMI in the Army Hosting Environment
 
Open Source Software (OSS/FLOSS) and Security
Open Source Software (OSS/FLOSS) and SecurityOpen Source Software (OSS/FLOSS) and Security
Open Source Software (OSS/FLOSS) and Security
 
SOSCOE Overview
SOSCOE OverviewSOSCOE Overview
SOSCOE Overview
 
milSuite
milSuitemilSuite
milSuite
 
Importance of WS-Addressing and WS-Reliability in DoD Enterprises
Importance of WS-Addressing and WS-Reliability in DoD EnterprisesImportance of WS-Addressing and WS-Reliability in DoD Enterprises
Importance of WS-Addressing and WS-Reliability in DoD Enterprises
 
OZONE & OWF: A Community-wide GOTS initiative and its transition to GOSS
OZONE & OWF: A Community-wide GOTS initiative and its transition to GOSSOZONE & OWF: A Community-wide GOTS initiative and its transition to GOSS
OZONE & OWF: A Community-wide GOTS initiative and its transition to GOSS
 
Title TBD: "18 hundred seconds"
Title TBD: "18 hundred seconds"Title TBD: "18 hundred seconds"
Title TBD: "18 hundred seconds"
 
Reaching It's Potential: How to Make Government-Developed OSS A Major Player
Reaching It's Potential: How to Make Government-Developed OSS A Major PlayerReaching It's Potential: How to Make Government-Developed OSS A Major Player
Reaching It's Potential: How to Make Government-Developed OSS A Major Player
 
USIP Open Simulation Platform
USIP Open Simulation PlatformUSIP Open Simulation Platform
USIP Open Simulation Platform
 
CONNECT: An Open Source Platform for Promoting Military Health
CONNECT: An Open Source Platform for Promoting Military HealthCONNECT: An Open Source Platform for Promoting Military Health
CONNECT: An Open Source Platform for Promoting Military Health
 
CompanyCommand & PlatoonLeader Forums and MilSuite
CompanyCommand & PlatoonLeader Forums and MilSuiteCompanyCommand & PlatoonLeader Forums and MilSuite
CompanyCommand & PlatoonLeader Forums and MilSuite
 
.org to .com: Going from Project to Product
.org to .com: Going from Project to Product.org to .com: Going from Project to Product
.org to .com: Going from Project to Product
 
Cyber Challenges in a Hierarchical Culture
Cyber Challenges in a Hierarchical CultureCyber Challenges in a Hierarchical Culture
Cyber Challenges in a Hierarchical Culture
 
An Approach to Building & Maintaining a STIG'D RHEL Server
An Approach to Building & Maintaining a STIG'D RHEL ServerAn Approach to Building & Maintaining a STIG'D RHEL Server
An Approach to Building & Maintaining a STIG'D RHEL Server
 
How and Why Python is Used in the Model of Real-World Battlefield Scenarios
How and Why Python is Used in the Model of Real-World Battlefield ScenariosHow and Why Python is Used in the Model of Real-World Battlefield Scenarios
How and Why Python is Used in the Model of Real-World Battlefield Scenarios
 

Último

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Big Bad PostgreSQL: BI on a Budget

  • 1. Big Bad PostgreSQL: A Case Study Moving a “large,” “complicated,” and mission-critical datawarehouse from Oracle to PostgreSQL for cost control. 1 Monday, August 2, 2010
  • 2. Who am I? @postwait on twitter Author of “Scalable Internet Architectures” Pearson, ISBN: 067232699X CEO of OmniTI We build scalable and secure web applications I am an Engineer A practitioner of academic computing. IEEE member and Senior ACM member. On the Editorial Board of ACM’s Queue magazine. I work on/with a lot of Open Source software: Apache, perl, Linux, Solaris, PostgreSQL, Varnish, Spread, Reconnoiter, etc. I have experience. I’ve had the unique opportunity to watch a great many catastrophes. I enjoy immersing myself in the pathology of architecture failures. Monday, August 2, 2010
  • 3. Overall Architecture Oracle 8i OLTP instance: 0.5 TB 0.25 TB drives the site Hitachi JBOD OLTP Log import and Oracle 8i processing Oracle 8i 0.75 TB JBOD MySQL log importer 0.5 TB 1.5 TB Hitachi MTI OLTP warm backup Warm spare 1.2 TB SATA Datawarehouse RAID Log Importer MySQL 4.1 1.2 TB IDE RAID Data Exporter bulk selects / data exports Monday, August 2, 2010
  • 4. Database Situation • The problems: • The database is growing. • The OLTP and ODS/warehouse are too slow. • A lot of application code against the OLTP system. • Minimal application code against the ODS system. • Oracle: • Licensed per processor. • Really, really, really expensive on a large scale. • PostgreSQL: • No licensing costs. • Good support for complex queries. Monday, August 2, 2010
  • 5. Database Choices • Must keep Oracle on OLTP • Complex, Oracle-specific web application. • Need more processors. • ODS: Oracle not required. • Complex queries from limited sources. • Needs more space and power. • Result: • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS Monday, August 2, 2010
  • 6. PostgreSQL gotchas • For an OLTP system that does thousands of updates per second, vacuuming is a hassle. • No upgrades?! • pg_migrator is coming along. • Less community experience with large databases. • Replication features less evolved • though PostgreSQL 9.0 ups the ante Monday, August 2, 2010
  • 7. PostgreSQL ♥ ODS • Mostly inserts. • Updates/Deletes controlled, not real-time. • pl/perl (leverage DBI/DBD for remote database connectivity). • Monster queries. • Extensible. Monday, August 2, 2010
  • 8. Choosing Linux • Popular, liked, good community support. • Chronic problems: • kernel panics • filesystems remounting read-only • filesystems don’t support snapshots • LVM is clunky on enterprise storage • 20 outages in 4 months Monday, August 2, 2010
  • 9. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management. • ZFS • snapshots (persistent), BLI backups. • Excellent support for enterprise storage. • DTrace. • Free (too). Monday, August 2, 2010
  • 10. Oracle features we need • Partitioning • Statistics and Aggregations • rank over partition, lead, lag, etc. • Large selects (100GB) • Autonomous transactions • Replication from Oracle (to Oracle) Monday, August 2, 2010
  • 11. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super; count ------------ 3247286017 (1 row) • Next biggest tables: billions • Allows us to cluster data over specific ranges (by date in our case) • Simple, cheap archiving and removal of data. • Can put ranges used less often in different tablespaces (slower, cheaper storage) Monday, August 2, 2010
  • 12. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?) • some crazy object-relation paradigm. • We can use it to implement partitioning: • One master table with no rows. • Child tables that have our partition constraints. • Rules on the master table for insert/update/delete. Monday, August 2, 2010
  • 13. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions • Migrate less-often-accessed partitions to slower storage • Different indexes strategies per partition • PostgreSQL >8.1 supports constraint checking on inherited tables. • smarter planning • smarter executing Monday, August 2, 2010
  • 14. RANK OVER PARTITION • In Oracle: select userid, email from ( ! ! select u.userid, u.email, ! ! row_number() over (partition by u.email order by userid desc) as position ! ! from (...)) where position = 1 • In PostgreSQL: FOR v_row IN select u.userid, u.email from (...) order by email, userid desc LOOP ! IF v_row.email != v_last_email THEN ! ! RETURN NEXT v_row; ! ! v_last_email := v_row.email; ! ! v_rownum := v_rownum + 1; ! END IF; END LOOP; With 8.4, we have windowing functions Monday, August 2, 2010
  • 15. Large SELECTs • Application code does: select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate; • The width of these rows is about 2k • 50 million row return set • > 100 GB of data Monday, August 2, 2010
  • 16. The Large SELECT Problem • libpq will buffer the entire result in memory. • This affects language bindings (DBD::Pg). • This is an utterly deficient default behavior. • This can be avoided by using cursors • Requires the app to be PostgreSQL specific. • You open a cursor. • Then FETCH the row count you desire. Monday, August 2, 2010
  • 17. Big SELECTs the Postgres way The previous “big” query becomes: DECLARE CURSOR bigdump FOR select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate; Then, in a loop: FETCH FORWARD 10000 FROM bigdump; Monday, August 2, 2010
  • 18. Autonomous Transactions • In Oracle we have over 2000 custom stored procedures. • During these procedures, we like to: • COMMIT incrementally Useful for long transactions (update/delete) that need not be atomic -- incremental COMMITs. • start a new top-level txn that can COMMIT Useful for logging progress in a stored procedure so that you know how far you progessed and how long each step took even if it rolls back. Monday, August 2, 2010
  • 19. PostgreSQL shortcoming • PostgreSQL simply does not support Autonomous transactions and to quote core developers “that would be hard.” • When in doubt, use brute force. • Use pl/perl to use DBD::Pg to connect to ourselves (a new backend) and execute a new top-level transaction. Monday, August 2, 2010
  • 20. Replication • Cross vendor database replication isn’t too difficult. • Helps a lot when you can do it inside the database. • Using dbi-link (based on pl/perl and DBI) we can. • We can connect to any remote database. • INSERT into local tables directly from remote SELECT statements. [snapshots] • LOOP over remote SELECT statements and process them row-by-row. [replaying remote DML logs] Monday, August 2, 2010
  • 21. Replication (really) • Through a combination of snapshotting and DML replay we: • replicate over into over 2000 tables in PostgreSQL from Oracle • snapshot replication of 200 • DML replay logs for 1800 • PostgreSQL to Oracle is a bit harder • out-of-band export and imports Monday, August 2, 2010
  • 22. New Architecture • Master: Sun v890 and Hitachi AMS + warm standby running Oracle (1TB) • Logs: several customs running MySQL instances (2TB each) • ODS BI: 2x Sun v40 running PostgreSQL 8.3 (6TB on Sun JBODs on ZFS each) • ODS archive: 2x custom running PostgreSQL 8.3 (14TB internal storage on ZFS each) Monday, August 2, 2010
  • 23. PostgreSQL is Lacking • pg_dump is too intrusive. • Poor system-level instrumentation. • Poor methods to determine specific contention. • It relies on the operating system’s filesystem cache. (which make PostgreSQL inconsistent across it’s supported OS base) Monday, August 2, 2010
  • 24. Enter Solaris • Solaris is a UNIX from Sun Microsystems. • Is it different than other UNIX/UNIX-like systems? • Mostly it isn’t different (hence the term UNIX) • It does have extremely strong ABI backward compatibility. • It’s stable and works well on large machines. • Solaris 10 shakes things up a bit: • DTrace • ZFS • Zones Monday, August 2, 2010
  • 25. Solaris / ZFS • ZFS: Zettaback Filesystem. • 264 snapshots, 248 files/directory, 264 bytes/filesystem, 278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system • Extremely cheap differential backups. • I have a 5 TB database, I need a backup! • No rollback in your database? What is this? MySQL? • No rollback in your filesystem? • ZFS has snapshots, rollback, clone and promote. • OMG! Life altering features. • Caveat: ZFS is slower than alternatives, by about 10% with tuning. Monday, August 2, 2010
  • 26. Solaris / Zones • Zones: Virtual Environments. • Shared kernel. • Can share filesystems. • Segregated processes and privileges. • No big deal for databases, right? But Wait! Monday, August 2, 2010
  • 27. Solaris / ZFS + Zones = Magic Juju https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh • ZFS snapshot, clone, delegate to zone, boot and run. • When done, halt zone, destroy clone. • We get a point-in-time copy of our PostgreSQL database: • read-write, • low disk-space requirements, • NO LOCKS! Welcome back pg_dump, you don’t suck (as much) anymore. • Fast snapshot to usable copy time: • On our 20 GB database: 1 minute. • On our 1.2 TB database: 2 minutes. Monday, August 2, 2010
  • 28. ZFS: how I saved my soul. • Database crash. Bad. 1.2 TB of data... busted. The reason Robert Treat looks a bit older than he should. • xlogs corrupted. catalog indexes corrupted. • Fault? PostgreSQL bug? Bad memory? Who knows? • Trial & error on a 1.2 TB data set is a cruel experience. • In real-life, most recovery actions are destructive actions. • PostgreSQL is no different. • Rollback to last checkpoint (ZFS), hack postgres code, try, fail, repeat. Monday, August 2, 2010
  • 29. Let DTrace open your eyes • DTrace: Dynamic Tracing • Dynamically instrument “stuff” in the system: • system calls (like strace/truss/ktrace). • process/scheduler activity (on/off cpu, semaphores, conditions). • see signals sent and received. • trace kernel functions, networking. • watch I/O down to the disk. • user-space processes, each function... each machine instruction! • Add probes into apps where it makes sense to you. Monday, August 2, 2010
  • 30. Can you see what I see? • There is EXPLAIN... when that isn’t enough... • There is EXPLAIN ANALYZE... when that isn’t enough. • There is DTrace. ; dtrace -q -n ‘ postgresql*:::statement-start { self->query = copyinstr(arg0); self->ok=1; } io:::start /self->ok/ { @[self->query, args[0]->b_flags & B_READ ? "read" : "write", args[1]->dev_statname] = sum(args[0]->b_bcount); }’ dtrace: description 'postgres*:::statement-start' matched 14 probes ^C select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000; read sd1 16384 select division, sum(amount), avg(amount) from ods.billings where txn_timestamp between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division; read sd2 71647232 Monday, August 2, 2010
  • 31. OmniTI Labs / pgtreats • https://labs.omniti.com/labs/pgtreats • Where we stick out PostgreSQL goodies... • like pg_file_stress FILENAME/DBOBJECT READS WRITES # min avg max # min avg max alldata1__idx_remove_domain_external 1 12 12 12 398 0 0 0 slowdata1__pg_rewrite 1 12 12 12 0 0 0 0 slowdata1__pg_class_oid_index 1 0 0 0 0 0 0 0 slowdata1__pg_attribute 2 0 0 0 0 0 0 0 alldata1__mv_users 0 0 0 0 4 0 0 0 slowdata1__pg_statistic 1 0 0 0 0 0 0 0 slowdata1__pg_index 1 0 0 0 0 0 0 0 slowdata1__pg_index_indexrelid_index 1 0 0 0 0 0 0 0 alldata1__remove_domain_external 0 0 0 0 502 0 0 0 alldata1__promo_15_tb_full_2 19 0 0 0 11 0 0 0 slowdata1__pg_class_relname_nsp_index 2 0 0 0 0 0 0 0 alldata1__promo_177intaoltest_tb 0 0 0 0 1053 0 0 0 slowdata1__pg_attribute_relid_attnum_index 2 0 0 0 0 0 0 0 alldata1__promo_15_tb_full_2_pk 2 0 0 0 0 0 0 0 alldata1__all_mailable_2 1403 0 0 423 0 0 0 0 alldata1__mv_users_pkey 0 0 0 0 4 0 0 0 Monday, August 2, 2010
  • 32. Results • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS • Save $800k in license costs. • Spend $100k in labor costs. • Learn a lot. Monday, August 2, 2010
  • 33. Thanks! • Thank you. • http://omniti.com/does/postgresql • We’re hiring, but only if you love: • lots of data on lots of disks on lots of big boxes • smart people • hard problems • more than one database technology (including PostgreSQL) • responsibility Monday, August 2, 2010