SlideShare uma empresa Scribd logo
1 de 42
Honey, I Shrunk the Database For Test and Development Environments Vanessa Hurst Paperless Post Postgres Open, September 2011
User Data
Why Shrink? Accuracy You don’t truly know how your app will behave in production unless you use real data. Production data is the ultimate in accuracy.
Why Shrink? Accuracy Freshness New data should be available regularly. Full database refreshes should be timely.
Why Shrink? Accuracy Freshness Resource Limitations Staging and developer machines cannot handle production load.
Why Shrink? Accuracy Freshness Resource Limitations Data Protection Limit spread of sensitive user or client data.
Why Shrink? Accuracy Freshness Resource Limitations Data Protection
Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations
Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
Shrink Strategies Copies Restored backups or live replicas of entire production database
Shrink Strategies Copies Slices Select portions of exact data
Shrink Strategies Copies Slices Mutations Sanitized, anonymized, or otherwise changed data
Shrink Strategies Copies Slices Mutations Assumptions Seed databases, fixtures, test data
Shrink Strategies Copies Slices Mutations Assumptions
Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others
Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others Horizontal Slice Difficult to write and maintain. Example: SQL or application code to determine subset of data
PG Tools – Vertical Slice Flexibility at Source (Production) pg_dump Include data only [-a --data-only] Include table schema only [-s --schema-only] Select tables [-t table1 table2 --table table1 table2] Select schemas [-nschema --schema=schema] Exclude schemas [-N schema --exclude-schema=schema]
PG Tools – Vertical Slice Flexibility at Destination (Staging, Development) pg_restore Include data only [-a --data-only] Select indexes [-iindex --index=index] Tune processing [-jnumber-of-jobs --jobs=number-of-jobs] Select schemas [-nschema --schema=schema] Select triggers[-T trigger --trigger=trigger] Exclude privileges [-x --no-privileges --no-acl]
Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use
Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use Internal Data Protection Protecting your users’ personal data Protecting your users from accidents, e.g. staging emails Your Terms of Service
User Data
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql 	Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
Case Study: Paperless Post CREATE SCHEMA staging;
Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users); Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
Case Study: Paperless Post Horizontal Slice Custom SQL Dynamic relative to full data set or newly created slice Mutations Email Addresses Use regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.com Cached Data Clear cached short link from link-shortening API
Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql 	Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses pg_dump --data-only --schema staging db-01 >> slice.sql
Case Study: Paperless Post Rebuild Prepare new database as standby Gracefully close connections Rotate by renaming databases Security				 Dedicated database build user Membership in application user role Application user role & privileges remain
Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema
Case Study: Paperless Post We hacked our rebuild by importing across schemas! Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
Secret Weapon  --Updates all serial sequences for ID columns only BEGIN FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP table_name = table_record.relname::text; 	EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || '  	WHERE EXISTS (SELECT 1 FROM ' || table_name || ')'; END LOOP;
Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema echo “select 1 from update_id_sequences();” >> slice.sql Vacuum Reindex
Case Study: Paperless Post Security					 Database build user CREATE DB privileges Member of Application user role Application user remains database owner Application user privileges remain limited Build only works in predetermined environments
Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
Questions? Vanessa Hurst Paperless Post @DBNess Postgres Open, September 2011
More Tools Copies -- LVMSnapshots See talk by Jon Erdman at PG Conf EU Great for all reads Data stays virtualized & doesn’t take up space until changed Ideal for DDL changes without actual data changes
More Tools Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging Simple -- pauses pgbouncer & restores backup Efficient -- leverage bulk loading Flexible -- supports varying psql files Custom -- limited Slices -- replicate by rtomayko of Github	http://github.com/rtomayko/replicate Simple - Preserves object relations via ActiveRecord Inefficient -- Creates text-based .dump Inflexible -- Corrupts id sequences on data insert Custom -- highly

Mais conteúdo relacionado

Mais procurados

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper David Paquette
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Salesforce Summer 14 Release
Salesforce Summer 14 ReleaseSalesforce Summer 14 Release
Salesforce Summer 14 ReleaseJyothylakshmy P.U
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 

Mais procurados (19)

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Dapper performance
Dapper performanceDapper performance
Dapper performance
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper Advanced .NET Data Access with Dapper
Advanced .NET Data Access with Dapper
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Salesforce Summer 14 Release
Salesforce Summer 14 ReleaseSalesforce Summer 14 Release
Salesforce Summer 14 Release
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
HEPData workshop talk
HEPData workshop talkHEPData workshop talk
HEPData workshop talk
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 

Semelhante a Honey I Shrunk the Database

Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAmin Uddin
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersAdam Hutson
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAJISC GECO
 
Sql storeprocedure
Sql storeprocedureSql storeprocedure
Sql storeprocedureftz 420
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR MasterclassIan Massingham
 
NoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATNoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATHenryBowers
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 

Semelhante a Honey I Shrunk the Database (20)

Advance Sql Server Store procedure Presentation
Advance Sql Server Store procedure PresentationAdvance Sql Server Store procedure Presentation
Advance Sql Server Store procedure Presentation
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Sql storeprocedure
Sql storeprocedureSql storeprocedure
Sql storeprocedure
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
NoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RATNoCOUG Presentation on Oracle RAT
NoCOUG Presentation on Oracle RAT
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Lab manual asp.net
Lab manual asp.netLab manual asp.net
Lab manual asp.net
 

Mais de Vanessa Hurst

Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013Vanessa Hurst
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Vanessa Hurst
 
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)Vanessa Hurst
 
Coders as Superheroes
Coders as SuperheroesCoders as Superheroes
Coders as SuperheroesVanessa Hurst
 
Get Your Website Off the Ground
Get Your Website Off the GroundGet Your Website Off the Ground
Get Your Website Off the GroundVanessa Hurst
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsDefense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsVanessa Hurst
 

Mais de Vanessa Hurst (7)

Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013Girl Geek Dinner NYC 2013
Girl Geek Dinner NYC 2013
 
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...Coulda Been a Contributor: Making a difference with Open Source Software - OS...
Coulda Been a Contributor: Making a difference with Open Source Software - OS...
 
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
Startup Analytics: KPIs, Dashboards, & Metrics (NYC CTO School)
 
Coders as Superheroes
Coders as SuperheroesCoders as Superheroes
Coders as Superheroes
 
Get Your Website Off the Ground
Get Your Website Off the GroundGet Your Website Off the Ground
Get Your Website Off the Ground
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsDefense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMs
 
WTF Web Lecture
WTF Web LectureWTF Web Lecture
WTF Web Lecture
 

Último

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Honey I Shrunk the Database

  • 1. Honey, I Shrunk the Database For Test and Development Environments Vanessa Hurst Paperless Post Postgres Open, September 2011
  • 2.
  • 4. Why Shrink? Accuracy You don’t truly know how your app will behave in production unless you use real data. Production data is the ultimate in accuracy.
  • 5. Why Shrink? Accuracy Freshness New data should be available regularly. Full database refreshes should be timely.
  • 6. Why Shrink? Accuracy Freshness Resource Limitations Staging and developer machines cannot handle production load.
  • 7. Why Shrink? Accuracy Freshness Resource Limitations Data Protection Limit spread of sensitive user or client data.
  • 8. Why Shrink? Accuracy Freshness Resource Limitations Data Protection
  • 9. Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations
  • 10. Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
  • 11. Shrink Strategies Copies Restored backups or live replicas of entire production database
  • 12. Shrink Strategies Copies Slices Select portions of exact data
  • 13. Shrink Strategies Copies Slices Mutations Sanitized, anonymized, or otherwise changed data
  • 14. Shrink Strategies Copies Slices Mutations Assumptions Seed databases, fixtures, test data
  • 15. Shrink Strategies Copies Slices Mutations Assumptions
  • 16. Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others
  • 17. Slices Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude others Horizontal Slice Difficult to write and maintain. Example: SQL or application code to determine subset of data
  • 18. PG Tools – Vertical Slice Flexibility at Source (Production) pg_dump Include data only [-a --data-only] Include table schema only [-s --schema-only] Select tables [-t table1 table2 --table table1 table2] Select schemas [-nschema --schema=schema] Exclude schemas [-N schema --exclude-schema=schema]
  • 19. PG Tools – Vertical Slice Flexibility at Destination (Staging, Development) pg_restore Include data only [-a --data-only] Select indexes [-iindex --index=index] Tune processing [-jnumber-of-jobs --jobs=number-of-jobs] Select schemas [-nschema --schema=schema] Select triggers[-T trigger --trigger=trigger] Exclude privileges [-x --no-privileges --no-acl]
  • 20.
  • 21. Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use
  • 22. Mutations External Data Protection HIPAA Regulations PCI Compliance API Terms of Use Internal Data Protection Protecting your users’ personal data Protecting your users from accidents, e.g. staging emails Your Terms of Service
  • 24. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemasVertical Slice – Entire tables of static contentHorizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 25. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql
  • 26. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql
  • 27. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses
  • 28. Case Study: Paperless Post CREATE SCHEMA staging;
  • 29. Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);
  • 30. Case Study: Paperless Post Horizontal Slice Custom SQLSELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users); Dynamic relative to full data set or newly created sliceSELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);
  • 31. Case Study: Paperless Post Horizontal Slice Custom SQL Dynamic relative to full data set or newly created slice Mutations Email Addresses Use regular expressions to clean non-admin addressese.g. dude@gmail.com => staging+dudegmailcom@paperlesspost.com Cached Data Clear cached short link from link-shortening API
  • 32. Case Study: Paperless Post Composite Slice includingVertical Slice – All application object schemas pg_dump --clean --schema-only --schema public db-01 > slice.sqlVertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses pg_dump --data-only --schema staging db-01 >> slice.sql
  • 33. Case Study: Paperless Post Rebuild Prepare new database as standby Gracefully close connections Rotate by renaming databases Security Dedicated database build user Membership in application user role Application user role & privileges remain
  • 34. Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema
  • 35. Case Study: Paperless Post We hacked our rebuild by importing across schemas! Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
  • 36. Secret Weapon --Updates all serial sequences for ID columns only BEGIN FOR table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP table_name = table_record.relname::text; EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' WHERE EXISTS (SELECT 1 FROM ' || table_name || ')'; END LOOP;
  • 37. Case Study: Paperless Post Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all data loads to default schema echo “select 1 from update_id_sequences();” >> slice.sql Vacuum Reindex
  • 38. Case Study: Paperless Post Security Database build user CREATE DB privileges Member of Application user role Application user remains database owner Application user privileges remain limited Build only works in predetermined environments
  • 39. Case Study: Paperless Post Requirements Freshness – Daily, On command for non-developers Shrinkage – Slices, Mutations Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-optimized Development -- constrained DBA resources
  • 40. Questions? Vanessa Hurst Paperless Post @DBNess Postgres Open, September 2011
  • 41. More Tools Copies -- LVMSnapshots See talk by Jon Erdman at PG Conf EU Great for all reads Data stays virtualized & doesn’t take up space until changed Ideal for DDL changes without actual data changes
  • 42. More Tools Copies, Slices-- pg_staging by dmitrihttp://github.com/dimitri/pg_staging Simple -- pauses pgbouncer & restores backup Efficient -- leverage bulk loading Flexible -- supports varying psql files Custom -- limited Slices -- replicate by rtomayko of Github http://github.com/rtomayko/replicate Simple - Preserves object relations via ActiveRecord Inefficient -- Creates text-based .dump Inflexible -- Corrupts id sequences on data insert Custom -- highly

Notas do Editor

  1. I am Vanessa Hurst and I lead Data and Analytics at Paperless Post, a customizable online stationery startup in New York. I studied Computer Science and Systems and Information Engineering at the University of Virginia. I have experience in databases ranging from a few hundred megabyte CMSes for non-profits to terabytes of financial data and high traffic consumer websites. I've worked in data processing, product development, and business intelligence. I am happy open-source convert and lone data wrangler in a land of web developers using Ruby on Rails.
  2. Static Data
  3. This may include external, legal regulations or internal regulations such as terms of service.Data protection can also include mitigating risk or proactively screening before data is even available.HIPAA RegulationsPCI ComplianceAPI Terms of Use
  4. Any other reasons?
  5. RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  6. RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  7. Quick vocabularyBackup & restore, trigger-based replication, there are plenty of options that are all straight forward, but don’t give you a lot of leeway on resources.
  8. Most common case
  9. If you’re doing Business Intelligence, you need a copy of your production database. Figure it out.
  10. Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  11. Vertical -- difficult to keep data valid & usable -- valid units of space are not always valid in an applicatione.g. WAL logs, Pages 1-16 => smaller, finite size, not usableHorizontal -- requires application logic, highly customized but usable e.g. Users with ids 1-50, Users who joined before July 4 Users who are admins, any SQL logic
  12. http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  13. http://www.postgresql.org/docs/current/static/app-pgdump.htmlOptions to: DumpOIDs in case your app uses them Leave out ownership commands (if staging environments run as different users)
  14. Static Data
  15. Dedicated schema preserves all table, index, sequence names, etc
  16. Only the build process is staging-specific, all other privileges and settings match production
  17. Only the build process is staging-specific, all other privileges and settings match production
  18. Only the build process is staging-specific, all other privileges and settings match production
  19. Only the build process is staging-specific, all other privileges and settings match production
  20. RequirementsSlice -- significantly less space, power, & memory in staging and dev environments, need smaller data setMutation -- protect user data in highly personal communications, prevent staging use of customer emailsDaily RefreshResourcesSource -- production server w/ample space, power, & memoryDestination -- weak, shared staging infrastructure across several servers, local machine development infrastructureExpertise -- flexible infrastructure automation tools, many application developers, limited DBA time
  21. http://github.com/rtomayko/replicate