SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Yandex.Mail success story
Vladimir Borodin, DBA
One of the largest internet companies in Europe
57+% of all search traffic in Russia
Ukraine, Kazakhstan, Belarus and Turkey
https://yandex.com/company/technologies
About 6000 employees all over the world
3
About Yandex
Launched in 2000
10+ million users daily
200.000 RPS to web/mobile/imap backends
150+ million incoming letters daily
20+ PB of data
4
About Yandex.Mail
Migration from Oracle to PostgreSQL
300+ TB of metadata without redundancy
250k requests per second
OLTP with 80% reads, 20% writes
Previous attempts
MySQL
Self-written DBMS
5
About this talk
6
What is mail metadata?
7
Back in 2012
Everything stored in Oracle
Lots of PL/SQL logic
Efficient hardware usage
10+ TB per shard
Working LA 100
Lots of manual operations
Warm (SSD) and cold (SATA) databases for different users
75% SSD, 25% SATA
9
Yandex.Mail metadata
10
Sharding and fault tolerance
11
Inside the backend
12
Reality
PL/SQL deploy
Library cache
Lots of manual operations
Switchover, new DB setup, data transfer between shards
Only synchronous interface in OCCI
Problems with development environments
Not very responsive support
13
Most common problems
The main reason
shop.oracle.com
Timeline
Oct 2012 — the willful decision
Get rid of Oracle in 3 years
Apr 2013 — first experiments with different DBMS
PostgreSQL
Lots on NoSQL stores
Self-written solution on base of search backend
Jul 2013 — Jun 2014 — collectors experiment
16
Experiments
17
About collectors
https://simply.name/video-pg-meetup-yandex.html
Our first experience with PostgreSQL
Monitoring/graphs/deploy
PL/Proxy for sharding
Self-written tools for switchovers and read-only degradation
Plenty of initial problems
2 TB of metadata (15+ billion records)
40k RPS
18
Experiment with collectors
Aug 2014 — Dec 2014
Storing all production stream of letters to PostgreSQL
Asynchronously
Initial schema decisions
Important for abstraction library
Load testing with our workload
Choosing hardware
Lots of other PostgreSQL related experience
https://simply.name/postgresql-and-systemtap.html
19
Full mail prototype
Jan 2015 — Jan 2016 — development
Jun 2015 — dog fooding
Accelerated development
Sep 2015 — start of inactive users migration
Fixing bugs of transfer code
Reverse transfer (plan B)
Jan 2016 — Apr 2016 — migration
20
Main work
Time to rewrite all software to support Oracle and PostgreSQL
10 man-years
22
Migration
23
Completion
Main changes
25
macs
26
Sharding and fault tolerance
27
Hardware
Warm DBs (SSD) for most active users
Cold DBs (SATA) for all inactive users
Hot DBs for super active users
2% of users generate 50% of workload
Automation to move users between different shard types
TBD: moving old letters of one user from SSD to SATA
28
Hardware
In Oracle all IDs (mid, fid, lid, tid) were globally unique
Sequences ranges for every shard in special DB
NUMBER(20, 0) — 20 bytes
In PostgreSQL IDs are unique inside particular user
Globally unique mid changed to globally unique (uid, mid)
Biginit + bigint — 16 bytes
29
Identifiers
Less contention for single index page
Normal B-Tree instead of reversed indexes
Revisions for all objects
Ability to read only actual data from standbys
Incremental diffs for IMAP and mobile apps
Denormalized some data
Arrays and GIN
Composite types
30
Schema changes
xdb01g/maildb M # dS mail.box
Table "mail.box"
Column | Type | Modifiers
---------------+--------------------------+------------------------
uid | bigint | not null
mid | bigint | not null
lids | integer[] | not null
<...>
Indexes:
"pk_box" PRIMARY KEY, btree (uid, mid)
"i_box_uid_lids" gin (mail.ulids(uid, lids)) WITH (fastupdate=off)
<...>
xdb01g/maildb M #
31
Example
PL/pgSQL is awesome
Greatly reduced code size
Only to ensure data consistency
Greatly increased test coverage
The cost of failure is high
Easy deploy since no library cache locks
32
Stored logic
SaltStack
Detailed diff between current and desired state
All schema and code changes through migrations
All common tasks are automated
Representative testing environments
33
Maintenance approach
Problems
Problem with ExclusiveLock on inserts
Checkpoint distribution
ExclusiveLock on extension of relation with huge shared_buffers
Hanging startup process on the replica after vacuuming on master
Replication slots and isolation levels
Segfault in BackendIdGetTransactionIds
A lot more solved without community help
35
Before main migration
Oracle DBA
In any unclear situation
autovacuum is to blame
https://simply.name/pg-stat-wait.html
Wait_event in pg_stat_activity (9.6)
https://simply.name/ru/slides-pgday2015.html (RUS)
37
Diagnostics
Our retention policy is 7 days
In Oracle backups (inc0 + 6 * inc1) and archive logs ≈ DB size
In PostgreSQL with barman ≈ N* DB size, where N > 5
WALs compressed but backups not
File-level increments don’t work properly
All operations are single-threaded and very slow
For 300 TB we needed ≈ 2 PB for backups
https://github.com/2ndquadrant-it/barman/issues/21
38
Backups
Not PostgreSQL problems
Data problems
A lot of legacy for 10+ years
Bugs in transfer code
39
During migration
Conclusion
Declarative partitioning
Good recovery manager
Parallelism/compression/page-level increments
Partial online recovery (i.e. single table)
Future development of wait interface
Huge shared buffers, O_DIRECT and async I/O
Quorum commit
41
Our wishlist for PostgreSQL
1 PB with redundancy (100+ billion records)
250k TPS
Three calendar years / 10+ man-years
Faster deployment / more efficient human time usage
All backend refactoring
3x more hardware
No major fuckups yet :)
Linux, nginx, postfix, PostgreSQL
42
Summary
Vladimir Borodin
DBA
Questions?
@man_brain
https://simply.name
+7 (495) 739 70 00, ext.: 7255
d0uble@yandex-team.ru

Mais conteúdo relacionado

Mais procurados

Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Ontico
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
MongoDB
 

Mais procurados (20)

Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
Масштабируемая конфигурация Nginx, Игорь Сысоев (Nginx)
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFS
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
Sql 2016
Sql 2016Sql 2016
Sql 2016
 
MongoUK - Approaching 1 billion documents with MongoDB1 Billion Documents
MongoUK - Approaching 1 billion documents with MongoDB1 Billion DocumentsMongoUK - Approaching 1 billion documents with MongoDB1 Billion Documents
MongoUK - Approaching 1 billion documents with MongoDB1 Billion Documents
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
 
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
Tarantool: как сэкономить миллион долларов на базе данных на высоконагруженно...
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
WiredTiger In-Memory vs WiredTiger B-Tree
WiredTiger In-Memory vs WiredTiger B-TreeWiredTiger In-Memory vs WiredTiger B-Tree
WiredTiger In-Memory vs WiredTiger B-Tree
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...
 
Как PostgreSQL работает с диском
Как PostgreSQL работает с дискомКак PostgreSQL работает с диском
Как PostgreSQL работает с диском
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
 
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
Как мы сделали PHP 7 в два раза быстрее PHP 5 / Дмитрий Стогов (Zend Technolo...
 

Semelhante a Yandex.Mail success story

Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
Stephen Rose
 
Gavin M
Gavin MGavin M
Gavin M
Ontico
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

Semelhante a Yandex.Mail success story (20)

Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbs
 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
NewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACIDNewSQL - Deliverance from BASE and back to SQL and ACID
NewSQL - Deliverance from BASE and back to SQL and ACID
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Gavin M
Gavin MGavin M
Gavin M
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
Doing More with Postgres - Yesterday's Vision Becomes Today's Reality
Doing More with Postgres - Yesterday's Vision Becomes Today's RealityDoing More with Postgres - Yesterday's Vision Becomes Today's Reality
Doing More with Postgres - Yesterday's Vision Becomes Today's Reality
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
 
MOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMOUG17 Keynote: Oracle OpenWorld Major Announcements
MOUG17 Keynote: Oracle OpenWorld Major Announcements
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 

Mais de dev1ant (6)

История успеха Яндекс.Почты
История успеха Яндекс.ПочтыИстория успеха Яндекс.Почты
История успеха Яндекс.Почты
 
2015.10.14 Как спать спокойно
2015.10.14 Как спать спокойно2015.10.14 Как спать спокойно
2015.10.14 Как спать спокойно
 
2015.07.16 Способы диагностики PostgreSQL
2015.07.16 Способы диагностики PostgreSQL2015.07.16 Способы диагностики PostgreSQL
2015.07.16 Способы диагностики PostgreSQL
 
2015.02.06 PostgreSQL в Яндексе: история успеха №2
2015.02.06 PostgreSQL в Яндексе: история успеха №22015.02.06 PostgreSQL в Яндексе: история успеха №2
2015.02.06 PostgreSQL в Яндексе: история успеха №2
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
История небольшого успеха с PostgreSQL
История небольшого успеха с PostgreSQLИстория небольшого успеха с PostgreSQL
История небольшого успеха с PostgreSQL
 

Último

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 

Último (20)

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 

Yandex.Mail success story

  • 1.
  • 3. One of the largest internet companies in Europe 57+% of all search traffic in Russia Ukraine, Kazakhstan, Belarus and Turkey https://yandex.com/company/technologies About 6000 employees all over the world 3 About Yandex
  • 4. Launched in 2000 10+ million users daily 200.000 RPS to web/mobile/imap backends 150+ million incoming letters daily 20+ PB of data 4 About Yandex.Mail
  • 5. Migration from Oracle to PostgreSQL 300+ TB of metadata without redundancy 250k requests per second OLTP with 80% reads, 20% writes Previous attempts MySQL Self-written DBMS 5 About this talk
  • 6. 6 What is mail metadata?
  • 7. 7
  • 9. Everything stored in Oracle Lots of PL/SQL logic Efficient hardware usage 10+ TB per shard Working LA 100 Lots of manual operations Warm (SSD) and cold (SATA) databases for different users 75% SSD, 25% SATA 9 Yandex.Mail metadata
  • 13. PL/SQL deploy Library cache Lots of manual operations Switchover, new DB setup, data transfer between shards Only synchronous interface in OCCI Problems with development environments Not very responsive support 13 Most common problems
  • 16. Oct 2012 — the willful decision Get rid of Oracle in 3 years Apr 2013 — first experiments with different DBMS PostgreSQL Lots on NoSQL stores Self-written solution on base of search backend Jul 2013 — Jun 2014 — collectors experiment 16 Experiments
  • 18. https://simply.name/video-pg-meetup-yandex.html Our first experience with PostgreSQL Monitoring/graphs/deploy PL/Proxy for sharding Self-written tools for switchovers and read-only degradation Plenty of initial problems 2 TB of metadata (15+ billion records) 40k RPS 18 Experiment with collectors
  • 19. Aug 2014 — Dec 2014 Storing all production stream of letters to PostgreSQL Asynchronously Initial schema decisions Important for abstraction library Load testing with our workload Choosing hardware Lots of other PostgreSQL related experience https://simply.name/postgresql-and-systemtap.html 19 Full mail prototype
  • 20. Jan 2015 — Jan 2016 — development Jun 2015 — dog fooding Accelerated development Sep 2015 — start of inactive users migration Fixing bugs of transfer code Reverse transfer (plan B) Jan 2016 — Apr 2016 — migration 20 Main work
  • 21. Time to rewrite all software to support Oracle and PostgreSQL 10 man-years
  • 28. Warm DBs (SSD) for most active users Cold DBs (SATA) for all inactive users Hot DBs for super active users 2% of users generate 50% of workload Automation to move users between different shard types TBD: moving old letters of one user from SSD to SATA 28 Hardware
  • 29. In Oracle all IDs (mid, fid, lid, tid) were globally unique Sequences ranges for every shard in special DB NUMBER(20, 0) — 20 bytes In PostgreSQL IDs are unique inside particular user Globally unique mid changed to globally unique (uid, mid) Biginit + bigint — 16 bytes 29 Identifiers
  • 30. Less contention for single index page Normal B-Tree instead of reversed indexes Revisions for all objects Ability to read only actual data from standbys Incremental diffs for IMAP and mobile apps Denormalized some data Arrays and GIN Composite types 30 Schema changes
  • 31. xdb01g/maildb M # dS mail.box Table "mail.box" Column | Type | Modifiers ---------------+--------------------------+------------------------ uid | bigint | not null mid | bigint | not null lids | integer[] | not null <...> Indexes: "pk_box" PRIMARY KEY, btree (uid, mid) "i_box_uid_lids" gin (mail.ulids(uid, lids)) WITH (fastupdate=off) <...> xdb01g/maildb M # 31 Example
  • 32. PL/pgSQL is awesome Greatly reduced code size Only to ensure data consistency Greatly increased test coverage The cost of failure is high Easy deploy since no library cache locks 32 Stored logic
  • 33. SaltStack Detailed diff between current and desired state All schema and code changes through migrations All common tasks are automated Representative testing environments 33 Maintenance approach
  • 35. Problem with ExclusiveLock on inserts Checkpoint distribution ExclusiveLock on extension of relation with huge shared_buffers Hanging startup process on the replica after vacuuming on master Replication slots and isolation levels Segfault in BackendIdGetTransactionIds A lot more solved without community help 35 Before main migration
  • 36. Oracle DBA In any unclear situation autovacuum is to blame
  • 37. https://simply.name/pg-stat-wait.html Wait_event in pg_stat_activity (9.6) https://simply.name/ru/slides-pgday2015.html (RUS) 37 Diagnostics
  • 38. Our retention policy is 7 days In Oracle backups (inc0 + 6 * inc1) and archive logs ≈ DB size In PostgreSQL with barman ≈ N* DB size, where N > 5 WALs compressed but backups not File-level increments don’t work properly All operations are single-threaded and very slow For 300 TB we needed ≈ 2 PB for backups https://github.com/2ndquadrant-it/barman/issues/21 38 Backups
  • 39. Not PostgreSQL problems Data problems A lot of legacy for 10+ years Bugs in transfer code 39 During migration
  • 41. Declarative partitioning Good recovery manager Parallelism/compression/page-level increments Partial online recovery (i.e. single table) Future development of wait interface Huge shared buffers, O_DIRECT and async I/O Quorum commit 41 Our wishlist for PostgreSQL
  • 42. 1 PB with redundancy (100+ billion records) 250k TPS Three calendar years / 10+ man-years Faster deployment / more efficient human time usage All backend refactoring 3x more hardware No major fuckups yet :) Linux, nginx, postfix, PostgreSQL 42 Summary
  • 43. Vladimir Borodin DBA Questions? @man_brain https://simply.name +7 (495) 739 70 00, ext.: 7255 d0uble@yandex-team.ru