SlideShare uma empresa Scribd logo
1 de 71
Baixar para ler offline
HANY FAHIM
DevOps throughout time:
the evolution of troubleshooting
March 15, 2016
• Customer reported a problem with one of their async worker clusters.
• Apparently workers had stopped processing tasks unexpectedly.
• Cluster has been in place for over 2 years at this point.
Problem Reported
What changed?
Worker Setup
1. Scheduler
regularly loads
tasks unto the
queue.
2. Queue server
3. Worker servers
consume tasks
from the queue...
4. and interact
with the DB.
• Investigation revealed Workers were still processing results.
• Queue size was decreasing as the Workers were consuming tasks.
• Worker CPUs were busy.
• Customer confirmed queue was being processed, but throughput
dropped.
Investigation
3 servers X 4 workers each.
Task Throughput
Tried turning it off and on again...
Task Throughput
A few days pass...
Problem strikes again.
• Throughput had dropped again, although affected different server.
• Looking closer at Worker software, they were several versions behind.
• CHANGELOG showed many fixes, however none matched the situation.
• Advised customer to upgrade worker version.
• Also upped the log level.
Stuck worker
More waiting...
Problem was back.
• Logs were clean.
• Deep diving into Worker documentation, discovered we were not
running the workers correctly.
• Should give each worker a unique ID.
• Long shot, but we were out of ideas.
More mystery
At least for multi-worker setups.
Again, why now?
Waited yet again.
The problem came back
the very next day.
• The latest version of the Worker allowed responding to signals for
debugging.
• While the issue was ongoing, sent it a USR1 signal.
• Process dumped what it was doing.
Signals?
Worker Debug Output
-> celery@prod-wrkr-02-celeryd-batch_01: OK
* {u'args': u'(149747,)', u'time_start': 75436993.58697978,
u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0,
u'redelivered': False, u'routing_key': u'background', u'exchange':
u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01',
u'acknowledged': True, u'kwargs': u'{}', u'id':
u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325}
-> celery@prod-wrkr-02-celeryd-batch_01: OK
* {u'args': u'(149747,)', u'time_start': 75436993.58697978,
u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0,
u'redelivered': False, u'routing_key': u'background', u'exchange':
u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01',
u'acknowledged': True, u'kwargs': u'{}', u'id':
u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325}
-> celery@prod-wrkr-02-celeryd-batch_01: OK
* {u'args': u'(149747,)', u'time_start': 75436993.58697978,
u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0,
u'redelivered': False, u'routing_key': u'background', u'exchange':
u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01',
u'acknowledged': True, u'kwargs': u'{}', u'id':
u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325}
Made note of this. Was it some user ID?
Used this PID to inspect the process.
strace the worker
# strace -fp 21325# strace -fp 21325
…
[pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354
[pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}])
# strace -fp 21325
…
[pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354
[pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}])
# strace -fp 21325
…
[pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354
[pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}])
# strace -fp 21325
…
[pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354
[pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}])
Tracing the PID
This is the file descriptor
This looks like an SQL query
The process waits here
lsof the worker
# lsof -p 21325 | grep 25u# lsof -p 21325 | grep 25u
…
python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres
(ESTABLISHED)
# lsof -p 21325 | grep 25u
…
python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres
(ESTABLISHED)
# lsof -p 21325 | grep 25u
…
python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres
(ESTABLISHED)
Using same PID, and FD 25.
Made note of the source port.
FD 25 is a database connection.
On the DB server
# ps auxwf | grep 43776# ps auxwf | grep 43776
…
postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user
database prod-wrkr-02(43776) SELECT
# ps auxwf | grep 43776
…
postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user
database prod-wrkr-02(43776) SELECT
# ps auxwf | grep 43776
…
postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user
database prod-wrkr-02(43776) SELECT
# ps auxwf | grep 43776
…
postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user
database prod-wrkr-02(43776) SELECT
This is the source port of the worker.
The query is using 13% of memory,
or > 4GB on a 32GB server.
The query has been running since 7am.
It was currently 12:30pm.
The worker was actually not stuck!
It was waiting for a query to finish!
• Customer was able to identify the query based on the user ID.
• Normally, this query should return a single row.
• However this particular user had over 4 million rows.
• Likely left over from a bad data migration.
• The large database server masked the performance impact of the
large query.
Root-cause
Data was actually
duplicated.
This story is a typical troubleshooting
scenario of the past (5-10 yrs ago).
This was the Age of the Toolbox.
• With full access, you had a whole toolbox (OS layer and up) at your
fingertips.
• It allowed you to “pop open the hood” and inspect the lower layer
components.
• Particularly helpful when you don’t know what you’re looking for.
Age of the Toolbox
• It required advanced knowledge of core components to operate
complex systems.
• Systems were usually custom-built for a business’ needs. This makes it
difficult to hire and train.
• Difficult from a security standpoint (who gets access? Everyone?)
Downsides to this era
Times have changed.
October 18, 2019
• Customer wanted to migrate their replicated MySQL database from
self-hosted to Amazon’s RDS.
• Fairly large database in production (~1TB).
• Should be straight-forward.
• Plan was use one of the slaves to migrate to avoid extra load on the
master.
DB Migration
• DMS was a natural choice for migrating the DB (it’s specifically designed
for this).
• DMS operates on a per-table basis allowing selective migrations.
• First shot was to migrate their staging database.
• Same table structure, smaller dataset (11GB vs. 1TB).
DMS
DMS Migration
Amazon DMS Amazon RDS
Master MySQL
Slave MySQL
AWS VPC Peering
MySQL Replication
• Seemed to work. Only a few issues encountered.
• Issues included:
• Converting dates such as 0000-00-00 to NULL.
• Approx 30 date records were shifted 3 hours into the future (other
300k+ similar records we unaffected).
• All were caught via DMS validation.
Success (mostly)
Why did it do this?
Why did it do this?
On to production!
• Everything seemed to be going well, albeit slow.
• After 36 hours, we were 61% complete.
• However on the 4th night, the RDS instance ran out of memory.
• Quadrupled the size of the instance, picked up where it crashed.
• However DMS validation errors began to spring up.
Production Migration
Manual Validation
mysql> SELECT count(*) FROM users;
+----------+
| count(*) |
+----------+
| 4968305 |
+----------+
MySQL [production]> SELECT count(*) FROM users;
+----------+
| count(*) |
+----------+
| 4974044 |
+----------+
Source
RDS
• Tables were unrecoverable. Many missing or invalid records.
• 3 tables were affected, and they were the largest.
• Had to truncate the tables and restart the process.
Tables Truncated
More corruption.
• More tables reported validation errors (different from the original).
• Truncated those tables and reloaded again.
• Also had to increase the binlog_retention_days to 10 days on the
source.
More corruption found
Got to 98%. So close…
More corruption.
• Many reports that v3.3.0 (current) is full of bugs.
• At this point, decided to switch to an older version of DMS.
• Fell back on previous version v3.1.4.
• Deleted everything and started over again to run over weekend.
Changed DMS version
Corruption.
• Older version caused even more corruption.
• At this point, 2 weeks have been sunk trying to wrestle with DMS.
• Had to cut our losses and find an alternative.
• Went through documentation and found that RDS supports loading
databases via backups.
• xtrabackup was a supported option, which customer already used!
Older Version
xtrabackup Method
Amazon RDS
Master MySQL
Slave MySQL
1. xtrabackup files were created and shipped to S3.
Amazon S3
2. RDS was directed to pick up backup from S3.
• Loaded previous night’s backup into S3 bucket.
• Recorded current replication position (via xtrabackup --prepare).
• Create RDS instance and point to bucket.
xtrabackup Method
Waited...
• RDS instance was stuck on “creating” state.
• Via console we can see that backup was downloaded to the instance.
• No errors in the RDS logs.
• Decided to leave it another day just in case (as data set was very large).
18 hours in…
• After nearly 36 hours, RDS was still stuck on “creating”.
• Something was wrong.
• Was it the specific data set?
Still stuck
• Used https://github.com/datacharmer/test_db as a test data
source (~167MB).
• Kept everything else the same.
• RDS also hung at “creating”.
• It was not the data set.
• Was it some version issue?
Test Data Set
From https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.html
• We were using the right versions!
• DB was MySQL 5.7, and xtrabackup was 2.4.
• Checked Percona’s documentation, and found a blog post from April
2018...
Verified versions
From https://www.percona.com/blog/2018/04/02/migrate-to-amazon-rds-with-percona-xtrabackup/
• Is 5.7 actually not supported via this method?
• Loaded the test database on a 5.6 instance.
• Took a backup with xtrabackup, copied to S3.
• Pointed RDS to it, and…
Are AWS’ docs wrong?
It worked!
• Import worked within a few minutes.
• Looks like AWS’ documentation is incorrect. 5.7 is not supported!
• However this did not help us.
• Source was 5.7 and downgrading was not possible.
Lesson: docs are not always right.
Another 3 days wasted.
• Last option was to use a tried and tested method.
• Good ol’ mysqldump.
• Not the best method:
• Non-transactional table types may not be consistent (MyISAM).
• Slowest of all methods.
• Can cause significant load on source.
3 time’s a charm
mysqldump Method
Master MySQL
Slave MySQL
Amazon EC2 Amazon RDS
VPC
1. Backup was created via mysqldump and shipped to an EC2 instance.
2. Dump file was loaded into MySQL via mysql client.
• Created an EC2 instance within the same VPC as RDS.
• Use --single-transaction and --quick.
• Modify output with sed due to DEFINERs.
• Record replication position.
• Disable Multi-AZ and backups on RDS to increase speed.
Method utilized
Wanted it as close to the
destination as possible.
mysqldump
# mysqldump 
--single-transaction 
--quick 
--master-data=2 
--databases production | 
sed 's/sDEFINER=`[^`]*`@`[^`]*`//g' | 
gzip > production_$(date -I).sql.gz
Waited...
• Import succeeded!
• Luckily MyISAM tables are rarely written to.
• RDS instance was almost 4 days behind (Seconds_Behind_Master).
• Took about 24 hours to catch up.
After 81 hours...
Would have caused major
issues if they were highly active.
What these stories highlight is the evolution
of DevOps and nature of troubleshooting.
We are now in the Age of Levers.
• As a system and businesses grow and advance, technical teams would
build their own levers (automation, abstraction) to meet the needs of
the organization. Eg:
• System images, config management, deployment scripts and tools.
• Building dev/staging environments that look like prod.
• CI/CD systems.
• This enabled unprecedented flexibility, but came with its own risks.
Natural Evolution
• Take on the burden of maintenance, security, compliance, etc...
• The bigger risk is employee turnover.
• Over an extended period of time, those who built the system will likely
move on, which creates a void.
• Hiring replacements comes at a high cost (no prior knowledge of your
systems, may not be familiar with your tech stack).
Past Risks
• The prevalence of cloud-based architecture, and the flurry of managed
services addresses these risks.
• It’s much easier to hire someone with experience in a common cloud's
ecosystem.
• Easier to train (read their docs*).
• Resources are abundant.
• Communities are well developed.
Evolution to common levers
*read with scrutiny.
While the marketing for managed services
claims to save time and money,
this is only true when things go right.
• Access to lower layers (“popping
the hood”).
• Systems typically built by the
same team operating it.
• Great flexibility and agility.
• Requires high levels of expertise.
• Higher operating risks.
Toolbox vs. Levers
• Quicker to get started.
• Easier to scale.
• Lower initial risks.
• Abundance of resources.
• Less flexibility, limited agility.
• Costs skyrocket when things go wrong.
• Lots of trial and error.
Lesson to be learned:
Embrace the evolution,
but be wary and budget for failure.
THE
END
… psst we’re hiring
—
301-675 King Street West
Toronto, ON, M5V 1M9
1 (866) 278 0021
@stackdotio
/stackdotio
/company/stackdotio

Mais conteúdo relacionado

Mais procurados

Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 Notes
Ross Lawley
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
Axel Liljencrantz
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 

Mais procurados (20)

Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data Lake
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
 
Rails Conf Europe 2007 Notes
Rails Conf  Europe 2007  NotesRails Conf  Europe 2007  Notes
Rails Conf Europe 2007 Notes
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
 
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
 
Cassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on FireCassandra Community Webinar | Data Model on Fire
Cassandra Community Webinar | Data Model on Fire
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
How to tune Kafka® for production
How to tune Kafka® for productionHow to tune Kafka® for production
How to tune Kafka® for production
 
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with Riemann
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
RedisConf18 - Ad serving platform using Redis
RedisConf18 - Ad serving platform using Redis RedisConf18 - Ad serving platform using Redis
RedisConf18 - Ad serving platform using Redis
 

Semelhante a DevOps throughout time

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 

Semelhante a DevOps throughout time (20)

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
 
MongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the FieldMongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the Field
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesDruid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and MoreHBaseCon 2013: How to Get the MTTR Below 1 Minute and More
HBaseCon 2013: How to Get the MTTR Below 1 Minute and More
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Performance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerformance & Scalability Improvements in Perforce
Performance & Scalability Improvements in Perforce
 
S3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using sparkS3 cassandra or outer space? dumping time series data using spark
S3 cassandra or outer space? dumping time series data using spark
 
HBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minuteHBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minute
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM Connections
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
System Design.pdf
System Design.pdfSystem Design.pdf
System Design.pdf
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
A study of our DNS full-resolvers
A study of our DNS full-resolversA study of our DNS full-resolvers
A study of our DNS full-resolvers
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

DevOps throughout time

  • 1. HANY FAHIM DevOps throughout time: the evolution of troubleshooting
  • 3. • Customer reported a problem with one of their async worker clusters. • Apparently workers had stopped processing tasks unexpectedly. • Cluster has been in place for over 2 years at this point. Problem Reported What changed?
  • 4. Worker Setup 1. Scheduler regularly loads tasks unto the queue. 2. Queue server 3. Worker servers consume tasks from the queue... 4. and interact with the DB.
  • 5. • Investigation revealed Workers were still processing results. • Queue size was decreasing as the Workers were consuming tasks. • Worker CPUs were busy. • Customer confirmed queue was being processed, but throughput dropped. Investigation 3 servers X 4 workers each.
  • 7. Tried turning it off and on again...
  • 9. A few days pass...
  • 11. • Throughput had dropped again, although affected different server. • Looking closer at Worker software, they were several versions behind. • CHANGELOG showed many fixes, however none matched the situation. • Advised customer to upgrade worker version. • Also upped the log level. Stuck worker
  • 14. • Logs were clean. • Deep diving into Worker documentation, discovered we were not running the workers correctly. • Should give each worker a unique ID. • Long shot, but we were out of ideas. More mystery At least for multi-worker setups. Again, why now?
  • 16. The problem came back the very next day.
  • 17. • The latest version of the Worker allowed responding to signals for debugging. • While the issue was ongoing, sent it a USR1 signal. • Process dumped what it was doing. Signals?
  • 18. Worker Debug Output -> celery@prod-wrkr-02-celeryd-batch_01: OK * {u'args': u'(149747,)', u'time_start': 75436993.58697978, u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'background', u'exchange': u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01', u'acknowledged': True, u'kwargs': u'{}', u'id': u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325} -> celery@prod-wrkr-02-celeryd-batch_01: OK * {u'args': u'(149747,)', u'time_start': 75436993.58697978, u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'background', u'exchange': u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01', u'acknowledged': True, u'kwargs': u'{}', u'id': u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325} -> celery@prod-wrkr-02-celeryd-batch_01: OK * {u'args': u'(149747,)', u'time_start': 75436993.58697978, u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'background', u'exchange': u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01', u'acknowledged': True, u'kwargs': u'{}', u'id': u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325} Made note of this. Was it some user ID? Used this PID to inspect the process.
  • 19. strace the worker # strace -fp 21325# strace -fp 21325 … [pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354 [pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}]) # strace -fp 21325 … [pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354 [pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}]) # strace -fp 21325 … [pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354 [pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}]) # strace -fp 21325 … [pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354 [pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}]) Tracing the PID This is the file descriptor This looks like an SQL query The process waits here
  • 20. lsof the worker # lsof -p 21325 | grep 25u# lsof -p 21325 | grep 25u … python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres (ESTABLISHED) # lsof -p 21325 | grep 25u … python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres (ESTABLISHED) # lsof -p 21325 | grep 25u … python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres (ESTABLISHED) Using same PID, and FD 25. Made note of the source port. FD 25 is a database connection.
  • 21. On the DB server # ps auxwf | grep 43776# ps auxwf | grep 43776 … postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user database prod-wrkr-02(43776) SELECT # ps auxwf | grep 43776 … postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user database prod-wrkr-02(43776) SELECT # ps auxwf | grep 43776 … postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user database prod-wrkr-02(43776) SELECT # ps auxwf | grep 43776 … postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user database prod-wrkr-02(43776) SELECT This is the source port of the worker. The query is using 13% of memory, or > 4GB on a 32GB server. The query has been running since 7am. It was currently 12:30pm. The worker was actually not stuck! It was waiting for a query to finish!
  • 22. • Customer was able to identify the query based on the user ID. • Normally, this query should return a single row. • However this particular user had over 4 million rows. • Likely left over from a bad data migration. • The large database server masked the performance impact of the large query. Root-cause Data was actually duplicated.
  • 23. This story is a typical troubleshooting scenario of the past (5-10 yrs ago). This was the Age of the Toolbox.
  • 24. • With full access, you had a whole toolbox (OS layer and up) at your fingertips. • It allowed you to “pop open the hood” and inspect the lower layer components. • Particularly helpful when you don’t know what you’re looking for. Age of the Toolbox
  • 25. • It required advanced knowledge of core components to operate complex systems. • Systems were usually custom-built for a business’ needs. This makes it difficult to hire and train. • Difficult from a security standpoint (who gets access? Everyone?) Downsides to this era
  • 28. • Customer wanted to migrate their replicated MySQL database from self-hosted to Amazon’s RDS. • Fairly large database in production (~1TB). • Should be straight-forward. • Plan was use one of the slaves to migrate to avoid extra load on the master. DB Migration
  • 29. • DMS was a natural choice for migrating the DB (it’s specifically designed for this). • DMS operates on a per-table basis allowing selective migrations. • First shot was to migrate their staging database. • Same table structure, smaller dataset (11GB vs. 1TB). DMS
  • 30. DMS Migration Amazon DMS Amazon RDS Master MySQL Slave MySQL AWS VPC Peering MySQL Replication
  • 31. • Seemed to work. Only a few issues encountered. • Issues included: • Converting dates such as 0000-00-00 to NULL. • Approx 30 date records were shifted 3 hours into the future (other 300k+ similar records we unaffected). • All were caught via DMS validation. Success (mostly) Why did it do this? Why did it do this?
  • 33. • Everything seemed to be going well, albeit slow. • After 36 hours, we were 61% complete. • However on the 4th night, the RDS instance ran out of memory. • Quadrupled the size of the instance, picked up where it crashed. • However DMS validation errors began to spring up. Production Migration
  • 34. Manual Validation mysql> SELECT count(*) FROM users; +----------+ | count(*) | +----------+ | 4968305 | +----------+ MySQL [production]> SELECT count(*) FROM users; +----------+ | count(*) | +----------+ | 4974044 | +----------+ Source RDS
  • 35. • Tables were unrecoverable. Many missing or invalid records. • 3 tables were affected, and they were the largest. • Had to truncate the tables and restart the process. Tables Truncated
  • 37. • More tables reported validation errors (different from the original). • Truncated those tables and reloaded again. • Also had to increase the binlog_retention_days to 10 days on the source. More corruption found
  • 38. Got to 98%. So close…
  • 40. • Many reports that v3.3.0 (current) is full of bugs. • At this point, decided to switch to an older version of DMS. • Fell back on previous version v3.1.4. • Deleted everything and started over again to run over weekend. Changed DMS version
  • 42. • Older version caused even more corruption. • At this point, 2 weeks have been sunk trying to wrestle with DMS. • Had to cut our losses and find an alternative. • Went through documentation and found that RDS supports loading databases via backups. • xtrabackup was a supported option, which customer already used! Older Version
  • 43. xtrabackup Method Amazon RDS Master MySQL Slave MySQL 1. xtrabackup files were created and shipped to S3. Amazon S3 2. RDS was directed to pick up backup from S3.
  • 44. • Loaded previous night’s backup into S3 bucket. • Recorded current replication position (via xtrabackup --prepare). • Create RDS instance and point to bucket. xtrabackup Method
  • 46. • RDS instance was stuck on “creating” state. • Via console we can see that backup was downloaded to the instance. • No errors in the RDS logs. • Decided to leave it another day just in case (as data set was very large). 18 hours in…
  • 47. • After nearly 36 hours, RDS was still stuck on “creating”. • Something was wrong. • Was it the specific data set? Still stuck
  • 48. • Used https://github.com/datacharmer/test_db as a test data source (~167MB). • Kept everything else the same. • RDS also hung at “creating”. • It was not the data set. • Was it some version issue? Test Data Set
  • 50. • We were using the right versions! • DB was MySQL 5.7, and xtrabackup was 2.4. • Checked Percona’s documentation, and found a blog post from April 2018... Verified versions
  • 52. • Is 5.7 actually not supported via this method? • Loaded the test database on a 5.6 instance. • Took a backup with xtrabackup, copied to S3. • Pointed RDS to it, and… Are AWS’ docs wrong?
  • 54. • Import worked within a few minutes. • Looks like AWS’ documentation is incorrect. 5.7 is not supported! • However this did not help us. • Source was 5.7 and downgrading was not possible. Lesson: docs are not always right.
  • 55. Another 3 days wasted.
  • 56. • Last option was to use a tried and tested method. • Good ol’ mysqldump. • Not the best method: • Non-transactional table types may not be consistent (MyISAM). • Slowest of all methods. • Can cause significant load on source. 3 time’s a charm
  • 57. mysqldump Method Master MySQL Slave MySQL Amazon EC2 Amazon RDS VPC 1. Backup was created via mysqldump and shipped to an EC2 instance. 2. Dump file was loaded into MySQL via mysql client.
  • 58. • Created an EC2 instance within the same VPC as RDS. • Use --single-transaction and --quick. • Modify output with sed due to DEFINERs. • Record replication position. • Disable Multi-AZ and backups on RDS to increase speed. Method utilized Wanted it as close to the destination as possible.
  • 59. mysqldump # mysqldump --single-transaction --quick --master-data=2 --databases production | sed 's/sDEFINER=`[^`]*`@`[^`]*`//g' | gzip > production_$(date -I).sql.gz
  • 61. • Import succeeded! • Luckily MyISAM tables are rarely written to. • RDS instance was almost 4 days behind (Seconds_Behind_Master). • Took about 24 hours to catch up. After 81 hours... Would have caused major issues if they were highly active.
  • 62. What these stories highlight is the evolution of DevOps and nature of troubleshooting.
  • 63. We are now in the Age of Levers.
  • 64. • As a system and businesses grow and advance, technical teams would build their own levers (automation, abstraction) to meet the needs of the organization. Eg: • System images, config management, deployment scripts and tools. • Building dev/staging environments that look like prod. • CI/CD systems. • This enabled unprecedented flexibility, but came with its own risks. Natural Evolution
  • 65. • Take on the burden of maintenance, security, compliance, etc... • The bigger risk is employee turnover. • Over an extended period of time, those who built the system will likely move on, which creates a void. • Hiring replacements comes at a high cost (no prior knowledge of your systems, may not be familiar with your tech stack). Past Risks
  • 66. • The prevalence of cloud-based architecture, and the flurry of managed services addresses these risks. • It’s much easier to hire someone with experience in a common cloud's ecosystem. • Easier to train (read their docs*). • Resources are abundant. • Communities are well developed. Evolution to common levers *read with scrutiny.
  • 67. While the marketing for managed services claims to save time and money, this is only true when things go right.
  • 68. • Access to lower layers (“popping the hood”). • Systems typically built by the same team operating it. • Great flexibility and agility. • Requires high levels of expertise. • Higher operating risks. Toolbox vs. Levers • Quicker to get started. • Easier to scale. • Lower initial risks. • Abundance of resources. • Less flexibility, limited agility. • Costs skyrocket when things go wrong. • Lots of trial and error.
  • 69. Lesson to be learned: Embrace the evolution, but be wary and budget for failure.
  • 71. — 301-675 King Street West Toronto, ON, M5V 1M9 1 (866) 278 0021 @stackdotio /stackdotio /company/stackdotio