DevOps throughout time

HANY FAHIM
DevOps throughout time:
the evolution of troubleshooting

• Customer reported a problem with one of their async worker clusters.
• Apparently workers had stopped processing tasks unexpectedly.
• Cluster has been in place for over 2 years at this point.
Problem Reported
What changed?

Worker Setup
1. Scheduler
regularly loads
tasks unto the
queue.
2. Queue server
3. Worker servers
consume tasks
from the queue...
4. and interact
with the DB.

• Investigation revealed Workers were still processing results.
• Queue size was decreasing as the Workers were consuming tasks.
• Worker CPUs were busy.
• Customer confirmed queue was being processed, but throughput
dropped.
Investigation
3 servers X 4 workers each.

Tried turning it off and on again...

• Throughput had dropped again, although affected different server.
• Looking closer at Worker software, they were several versions behind.
• CHANGELOG showed many fixes, however none matched the situation.
• Advised customer to upgrade worker version.
• Also upped the log level.
Stuck worker

• Logs were clean.
• Deep diving into Worker documentation, discovered we were not
running the workers correctly.
• Should give each worker a unique ID.
• Long shot, but we were out of ideas.
More mystery
At least for multi-worker setups.
Again, why now?

The problem came back
the very next day.

• The latest version of the Worker allowed responding to signals for
debugging.
• While the issue was ongoing, sent it a USR1 signal.
• Process dumped what it was doing.
Signals?

Worker Debug Output
-> celery@prod-wrkr-02-celeryd-batch_01: OK
* {u'args': u'(149747,)', u'time_start': 75436993.58697978,
u'name': u'tasks.RefreshTask', u'delivery_info': {u'priority': 0,
u'redelivered': False, u'routing_key': u'background', u'exchange':
u'batch'}, u'hostname': u'celery@prod-wrkr-02-celeryd-batch_01',
u'acknowledged': True, u'kwargs': u'{}', u'id':
u'648b5805-1a49-4118-97b3-a235f173230e', u'worker_pid': 21325}
Made note of this. Was it some user ID?
Used this PID to inspect the process.

strace the worker
# strace -fp 21325# strace -fp 21325
…
[pid 21325] sendto(25, “Q00t1n SELECTn”..., 2354, MSG_NOSIGNAL, NULL, 0) = 2354
[pid 21325] poll([{fd=25, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=25, revents=POLLIN}])
# strace -fp 21325
…
# strace -fp 21325
…
# strace -fp 21325
…
Tracing the PID
This is the file descriptor
This looks like an SQL query
The process waits here

lsof the worker
# lsof -p 21325 | grep 25u# lsof -p 21325 | grep 25u
…
python 21325 deploy 25u IPv4 42259 0t0 TCP prod-wrkr-02:43776->prod-db-01:postgres
(ESTABLISHED)
# lsof -p 21325 | grep 25u
…
(ESTABLISHED)
# lsof -p 21325 | grep 25u
…
(ESTABLISHED)
Using same PID, and FD 25.
Made note of the source port.
FD 25 is a database connection.

On the DB server
# ps auxwf | grep 43776# ps auxwf | grep 43776
…
postgres 30108 8.9 13.1 4504052 4294416 ? Ds 07:11 30:19 _ postgres: user
database prod-wrkr-02(43776) SELECT
# ps auxwf | grep 43776
…
…
…
This is the source port of the worker.
The query is using 13% of memory,
or > 4GB on a 32GB server.
The query has been running since 7am.
It was currently 12:30pm.
The worker was actually not stuck!
It was waiting for a query to finish!

• Customer was able to identify the query based on the user ID.
• Normally, this query should return a single row.
• However this particular user had over 4 million rows.
• Likely left over from a bad data migration.
• The large database server masked the performance impact of the
large query.
Root-cause
Data was actually
duplicated.

This story is a typical troubleshooting
scenario of the past (5-10 yrs ago).
This was the Age of the Toolbox.

• With full access, you had a whole toolbox (OS layer and up) at your
fingertips.
• It allowed you to “pop open the hood” and inspect the lower layer
components.
• Particularly helpful when you don’t know what you’re looking for.
Age of the Toolbox

• It required advanced knowledge of core components to operate
complex systems.
• Systems were usually custom-built for a business’ needs. This makes it
difficult to hire and train.
• Difficult from a security standpoint (who gets access? Everyone?)
Downsides to this era

• Customer wanted to migrate their replicated MySQL database from
self-hosted to Amazon’s RDS.
• Fairly large database in production (~1TB).
• Should be straight-forward.
• Plan was use one of the slaves to migrate to avoid extra load on the
master.
DB Migration

• DMS was a natural choice for migrating the DB (it’s specifically designed
for this).
• DMS operates on a per-table basis allowing selective migrations.
• First shot was to migrate their staging database.
• Same table structure, smaller dataset (11GB vs. 1TB).
DMS

DMS Migration
Amazon DMS Amazon RDS
Master MySQL
Slave MySQL
AWS VPC Peering
MySQL Replication

• Seemed to work. Only a few issues encountered.
• Issues included:
• Converting dates such as 0000-00-00 to NULL.
• Approx 30 date records were shifted 3 hours into the future (other
300k+ similar records we unaffected).
• All were caught via DMS validation.
Success (mostly)
Why did it do this?
Why did it do this?

• Everything seemed to be going well, albeit slow.
• After 36 hours, we were 61% complete.
• However on the 4th night, the RDS instance ran out of memory.
• Quadrupled the size of the instance, picked up where it crashed.
• However DMS validation errors began to spring up.
Production Migration

Manual Validation
mysql> SELECT count(*) FROM users;
+----------+
| count(*) |
+----------+
| 4968305 |
+----------+
MySQL [production]> SELECT count(*) FROM users;
+----------+
| count(*) |
+----------+
| 4974044 |
+----------+
Source
RDS

• Tables were unrecoverable. Many missing or invalid records.
• 3 tables were affected, and they were the largest.
• Had to truncate the tables and restart the process.
Tables Truncated

• More tables reported validation errors (different from the original).
• Truncated those tables and reloaded again.
• Also had to increase the binlog_retention_days to 10 days on the
source.
More corruption found

• Many reports that v3.3.0 (current) is full of bugs.
• At this point, decided to switch to an older version of DMS.
• Fell back on previous version v3.1.4.
• Deleted everything and started over again to run over weekend.
Changed DMS version

• Older version caused even more corruption.
• At this point, 2 weeks have been sunk trying to wrestle with DMS.
• Had to cut our losses and find an alternative.
• Went through documentation and found that RDS supports loading
databases via backups.
• xtrabackup was a supported option, which customer already used!
Older Version

xtrabackup Method
Amazon RDS
Master MySQL
Slave MySQL
1. xtrabackup files were created and shipped to S3.
Amazon S3
2. RDS was directed to pick up backup from S3.

• Loaded previous night’s backup into S3 bucket.
• Recorded current replication position (via xtrabackup --prepare).
• Create RDS instance and point to bucket.
xtrabackup Method

• RDS instance was stuck on “creating” state.
• Via console we can see that backup was downloaded to the instance.
• No errors in the RDS logs.
• Decided to leave it another day just in case (as data set was very large).
18 hours in…

• After nearly 36 hours, RDS was still stuck on “creating”.
• Something was wrong.
• Was it the specific data set?
Still stuck

• Used https://github.com/datacharmer/test_db as a test data
source (~167MB).
• Kept everything else the same.
• RDS also hung at “creating”.
• It was not the data set.
• Was it some version issue?
Test Data Set

From https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.html

• We were using the right versions!
• DB was MySQL 5.7, and xtrabackup was 2.4.
• Checked Percona’s documentation, and found a blog post from April
2018...
Verified versions

From https://www.percona.com/blog/2018/04/02/migrate-to-amazon-rds-with-percona-xtrabackup/

• Is 5.7 actually not supported via this method?
• Loaded the test database on a 5.6 instance.
• Took a backup with xtrabackup, copied to S3.
• Pointed RDS to it, and…
Are AWS’ docs wrong?

• Import worked within a few minutes.
• Looks like AWS’ documentation is incorrect. 5.7 is not supported!
• However this did not help us.
• Source was 5.7 and downgrading was not possible.
Lesson: docs are not always right.

• Last option was to use a tried and tested method.
• Good ol’ mysqldump.
• Not the best method:
• Non-transactional table types may not be consistent (MyISAM).
• Slowest of all methods.
• Can cause significant load on source.
3 time’s a charm

mysqldump Method
Master MySQL
Slave MySQL
Amazon EC2 Amazon RDS
VPC
1. Backup was created via mysqldump and shipped to an EC2 instance.
2. Dump file was loaded into MySQL via mysql client.

• Created an EC2 instance within the same VPC as RDS.
• Use --single-transaction and --quick.
• Modify output with sed due to DEFINERs.
• Record replication position.
• Disable Multi-AZ and backups on RDS to increase speed.
Method utilized
Wanted it as close to the
destination as possible.

mysqldump
# mysqldump
--single-transaction
--quick
--master-data=2
--databases production |
sed 's/sDEFINER=`[^`]*`@`[^`]*`//g' |
gzip > production_$(date -I).sql.gz

• Import succeeded!
• Luckily MyISAM tables are rarely written to.
• RDS instance was almost 4 days behind (Seconds_Behind_Master).
• Took about 24 hours to catch up.
After 81 hours...
Would have caused major
issues if they were highly active.

What these stories highlight is the evolution
of DevOps and nature of troubleshooting.

We are now in the Age of Levers.

• As a system and businesses grow and advance, technical teams would
build their own levers (automation, abstraction) to meet the needs of
the organization. Eg:
• System images, config management, deployment scripts and tools.
• Building dev/staging environments that look like prod.
• CI/CD systems.
• This enabled unprecedented flexibility, but came with its own risks.
Natural Evolution

• Take on the burden of maintenance, security, compliance, etc...
• The bigger risk is employee turnover.
• Over an extended period of time, those who built the system will likely
move on, which creates a void.
• Hiring replacements comes at a high cost (no prior knowledge of your
systems, may not be familiar with your tech stack).
Past Risks

• The prevalence of cloud-based architecture, and the flurry of managed
services addresses these risks.
• It’s much easier to hire someone with experience in a common cloud's
ecosystem.
• Easier to train (read their docs*).
• Resources are abundant.
• Communities are well developed.
Evolution to common levers
*read with scrutiny.

While the marketing for managed services
claims to save time and money,
this is only true when things go right.

• Access to lower layers (“popping
the hood”).
• Systems typically built by the
same team operating it.
• Great flexibility and agility.
• Requires high levels of expertise.
• Higher operating risks.
Toolbox vs. Levers
• Quicker to get started.
• Easier to scale.
• Lower initial risks.
• Abundance of resources.
• Less flexibility, limited agility.
• Costs skyrocket when things go wrong.
• Lots of trial and error.

Lesson to be learned:
Embrace the evolution,
but be wary and budget for failure.

THE
END
… psst we’re hiring

—
301-675 King Street West
Toronto, ON, M5V 1M9
1 (866) 278 0021
@stackdotio
/stackdotio
/company/stackdotio

DevOps throughout time

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a DevOps throughout time

Semelhante a DevOps throughout time (20)

Último

Último (20)

DevOps throughout time