This document provides best practices for fixing data issues that occur in production databases. It recommends treating data fixes as code by checking fixes into source control, testing them, and conducting code reviews. It also advises logging all data fix executions, changes, and exceptions. Developers should make fixes idempotent and reversible when possible, be fault-tolerant of exceptions, and optimize for bottlenecks like CPU, memory, and database usage. Database snapshots should be used for testing and reverting changes.
21. Track execution.
● Log when a script is executed.
● Log everything that changed.
● Log what did not change.
● Centralize your logging.
● Track the script’s progress.
26. for user_id in list_of_user_ids:
try:
toggle = FeatureToggle.objects.get(user_id=user_id)
except FeatureToggle.DoesNotExist:
logger.info('FeatureToggle does not exist for User
{}'.format(User_id))
continue
toggle.orientation_videos = True
toggle.save()
34. feature_toggles = []
for user_id in user_ids_to_backfill:
feature_toggles.append(FeatureToggle(
user_id=user_id,
orientation_videos=True
)
)
FeatureToggle.objects.bulk_create(feature_toggles)
35. def backfill_activity_progresses():
conn = psycopg2.connect("some_credentials")
cursor = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
data_to_replicate = []
for index in tqdm(batch_range):
cursor.execute("SELECT user_id, type, correct_answers,
total_answers, is_complete FROM legacy_activity ORDER BY id;")
data_to_replicate.append(cursor.fetchall())
conn.close()
add_to_new_activity_table(data_to_replicate)
37. Execute at the right level
of abstraction.
● Use existing functions and the ORM when
you can afford to.
● Use SQL when execution time becomes
significant.
3. This is a talk about how to deal with problems that arise with your data in production.
Imagine that you’re working on a web application. It has an admin interface where admins can toggle different features on and off for different users.
Somehow, those feature toggles got messed up. Now the orientation_videos flag is set to False for a large number of users.
Fortunately, you have a way to recover which users are supposed to have that feature enabled so you can go into your database and fix the problem.
If that is your response to updating production data in your web application, I’m going to suggest that you stop and think about the variety of ways in which such an operation can go wrong.
5. The right approach is to never, ever change production data.
But of course that’s not how things work in the real world.
6. So instead, let’s talk about some realistic ways to fix your data safely. The first bit of advice should be obvious: don’t just start executing SQL queries or shell commands off the cuff. Treat these changes as code. That means checking them in...
7. Testing them. This might seem like a waste of time since you’re probably going to throw this code away after you run it. But better to waste a little time writing a test than a lot of time trying to reverse a catastrophic mistake to your data.
Whatever process your team has for code review, do that.
10. Secondly, when you execute one of these scripts, the last thing that you want is code that runs silently for an indeterminate period of time, and may or may not have had the desired effect. So be very generous with your logging.
Secondly, when you execute one of these scripts, the last thing that you want is code that runs silently for an indeterminate period of time, and may or may not have had the desired effect. So be very generous with your logging.
Secondly, when you execute one of these scripts, the last thing that you want is code that runs silently for an indeterminate period of time, and may or may not have had the desired effect. So be very generous with your logging.
Here’s an example script. Note that we’re logging at the beginning of the function, at the end, and for every code path in between.
Another consideration is that you should centralize your logging. These logs need to be accessible to anyone on your team who may need to look back and see what happened.
Any tool that works for you is fine, but what we use is Amazon Kinesis Firehose. We use it to write logs from our scripts to Amazon S3. What’s great about this service is the simplicity of using it.
You set up a firehose in AWS, and then writing a log to an S3 bucket is just as simple as this. Note that I have a cute little function that gets the filename of the script that’s calling this function.
Since I imagine that you’ll still be running these scripts manually, it’s pretty important to know how far along you are.
For that, we use tqdm
You pass an iterable in the “tqdm” function, passing it a count if your iterable is expensive to get the length of, and it gives you a little progress bar which shows time estimates.
15. Speaking of time, your script might take a long time to run. And won’t it be annoying if it runs for 3 hours and then fails halfway through?
It’s always easy to forget to handle error conditions.
So it’s a good idea to practice defensive programming in this case. Think about possible exceptions and catch them, making sure to log.
Another thing to think about is, when your script does break 2 hours in, can you safely run it again and get the desired results?
This might not always be possible or necessary, but in some extreme cases we have resorted to adding a new field to a model to track which items have been backfilled.
Another nice feature is reversibility. If you screwed something up, is it possible to figure out what the previous state of the data was?
This is where really detailed logging comes into play. If your logs contain all of the necessary information, you could conceivably parse them to get the original state of your data and reverse the damage.
20. If you’re dealing with a lot of data that needs to be fixed, you’re going to start needing to do some actual engineering. You’re going to need to think about what bottlenecks you might encounter. Maybe you’re doing something computationally intensive, in which case you might need to think about how to parallelize the job.
Or maybe the naive version of your script is going to load several GB of data into memory. In that case, you might need to rewrite to use a generator or something.
More likely, the database is going to be your big issue.
In that case, it’s time to explore some of the features of your ORM. For example, here’s a construct in the Django ORM that I’ve used to bulk create objects instead of creating them one by one. That can be a huge time saver.
If you’re still running too slow, you might want to drop down directly to the SQL level and skip Object instantiation and so on. You can get huge performance gains this way.
But please, please don’t do these things if you don’t need to. I’ve gotten into PR debates with coworkers who wanted to over-optimize a script that takes 15 minutes to run. Don’t get fancy.
So my general rule of thumb is to use the ORM when you can, and use SQL or the equivalent when you need to.
24. Let’s talk about other issue which should be obvious but needs to be said. Backups are your friend. First of all, if it’s at all feasible, set up a snapshot of your data and run your script against that before running it against the real thing.
If you’re going to be doing something risky, please back up your data before you execute your script. Right before.
If you’re going to be doing significant changes rather often, consider automating your database backups so that you snapshot right before your scripts run.
There are lots of other considerations, but those are some of the big ones.