Fixing Web Data in Production

Fixing Web Data
in Production
Best practices for bad situations
Aaron Knight, Full Stack Engineer at Voxy

Aaron Knight (@iamaaronknight)
Full Stack Engineer
Voxy.com
I am

A Django web app
8 years old
12 engineers
10+ data stores
Voxy is

Oops.
There’s a problem with the data!
(It was probably my fault)

> SELECT * FROM feature_toggles LIMIT 2;
+-------+----------+--------------------+
| id | user_id | orientation_videos |
|-------+----------+--------------------+
| 1234 | 8923123 | f |
| 1235 | 9213483 | f |
| 1236 | 2136935 | f |

> UPDATE feature_toggles SET
orientation_videos = 't' WHERE...

Hold up!
What could co wrong?
● You bring down the site.
● You make the problem worse.
● You forget what you did.

Never change
data in prod
● Never introduce any bugs.
● Make all the right architecture decisions
the first time.

Data fixes are code.
● Check them in to source control.

● Test them.

● Test them.
● Code review them.

Track execution.
● Log when a script is executed.

Track execution.
● Log everything that changed.

Track execution.
● Log what did not change.

def fix_feature_toggles():
logger.info('Starting fix_feature_toggles script')
for toggle in FeatureToggle.objects.all():
if toggle.orientation_videos:
logger.info('FeatureToggle {} orientation_videos
already exists; skipping'.format(toggle.id))
else:
toggle.orientation_videos = get_correct_value(toggle)
toggle.save()
logger.info(
'FeatureToggle {} orientation_videos updated to
{}'.format(toggle.id, toggle.orientation_videos))
logger.info('Finished fix_feature_toggles script')

Track execution.
● Centralize your logging.

import boto3
firehose = boto3.client('firehose')
def log_to_kinesis(message):
data = OrderedDict([
('script_name', get_filename_of_caller()),
('environment', settings.ENVIRONMENT),
('ts', str(pytz.utc.localize(datetime.datetime.now()))),
('message', message),
])
firehose.put_record(
DeliveryStreamName='backfill-logs',
Record={'Data': (json.dumps(data, sort_keys=False) + 'n')}
)

Track execution.
● Centralize your logging.
● Track the script’s progress.

import tqdm
def backfill_toggles():
count = FeatureToggle.objects.count()
for org in tqdm(FeatureToggle.objects.all(), count=count):
...

Be fault-tolerant.
● Think of possible exceptions.

for user_id in list_of_user_ids:
try:
toggle = FeatureToggle.objects.get(user_id=user_id)
except FeatureToggle.DoesNotExist:
logger.info('FeatureToggle does not exist for User
{}'.format(User_id))
continue
toggle.orientation_videos = True
toggle.save()

Be fault-tolerant.
● Make your scripts idempotent, if possible.

for user_id in list_of_user_ids:
try:
toggle = FeatureToggle.objects.get(user_id=user_id,
backfilled=False)
except FeatureToggle.DoesNotExist:
continue
toggle.orientation_videos = True
toggle.backfilled = True
toggle.save()

Be fault-tolerant.
● Make your scripts idempotent, if possible.
● Make your changes reversible, if possible.

{"environment": "production", "ts": "2017-10-02 18:33:08.805645+00:00",
"message": "unit_id 117 resource_id: None > 597ca7531ce6856f34607de9"}
{"environment": "production", "ts": "2017-10-02 18:33:08.878832+00:00",
"message": "unit_id 28 resource_id: None > 54ca763ca8615a76184dd4a9"}

Know your bottlenecks.
● CPU?

● CPU?
● Memory?

● CPU?
● Memory?
● Database?

feature_toggles = []
for user_id in user_ids_to_backfill:
feature_toggles.append(FeatureToggle(
user_id=user_id,
orientation_videos=True
)
)
FeatureToggle.objects.bulk_create(feature_toggles)

def backfill_activity_progresses():
conn = psycopg2.connect("some_credentials")
cursor = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
data_to_replicate = []
for index in tqdm(batch_range):
cursor.execute("SELECT user_id, type, correct_answers,
total_answers, is_complete FROM legacy_activity ORDER BY id;")
data_to_replicate.append(cursor.fetchall())
conn.close()
add_to_new_activity_table(data_to_replicate)

● CPU?
● Memory?
● Database?
● Developer time?
● Cognitive overhead?

Execute at the right level
of abstraction.
● Use existing functions and the ORM when
you can afford to.
● Use SQL when execution time becomes
significant.

Use database
snapshots.
● Test your script on a backup from
production.

Use database
snapshots.
production.
● Take snapshots before you make changes.

Use database
snapshots.
production.
● Take snapshots before you make changes.
● Automate your backups.

Aaron Knight (@iamaaronknight)
Full Stack Engineer
Voxy.com
Thanks!

Fixing Web Data in Production

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Fixing Web Data in Production

Semelhante a Fixing Web Data in Production (20)

Último

Último (20)

Fixing Web Data in Production

Notas do Editor