Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.
Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.
Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.
2. InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations
/data-pipelines-python
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
24. (Unit) Testing
Unit tests in three easy steps:
• import unittest
• Write your tests
• Quit complaining about lack of time to write tests
25. Benefits of (unit) testing
• Safety net for refactoring
• Safety net for lib upgrades
• Validate your assumptions
• Document code / communicate your intentions
• You’re forced to think
28. Testing: not convinced yet?
f1 = fscore(p, r)
min_bound, max_bound = sorted([p, r])
assert min_bound <= f1 <= max_bound
29. Testing: I’m almost done
• Unit tests vs Defensive Programming
• Say no to tautologies
• Say no to vanity tests
• The Python ecosystem is rich:
py.test, nosetests, hypothesis, coverage.py, …
31. Towards Good Data Pipelines (f)
Orchestration
Don’t re-invent the wheel
32. You need a workflow manager
Think:
GNU Make + Unix pipes + Steroids
33. Intro to Luigi
• Task dependency management
• Error control, checkpoints, failure recovery
• Minimal boilerplate
• Dependency graph visualisation
$ pip install luigi
34. Luigi Task: unit of execution
class MyTask(luigi.Task):
def requires(self):
return [SomeTask()]
def output(self):
return luigi.LocalTarget(…)
def run(self):
mylib.run()
35. Luigi Target: output of a task
class MyTarget(luigi.Target):
def exists(self):
... # return bool
Great off the shelf support
local file system, S3, Elasticsearch, RDBMS
(also via luigi.contrib)
36.
37. Intro to Airflow
• Like Luigi, just younger
• Nicer (?) GUI
• Scheduling
• Apache Project
38. Towards Good Data Pipelines (g)
When things go wrong
The Joy of debugging
40. Who reads the logs?
You’re not going to read the logs, unless…
• E-mail notifications (built-in in Luigi)
• Slack notifications
$ pip install luigi_slack # WIP
41. Towards Good Data Pipelines (h)
Static Analysis
The Joy of Duck Typing
42. If it looks like a duck,
swims like a duck,
and quacks like a duck,
then it probably is a duck.
— somebody on the Web
44. >>> '1' * 2
'11'
>>> '1' + 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object
to str implicitly
45. def do_stuff(a: int,
b: int) -> str:
...
return something
PEP 3107 — Function Annotations
(since Python 3.0)
(annotations are ignored by the interpreter)
46. typing module: semantically coherent
PEP 484 — Type Hints
(since Python 3.5)
(still ignored by the interpreter)