Metaflow (Ville Tuulos)
Data scientists at Netflix are expected to develop and operate large machine learning workflows autonomously. However, we do not expect that all our scientists are deeply experienced with distributed systems and data engineering. Metaflow was created to make it delightfully easy to build and operate ML workflows in the cloud using idiomatic Python and off-the-shelf ML libraries, covering the whole lifecycle of an ML project from prototype to production.
Polynote (Jeremy Smith)
Polynote is a new notebook tool we created from scratch to address some of the pain points we've run into while using Scala in machine-learning notebooks at Netflix. It provides essential code editing features other tools lack like interactive auto-completes, support for mixing multiple languages and sharing data between them within a single notebook, and encourages reproducible notebooks with its immutable data model.
Papermill (Matthew Seal)
Nteract is an open source organization under which there are several libraries and applications that Netflix and many other companies and individuals contribute to. One of these libraries is Papermill, a library used to programmatically parameterize and execute Jupyter Notebooks. Papermill provides a CLI and Python interface that we'll explore during the session to see how it can be used and what value it adds. Using this pattern we'll also briefly talk about how we've integrated papermill at Netflix and how it interfaces with other Jupyter and nteract services.
31. Polynote is a polyglot notebook environment,
built from scratch.
32. Polynote is a polyglot notebook environment,
built from scratch.
It supports mixing Scala, Python, SQL, and
Vega in a single notebook.
33. Polynote is a polyglot notebook environment,
built from scratch.
It supports mixing Scala, Python, SQL, and
Vega in a single notebook.
Data is shared seamlessly* between
languages.
34. Why did we build it?
Scientists were avoiding Scala notebooks for
experimentation.
35. Why did we build it?
Scientists were avoiding Scala notebooks for
experimentation.
It was just a pain to use Scala and Spark in a
notebook.
36. Scala + Spark pain points
● Interactive autocomplete is practically a
necessity
● Difficult to find compiler errors
● Dependencies are many and varied
● Spark clashes with dependencies –
constantly building shaded JARs
40. Visibility
See what the Kernel's up to with the symbol table, task
list and executing expression highlight.
41. Data Visualization
Use the built-in Data Inspector to browse tabular data
and inspect schema. Plot data with the plot editor, or use
Vega or matplotlib directly.
42.
43. Polyglot
Scala cells and Python cells together in one notebook.
Variables from each language are available to the other.
44. Polyglot
Scala cells and Python cells together in one notebook.
Variables from each language are available to the other.
Example use case: data prep in Scala+Spark, model
training in Python with TensorFlow/PyTorch/etc
49. Things to preserve:
● Results linked to code
● Good visuals
● Easy to share
Focus points to extend uses.
Things to improve:
● Not versioned
● Mutable state
● Templating
50. Jupyter Notebooks:
A Repl Protocol + UIs
Jupyter
UIs
Jupyter
Server
Jupyter
Kernel
execute
code
receive
outputs
forward
requests
save / load
.ipynb
It’s more complex than this in reality
develop
share
51. A simple library for executing
notebooks.
EFS
S3
Papermill
template.ipynb
run_1.ipynb
run_3.ipynb
output
notebooks
parameterize & run
run_2.ipynb
run_4.ipynbinput
notebook
input store
s3://output/mseal/
efs://users/mseal/notebooks
52. import papermill as pm
pm.execute_notebook('input_nb.ipynb', 'outputs/20190402_run.ipynb')
…
# Each run can be placed in a unique / sortable path
pprint(files_in_directory('outputs'))
outputs/
...
20190401_run.ipynb
20190402_run.ipynb
Choose an output location.
54. # Same example as last slide
pm.execute_notebook('input_nb.ipynb', 'outputs/20190402_run.ipynb',
{'region': 'ca', 'devices': ['phone', 'tablet']})
…
# Bash version of that input
papermill input_nb.ipynb outputs/20190402_run.ipynb -p region ca -y
'{"devices": ["phone", "tablet"]}'
Also Available as a CLI
57. # To add SFTP support you’d add this class
class SFTPHandler():
def read(self, file_path):
...
def write(self, file_contents, file_path):
…
# Then add an entry_point for the handler
from setuptools import setup, find_packages
setup(
# all the usual setup arguments ...
entry_points={'papermill.io':
['sftp://=papermill_sftp:SFTPHandler']})
# Use the new prefix to read/write from that location
pm.execute_notebook('sftp://my_ftp_server.co.uk/input.ipynb',
'sftp://my_ftp_server.co.uk/output.ipynb')
Entire Library is Component Based
60. Output notebooks are the place to
look for failures. They have:
● Stack traces
● Re-runnable code
● Execution logs
● Same interface as input
Failed outputs
are useful.
61. Find the issue.
Test the fix.
Update the notebook.
Output notebooks are the place to
look for failures. They have:
● Stack traces
● Re-runnable code
● Execution logs
● Same interface as input
62. Adds notebook isolation
● Immutable inputs
● Immutable outputs
● Parameterization of notebook runs
● Configurable sourcing / sinking
and gives better control of notebook flows via library calls.
Changes to the notebook experience.
63. ● Platform Scheduler uses Jupyter
Notebooks for all Templates
● Notebooks used to run integration tests,
monitor systems, execute ETL, and wrap
ML flows.
Jupyter Notebooks @Netflix