Here in DS team in WIX we want to help to create stunning sites by applying recent achievement of AI research to production. Since Data Science engineering practices are still not fully shaped we found out that it is crucial to bring the best practices from software engineering - give Data Scientist ability to deliver models fast without loss in quality and computation efficiency to stay competitive in this overhyped market. To achieve this we are developing our own infrastructure for creating pipelines and deploying them to production with minimum (to none) engineer involvement.
This talk will cover initial motivation, solved technical issues and lessons learned while building such ML delivery system.
Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/continuous-delivery-of-ml-pipelines-to-production
2. In the late 1990s
software development was:
● Invaluable capability for improving business outcomes
● Under the transformation from solo practitioners/small teams to large collaborative
teams
● Undergoing a rapid evolution of best practices and tooling
● Heavily in demand for competent practitioners
Source: https://blog.dominodatalab.com/joel-test-data-science/
3. Joel Test
to rate the quality of a software team (August 2000)
1. Do you use source control?
2. Can you make a build in one step?
3. Do you make daily builds?
4. Do you have a bug database?
5. Do you fix bugs before writing new code?
6. Do you have an up-to-date schedule?
7. Do you have a spec?
8. Do programmers have quiet working conditions?
9. Do you use the best tools money can buy?
10. Do you have testers?
11. Do new candidates write code during their interview?
12. Do you do hallway usability testing?
Source: https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/
4. Joel Test for Data Science
1. Are results reproducible?
2. Do you create a data pipeline that you can rebuild with one command?
3. Do you rebuild pipelines frequently?
4. Can Data Scientists deploy their models with minimal dependencies on
engineering and infrastructure?
5. Do you use source control?
6. Do you track bugs in your models and your pipeline code?
7. Do you translate model performance to commercial KPIs?
Source: http://guerrilla-analytics.net/joel-test-of-data-science-maturity/
(edited)
version
write auto tests for
efficiency
6. Code Reuse
What we do foremost while setting up a new project:
1. Create an environment
2. Get data
3. Copy/Paste data processing code from the previous project
Therefore, each Data Scientist has his own unique
carefully crafted environment which is also not aligned
with production.
(not exactly)
8. Code Reuse
preprocessing_pipeline = ImagePipeline(
pipe=[
ops.validation.SizeLimit(max_pixels=9000000),
ops.image.Resize(target=(500, 500)),
ops.scale.Scale0to1(),
]
)
inputs = preprocessing_pipeline.tf_generator(input_path="raw/*.jpg")
labels =
preprocessing_pipeline.tf_generator(input_path="labels/*.jpg")
dataset = tf.data.Dataset.zip((
tf.data.Dataset.from_generator(inputs),
tf.data.Dataset.from_generator(labels)
))
● Recreate your prod environment
in research
● Choose efficient dependencies
● Make data processing
serializable
● Universal solution for pytorch,
tensorflow
9. Meet the BA guild
01
Meet the BA guild
01Do you create a data pipeline you can
rebuild with one command?
2
10. Pipeline consists of
● Pre-processing (python) code
(usually shared between models / maintained by engineers)
● Estimator - eg. Tensorflow
(usually stored as binary file with weights and graph)
● Post-processing (python) code
(usually custom code that is frequently updated by Researcher)
11. Code Publishing
1. Organize your project as python package
2. You now have unified API to
install dependencies / run tests / etc
3. Your code could be installed anywhere
from
PYPI / github
Local Python Package Index
(PyPI)
12. Declare serializable Pipeline
● Good example: Spark MLPipeline
● Bad example: scikit-learn
serialization
(avoid pickling at any price)
● Better: custom solution
Python -> YAML -> Python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),
outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
from ds_auto_enhance.ops import increase_brightness
pipeline = Pipeline(
name='auto-enhance',
pipe=[
preprocessing_pipeline,
TensorflowEstimator.from_storage(model_path),
increase_brightness(),
ScaleToUInt()
]
)
14. Meet the BA guild
01
Meet the BA guild
01Do you rebuild pipelines frequently?
Can Data Scientists deploy their models
by themselves?
3/4
15. Model Registry
pipeline = Pipeline(
name='auto-enhance',
pipe=[
preprocessing_pipeline,
TensorflowEstimator.from_storage(model_path),
increase_brightness(),
ScaleToUInt()
]
)
pipeline.predict(test_image)
pipeline.deploy()
Model Registry
Evaluator
16. Meet the BA guild
01
Meet the BA guild
01
Do you use version control?
5
17. Each deploy bumps version automatically
model:
name: object-segmentation
version: 1.0
steps:
- ...
- func: ds_object_segmentation.ops.PostProcess
lib_name: ds_object_segmentation
lib_version: 2.1.1
● Model versioning enables A/B tests
● Pipeline version must be attached to code version
● Manage different library versions with pkg_resource
18. Meet the BA guild
01
Meet the BA guild
01Do you write auto-tests for your
models and your pipeline code?
6
19. Testing approaches
● Smoke testing based on the pipeline type
(ImageToImage, ImageToBoundingBox, etc)
We collect test set for each type
● Unit tests for custom operations written by Researches
(sounds like utopia)
● “Behavior-Driven Development” style - divide job between
Researchers and Engineer
20. BDD for e2e tests
example 1
Feature: Face Detector
A service that returns coordinates of
bounding boxes around faces on given image
Scenario: Simple Check
Given loaded pipeline face detector-1.0
Given image stored at test_data/1.jpg
When I call pipeline
Then I should receive response with 1
bounding-box
@given(parsers.parse("image stored at {image_path}"))
def input_image(image_path):
return Image.open('tests/' + image_path)
@given(parsers.parse("loaded pipeline {name_and_version}"))
def pipeline(name_and_version):
name, version = name_and_version.rsplit('-', 1)
return find_pipeline(name, version)
@when(parsers.parse('I call {pipeline_name}'))
def call(input_image, pipeline):
pipeline(input_image)
@then(parsers.re(r'I should receive response with
(?P<expected>d+) (?P<object_type>S+)'),
converters={'expected': int})
def verify(pipeline, expected, object_type):
assert len(pipeline.result['faces']) == expected
21. @pytest.fixture
def database():
return {}
@when(parsers.cfparse("image {key} with labels
{labels:Label+}", {'Label': str}))
def input_image(database, key, labels):
database[key] = labels
@given(parsers.parse("loaded pipeline {name_and_version}"))
def pipeline(name_and_version):
name, version = name_and_version.rsplit('-', 1)
return find_pipeline(name, version)
@when(parsers.re(r'I search for (?P<search>.+)'))
def call(database, pipeline, search):
pipeline(search, context=database)
@then(parsers.parse(r'I should receive list that starts with
{expected}'))
def verify(pipeline, expected):
assert pipeline.result[0] == expected
Feature: Semantic Search
A service that finds the closest
(semantically) image to given paragraph
Scenario: Search
Given loaded pipeline semantic-search-1.0
When image 1.jpg with labels train,mountain
And image 2.jpg with labels car,highway
And I search for railway station
Then I should receive list that starts with
1.jpg
BDD for e2e tests
example 2
22. Behavior-Driven Development
for Data Science
● Good for end-to-end testing
● Separation of concerns between Researcher and Engineer
● Code can be easily reused across various models
● Easy to write scenarios & read reports
25. Get maximum from your hardware
(obvious steps)
● Compile your kernels on target hardware (get benefits from SIMD, FMA)
● Fixed shapes enable intermediate compilation (eg XLA)
It’s sometimes much more efficient to have bucketing + padding + several fixed models
● Always (like seriously always) have separated graphs for training & inference
(especially if you want to compile model)
● (For TF) Use NCHW instead of NHWC, where C > 16
It actually will be converted HCHW[c]
Reason: CPU cache and vector operations
To fully support NCHW you need compile tensorflow with MKL
26. ● Quantization:
○ TensorRT offers float16 for GPUs
○ Intel offers int8 for CPUs
○ Google offers bfloat16 for TPUs
● TVM - convert your model into low-level with LLVM
TF/PyTorch/ONNX -> Relay -> Bytecode
(open-sourced technology behind Amazon SageMaker Neo)
Get maximum from your hardware
(advanced steps)
27. Joel Test for Data Science
✓ Are results reproducible?
✓ Do you create a data pipeline that you can rebuild with one command?
✓ Do you rebuild pipelines frequently?
✓ Can Data Scientists deploy their models with minimal dependencies on
engineering and infrastructure?
✓ Do you use version control?
✓ Do you write-auto tests for your models and your pipeline code?
✓ Do you translate model efficiency to commercial KPIs?