The advances in machine learning are great, yet, in order to have real value within a company, data scientists must be able to go from a research project to a reproducible process. A common problem is that the code is intrinsically linked to the data it was developed against. Hence it is critically important to track, trace and validate the input data used to train and test the algorithm. This talk will be a review of the several tools which for data versioning and processing.
4. Reproducibility
crisis
• Dark ages when tracking changes and building models
• ML lacking abstractions that developers have developed
• Creates problems for yourself, your team and public projects
• Partially due to cant commit large files into git
5. Data Scientist Manifesto
• Reproducibility – the ability to reconstruct any previous state of your
data analysis (data and execution)
• Provenance – the ability to track any result and link it the the input
and code used
• Collaboration – the ability to easily collaborate with team members
• Environment agnostic – the ability to deploy a process to in different
environments without much hindrance
Adapted from http://www.pachyderm.io/dsbor.html
10. Git LFS
• Git extension
• Allows you to commit large files into Git
• Uses custom protocol and store
• No concept of pipelining
11. What's similar
Data Pipelines
• Version controls data and pipelines, similar to what Git does with
code
• Two main abstractions:
12. alpha=0.7 Output
v1
ML Image model
Input Output
alpha=0.7 Output
v2
Change
input
alpha=0.1 Output
v3
Change
model
Version
control all
• Version controls data and pipelines, similar to what Git does with
code
14. Pachyderm
• I like:
• Interlinked data-pipeline-
output version control
• Automatic output generation
• Parallelisation and distribution
• Semi-mature project –
started 2014
👍 👎
• I dislike:
• Not environment agnostic
• Bloated tool:
• Not generic - highly integrated
with Kubernetes and S3
• Installation is complicated
• Not portable
• Not integrated with git
15. dvc
• “Git extension for data scientists – mange your code and data together”
• Same git workflow with extra commands
dvc add
Data
repo
Pipeline Output
dvc run –d input.csv
–o output.csv model.py alpha=0.1
17. dvc
• I like:
• Integration with GIT
• Interlinked data-pipeline-
output version control
• Easy to install
• Environment agnostic
👍 👎
• I dislike:
• Double the actions required
compared to just GIT – easy to
get lost in workflow
• Terrible name
• Immature project – started
2017
18. Round up
• No solution quite there yet
• Data version control is the best
contender
🤔
Notas do Editor
These can be version controlled
And the output
Pachyderm is the most developed
Track experimentation. Even the crap trys
Imagine this is a model that’s doing some Ml on the images
Going tobefome a margional tool but not going to be the next git
Going tobefome a margional tool but not going to be the next git