SlideShare uma empresa Scribd logo
1 de 18
Reproducible data science
Lightning review
Into
• Data Science Lead at Outra
• 5 years in Data Science
• Main focus has been social media,
marketing and retail
Josh Levy-Kramer
Outline
1. Reproducibility crisis
2. Possible solutions
3. Comparison and roundup
Reproducibility
crisis
• Dark ages when tracking changes and building models
• ML lacking abstractions that developers have developed
• Creates problems for yourself, your team and public projects
• Partially due to cant commit large files into git
Data Scientist Manifesto
• Reproducibility – the ability to reconstruct any previous state of your
data analysis (data and execution)
• Provenance – the ability to track any result and link it the the input
and code used
• Collaboration – the ability to easily collaborate with team members
• Environment agnostic – the ability to deploy a process to in different
environments without much hindrance
Adapted from http://www.pachyderm.io/dsbor.html
Data – Code – Environment → Output
CodeData Output
Environment
Code – Environment
Git
1 numpy==1.13.3
2 pandas==0.21.1
3 scikit-learn==0.19.1
4 psutil==5.4.0
5 pyyaml==3.1
OS environment
Python eniroment
1 FROM continuumio/miniconda3:4.3.27
2 COPY requirements.txt .
3 RUN pip install -r requirements.txt
4 COPY model.py
5 ENTRYPOINT model.py
1 AWSTemplateFormatVersion: "2010-09-09"
2 Resources:
3 WebInstance:
4 Type: AWS::EC2::Instance
5 Properties:
6 InstanceType: r4.8xlarge
7 ImageId: ami-80861296
8 KeyName: my-key
9 SecurityGroupIds:
10 - sg-abc01234
11 SubnetId: subnet-acb01234
Hardware environment
Code
Dockerfile
pip requirements.txt
AWS CloudFormation template
Data?
Data Output
Emerging solutions
Data Version Control
Git LFS
• Git extension
• Allows you to commit large files into Git
• Uses custom protocol and store
• No concept of pipelining
What's similar
Data Pipelines
• Version controls data and pipelines, similar to what Git does with
code
• Two main abstractions:
alpha=0.7 Output
v1
ML Image model
Input Output
alpha=0.7 Output
v2
Change
input
alpha=0.1 Output
v3
Change
model
Version
control all
• Version controls data and pipelines, similar to what Git does with
code
Pachyderm - workflow
Data
repo
pachctl putfile
Pipeline Output
pachctl create-pipeline
Pachyderm
• I like:
• Interlinked data-pipeline-
output version control
• Automatic output generation
• Parallelisation and distribution
• Semi-mature project –
started 2014
👍 👎
• I dislike:
• Not environment agnostic
• Bloated tool:
• Not generic - highly integrated
with Kubernetes and S3
• Installation is complicated
• Not portable
• Not integrated with git
dvc
• “Git extension for data scientists – mange your code and data together”
• Same git workflow with extra commands
dvc add
Data
repo
Pipeline Output
dvc run –d input.csv
–o output.csv model.py alpha=0.1
Workflow
dvc
• I like:
• Integration with GIT
• Interlinked data-pipeline-
output version control
• Easy to install
• Environment agnostic
👍 👎
• I dislike:
• Double the actions required
compared to just GIT – easy to
get lost in workflow
• Terrible name
• Immature project – started
2017
Round up
• No solution quite there yet
• Data version control is the best
contender
🤔

Mais conteúdo relacionado

Mais procurados

SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
DevOpsDays Tel Aviv
 
Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1
Claudius Hauptmann
 

Mais procurados (8)

GraphQL and mule4
GraphQL and mule4 GraphQL and mule4
GraphQL and mule4
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
 
Enterprise graph applications
Enterprise graph applicationsEnterprise graph applications
Enterprise graph applications
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1
 

Semelhante a Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
Marcus Hanwell
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 
GR8CONF Contributing Back To Grails
GR8CONF Contributing Back To GrailsGR8CONF Contributing Back To Grails
GR8CONF Contributing Back To Grails
bobbywarner
 

Semelhante a Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools (20)

Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?
 
Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...
 
Managing changes to eZPublish Database
Managing changes to eZPublish DatabaseManaging changes to eZPublish Database
Managing changes to eZPublish Database
 
Building a custom cms with django
Building a custom cms with djangoBuilding a custom cms with django
Building a custom cms with django
 
Git SVN Migrate Reasons
Git SVN Migrate ReasonsGit SVN Migrate Reasons
Git SVN Migrate Reasons
 
Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practice
 
tip oopt pse-summit2017
tip oopt pse-summit2017tip oopt pse-summit2017
tip oopt pse-summit2017
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Database Migrations with Gradle and Liquibase
Database Migrations with Gradle and LiquibaseDatabase Migrations with Gradle and Liquibase
Database Migrations with Gradle and Liquibase
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Git version control and trunk based approach with VSTS
Git version control and trunk based approach with VSTSGit version control and trunk based approach with VSTS
Git version control and trunk based approach with VSTS
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformatics
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
GR8CONF Contributing Back To Grails
GR8CONF Contributing Back To GrailsGR8CONF Contributing Back To Grails
GR8CONF Contributing Back To Grails
 
Git(hub) for windows developers
Git(hub) for windows developersGit(hub) for windows developers
Git(hub) for windows developers
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

  • 2. Into • Data Science Lead at Outra • 5 years in Data Science • Main focus has been social media, marketing and retail Josh Levy-Kramer
  • 3. Outline 1. Reproducibility crisis 2. Possible solutions 3. Comparison and roundup
  • 4. Reproducibility crisis • Dark ages when tracking changes and building models • ML lacking abstractions that developers have developed • Creates problems for yourself, your team and public projects • Partially due to cant commit large files into git
  • 5. Data Scientist Manifesto • Reproducibility – the ability to reconstruct any previous state of your data analysis (data and execution) • Provenance – the ability to track any result and link it the the input and code used • Collaboration – the ability to easily collaborate with team members • Environment agnostic – the ability to deploy a process to in different environments without much hindrance Adapted from http://www.pachyderm.io/dsbor.html
  • 6. Data – Code – Environment → Output CodeData Output Environment
  • 7. Code – Environment Git 1 numpy==1.13.3 2 pandas==0.21.1 3 scikit-learn==0.19.1 4 psutil==5.4.0 5 pyyaml==3.1 OS environment Python eniroment 1 FROM continuumio/miniconda3:4.3.27 2 COPY requirements.txt . 3 RUN pip install -r requirements.txt 4 COPY model.py 5 ENTRYPOINT model.py 1 AWSTemplateFormatVersion: "2010-09-09" 2 Resources: 3 WebInstance: 4 Type: AWS::EC2::Instance 5 Properties: 6 InstanceType: r4.8xlarge 7 ImageId: ami-80861296 8 KeyName: my-key 9 SecurityGroupIds: 10 - sg-abc01234 11 SubnetId: subnet-acb01234 Hardware environment Code Dockerfile pip requirements.txt AWS CloudFormation template
  • 10. Git LFS • Git extension • Allows you to commit large files into Git • Uses custom protocol and store • No concept of pipelining
  • 11. What's similar Data Pipelines • Version controls data and pipelines, similar to what Git does with code • Two main abstractions:
  • 12. alpha=0.7 Output v1 ML Image model Input Output alpha=0.7 Output v2 Change input alpha=0.1 Output v3 Change model Version control all • Version controls data and pipelines, similar to what Git does with code
  • 13. Pachyderm - workflow Data repo pachctl putfile Pipeline Output pachctl create-pipeline
  • 14. Pachyderm • I like: • Interlinked data-pipeline- output version control • Automatic output generation • Parallelisation and distribution • Semi-mature project – started 2014 👍 👎 • I dislike: • Not environment agnostic • Bloated tool: • Not generic - highly integrated with Kubernetes and S3 • Installation is complicated • Not portable • Not integrated with git
  • 15. dvc • “Git extension for data scientists – mange your code and data together” • Same git workflow with extra commands dvc add Data repo Pipeline Output dvc run –d input.csv –o output.csv model.py alpha=0.1
  • 17. dvc • I like: • Integration with GIT • Interlinked data-pipeline- output version control • Easy to install • Environment agnostic 👍 👎 • I dislike: • Double the actions required compared to just GIT – easy to get lost in workflow • Terrible name • Immature project – started 2017
  • 18. Round up • No solution quite there yet • Data version control is the best contender 🤔

Notas do Editor

  1. These can be version controlled
  2. And the output
  3. Pachyderm is the most developed
  4. Track experimentation. Even the crap trys
  5. Imagine this is a model that’s doing some Ml on the images
  6. Going tobefome a margional tool but not going to be the next git
  7. Going tobefome a margional tool but not going to be the next git