Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

•Transferir como PPTX, PDF•

5 gostaram•2,800 visualizações

The advances in machine learning are great, yet, in order to have real value within a company, data scientists must be able to go from a research project to a reproducible process. A common problem is that the code is intrinsically linked to the data it was developed against. Hence it is critically important to track, trace and validate the input data used to train and test the algorithm. This talk will be a review of the several tools which for data versioning and processing.

Software

Reproducible data science
Lightning review

Into
• Data Science Lead at Outra
• 5 years in Data Science
• Main focus has been social media,
marketing and retail
Josh Levy-Kramer

Outline
1. Reproducibility crisis
2. Possible solutions
3. Comparison and roundup

Reproducibility
crisis
• Dark ages when tracking changes and building models
• ML lacking abstractions that developers have developed
• Creates problems for yourself, your team and public projects
• Partially due to cant commit large files into git

Data Scientist Manifesto
• Reproducibility – the ability to reconstruct any previous state of your
data analysis (data and execution)
• Provenance – the ability to track any result and link it the the input
and code used
• Collaboration – the ability to easily collaborate with team members
• Environment agnostic – the ability to deploy a process to in different
environments without much hindrance
Adapted from http://www.pachyderm.io/dsbor.html

Data – Code – Environment → Output
CodeData Output
Environment

Code – Environment
Git
1 numpy==1.13.3
2 pandas==0.21.1
3 scikit-learn==0.19.1
4 psutil==5.4.0
5 pyyaml==3.1
OS environment
Python eniroment
1 FROM continuumio/miniconda3:4.3.27
2 COPY requirements.txt .
3 RUN pip install -r requirements.txt
4 COPY model.py
5 ENTRYPOINT model.py
1 AWSTemplateFormatVersion: "2010-09-09"
2 Resources:
3 WebInstance:
4 Type: AWS::EC2::Instance
5 Properties:
6 InstanceType: r4.8xlarge
7 ImageId: ami-80861296
8 KeyName: my-key
9 SecurityGroupIds:
10 - sg-abc01234
11 SubnetId: subnet-acb01234
Hardware environment
Code
Dockerfile
pip requirements.txt
AWS CloudFormation template

Git LFS
• Git extension
• Allows you to commit large files into Git
• Uses custom protocol and store
• No concept of pipelining

What's similar
Data Pipelines
• Version controls data and pipelines, similar to what Git does with
code
• Two main abstractions:

alpha=0.7 Output
v1
ML Image model
Input Output
alpha=0.7 Output
v2
Change
input
alpha=0.1 Output
v3
Change
model
Version
control all
• Version controls data and pipelines, similar to what Git does with
code

Pachyderm - workflow
Data
repo
pachctl putfile
Pipeline Output
pachctl create-pipeline

Pachyderm
• I like:
• Interlinked data-pipeline-
output version control
• Automatic output generation
• Parallelisation and distribution
• Semi-mature project –
started 2014
👍 👎
• I dislike:
• Not environment agnostic
• Bloated tool:
• Not generic - highly integrated
with Kubernetes and S3
• Installation is complicated
• Not portable
• Not integrated with git

dvc
• “Git extension for data scientists – mange your code and data together”
• Same git workflow with extra commands
dvc add
Data
repo
Pipeline Output
dvc run –d input.csv
–o output.csv model.py alpha=0.1

dvc
• I like:
• Integration with GIT
• Interlinked data-pipeline-
output version control
• Easy to install
• Environment agnostic
👍 👎
• I dislike:
• Double the actions required
compared to just GIT – easy to
get lost in workflow
• Terrible name
• Immature project – started
2017

Round up
• No solution quite there yet
• Data version control is the best
contender
🤔

Mais conteúdo relacionado

Mais procurados

GraphQL and mule4

Shivam Khandelwal

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

Work-Bench

When writing code there are many common repetitive tasks that we need to write but don't want to invent ourselves, such as CRUD operations, http client etc. There are of course many frameworks and packages that we can use and rely on but we need to learn how to use them. So in most cases for implementing common operations and patterns we find ourselves searching in Google. With very basic string syntax we search for what we are looking for and usually the search engine will give us back a Stack Overflow solution. In most cases we will copy and paste it (or some similar version of it) to our IDE. This process is time consuming, error prone and distracting, we lose our focus and context every time we leave the IDE to the browser and make decisions on our code that might be risky. However, relying on the knowledge of the developer community is important and very helpful for all of us. So Instead of searching on the web for solutions, it seems that integrating stack overflow inside our IDE will make developers more efficient and less likely to make mistakes. With the continued growth of technology, prediction tools such as Intellij and AI systems a new line of developer tools emerges such as CoPilot , Kite and TabNine that are going to shape the future of developers. These tools have an AI engine that is able to give code suggestions for whole lines or entire functions right inside the IDE based on simple sentences. This might sound scary for many of us developers, as it's going to change the way we work but I actually think it's exciting. In my talk I am going to share with you a bit more details about these tools and how they work and why I think all of us should take part in shaping our future.

SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...

DevOpsDays Tel Aviv

Continuum Analytics and Python

Travis Oliphant

Scratch orgs are extremely valuable tools for Salesforce developers, but due to their individual, disposable nature, a source of truth for QA data is often not accounted for. Without a single repository for QA data, many developers may be testing against incomplete data sets, skewing their results. In our latest tech webinar, we discuss implications planning for QA data can have on Salesforce development. In this webinar, you will learn: - Why it’s essential to have a plan in place early on how to deploy data to scratch orgs and QA orgs. - Shortcuts which can inadvertently hide bugs that don't manifest until tested with real data, and lengthen the time it takes to complete a task. - Strategies for maintaining data models as projects progress and as data is added or removed to stay realistic and current. CodeScience Lead Salesforce Developer, Bobby Tamburrino will dive into these topics and provide key insights that can help ISVs succeed on the AppExchange.

WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing

CodeScience

Enterprise graph applications

David Colebatch

Apache Flink community Update for March 2016 - Slim Baltagi

Slim Baltagi

Bachelor Practical Course Semantic Web Part 1

Claudius Hauptmann

Mais procurados (8)

GraphQL and mule4

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...

Continuum Analytics and Python

WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing

Enterprise graph applications

Apache Flink community Update for March 2016 - Slim Baltagi

Bachelor Practical Course Semantic Web Part 1

Semelhante a Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.

Data Versioning and Reproducible ML with DVC and MLflow

Databricks

Que nos espera a los ALM Dudes para el 2013?

Bruno Capuano

Managing Changes to the Database Across the Project Life Cycle (presented by ...

eZ Systems

Managing changes to eZPublish Database

Gaetano Giunta

Building a custom cms with django

Yann Malet

Git SVN Migrate Reasons

Ovidiu Dimulescu

Migrating a Magento site is not just about code and data. Commerce platforms evolve over time and your Magento 1 solution is likely different today compared to the day you launched. Planning a successful migration means understanding what you have and where you are going before you can begin. In this session, architects from the Magento Expert Consulting Group will lay out best practices for defining your migration strategy, and share tips and techniques for code and data migration.

Expert guidance on migrating from magento 1 to magento 2

James Cowie

Intro to Big Data

Zohar Elkayam

Reproducible research: practice

C. Tobin Magle

tip oopt pse-summit2017

domenico di mola

Open Chemistry: Input Preparation, Data Visualization & Analysis

Marcus Hanwell

Database migration scripts are a notorious source of difficulty in the software delivery process. This session will discuss how we neutralized this all too common headache. Now our deployment framework executes database migrations automatically with every application deploy, and the QA team performs self-service full stack deployments in test environments. The resulting additional bandwidth has been invested in more frequent software releases, and the opportunity to focus on higher-value tasks.

Database Migrations with Gradle and Liquibase

Dan Stine

Team Data Science Process Presentation (TDSP), Aug 29, 2017

Debraj GuhaThakurta

Git version control and trunk based approach with VSTS

Murughan Palaniachari

2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking

Bruce Kozuma

A recent but quite common observation in industry is that although there is an overall high adoption of data science, many companies struggle to get it into production. Huge teams of well-payed data scientists often present one fancy model after the other to their managers but their proof of concepts never manifest into something business relevant. The frustration grows on both sides, managers and data scientists. In my talk I elaborate on the many reasons why data science to production is such a hard nut to crack. I start with a taxonomy of data use cases in order to easier assess technical requirements. Based thereon, my focus lies on overcoming the two-language-problem which is Python/R loved by data scientists vs. the enterprise-established Java/Scala. From my project experiences I present three different solutions, namely 1) migrating to a single language, 2) reimplementation and 3) usage of a framework. The advantages and disadvantages of each approach is presented and general advices based on the introduced taxonomy is given. Additionally, my talk also addresses organisational as well as problems in quality assurance and deployment. Best practices and further references are presented on a high-level in order to cover all facets of data science to production. With my talk I hope to convey the message that breakdowns on the road from data science to production are rather the rule than the exception, so you are not alone. At the end of my talk, you will have a better understanding of why your team and you are struggling and what to do about it.

Bridging the Gap: from Data Science to Production

Florian Wilhelm

In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.

Reproducibility - The myths and truths of pipeline bioinformatics

Simon Cockell

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.

Architect’s Open-Source Guide for a Data Mesh Architecture

Databricks

GR8CONF Contributing Back To Grails

bobbywarner

Centralized Source Control Systems are so 90’s. Forget Team Foundation Server, Subversion,… Today it’s all about Distributed Source Control systems. Working with many developers on the same project, using lots of branches, managing versions and releases will no longer be a painful experience. Let’s have a (first) look together at GIT and GITHub and how this will simplify your life as a .NET developer.

Git(hub) for windows developers

bwullems

Semelhante a Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools (20)

Data Versioning and Reproducible ML with DVC and MLflow

Que nos espera a los ALM Dudes para el 2013?

Managing Changes to the Database Across the Project Life Cycle (presented by ...

Managing changes to eZPublish Database

Building a custom cms with django

Git SVN Migrate Reasons

Expert guidance on migrating from magento 1 to magento 2

Intro to Big Data

Reproducible research: practice

tip oopt pse-summit2017

Open Chemistry: Input Preparation, Data Visualization & Analysis

Database Migrations with Gradle and Liquibase

Team Data Science Process Presentation (TDSP), Aug 29, 2017

Git version control and trunk based approach with VSTS

2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking

Bridging the Gap: from Data Science to Production

Reproducibility - The myths and truths of pipeline bioinformatics

Architect’s Open-Source Guide for a Data Mesh Architecture

GR8CONF Contributing Back To Grails

Git(hub) for windows developers

Último

+971565801893 Mtp-Kit (500MG) Prices » Dubai [(+971565801893**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Leen Whatsapp +971565801893 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971565801893''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971565801893' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Clinic in Abu Dhabi, United Arab Emirates.+971565801893

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

Health

Conference: Engage2024 in Antwerp Type: Workshop Speakers: Florian Vogler, Henning Kunz, Christoph Adler Title: Navigating the Future with The Hitchhiker's Guide to Notes and Domino 14 Abstract: Embark on an exhilarating journey with industry trailblazers Florian Vogler, Henning Kunz, and Christoph Adler in this not-to-be-missed workshop at the forefront of the tech universe. Get ready for a thrilling kick-off as we navigate the current state of the HCL universe, setting the stage for an exploration of the groundbreaking Notes and Domino 14. Discover the latest enhancements and revolutionary features that will redefine your experience. In this interactive session, unlock a treasure trove of tips and tricks to elevate your utilization of version 14, both with and without the game-changing panagenda MarvelClient. Brace yourself for also diving into Nomad, Nomad Web, and VoltMX, expanding your horizons in the expansive HCL landscape. Be a part of this exclusive opportunity to stay ahead in the ever-evolving world of HCL technologies. Your journey to mastering Notes and Domino 14 begins here. And remember, in the spirit of intergalactic exploration, don't forget to bring your towel!

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

panagenda

A great deal of attention in medical devices has shifted towards cybersecurity with the ratification of section 524B of the FD&C act. This new law enables the FDA to enforce cybersecurity controls in any medical device that is capable of networked communications or that has software. In this webinar we will recap the process for managing vulnerabilities, identify categories of vulnerabilities and solutions and more.

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

ICS

A Secure and Reliable Document Management System is Essential.docx

ComplianceQuest1

VTU technical seminar 8Th Sem on Scikit-learn

AmarnathKambale

Unlocking the Future of AI Agents with Large Language Models

aagamshah0812

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

ThousandEyes

Azure Native Qumulo scales elastically for common High Performance Compute (HPC) workloads based on application requirements for: Financial Services, Automotive, Genomics / Life Sciences, Media and Entertainment, Energy, Oil and Gas, etc. Performance can be dialed UP (and back down) much higher than the examples shown here. These slides offer a glimpse into ANQ's HPC capabilities, although at a smaller scale. We invite YOU to do your own testing (with a free ANQ trial) and work with us to test your HPC workloads in Azure.

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

ryanfarris8

Model Call Girl Services in Delhi reach out to us at 🔝 9953056974 🔝✔️✔️ Our agency presents a selection of young, charming call girls available for bookings at Oyo Hotels. Experience high-class escort services at pocket-friendly rates, with our female escorts exuding both beauty and a delightful personality, ready to meet your desires. Whether it's Housewives, College girls, Russian girls, Muslim girls, or any other preference, we offer a diverse range of options to cater to your tastes. We provide both in-call and out-call services for your convenience. Our in-call location in Delhi ensures cleanliness, hygiene, and 100% safety, while our out-call services offer doorstep delivery for added ease. We value your time and money, hence we kindly request pic collectors, time-passers, and bargain hunters to refrain from contacting us. Our services feature various packages at competitive rates: One shot: ₹2000/in-call, ₹5000/out-call Two shots with one girl: ₹3500/in-call, ₹6000/out-call Body to body massage with sex: ₹3000/in-call Full night for one person: ₹7000/in-call, ₹10000/out-call Full night for more than 1 person: Contact us at 🔝 9953056974 🔝. for details Operating 24/7, we serve various locations in Delhi, including Green Park, Lajpat Nagar, Saket, and Hauz Khas near metro stations. For premium call girl services in Delhi 🔝 9953056974 🔝. Thank you for considering us!

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

9953056974 Low Rate Call Girls In Saket, Delhi NCR

(Vivek)Call Us, 8448380779,Call girls in Delhi NCr – We Offer best in class call girls. escort Service At Affordable Price At low Rate with Space Night 8000 We Are One Of The Oldest Escort and Call girls Agencies in Delhi. You Will Find That Our Female Escorts Are Full Of Fun, Sexy And They Would Love Enjoy Your Company. We Have A Fantastic Selection Of Escort Ladies Available For In-Calls As Well As Out-Calls. Our Escorts Are Not Only Beautiful But All Have Great Personalities Making Them The Perfect Companion For Any Occasion. In-Call:- You Can Come At Our Place in Delhi Our place Which Is Very Clean Hygienic 100% safe Accommodation. Out-Call:- You have To Come Pick The Girl From My Place We Are Also Provide Door Step Services (Delhi Ncr, Noida, Gurgaon, Faridabad, Ghaziabad Note:- Pic Collectors Time Passers Bargainers Stay Away As We Respect The Value For Your Money Time And Expect The Same From You Hygienic:- Full Ac room And Clean Rooms Available In Hotel 24 * 7 Hourly In Delhi NCR More Details, With WhatsApp Number, +91-8448380779

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified

Delhi Call girls

Software Quality Assurance Interview Questions

Arshad QA

Test automation is a cornerstone of software development and quality assurance in today's rapidly evolving digital landscape. Its significance cannot be overstated. Businesses can enhance efficiency, productivity, and accelerate software delivery to market through automation, streamlining testing processes effectively. This comprehensive guide addresses the best practices for test automation in 2024. It offers a detailed checklist to empower you to optimize your automation efforts and maintain a competitive edge.

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

kalichargn70th171

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

masabamasaba

Technology has taken up space all over the world. From generating content with a single command on ChatGPT to getting your food served by Robots at your favorite restaurant, artificial advancements have ruled every space. Every industry is set to develop top-notch technology in every sector; finance, IT, healthcare, gaming, and banking, with competitive market standards. One of these rapidly growing industries is Mobile App Development. According to the Straits Research report, it is expected to reach USD 583.03 billion at a CAGR OF 12.8% between (2022 and 2030). It clearly shows how mobile app development has become an integral part of the digital landscape and revolutionized technology.

The Top App Development Trends Shaping the Industry in 2024-25 .pdf

ayushiqss

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

masabamasaba

At the recent Microsoft Ignite 2023 conference, Microsoft unveiled a groundbreaking strategy that will redefine enterprise work management. The plan involves integrating Microsoft’s key planning tools, Microsoft To Do, Microsoft Planner, and Microsoft Project for the web into a unified experience called “Microsoft Planner.” What does this new strategy from Microsoft mean for current users? Join us and learn how best to take advantage of this announcement while gaining a clear path on how to elevate the current state of Microsoft Planner from a basic task manager to a comprehensive tool for Enterprise Work Management using OnePlan. Learn how OnePlan’s integration with Microsoft Planner allows for strategic alignment with business goals through advanced features like strategic planning, portfolio management, resource management, financial management, and more!

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

OnePlan Solutions

Looking for an efficient way to manage your finances? Look no further than our money management app. With easy-to-use features, you can track your expenses, create budgets, and monitor your savings goals all in one place. Our app provides real-time updates on your spending habits and helps you make smarter financial decisions. Take control of your finances today with our user-friendly money management app.

Right Money Management App For Your Financial Goals

Jhone kinadey

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

SelfMade bd

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

KiaraTiradoMicha

In the realm of real-time applications, Large Language Models (LLMs) have long dominated language-centric tasks, while tools like OpenCV have excelled in the visual domain. However, the future (maybe) lies in the fusion of LLMs and deep learning, giving birth to the revolutionary concept of Large Action Models (LAMs). Imagine a world where AI not only comprehends language but mimics human actions on technology interfaces. For example, the Rabbit r1 device presented at CES 2024, driven by an AI operating system and LAM, brings this vision to life. It executes complex commands, leveraging GUIs with unprecedented ease. In this presentation, join me on a journey as a software engineer tinkering with WebRTC, Janus, and LLM/LAMs. Together, we’ll evaluate the current state of these AI technologies, unraveling the potential they hold for shaping the future of real-time applications.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Alberto González Trastoy

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

1. Reproducible data science Lightning review

2. Into • Data Science Lead at Outra • 5 years in Data Science • Main focus has been social media, marketing and retail Josh Levy-Kramer

3. Outline 1. Reproducibility crisis 2. Possible solutions 3. Comparison and roundup

4. Reproducibility crisis • Dark ages when tracking changes and building models • ML lacking abstractions that developers have developed • Creates problems for yourself, your team and public projects • Partially due to cant commit large files into git

5. Data Scientist Manifesto • Reproducibility – the ability to reconstruct any previous state of your data analysis (data and execution) • Provenance – the ability to track any result and link it the the input and code used • Collaboration – the ability to easily collaborate with team members • Environment agnostic – the ability to deploy a process to in different environments without much hindrance Adapted from http://www.pachyderm.io/dsbor.html

6. Data – Code – Environment → Output CodeData Output Environment

7. Code – Environment Git 1 numpy==1.13.3 2 pandas==0.21.1 3 scikit-learn==0.19.1 4 psutil==5.4.0 5 pyyaml==3.1 OS environment Python eniroment 1 FROM continuumio/miniconda3:4.3.27 2 COPY requirements.txt . 3 RUN pip install -r requirements.txt 4 COPY model.py 5 ENTRYPOINT model.py 1 AWSTemplateFormatVersion: "2010-09-09" 2 Resources: 3 WebInstance: 4 Type: AWS::EC2::Instance 5 Properties: 6 InstanceType: r4.8xlarge 7 ImageId: ami-80861296 8 KeyName: my-key 9 SecurityGroupIds: 10 - sg-abc01234 11 SubnetId: subnet-acb01234 Hardware environment Code Dockerfile pip requirements.txt AWS CloudFormation template

8. Data? Data Output

9. Emerging solutions Data Version Control

10. Git LFS • Git extension • Allows you to commit large files into Git • Uses custom protocol and store • No concept of pipelining

11. What's similar Data Pipelines • Version controls data and pipelines, similar to what Git does with code • Two main abstractions:

12. alpha=0.7 Output v1 ML Image model Input Output alpha=0.7 Output v2 Change input alpha=0.1 Output v3 Change model Version control all • Version controls data and pipelines, similar to what Git does with code

13. Pachyderm - workflow Data repo pachctl putfile Pipeline Output pachctl create-pipeline

14. Pachyderm • I like: • Interlinked data-pipeline- output version control • Automatic output generation • Parallelisation and distribution • Semi-mature project – started 2014 👍 👎 • I dislike: • Not environment agnostic • Bloated tool: • Not generic - highly integrated with Kubernetes and S3 • Installation is complicated • Not portable • Not integrated with git

15. dvc • “Git extension for data scientists – mange your code and data together” • Same git workflow with extra commands dvc add Data repo Pipeline Output dvc run –d input.csv –o output.csv model.py alpha=0.1

16. Workflow

17. dvc • I like: • Integration with GIT • Interlinked data-pipeline- output version control • Easy to install • Environment agnostic 👍 👎 • I dislike: • Double the actions required compared to just GIT – easy to get lost in workflow • Terrible name • Immature project – started 2017

18. Round up • No solution quite there yet • Data version control is the best contender 🤔

Notas do Editor

These can be version controlled
And the output
Pachyderm is the most developed
Track experimentation. Even the crap trys
Imagine this is a model that’s doing some Ml on the images
Going tobefome a margional tool but not going to be the next git
Going tobefome a margional tool but not going to be the next git

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Semelhante a Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

Semelhante a Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools (20)

Último

Último (20)

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

Notas do Editor