3. Process challenge in Data Science
o“Intelligent” application (ML/AI)
development has unique
complexity not always
encountered in other Software
Development scenarios
Organization
Collaboration
Quality
Knowledge Accumulation
Agility
Global Teams
•Geographic Locations
Team Growth
•Onboard New
Members Rapidly
Varied Use Cases
•Industries and Use
Cases
Diverse DS
Backgrounds
•DS have diverse
backgrounds,
experiences with
tools, languages
4. DevOps: the three stage conversation
DevOps is union of People (culture), Process, and Products to continuously deliver
value to users
6. TDSP objective
Integrate DevOps with data science workflow to improve collaboration,
quality, and productivity of data science teams
o Infrastructure as Code (IaC)
o Automated Testing
o …
7. TDSP components for data science teams
Standardized Data Science Lifecycle
Project Structure, Templates & Roles
Infrastructure
Re-usable Data Science Utilities
9. Project structure and templates
Business understanding &
problem scope definition
Working Directory Template Documentation Template (example)
Structure,templates,roles
10. Project roles & tasks
Governance and Project Management
Data Science and Engineering
Structure,templates,roles
11. Agile work planning and execution template
o Use Agile work planning & execution
template (data science specific)
Structure,templates,roles
12. Tracking progress with PowerBI dashboards
Power BI content pack for VSTS: tool for PM
Structure,templates,roles
13. Shared and distributed infrastructure
Virtual machines (VMs), or clusters are
disposable compute, added to projects as
needed
Many-to-many relationship between data
scientists, VMs and projects possible
Data typically stored in cloud stores, such as
blob or database
Project artifacts & code permanently stored
in central git (version control) repositories.
Infrastructure
14. DSVM (Data Science Virtual Machine)
Example cloud compute resource
Azure virtual machine (VM) image pre-installed and
configured with data science tools
Spark
Microsoft R Server Developer Edition
Anaconda Python distribution
Jupyter notebook (with R, Python kernels)
Visual Studio Community Edition
Power BI desktop
SQL Server 2016 Developer Edition
Machine learning and Data Analytics tools
Deep Learning Toolkits
Infrastructure
15. Collaborative development guidelines
Version control and review
oGit is a Version Control System
oEach repo contains the full change history
oUsed in a distributed way with a single remote repo
and several local repos (on local machine or a VM)
Remote
Local LocalLocal Local
TDSP git Template
Integrated Agile planning &
code development
Infrastructure
16. Summary
TDSP components, guidelines, E2E samples eases process challenges in data science solutions
Standardized Data Science
Lifecycle
Project Structure, Templates &
Roles
Infrastructure
Re-usable Data Science Utilities
TDSP Components Data Science Challenges