This presentation was delivered by Adrien Ball at the Open Science in Practice (OSIP) summer school at EPFL Lausanne on September 2019 (http://osip2019.epfl.ch/).
It presents some lessons that were learned in the process of open sourcing the Snips NLU python library.
9. PACKAGING FOR OPEN SOURCE
OBJECTIVES FOR THE COMMUNITY
▸ Understand
▸ Use
▸ Contribute
10. PACKAGING FOR OPEN SOURCE
REQUIREMENTS
▸ Documentation
▸ Continuous Integration and build automation
▸ APIs and versioning
11. PACKAGING FOR OPEN SOURCE
GOOD PRACTICES
▸ Documentation
▸ Continuous Integration and build automation
▸ APIs and versioning
12. PACKAGING FOR OPEN SOURCE
DOCUMENTATION
▸ Hard and painful to maintain
▸ More documentation => More outdated documentation
▸ Less documentation => Less explanations
18. PACKAGING FOR OPEN SOURCE
GOOD PRACTICES
▸ Documentation
▸ Continuous Integration and build automation
▸ APIs and versioning
19. PACKAGING FOR OPEN SOURCE
CONTINUOUS INTEGRATION AND BUILD AUTOMATION
▸ Continuous integration:
▸ Always be merging into a branch
▸ Merge frequently
▸ Build automation:
▸ Enforce tests and checks to pass before merging
20. PACKAGING FOR OPEN SOURCE
BUILD AUTOMATION, WHAT FOR ?
▸ the project can be installed or built on the targeted
platforms
▸ the code is doing what it is expected to do
▸ you haven't introduced regressions
▸ the documentation is not outdated
▸ automate whatever is error prone, and can be automated
21. PACKAGING FOR OPEN SOURCE
GOOD PRACTICES
▸ Documentation
▸ Continuous Integration and build automation
▸ APIs and versioning
22. PACKAGING FOR OPEN SOURCE
APIS AND VERSIONING
▸ Python: everything is public!
▸ Public API = Conventions + Doc
24. PACKAGING FOR OPEN SOURCE
SEMANTIC VERSIONING
1 3 2
major minor patch
Bump when you
Examples
make incompatible API
changes
- removed function
- additional mandatory
param
- changed returned
type
Impact on
client code
no longer works
add functionality in a
backwards compatible
manner
- new API
- new optional param
additional capabilities
make backwards
compatible bug fixes
improved behavior
- internal bugs
26. MACHINE LEARNING AND OPEN SOURCE
SPECIFIC CHALLENGES
▸ Managing resources
▸ Testing a Machine Learning pipeline
▸ Reproducibility
▸ Modularity and Extensibility
27. MACHINE LEARNING AND OPEN SOURCE
SPECIFIC CHALLENGES
▸ Managing resources
▸ Testing a Machine Learning pipeline
▸ Reproducibility
▸ Modularity and Extensibility
28. MACHINE LEARNING AND OPEN SOURCE
MANAGING RESOURCES
Input Output
resources
ML Pipeline
29. MACHINE LEARNING AND OPEN SOURCE
MANAGING RESOURCES
Input Output
ML Pipeline
▸ Heavier library
▸ Updating the resources requires a release
▸ No user-defined resources
30. MACHINE LEARNING AND OPEN SOURCE
MANAGING RESOURCES
Input Output
resources
ML Pipeline
31. MACHINE LEARNING AND OPEN SOURCE
MANAGING RESOURCES
Input Output
resources
ML Pipeline
34. MACHINE LEARNING AND OPEN SOURCE
SPECIFIC CHALLENGES
▸ Managing resources
▸ Testing a Machine Learning pipeline
▸ Reproducibility
▸ Modularity and Extensibility
35. ▸ Traditional testing:
▸ Testing in ML ?
MACHINE LEARNING AND OPEN SOURCE
TESTING A MACHINE LEARNING PIPELINE
43. MACHINE LEARNING AND OPEN SOURCE
SPECIFIC CHALLENGES
▸ Managing resources
▸ Testing a Machine Learning pipeline
▸ Reproducibility
▸ Modularity and Extensibility
44. MACHINE LEARNING AND OPEN SOURCE
REPRODUCIBILITY FROM A PRODUCT PERSPECTIVE
Data
Training
Evaluation selected data
48. MACHINE LEARNING AND OPEN SOURCE
REPRODUCIBILITY THROUGH CONFIGURATIONS
56
3.0
True
code
49. MACHINE LEARNING AND OPEN SOURCE
REPRODUCIBILITY THROUGH CONFIGURATIONS
42
1.5
False
code
50. MACHINE LEARNING AND OPEN SOURCE
REPRODUCIBILITY THROUGH CONFIGURATIONS
42
1.5
False
x
y
z
param_1:
param_2:
param_3:
code
config
51. MACHINE LEARNING AND OPEN SOURCE
REPRODUCIBILITY THROUGH CONFIGURATIONS
Data
+
Code
+
Config
🤓
🤓
🤓
0.95 0.87 0.92
0.98 0.91 0.88
0.89 0.83 0.92
52.
53. MACHINE LEARNING AND OPEN SOURCE
SPECIFIC CHALLENGES
▸ Managing resources
▸ Testing a Machine Learning pipeline
▸ Reproducibility
▸ Modularity and Extensibility
54. MACHINE LEARNING AND OPEN SOURCE
MODULARITY AND EXTENSIBILITY
Input Output
LogReg
SVM
PIPELINE
AVAILABLE
COMPONENTS
55. MACHINE LEARNING AND OPEN SOURCE
MODULARITY AND EXTENSIBILITY
Input Output
LogReg
AVAILABLE
COMPONENTS
SVM
PIPELINE
56. MACHINE LEARNING AND OPEN SOURCE
MODULARITY AND EXTENSIBILITY
Input Output
LogReg
SVM
PIPELINE
AVAILABLE
COMPONENTS
57. MACHINE LEARNING AND OPEN SOURCE
MODULARITY AND EXTENSIBILITY
Input Output
LogReg
SVM
PIPELINE
AVAILABLE
COMPONENTS
58. MACHINE LEARNING AND OPEN SOURCE
MODULARITY AND EXTENSIBILITY
Input Output
LogReg
SVM
PIPELINE
AVAILABLE
COMPONENTS
65. EXPERIENCES FROM GOING OPEN SOURCE
TAKEAWAYS
▸ writing tests save you time, not the opposite
▸ test the right things
▸ make your outputs reproducible
▸ use abstractions to improve modularity and clarity