While there has been a significant increase in the amount of machine learning research across various domains of science, the processes to publish the results and make the resulting models and code available for reuse has been lacking. In this talk, we discuss FAIR data principles applied to machine learning models and how the Data and Learning Hub for Science (DLHub) can help make models more easily discoverable and usable in common scientific workflows. Visit https://www.dlhub.org for more information.
The dark energy paradox leads to a new structure of spacetime.pptx
A FAIR Approach to Publishing and Sharing Machine Learning Models
1. Funding: 2018 Argonne Advanced Computing LDRD
Collaborators: Ryan Chard, Logan Ward, Marcus Schwarting, Kyle Chard, Zhuozhao Li, Anna
Woodard, Yadu Babuji, Steve Tuecke, Mike Franklin, Ian Foster
Blue – also presenting at this workshop
Data and Learning Hub for Science
https://www.dlhub.org
A FAIR Approach to Publishing and
Sharing Machine Learning Models
Ben Blaiszik (blaiszik@uchicago.edu)
2. Quick Polls
• How many of you have trained a machine learning model?
• How many of you have published papers using machine learning?
• How many of you have tried to reuse models from others?
3. State of Machine Learning in Science
Highs
• Rapid increase in number of
journal publications
• Advances across the scientific
domains
• Achievements on par with experts
or best-in-class methods in many
domains
• Funding agencies are coalescing
around ML (AI Initiative etc.)
Chart Source and Method:
https://github.com/blaiszik/ml_publication_charts
4. State of Machine Learning in Science
For a given model:
• Where is the code?
• Where are the trained models?
• Where is the training data?
• How can I reproduce these
results?
Without all of these pieces,
progress is drastically slowed
Location of many ML models after a
paper is finished
Github is another location…
Lows
5. FAIR Data Principles
• Findable
• Accessible
• Interoperable
• Reusable
https://www.force11.org/group/fairgroup/fairprinciples
Set of principles to help make data as
useful as possible to the community
6. FAIR Data Principles
Findable
• Data have an identifier
• Data are registered in a searchable resource
Accesible
• Data accessible via identifier
• Data retrievable by open protocols
7. FAIR Data Principles
Interoperable
• Data leverage formalized shared vocabularies
• Vocabularies themselves follow FAIR principles
Reusable
• Clear licensing
• Descriptive metadata is sufficient to promote
reuse
8. What Would FAIR Look Like in ML?
(1) Find Interesting Science Paper
• Links to code repository
(Github/DOI)
• Links to data repository (DOI)
• Publication describes the model
and its uses and limitations
9. What Would FAIR Look Like in ML?
(2) Find Code
• Has unique identifier (DOI)
• Links back to publication
(DOI)
• Has well-documented code
• Tagged with metadata to aid
discovery
• Registered in a search index
• Open license
10. What Would FAIR Look Like in ML?
(3) Find and Run Model
• Model has identifier (DOI)
• Model has links to data (DOI)
• Model has links to the code
(DOI/Github)
• Model has links to publication
(DOI)
• Data are accessible
• Inference run from the cloud - no
installation necessary!
11. 11
• Collect, publish, categorize models and pre/post processing code
• Operate models as a service to simplify sharing, consumption, and
access
• Identify models with unique and persistent identifiers (e.g., DOI)
• Implement versioning, search, access controls etc.
Goal: Deliver FAIR for ML
2018 Argonne Adv. Computing LDRD
DATA AND LEARNING HUB FOR
SCIENCE (DLHUB)
12. DLHub: Key Concepts
Run()
• Servables are containers with defined
inputs and outputs
• Servables may represent machine
learning models or other data
transformations
• Outputs can be cached for inputs
13. DLHub: Key Concepts
• Servables are containers with defined
inputs and outputs
• Servables may represent machine
learning models or other data
transformations
• Outputs can be cached for inputs
Preprocess 1
Run()
Preprocess 2
Run()
Model predict
Run()
17. Marking up a Model – Python SDK
Existing Model
User Mark Up with
SDK
Send to DLHub
(via Globus or HTTPS)
DLHub
Containerization
Populate Search
Index / Mint
Identifiers
SDK Extracts Metadata
for Known Model
Types
18. Python SDK – Automated Metadata Generation
Citation Metadata
Following Datacite
DLHub Metadata Servable Metadata
Access Control
• Public
• Globus users
• Globus groups
19. Using DLHub is Easy!
19 2018 Argonne Adv. Computing LDRD
Python SDK
$ pip install dlhub_sdk
1
2
Describe
Publish
• Publish to DLHub
• DLHub service creates
containers
• DLHub service creates unique
endpoint for servable
• Specify the model files
• Mark up the model with
information to make it
discoverable and usable
20. Using DLHub is Easy!
20 2018 Argonne Adv. Computing LDRD
4
Run
• Make predictions by sending
data to DLHub and
specifying the servable to
use
3
Discover
• Discover servables with
advanced search capabilities
through Python SDK or web
UI (under construction)
22. Combining DLHub with Data Repositories
Get Data
Run Model
2018 Argonne Adv. Computing LDRD
22
• Using high-throughput optical
imaging to predict material
bandgap
24. Model-in-the-Loop Science
Select DLHub Use Cases
Funding: 2018 Argonne Adv. Computing LDRD
• Crystal structure • NIST PFHub
• Models linked to dynamic data sources
Community Model Benchmarking
Automated Model Retraining with New
Data
• Metallic glass discovery [active learning]
• XRD applications
XRD image tagging
(Yager, BNL)
(Ward, ANL/UC)
(Ward, ANL/UC) (Wheeler, Warren, Heinonen
NIST/UC/Argonne/NU)
(Center for Hierarchical Materials
Design NIST/UC/Argonne/NU)
CH MaD
XRD intensity à structure/phase
(Cherukara Argonne)
25. More Examples Available In Our Repositories
25 2018 Argonne Adv. Computing LDRD
Cherukara et al.
Energy Storage Tomography X-Ray Science
Ward et al.
TomoGAN
Liu et al.
26. DLHub Architecture and Performance
• Task Managers (TM) to support
execution on various compute
resources
• Executors chosen by TM to invoke a
given servable’
• Caching at TM
• Data staging with Globus
• Batch submissions
• Scalability through deployment of
model replicas
https://arxiv.org/abs/1811.11213
zmq
Task Manager
Model
Repository
REST
CLI SDK
TF
Serving
DLHub Management
Service Key
Servable
Node
Model
Serving
Parsl
Sage
Maker
Executor Executor Executor
zmq
Task Manager
Ryan Chard Zhuozhao Li
27. Open Source Opportunities
2018 Argonne Adv. Computing LDRDhttps://www.dlhub.org
https://github.com/DLHub-Argonne
• Deposit models from the community
• Help build client functionality
• Build examples using existing servables
• Be you!
Contact: Ben Blaiszik (blaiszik@uchicago.edu)
28. Thanks to our sponsors!
U.S. DEPARTMENT OF
ENERGY
ALCF DF
Parsl Globus IMaD
DLHub Argonne
LDRD