fundamental of entomology all in one topics of entomology
Open Source Tools for Materials Informatics
1. Open Source Tools for Materials Informatics
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
MRS Fall Meeting 2019
Slides (already) posted to hackingmaterials.lbl.gov
2. Staffing interdisciplinary research
Machine learningMaterials Science
I find a recurring dilemma and asymmetry in
staffing materials informatics research
Materials Informatics
3. 3
Who has a tougher job to get started?
MS&E major CS major
• Already has background in the
material science aspects of the
project
• But needs to learn the
machine learning and
software engineering aspects
• Already has background in
software engineering and
appropriate machine learning
• But needs to learn the
materials science aspects
4. 4
MS&E major CS major
My experience is that the
CS major typically has the
tougher road ahead of
them
Who has a tougher job to get started?
5. 5
MS&E major CS major
My experience is that the
CS major typically has the
tougher road ahead of
them
Who has a tougher job to get started?
easier to pick up / self-learn
random forests & neural networks
than
phase diagrams & crystal structures
6. 6
There is an asymmetry in resources available
MS&E major CS major
• Hands-on code and examples to
run and modify
• Hundreds of Youtube videos
and online courses
• Code reviews from collaborators
• And the standard books, etc.
• Books and research articles
• Conversations with colleagues,
impromptu lectures
• Practice problems? Worked
examples? Interactive code?
7. Outline
7
①Matminer: data and descriptors for
producing ML structure-property
relationships
② Matscholar – applying natural language
processing to materials science information
retrieval
8. 8
How can we make it easy to develop and test ML models for
composition-structure-property relationships?
How can we quickly
represent chemistry and
structure as vectors?
How do we get
labeled training
/test data?
How do we know
if our ML model is
extraordinary?
9. 9
How can we make it easy to develop and test ML models for
composition-structure-property relationships?
How can we quickly
represent chemistry and
structure as vectors?
10. >60 featurizer classes can
generate thousands of potential
descriptors that are described in
the literature
10
Matminer contains a library of descriptors for various
materials science entities
feat = EwaldEnergy([options])
y = feat.featurize([input_data])
• compatible with scikit-
learn pipelining
• automatically deploy
multiprocessing to
parallelize over data
• include citations to
methodology papers
11. 11
How can we make it easy to develop and test ML models for
composition-structure-property relationships?
How do we get
labeled training
/test data?
12. • Typically, a lot of attention is given to advanced
algorithms for machine learning
– e.g., deep neural networks versus standard ML
• But perhaps there is not enough emphasis on
developing the appropriate data sets
– with enough information to train ML algorithms
– with sufficient data quality
– easy enough for anyone to at least get started without
specialized knowledge
12
What about data?
13. The importance of data
13
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-
research-and-possibly-the-world/
14. 14
What is ImageNet?
The ImageNet data
set collected and
hand-labeled (e.g.,
via Amazon
Mechanical Turk).
The latest version
has over 14 million
hand-annotated
images, organized
into ~20,000
categories
16. How data stimulates new algorithms
16
How can we create an
ImageNet for materials
science?
17. • We want a test set that contains a diverse array
of problems
– Smaller data versus larger data
– Different applications (electronic, mechanical, etc.)
– Composition-only or structure information available
– Classification or regression
• We also want a cross-validation metric that gives
reliable error estimates
– i.e., less dependent on specific choice of splits
17
An “ImageNet” for materials science
18. 18
Overview of Matbench test set
Target Property Data Source Samples Method
Bulk Modulus Materials Project 10,987 DFT-GGA
Shear Modulus Materials Project 10,987 DFT-GGA
Band Gap Materials Project 106,113 DFT-GGA
Metallicity Materials Project 106,113 DFT-GGA
Band Gap Zhuo et al. [1] 6,354 Experiment
Metallicity Zhuo et al. [1] 6,354 Experiment
Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment
Refractive index Materials Project 4,764 DFPT-GGA
Formation Energy Materials Project 132,752 DFT-GGA
Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA
Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA
Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF
Steel yield strength Citrine Informatics 312 Experiment
1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
19. <1K
1K-10K10K-100K
>100K
19
Diversity of benchmark suite
mechanical
electronic
stability
optical
thermal
classification
regression
experiment
(composition
only)
DFT
(structure)
application data size
problem
type
data type
20. 20
How can we make it easy to develop and test ML models for
composition-structure-property relationships?
How do we know
if our ML model is
extraordinary?
21. 21
How about a benchmark algorithm?
Automatminer is a ”black box” machine learning model
Give it any data set with either composition or structure inputs, and
automatminer will train an ML model (no researcher intervention)
22. 22
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
24. 24
If we can get a well-established “benchmark”, perhaps
interdisciplinary teams can start hammering on accuracy
Today
5years
10years
A lower barrier to entry
in the field means more
ideas can be tested from
more researchers
Matbenchtestset
averageerror
25. 25
Matminer, matbench, and automatminer can all be
accessed, used, and modified by anyone
Code / Examples all on Github
• github.com/hackingmaterials/matminer
• github.com/hackingmaterials/matminer_examples
• github.com/hackingmaterials/automatminer
Matbench data on Figshare
• (coming soon, still finalizing)
Free support via Discourse
• https://discuss.matsci.org
26. Outline
26
① Matminer: data and descriptors for producing
ML structure-property relationships
②Matscholar – applying natural language
processing to materials science information
retrieval
27. We have extracted ~2
million abstracts of
relevant scientific
articles
We use natural
language processing
algorithms to try to
extract knowledge from
all this data
27
Goal: collect and organize knowledge embedded in the
materials science literature
31. • How do we get more people
benefitting from this work
and involved in improving it?
• One solution - expose an
easy-to-use web frontend,
with links to all the backend
codes in case people want to
dive further
– New tools like Plotly Dash
make this easier than ever
31
Using a web site as a “gateway” into the algorithms
frontend
backend
36. • We need more resources to help computer
scientists learn about materials science topics
through hands-on examples and interactive demos
• Some things that can help:
– Open-source implementations of materials science
methods
– Interactive examples (e.g., Jupyter)
– Documentation and support(!)
– Labeled data sets
– Front-ends for easy exploration
36
Concluding thoughts
37. 37
Funding acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov
• Matminer
– U.S. Department of Energy, Materials Science Division
• Matscholar
– Toyota Research Institutes