Software tools, crystal descriptors, and machine learning applied to materials design
1. Software tools, crystal descriptors, and
machine learning applied to materials design
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
TMS 2019
Slides (already) posted to hackingmaterials.lbl.gov
2. 2
This talk is centered around open-source software that you
can use to accelerate your own materials design efforts
High-throughput
computing and
simulations
Machine learning
Interpretable
crystal structure
representations
4. We know that high-throughput DFT is useful for generating
large data sets, e.g., for materials screening
4
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.
>10,000
elastic tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier, Sci
Data 2017, 4, 170085.
5. We know that high-throughput DFT is useful for generating
large data sets, e.g., for materials screening
5
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.
>10,000
elastic tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier, Sci
Data 2017, 4, 170085.
Atomate’s goal: make
high-throughput easy and
scalable for everyone
6. A “black-box” view of performing a calculation
6
“something”
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
7. Unfortunately, the inside of the “black box”
is usually tedious and “low-level”
7
lots of tedious,
low-level work…
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
Input file flags
SLURM format
how to fix ZPOTRF?
q set up the structure coordinates
q write input files, double-check all
the flags
q copy to supercomputer
q submit job to queue
q deal with supercomputer
headaches
q monitor job
q fix error jobs, resubmit to queue,
wait again
q repeat process for subsequent
calculations in workflow
q parse output files to obtain results
q copy and organize results, e.g., into
Excel
8. What would be a better way?
8
“something”
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
9. What would be a better way?
9
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
Workflows to run
q band structure
q surface energies
ü elastic tensor
q Raman spectrum
q QH thermal expansion
10. Ideally the method should scale to millions of calculations
10
Results!
researcher
Start with all binary
oxides, replace O->S,
run several different
properties
Workflows to run
ü band structure
ü surface energies
ü elastic tensor
q Raman spectrum
q QH thermal expansion
q spin-orbit coupling
11. Atomate tries make it easy, automatic, and flexible to
generate data with existing simulation packages
11
Results!
researcher
Run many different
properties of many
different materials!
12. Each simulation procedure translates high-level instructions
into a series of low-level tasks
12
quickly and automatically translate high-level (minimal)
specifications into well-defined FireWorks workflows
What is the
GGA-PBE elastic
tensor of GaAs?
M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, et al.,
Charting the complete elastic properties of inorganic crystalline compounds,
Sci. Data. 2 (2015).
13. Atomate contains a library of simulation procedures
13
VASP-based
• band structure
• spin-orbit coupling
• hybrid functional
calcs
• elastic tensor
• piezoelectric tensor
• Raman spectra
• NEB
• GIBBS method
• QH thermal
expansion
• AIMD
• ferroelectric
• surface adsorption
• work functions
• NMR spectra*
• Bader charges*
• Magnetic
orderings*
• SCAN functionals*
Other
• BoltzTraP
• FEFF method
• Q-Chem*
*=added / major
updates in past year
Mathew, K. et al Atomate: A high-level interface to generate, execute, and analyze
computational materials science workflows, Comput. Mater. Sci. 139 (2017) 140–152.
14. 14
Full operation diagram
job 1
job 2
job 3 job 4
structure workflow database of
all workflows
automatically submit + executeoutput files + database
15. 15
A web-based interface is in progress to give atomate users a
“personal Materials Project” of their own calculations
16. Atomate now powers the Materials Project
• Online resource of density
functional theory simulation data
for ~85,000 inorganic materials
• Includes band structures, elastic
tensors, piezoelectric tensors,
battery properties and more
• >75,000 registered users
• Free
• www.materialsproject.org
16
Jain et al. Commentary: The Materials Project: A
materials genome approach to accelerating
materials innovation. APL Mater. 1, 11002 (2013).
17. 17
Getting started with atomate
Mathew, K. et al. Atomate: A high-
level interface to generate, execute,
and analyze computational
materials science workflows.
Comput. Mater. Sci. 139, 140–152
(2017).
hackingmaterials.github.io/
atomate
https://groups.google.com/
forum/#!forum/atomate
Paper Docs Support
19. • With atomate/FireWorks,
the user must decide which
calculations to perform
– E.g., which materials to
calculate
• Rocketsled is an extension
to FireWorks that lets the
computer decide what the
next best calculation is
based on the results of
previous calculations
• Works for materials design
or any other “inverse
computational problem”
19
Rocketsled uses adaptive design to suggest the best
computations to optimize some metric
20. 20
Given a search domain, Rocketsled uses an optimization
engine to select calculations and submit to supercomputers
Optimization engine includes 4 built-in regressors (e.g., RandomForest,
Gaussian Process) and 5 acquisition functions (e.g., Expected
Improvement). Can bootstrap uncertainty estimates. Or use your own!
21. 21
Results of using optimization can be dramatic!
In the problem of finding materials with
high K and high G for superhard
materials (7394 possibilities), Rocketsled
finds solutions ~30-60X faster than
randomly computing the space.
Can use pure ML approaches or use
matminer featurizations for materials
science (latter helps give such good
performance)
22. 22
Results of using optimization can be dramatic!
In the problem of finding materials with
high K and high G for superhard
materials (7394 possibilities), Rocketsled
finds solutions ~30-60X faster than
randomly computing the space.
Even after just 200
calculations of the
7394 possibilities,
all solutions are
almost certain to
be found with
Rocketsled.
Can use pure ML approaches or use
matminer featurizations for materials
science (latter helps give such good
performance)
23. 23
Getting started with rocketsled
Dunn, A.R., et al. Rocketsled: a
software library for optimizing
high-throughput computational
searches. J. Phys. Mater.
https://doi.org/10.1088/2515-
7639/ab0c3d
hackingmaterials.github.io/
rocketsled
https://groups.google.com/for
um/#!forum/fireworkflows
Paper Docs Support
25. 25
What is needed to do machine learning on materials?
How can we represent
chemistry and structure as
vectors?
How do we get
enough output
data for training?
26. Matminer connects materials data with data mining
algorithms and data visualization libraries
26
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
27. >60 featurizer classes can
generate thousands of potential
descriptors that are described in
the literature
27
Matminer contains a library of descriptors for various
materials science entities
feat = EwaldEnergy([options])
y = feat.featurize([input_data])
• compatible with scikit-
learn pipelining
• automatically deploy
multiprocessing to
parallelize over data
• include citations to
methodology papers
28. 28
Interactive Jupyter notebooks demonstrate use cases
https://github.com/hackingmaterials/matminer_examples
Many examples available:
• Retrieving data from various databases
• Predicting bulk / shear modulus
• Predicting formation energies:
• from composition alone
• with Voronoi-based structure features
included
• with Coulomb matrix and Orbital Field
matrix descriptors (reproducing
previous studies in the literature)
• Making interactive visualizations
• Creating an ML pipeline
29. 29
Getting started with matminer
Ward et al. Matminer : An open
source toolkit for materials data
mining. Computational Materials
Science, 152, 60–69 (2018).
Paper Docs Support
hackingmaterials.github.io
/matminer
https://groups.google.com/
forum/#!forum/matminer
31. 31
Typically several steps of machine learning are performed by
a human researcher – can these be automated?
Descriptors developed and
chosen by a researcher
ML model developed
and chosen by a
researcher
Why can’t we just give the computer some raw input data
(compositions, crystal structures) and output properties and get
back an ML model?
32. 32
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Missing value
imputation
• Scaling
• One-hot
encoding
• PCA-based
• Correlation
• Relief-based
(MultiSURF)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
33. 33
We are benchmarking automatminer vs current state of the
art against 11 problems intended to be a standard test set
Dataset Target(s) Samples
Elastic Tensor KVRH (GPa), GVRH (GPa) 10,987
Dielectric Tensor Refractive index 4,765
JARVIS 2D Exfoliation energy (meV/atom) 636
Materials Project
phonons
Highest LO Phonon Frequency (Last
PhDOS peak)
1,265
Materials Project
(stable)
Band gap (eV), Is metallic? (classification) 106,113
Perovskites Formation energy (eV/atom) 18,928
Experimental Band
Gaps
Is metallic? (classification) 6,354
Experimental Metallic
Glasses
Glass forms? (classification) 7,190
Materials Project (all) Formation energy (eV/atom) 132,752
34. 34
Usually, automatminer does very well
Usually, automatminer outperforms both
state-of-the-art graph based models AND
human-generated models!
But …
35. 35
Graph-based approaches work better in some problems
Hypothesis – automatminer
approaches are better for smaller
data sets, graph-based
approaches are better for larger
data sets
Unfortunately, it can be difficult
to train some of the graph
models on large data sets,
particularly without GPUs, so
the results are not in yet!
36. 36
Getting started with automatminer
Paper Docs Support
hackingmaterials.github.io
/automatminer
https://groups.google.com/
forum/#!forum/matminer
In preparation …
40. 40
Example of fully automated robocrystallographer output
GaAs is zincblende structured and crystallizes in
the cubic F4 ̅3m space group. Ga3+ is bonded to
four equivalent As3– atoms to form corner-
sharing GaAs4 tetrahedra. All Ga–As bond lengths
are 2.49 Å. As3– is bonded in a tetrahedral
geometry to four equivalent Ga3+ atoms.
42. 42
Example of fully automated robocrystallographer output
BiOCuSe is parent of FeAs superconductors
structured and crystallizes in the tetragonal
P4/nmm space group. The structure is two-
dimensional and consists of one BiO sheet
oriented in the (0, 0, 1) direction and one CuSe
sheet oriented in the (0, 0, 1) direction. In the BiO
sheet, Bi3+ is bonded in a 4-coordinate geometry
to four equivalent O2– atoms. All Bi–O bond
lengths are 2.35 Å. O2– is bonded in a tetrahedral
geometry to four equivalent Bi3+ atoms. In the
CuSe sheet, Cu1+ is bonded to four equivalent
Se2– atoms to form a mixture of edge and corner-
sharing CuSe4 tetrahedra. All Cu–Se bond lengths
are 2.52 Å. Se2– is bonded in a 4-coordinate
geometry to four equivalent Cu1+ atoms.
43. 43
Robocrystallographer is integrated into the Materials Project
Click the
robot icon for
Robocrys
Click the
speaker icon
to have it talk
to you.
TiO2 is Rutile structured and crystallizes in the
tetragonal P4_2/mnm space group. The
structure is three-dimensional. Ti4+ is bonded
to six equivalent O2- atoms to form a mixture of
corner and edge-sharing TiO6 octahedra. The
corner-sharing octahedral tilt angles are 49°.
There is four shorter (1.96 Å) and two longer
(2.00 Å) Ti–O bond length. O2- is bonded in a
distorted trigonal planar geometry to three
equivalent Ti4+ atoms.
44. 44
Getting started with robocrystallographer
Submitted - waiting
for referee report!!
Paper Docs Support
hackingmaterials.github.io
/robocrystallographer
Alex Ganose
aganose@lbl.gov
45. 45
Conclusion: hopefully you’ve found something interesting
or useful for your own work!
High-throughput
computing and
simulations
Machine learning
Interpretable
crystal structure
representations
46. • Lead developers:
– Atomate: Kiran Mathew
– Rocketsled: Alex Dunn
– Matminer: Logan Ward
– Automatminer: Alex Dunn
– Robocrystallographer: Alex Ganose
• And the dozens of other developers who have contributed to
these packages or reported issues!
• Funding: U.S. Department of Energy, Basic Energy Sciences,
Early Career Award
• AddiLonal funding from the DOE-funded Materials Project
46
Acknowledgements
Slides (already) posted to hackingmaterials.lbl.gov