Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to
Diverse Materials Design Problems
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
MRS Spring Meeting, 2019
Slides (already) posted to hackingmaterials.lbl.gov
2
There are many algorithms developed for machine learning
in materials – new ones are constantly reported!
3
Q: Which one is the “best” based
on all the literature reports?
4
Q: Which one is the “best” based
on all the literature reports?
A: Can’t tell! They are (almost?)
all tested on different data sets.
• Different data sets
– Source (e.g., OQMD vs MP)
– Quantity (e.g., MP 2018 vs MP 2019)
– Subset / data filtering (e.g., ehull<X)
• Different cross validation
metrics
– e.g., what fraction is test set?
• Often, this can’t be helped
– Usually can’t access training /
test data of past works
– Sometimes no runnable version
of a published algorithm
– should referees be tougher on this?
5
Difficulty of comparing different ML algorithms
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
6
Outline
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
7
Outline
• We want a test set that contains a diverse array
of problems
– Smaller data versus larger data
– Different applications (electronic, mechanical, etc.)
– Composition-only or structure information available
– Classification or regression
• We also want a cross-validation metric that gives
reliable error estimates
– i.e., less dependent on specific choice of splits
8
A standard test method for ML algorithms in materials
9
Overview of Matbench test set
Target Property Data Source Samples Method
Bulk Modulus Materials Project 10,987 DFT-GGA
Shear Modulus Materials Project 10,987 DFT-GGA
Band Gap Materials Project 106,113 DFT-GGA
Metallicity Materials Project 106,113 DFT-GGA
Band Gap Zhuo et al. [1] 6,354 Experiment
Metallicity Zhuo et al. [1] 6,354 Experiment
Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment
Refractive index Materials Project 4,764 DFPT-GGA
Formation Energy Materials Project 132,752 DFT-GGA
Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA
Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA
Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF
Steel yield strength Citrine Informatics 312 Experiment
1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
<1K
1K-10K10K-100K
>100K
10
Diversity of benchmark suite
mechanical
electronic
stability
optical
thermal
classification
regression
experiment
(composition
only)
DFT
(structure)
application data size
problem
type
data type
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
11
Outline
12
Most commonly used test procedure
• Training/validation
is used for model
selection
• Test / hold-out is
used only for error
estimation
(Test set should not
inform model
selection, i.e. “final
answer”)
Think of it as N different “universes” – we have a different
training of the model in each universe and a different hold-out.
13
Nested CV – like hold-out, but varies the hold-out set
Think of it as N different “universes” – we have a different
training of the model in each universe and a different hold-out.
14
Nested CV – like hold-out, but varies the hold-out set
“A nested CV procedure provides an almost unbiased estimate of the true error.”
Varma and Simon, Bias in error estimation when using cross-validation for model
selection (2006)
• Matbench is a curated set of data sets that provide a
diverse set of problems representative of those
found in materials science
• ML developers can work on a consistent set of test
problems
• Ideally – consistent reports of error in the literature!
• Matbench v1 will be released soon …
– Let us know if you have feedback / comments /
suggestions!
15
Summary of Matbench
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
16
Outline
17
Typically several steps of machine learning are performed by
a human researcher – can these be automated?
Descriptors developed and
chosen by a researcher
ML model developed
and chosen by a
researcher
Why can’t we just give the computer some raw input data
(compositions, crystal structures) and output properties and get
back an ML model?
18
Automatminer is a ”black box” machine learning model
Give it any data set with either composition or structure inputs, and
automatminer will train an ML model (no researcher intervention)
19
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
20
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
>60 featurizer classes can
generate thousands of potential
descriptors that are described in
the literature
21
Matminer contains a library of descriptors for various
materials science entities
feat = EwaldEnergy([options])
y = feat.featurize([input_data])
• compatible with scikit-
learn pipelining
• automatically deploy
multiprocessing to
parallelize over data
• include citations to
methodology papers
22
The matminer library is available for open use
Ward et al. Matminer : An open
source toolkit for materials data
mining. Computational Materials
Science, 152, 60–69 (2018).
Paper Docs Support
hackingmaterials.github.io
/matminer
https://groups.google.com/
forum/#!forum/matminer
23
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
• TPOT uses genetic algorithms to determine
the best ML model and hyperparameters
using the training / validation set
– Also some internal feature reduction, scaling,
etc. – a full pipeline of operations
• Menu of ML options is all the algorithms
implemented in scikit-learn
– i.e., not neural networks
• Parameters include population size and
number of generations for genetic algorithm
– Tradeoff between CPU time and performance
– Auto-convergence or early stop possible
24
TPOT for AutoML
Olson, R. S. & Moore, J. H. TPOT: A Tree-based Pipeline Optimization Tool for
Automating Machine Learning. in Proceedings of the Workshop on Automatic Machine
Learning (eds. Hutter, F., Kotthoff, L. & Vanschoren, J.) 64, 66–74 (PMLR, 2016).
• Comparison 1: CGCNN
• Comparison 2: MEGNET
• Comparison 3: Untuned random forest (“no frills”)
– MAGPIE features for composition
– MAGPIE + Sine Coulomb matrix for structure
25
Comparing automatminer against state-of-the-art
Xie, T. & Grossman, J. C. Phys. Rev. Lett. 120, 145301 (2018).
Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P.
arXiv:1812.05055 (2018).
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj
Computational Materials 2, 16028–16028 (2016).
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj
Computational Materials 2, 16028–16028 (2016).
27
How does data set size affect performance?
For all structure-based regression problems, divide the mean absolute
error of model by mean absolute deviation of the data set.
• Always predicting the mean would yield a value of 1.0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
100 1000 10000 100000 1000000
MAE/MAD
Data Set Size
automatminer
CGCNN
MEGNET
• Automatminer is much faster / easier to train
– One can adjust training time of all algorithms to some
extent
– Note that MEGNET is faster than CGCNN but same
order of magnitude
• GPUs might greatly accelerate CGCNN /
MEGNET training (no timing available)
28
Algorithm training time per fold on 8-16 CPU cores
Data set size Automatminer CGNN MEGNET
~1K ~1 hour or less ~few hours ~few hours
~10K ~few hours ~few days ~few days
~100K ~12 hours ~few weeks ~few weeks
29
Getting started with automatminer
Paper Docs Support
hackingmaterials.github.io
/automatminer
https://groups.google.com/
forum/#!forum/matminer
In preparation …
• We proposed a diverse benchmark test suite of
problems to develop and test ML algorithms against
• We presented a black-box ML algorithm,
Automatminer, that performs comparably or
outperforms literature values on small data sets
(N<10,000), but does more poorly on larger data sets
• Further upgrades to automatminer are in progress!
– See if we can do better on N>10,000 problems
– Although crystal networks might alternately use transfer
learning to tackle N<10,000 problems (e.g., MEGNET)
30
Conclusions
31
Acknowledgements
Alex Dunn
Graduate student
Qi Wang
Postdoc
Alex Ganose
Postdoc
Alireza Faghaninia
Postdoc
Samy Cherfaoui
Undergraduate
Daniel Dopp
Undergraduate
Funding:
U.S. Department
of Energy, Basic
Energy Sciences
Slides (already) posted to
hackingmaterials.lbl.gov