SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
The Status of ML Algorithms
for Structure-property Relationships
Using Matbench as a Test Protocol
Anubhav Jain
Lawrence Berkeley National Laboratory
TMS Spring 2022, March 2022
Slides (already) posted to hackingmaterials.lbl.gov
ML is quickly becoming a standard tool for
materials screening
2
Machine learning
High-throughput DFT
Expensive calculation
Experiment
Millions of candidates
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
3
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
4
Q: Which one is the “best”
based on the literature?
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
5
Q: Which one is the “best”
based on the literature?
A: Can’t tell! They’re nearly
all done on different data.
Difficulty of comparing ML algorithms
6
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Different data sets
• Source (e.g., OQMD vs MP)
• Quantity (e.g., MP 2018 vs MP 2019)
• Subset / data filtering (e.g., ehull<X)
• Different evaluation metrics
• Test set vs. cross validation?
• Different test set fraction?
• Often no runnable version of a
published algorithm.
MAE 5-Fold CV = 0.102 eV
RMSE Test set = 0.098 eV
vs.
? ?
What’s needed – an “ImageNet” for materials
science
7
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
What does a standard
data set do for a field?
8
One of the reasons computer science
/ machine learning seems to advance
so quickly is that they decouple data
generation from algorithm
development
This allows groups to focus on
algorithm development without all
the data generation, data cleaning,
etc. that often is the majority of an
end-to-end data science project
The ingredients of the Matbench benchmark
qStandard data sets
qStandard test splits according to nested cross-validation procedure
qAn online leaderboard that encourages reproducible results
9
How to design good data sets for materials
science?
10
• There is no single type of problem that materials scientists are trying
to solve
• For now, focus on materials property prediction (from structure or
composition)
• We want a test set that contains a diverse array of problems
• Smaller data versus larger data
• Different applications (electronic, mechanical, etc.)
• Composition-only or structure information available
• Experimental vs. Ab-initio
• Classification or regression
Matbench includes 13 different ML tasks
11
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
The tasks encompass a variety of problems
12
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
The ingredients of the Matbench benchmark
ü Standard data sets
q Standard test splits according to nested cross-validation procedure
q An online leaderboard that encourages reproducible results
13
The most common method:
a single hold-out test set
14
• Training/validation is used for
model selection
• Test/hold-out is used only for
error estimation (i.e., final
score)
Nested CV as a standard scoring metric
15
Nested CV is like hold-out, but varies the hold out set.
Think of it as k different “universes” – we have a
different training + validation of the model in each
universe and a different hold-out.
Nested CV as a standard scoring metric
16
Nested CV is like hold-out, but varies the hold out set.
Think of it as N different “universes” – we have a
different training + validation of the model in each
universe and a different hold-out.
“A nested CV procedure provides an almost unbiased estimate of the true error.”
Varma and Simon, Bias in error estimation when using cross-validation for model
selection (2006)
The ingredients of the Matbench benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation procedure
q An online leaderboard that encourages reproducible results
17
Matbench Website – now complete!
https://matbench.materialsproject.org
Matbench compares ML algorithms
19
Bigger datasets
Better
relative
performance
Access to Datasets/ML tasks
Interactively, via Materials Project
ml.materialsproject.org
Programmatically via matbench in python (2 lines)
*loads all 13 tasks
Programmatically via matminer in python (2 lines) Direct download, via matbench.materialsproject.org
Preferred/easiest method!
https://github.com/hackingmaterials/matminer
https://github.com/hackingmaterials/matminer
Programmatic Access and Analysis of Submissions
21
• Run a benchmark on your own algorithm in ~10 lines of code
• Run on any combination or all of the 13 existing tasks
• If your entry outperforms existing entry, submit algorithm in a pull request!
Existing notebooks/code and
software requirements for
reproducing any benchmark
{'python': [['crabnet==1.2.1',
'scikit_learn==1.0.2', 'matbench==0.5']]}
Comprehensive raw data
(accessible via matbench python
package or any json-capable
language) on all benchmarks
Publicly available to anyone!
In-depth performance metrics for
individual ML tasks for all
submissions
Both visually on website, and
programmatically
The ingredients of the Matbench benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation procedure
ü An online leaderboard that encourages reproducible results
22
What algorithms have been tested on the
matbench data set so far?
• Magpie + sine coloumb matrix random forest (feature-based random forests)
• Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028
(2016). https://doi.org/10.1038/npjcompumats.2016.28
• Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015):
1094-1101.
• Automatminer (feature-based AutoML)
• Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer
Reference Algorithm. npj Comput Mater 2020, 6 (1), 138.
• CGCNN (graph neural network)
• Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett.
2018, 120 (14), 145301.
• MEGNET (graph neural network)
• Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31
(9), 3564–3572.
• MODNet (feature-based neural network)
• De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet.
arXiv:2102.02263 [cond-mat] 2021.
• CRABNet (attention-based composition neural network)
• Wang, A.; Kauwe, S.; Murdock, R.; Sparks, T. Compositionally-Restricted Attention-Based Network for Materials Property Prediction; ChemRxiv, 2020.
https://doi.org/10.26434/chemrxiv.11869026.v1.
• ALIGNN (graph neural network with bond angles)
• Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8.
23
Insights from standardized comparisons
24
• Originally, we found traditional ”hand-crafted” feature models performed best generally when ! < 10%
• So it seemed matsci data – typically small datasets, esp. experimental – was best modelled by traditional
ML/feature methods, e.g. Random Forest
• Clever developments in neural networks have improved GNN models on smaller datasets, in part
powered by competition on the Matbench leaderboard
• Standardized platform has enabled easier identification of techniques which work well for certain
problems, and those that do not
+
Insights from standardized comparisons
25
Errors Predicting Final Phonon DOS Peak Frequencies
Structural GNN
(2022)
Composition GNN
(2021)
Algorithm
Mean MAE
(cm-1)
Mean RMSE
(cm-1)
Maximum
max_error (cm-1)
ALIGNN (2022) 29.5385 53.501 615.3466
MODNet v0.1.10
(2021) 38.7524 78.222 1031.8168
CrabNet (2021) 55.1114 138.3775 1452.7562
AMMExpress
(2020) 56.1706 109.7048 1151.557
CGCNN (2019) 57.7635 141.7018 2504.8743
Mean Absolute Error !"#$ ± &"#$ Predicting Final PhDOS Peaks
SoTA early 2020
Same data, same test; so, why are some algorithms best?
• ALIGNN: Incorporation of bond angle into crystal graph
• Bond angle/local env importance for vibrational properties?
• Matbench enables these sorts of “instant” ablation studies
Insights from standardized comparisons
26
Errors Predicting Predicting Expt. !"#$
Mean Absolute Error %&'( ± *&'( Predicting Expt. !"#$
Composition GNN
Algorithm
Mean MAE
(eV)
Std. MAE
(eV)
Mean RMSE
(eV)
CrabNet 0.3463 0.0088 0.8504
MODNet (v0.1.10) 0.347 0.0222 0.7437
CrabNet v1.2.1 0.3757 0.0207 0.8805
AMMExpress v2020 0.4161 0.0194 0.9918
Traditional Features
+ Encoding/selection
SoTA early 2020
Same data, same test; so, why are some algorithms best?
• CrabNet: Importance of attention mechanism for
compositional props.; low variability across folds
• MODNet: Normalized Mutual Information feature selection
results in high performance at risk of higher variability across
folds
Improvements to Materials ML Benchmarks
27
Standardized Uncertainty Quantification More Datasets + Better Tasks!
• ML-Materials design improved by UQ of each prediction
• Enables adaptive design:
• Practical: modern models (e.g., MODNet) produce
UQ estimates naturally
• Useful: Can analyze UQ to tell us how often samples
true values actually fall outside UQ range
• In progress: Coming soon to matbench package!
• Impossible to represent the full field of materials
design in a single set of benchmarks
• However… can we come close? Aim to include a wider
variety of properties and sources:
• Expt. load-dependent Vicker’s hardness
• Expt. superconductor Tc
• Expt. Δ"#
$
from crystal structure
• Expt. UV-Vis measurements of metal oxides
• Unique, domain-specific procedures for each task
• For example: segregation of CV samples into clusters
based on structure/composition (LOCOCV)
• Evaluation procedures which most closely resemble
real world usage of these algorithms in the most
computationally feasible fashion
Conclusions and future
• As the community increasingly develops new algorithms for machine
learning materials properties, a standard way to test these algorithms
is needed
• Matbench represents such a standard and allows you to test your
algorithms against others
• Matbench also allows us to measure overall progress in the field
• We hope to see you on the leaderboard!
28
Acknowledgements
29
Alex Dunn
Lead developer
Qi Wang
Alex Ganose Daniel Dopp
Slides (already) posted to hackingmaterials.lbl.gov

Mais conteúdo relacionado

Mais procurados

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignAnubhav Jain
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Anubhav Jain
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Anubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and PythonShintaro Fukushima
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsAnubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Anubhav Jain
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)Anubhav Jain
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningAnubhav Jain
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processingAnubhav Jain
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Designaimsnist
 

Mais procurados (20)

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 

Semelhante a The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol

Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Anubhav Jain
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningKAMAL CHOUDHARY
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learningSung Kim
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Manuel Martín
 
A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...IRJET Journal
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization CS, NcState
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningGuido A. Ciollaro
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSung Kim
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071CS, NcState
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsIRJET Journal
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 

Semelhante a The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol (20)

Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learning
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 

Mais de Anubhav Jain

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAnubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software disseminationAnubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software disseminationAnubhav Jain
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Anubhav Jain
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Anubhav Jain
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst DesignAnubhav Jain
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAnubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …Anubhav Jain
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials ProjectAnubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectAnubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...Anubhav Jain
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignAnubhav Jain
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectAnubhav Jain
 

Mais de Anubhav Jain (18)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 

Último

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Último (20)

High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol

  • 1. The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol Anubhav Jain Lawrence Berkeley National Laboratory TMS Spring 2022, March 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. ML is quickly becoming a standard tool for materials screening 2 Machine learning High-throughput DFT Expensive calculation Experiment Millions of candidates
  • 3. There are many new algorithms being published for ML in materials – New ones constantly reported! 3
  • 4. There are many new algorithms being published for ML in materials – New ones constantly reported! 4 Q: Which one is the “best” based on the literature?
  • 5. There are many new algorithms being published for ML in materials – New ones constantly reported! 5 Q: Which one is the “best” based on the literature? A: Can’t tell! They’re nearly all done on different data.
  • 6. Difficulty of comparing ML algorithms 6 Data set used in study A Data set used in study B Data set used in study C • Different data sets • Source (e.g., OQMD vs MP) • Quantity (e.g., MP 2018 vs MP 2019) • Subset / data filtering (e.g., ehull<X) • Different evaluation metrics • Test set vs. cross validation? • Different test set fraction? • Often no runnable version of a published algorithm. MAE 5-Fold CV = 0.102 eV RMSE Test set = 0.098 eV vs. ? ?
  • 7. What’s needed – an “ImageNet” for materials science 7 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
  • 8. What does a standard data set do for a field? 8 One of the reasons computer science / machine learning seems to advance so quickly is that they decouple data generation from algorithm development This allows groups to focus on algorithm development without all the data generation, data cleaning, etc. that often is the majority of an end-to-end data science project
  • 9. The ingredients of the Matbench benchmark qStandard data sets qStandard test splits according to nested cross-validation procedure qAn online leaderboard that encourages reproducible results 9
  • 10. How to design good data sets for materials science? 10 • There is no single type of problem that materials scientists are trying to solve • For now, focus on materials property prediction (from structure or composition) • We want a test set that contains a diverse array of problems • Smaller data versus larger data • Different applications (electronic, mechanical, etc.) • Composition-only or structure information available • Experimental vs. Ab-initio • Classification or regression
  • 11. Matbench includes 13 different ML tasks 11 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 12. The tasks encompass a variety of problems 12 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 13. The ingredients of the Matbench benchmark ü Standard data sets q Standard test splits according to nested cross-validation procedure q An online leaderboard that encourages reproducible results 13
  • 14. The most common method: a single hold-out test set 14 • Training/validation is used for model selection • Test/hold-out is used only for error estimation (i.e., final score)
  • 15. Nested CV as a standard scoring metric 15 Nested CV is like hold-out, but varies the hold out set. Think of it as k different “universes” – we have a different training + validation of the model in each universe and a different hold-out.
  • 16. Nested CV as a standard scoring metric 16 Nested CV is like hold-out, but varies the hold out set. Think of it as N different “universes” – we have a different training + validation of the model in each universe and a different hold-out. “A nested CV procedure provides an almost unbiased estimate of the true error.” Varma and Simon, Bias in error estimation when using cross-validation for model selection (2006)
  • 17. The ingredients of the Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure q An online leaderboard that encourages reproducible results 17
  • 18. Matbench Website – now complete! https://matbench.materialsproject.org
  • 19. Matbench compares ML algorithms 19 Bigger datasets Better relative performance
  • 20. Access to Datasets/ML tasks Interactively, via Materials Project ml.materialsproject.org Programmatically via matbench in python (2 lines) *loads all 13 tasks Programmatically via matminer in python (2 lines) Direct download, via matbench.materialsproject.org Preferred/easiest method! https://github.com/hackingmaterials/matminer https://github.com/hackingmaterials/matminer
  • 21. Programmatic Access and Analysis of Submissions 21 • Run a benchmark on your own algorithm in ~10 lines of code • Run on any combination or all of the 13 existing tasks • If your entry outperforms existing entry, submit algorithm in a pull request! Existing notebooks/code and software requirements for reproducing any benchmark {'python': [['crabnet==1.2.1', 'scikit_learn==1.0.2', 'matbench==0.5']]} Comprehensive raw data (accessible via matbench python package or any json-capable language) on all benchmarks Publicly available to anyone! In-depth performance metrics for individual ML tasks for all submissions Both visually on website, and programmatically
  • 22. The ingredients of the Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure ü An online leaderboard that encourages reproducible results 22
  • 23. What algorithms have been tested on the matbench data set so far? • Magpie + sine coloumb matrix random forest (feature-based random forests) • Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016). https://doi.org/10.1038/npjcompumats.2016.28 • Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101. • Automatminer (feature-based AutoML) • Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. • CGCNN (graph neural network) • Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301. • MEGNET (graph neural network) • Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31 (9), 3564–3572. • MODNet (feature-based neural network) • De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. arXiv:2102.02263 [cond-mat] 2021. • CRABNet (attention-based composition neural network) • Wang, A.; Kauwe, S.; Murdock, R.; Sparks, T. Compositionally-Restricted Attention-Based Network for Materials Property Prediction; ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.11869026.v1. • ALIGNN (graph neural network with bond angles) • Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8. 23
  • 24. Insights from standardized comparisons 24 • Originally, we found traditional ”hand-crafted” feature models performed best generally when ! < 10% • So it seemed matsci data – typically small datasets, esp. experimental – was best modelled by traditional ML/feature methods, e.g. Random Forest • Clever developments in neural networks have improved GNN models on smaller datasets, in part powered by competition on the Matbench leaderboard • Standardized platform has enabled easier identification of techniques which work well for certain problems, and those that do not +
  • 25. Insights from standardized comparisons 25 Errors Predicting Final Phonon DOS Peak Frequencies Structural GNN (2022) Composition GNN (2021) Algorithm Mean MAE (cm-1) Mean RMSE (cm-1) Maximum max_error (cm-1) ALIGNN (2022) 29.5385 53.501 615.3466 MODNet v0.1.10 (2021) 38.7524 78.222 1031.8168 CrabNet (2021) 55.1114 138.3775 1452.7562 AMMExpress (2020) 56.1706 109.7048 1151.557 CGCNN (2019) 57.7635 141.7018 2504.8743 Mean Absolute Error !"#$ ± &"#$ Predicting Final PhDOS Peaks SoTA early 2020 Same data, same test; so, why are some algorithms best? • ALIGNN: Incorporation of bond angle into crystal graph • Bond angle/local env importance for vibrational properties? • Matbench enables these sorts of “instant” ablation studies
  • 26. Insights from standardized comparisons 26 Errors Predicting Predicting Expt. !"#$ Mean Absolute Error %&'( ± *&'( Predicting Expt. !"#$ Composition GNN Algorithm Mean MAE (eV) Std. MAE (eV) Mean RMSE (eV) CrabNet 0.3463 0.0088 0.8504 MODNet (v0.1.10) 0.347 0.0222 0.7437 CrabNet v1.2.1 0.3757 0.0207 0.8805 AMMExpress v2020 0.4161 0.0194 0.9918 Traditional Features + Encoding/selection SoTA early 2020 Same data, same test; so, why are some algorithms best? • CrabNet: Importance of attention mechanism for compositional props.; low variability across folds • MODNet: Normalized Mutual Information feature selection results in high performance at risk of higher variability across folds
  • 27. Improvements to Materials ML Benchmarks 27 Standardized Uncertainty Quantification More Datasets + Better Tasks! • ML-Materials design improved by UQ of each prediction • Enables adaptive design: • Practical: modern models (e.g., MODNet) produce UQ estimates naturally • Useful: Can analyze UQ to tell us how often samples true values actually fall outside UQ range • In progress: Coming soon to matbench package! • Impossible to represent the full field of materials design in a single set of benchmarks • However… can we come close? Aim to include a wider variety of properties and sources: • Expt. load-dependent Vicker’s hardness • Expt. superconductor Tc • Expt. Δ"# $ from crystal structure • Expt. UV-Vis measurements of metal oxides • Unique, domain-specific procedures for each task • For example: segregation of CV samples into clusters based on structure/composition (LOCOCV) • Evaluation procedures which most closely resemble real world usage of these algorithms in the most computationally feasible fashion
  • 28. Conclusions and future • As the community increasingly develops new algorithms for machine learning materials properties, a standard way to test these algorithms is needed • Matbench represents such a standard and allows you to test your algorithms against others • Matbench also allows us to measure overall progress in the field • We hope to see you on the leaderboard! 28
  • 29. Acknowledgements 29 Alex Dunn Lead developer Qi Wang Alex Ganose Daniel Dopp Slides (already) posted to hackingmaterials.lbl.gov