Anúncio

Automated Machine Learning Applied to Diverse Materials Design Problems

Berkeley Lab
24 de Apr de 2019
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a Automated Machine Learning Applied to Diverse Materials Design Problems(20)

Anúncio

Mais de Anubhav Jain(20)

Anúncio

Automated Machine Learning Applied to Diverse Materials Design Problems

  1. Automated Machine Learning Applied to Diverse Materials Design Problems Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Spring Meeting, 2019 Slides (already) posted to hackingmaterials.lbl.gov
  2. 2 There are many algorithms developed for machine learning in materials – new ones are constantly reported!
  3. 3 Q: Which one is the “best” based on all the literature reports?
  4. 4 Q: Which one is the “best” based on all the literature reports? A: Can’t tell! They are (almost?) all tested on different data sets.
  5. • Different data sets – Source (e.g., OQMD vs MP) – Quantity (e.g., MP 2018 vs MP 2019) – Subset / data filtering (e.g., ehull<X) • Different cross validation metrics – e.g., what fraction is test set? • Often, this can’t be helped – Usually can’t access training / test data of past works – Sometimes no runnable version of a published algorithm – should referees be tougher on this? 5 Difficulty of comparing different ML algorithms Data set used in study A Data set used in study B Data set used in study C
  6. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 6 Outline
  7. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 7 Outline
  8. • We want a test set that contains a diverse array of problems – Smaller data versus larger data – Different applications (electronic, mechanical, etc.) – Composition-only or structure information available – Classification or regression • We also want a cross-validation metric that gives reliable error estimates – i.e., less dependent on specific choice of splits 8 A standard test method for ML algorithms in materials
  9. 9 Overview of Matbench test set Target Property Data Source Samples Method Bulk Modulus Materials Project 10,987 DFT-GGA Shear Modulus Materials Project 10,987 DFT-GGA Band Gap Materials Project 106,113 DFT-GGA Metallicity Materials Project 106,113 DFT-GGA Band Gap Zhuo et al. [1] 6,354 Experiment Metallicity Zhuo et al. [1] 6,354 Experiment Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment Refractive index Materials Project 4,764 DFPT-GGA Formation Energy Materials Project 132,752 DFT-GGA Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF Steel yield strength Citrine Informatics 312 Experiment 1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
  10. <1K 1K-10K10K-100K >100K 10 Diversity of benchmark suite mechanical electronic stability optical thermal classification regression experiment (composition only) DFT (structure) application data size problem type data type
  11. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 11 Outline
  12. 12 Most commonly used test procedure • Training/validation is used for model selection • Test / hold-out is used only for error estimation (Test set should not inform model selection, i.e. “final answer”)
  13. Think of it as N different “universes” – we have a different training of the model in each universe and a different hold-out. 13 Nested CV – like hold-out, but varies the hold-out set
  14. Think of it as N different “universes” – we have a different training of the model in each universe and a different hold-out. 14 Nested CV – like hold-out, but varies the hold-out set “A nested CV procedure provides an almost unbiased estimate of the true error.” Varma and Simon, Bias in error estimation when using cross-validation for model selection (2006)
  15. • Matbench is a curated set of data sets that provide a diverse set of problems representative of those found in materials science • ML developers can work on a consistent set of test problems • Ideally – consistent reports of error in the literature! • Matbench v1 will be released soon … – Let us know if you have feedback / comments / suggestions! 15 Summary of Matbench
  16. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 16 Outline
  17. 17 Typically several steps of machine learning are performed by a human researcher – can these be automated? Descriptors developed and chosen by a researcher ML model developed and chosen by a researcher Why can’t we just give the computer some raw input data (compositions, crystal structures) and output properties and get back an ML model?
  18. 18 Automatminer is a ”black box” machine learning model Give it any data set with either composition or structure inputs, and automatminer will train an ML model (no researcher intervention)
  19. 19 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  20. 20 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  21. >60 featurizer classes can generate thousands of potential descriptors that are described in the literature 21 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) • compatible with scikit- learn pipelining • automatically deploy multiprocessing to parallelize over data • include citations to methodology papers
  22. 22 The matminer library is available for open use Ward et al. Matminer : An open source toolkit for materials data mining. Computational Materials Science, 152, 60–69 (2018). Paper Docs Support hackingmaterials.github.io /matminer https://groups.google.com/ forum/#!forum/matminer
  23. 23 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  24. • TPOT uses genetic algorithms to determine the best ML model and hyperparameters using the training / validation set – Also some internal feature reduction, scaling, etc. – a full pipeline of operations • Menu of ML options is all the algorithms implemented in scikit-learn – i.e., not neural networks • Parameters include population size and number of generations for genetic algorithm – Tradeoff between CPU time and performance – Auto-convergence or early stop possible 24 TPOT for AutoML Olson, R. S. & Moore, J. H. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. in Proceedings of the Workshop on Automatic Machine Learning (eds. Hutter, F., Kotthoff, L. & Vanschoren, J.) 64, 66–74 (PMLR, 2016).
  25. • Comparison 1: CGCNN • Comparison 2: MEGNET • Comparison 3: Untuned random forest (“no frills”) – MAGPIE features for composition – MAGPIE + Sine Coulomb matrix for structure 25 Comparing automatminer against state-of-the-art Xie, T. & Grossman, J. C. Phys. Rev. Lett. 120, 145301 (2018). Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. arXiv:1812.05055 (2018). Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj Computational Materials 2, 16028–16028 (2016). Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj Computational Materials 2, 16028–16028 (2016).
  26. 26 Matbench results for all algorithms
  27. 27 How does data set size affect performance? For all structure-based regression problems, divide the mean absolute error of model by mean absolute deviation of the data set. • Always predicting the mean would yield a value of 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100 1000 10000 100000 1000000 MAE/MAD Data Set Size automatminer CGCNN MEGNET
  28. • Automatminer is much faster / easier to train – One can adjust training time of all algorithms to some extent – Note that MEGNET is faster than CGCNN but same order of magnitude • GPUs might greatly accelerate CGCNN / MEGNET training (no timing available) 28 Algorithm training time per fold on 8-16 CPU cores Data set size Automatminer CGNN MEGNET ~1K ~1 hour or less ~few hours ~few hours ~10K ~few hours ~few days ~few days ~100K ~12 hours ~few weeks ~few weeks
  29. 29 Getting started with automatminer Paper Docs Support hackingmaterials.github.io /automatminer https://groups.google.com/ forum/#!forum/matminer In preparation …
  30. • We proposed a diverse benchmark test suite of problems to develop and test ML algorithms against • We presented a black-box ML algorithm, Automatminer, that performs comparably or outperforms literature values on small data sets (N<10,000), but does more poorly on larger data sets • Further upgrades to automatminer are in progress! – See if we can do better on N>10,000 problems – Although crystal networks might alternately use transfer learning to tackle N<10,000 problems (e.g., MEGNET) 30 Conclusions
  31. 31 Acknowledgements Alex Dunn Graduate student Qi Wang Postdoc Alex Ganose Postdoc Alireza Faghaninia Postdoc Samy Cherfaoui Undergraduate Daniel Dopp Undergraduate Funding: U.S. Department of Energy, Basic Energy Sciences Slides (already) posted to hackingmaterials.lbl.gov
Anúncio