Mais conteúdo relacionado

Similar a Accelerating New Materials Design with Supercomputing and Machine Learning(20)

Mais de Anubhav Jain(20)


Accelerating New Materials Design with Supercomputing and Machine Learning

  1. Accelerating New Materials Design with Supercomputing and Machine Learning Anubhav Jain Lawrence Berkeley National Laboratory Alvarez seminar series Slides (already) posted to
  2. Outline 2 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  3. I began my research career with a high school internship program at Brookhaven NL • The project was supposed to be about how to build large rectifier circuits to ensure a stable power supply to superconducting magnets • But mostly, we just goofed around 3 Spending half a day getting a pen cap out of a colleague’s cast Fun with the Plan9 operating system and circuit diagrams
  4. In college, I did an initial SULI internship on functional MRI (fMRI) research • We wanted to see if all fMRI signals were real or if some signals were spurious • I also did a lot of manual data cleaning, and I always wondered why they didn’t just write a program to do this kind of grunt work (but was too shy to ask) • Sometimes I actually ran live scans on subjects with little assistance, something that would definitely not pass safety checks today … (and maybe didn’t at the time either) 4 increasing rotation leads to more spurious signals
  5. I then did another SULI internship, this time engaging for a couple of years and publishing! • Goal was to have a robotic system for performing protein crystallography – allowing experiments to run autonomously / overnight and at higher rates • My portion of the robot system was to automatically align protein crystals in an x-ray using computer vision • Lots of Java programming and traditional computer vision 5
  6. Today, I’ve advised a lot of SULI interns myself – many of whom are now PhD students pushing forward their own research boundaries! 6
  7. For my PhD application, I applied (and was admitted) for doing biology / materials science research • I was targeting drug delivery materials • But I didn’t get a position in any labs related to biomaterials • So I ended up taking a position with my thermodynamics teacher (G. Ceder), who seemed smart and also played with dinosaur figurines in class 7
  8. In the end, I focused on automating calculations for materials screening for PhD • Traditionally, theoretical calculations on materials were performed one-at-a- time, mostly manually • We built a complex system to automate these calculations and used it to screen Li-ion battery cathode materials • We also began a process of putting the data online … 8 + ) }; ({ ) }; ({ t r H dt t r d i i i Y = Y Ù ! + Total energy Optimized structure Magnetic ground state Charge density Band structure / DOS H = Ñi 2 i =1 Ne å + Vnuclear (ri ) i=1 Ne å + Veffective(ri ) i =1 Ne å Setyawan, Curtarolo Comp. Mat. Sci (2010) 384 Xeon cores 134,000 lines of code 50 core tables Chemistry Novelty Energy density vs. LiFePO4 % of theoretical capacity already achieved in the lab Li9V3(P2O7)3(PO4)2 New 20% greater ~65% Origin: V to Fe substitution in Li9Fe3(P2O7)3(PO4)2* Remarks: • Structure has “layers” and “tunnels” • Pyrophosphate-phosphate mixture • Potential 2-electron material
  9. My thesis defense talk would hint at my Alvarez postdoc work … 9
  10. Outline 10 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  11. Yet another workflow software …. • By 2011, our computing infrastructure at MIT used for battery screening was showing a lot of wear and tear / barnacles. It was also not suitable for running at LBNL’s supercomputers • We essentially rebuilt 5 years of work from scratch • Part of this was creating a new workflow software that also merged with LBNL work called “rockets” for launching jobs. We called the new software “FireWorks” for naming harmony, but is a name I regret now … • This was most of what I did during the Alvarez fellowship • Supported by the infinite patience of Kristin Persson who co-advised by Alvarez fellowship and provided supplemental funding, in addition to D. Bailey who served as my CRD host • Note that in the end, things were better by scrapping almost all of “rockets” and designing FireWorks from scratch 11 Note: we did think a lot about whether to use an existing workflow package, but none met our needs for (i) ease of use (could be operated by scientists), (ii) good documentation, and (iii) compatibility with error-prone and dynamic high- throughput workflows
  12. I spent a lot of time developing FireWorks and associated infrastructure for high-throughput computing • We did several things that were new at the time • Based everything off MongoDB • Really planned for job failures and reruns • Took into account that duplicate steps of workflows may be submitted, but should run only once • Allowed jobs to modify their own workflow graph or create new workflows • Spent a lot of time on documentation and support • FireWorks continues to be an active project and is now largely community supported 12 I spent a lot of time programming … Rocketsled: Use FireWorks to perform virtual active learning, even when simulations are expensive and require supercomputers Borealis: Run FireWorks in the cloud via GCP [[externally developed and maintained]] atomate: Use FireWorks to run materials science calculations
  13. Growing the Materials Project • Apart from the workflow software itself, I was running a lot of density functional calculations to populate a public database of calculations (The Materials Project, headed by K. Persson) • Interest grew steadily in this resource and a few core members • Each of us was wearing lots of hats – materials scientists, web developers, workflow programmers, REST API developers 13 “I am so incredibly happy an effort like this exists now... I have been lamenting for years that despite the importance of materials we have remained relatively unaided by the information age. Please please don't stop growing!” Cymbet
  14. A continuing challenge has been that every mistake in high-throughput is magnified … “I’m overly paranoid probably because I (and others on the Materials Project team) spend inordinate amounts of time fixing problems in the Materials Project data. A search for the word “bug” in my email gives ~500 results in the past year (and there are additional “issues”, “problems”, and “errors”). … trying to exterminate the Materials Project’s bugs can be somewhat maddening – the past few years have demonstrated that the infestation will always return, usually based on something that appears innocent at first glance … For example, on multiple occasions, code that incorrectly set (or failed to set) a single input tag ruined tens of thousands of dollars worth of computing and several weeks of work. Currently, we’re struggling to find out whether old bugs in a crystal structure matching code may have affected what we’ve computed and potentially any of the reported results …” 14 (myself in a blog post about MP work)
  15. Outline 15 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  16. Transitioning to LBNL staff • I became staff in 2013 after being hired by K. Persson • At first, this mainly meant that I spent more time training new postdocs in some of the things we were doing and helping launch an new project on multivalent batteries • The real career game changer was when I got a DOE Early Career Award in 2015, which came with enough funding to make me an independent researcher essentially overnight • Nevertheless, continued working on past projects like Materials Project to this day (first as co-PI, now Associate Director) 16
  17. The Materials Project continues to grow • The Materials Project has grown beyond what most of us imagined • The team now includes ~3-4 staff dedicated to infrastructure and scaling • Staff web developer currently needed! • FireWorks is still used to run the calculations • We’ve begun new outreach efforts, like the MP seminar series • 17 > 180,000 registered users 4
  18. 2. Materials Project links to your contribution 3. Your data set and paper are linked 1. Google links to Materials Project page 18 A new phase of Materials Project: researchers can contribute their own data sets to MP
  19. Today, the Materials Project has led to many examples of “computer to lab” success stories MP for p-type transparent conductors References ✦ Hautier, G., Miglio,A., Ceder, G., Rignanese, G.-M. & Gonze, X. Identification and design principles of low hole effective mass p-type transparent conducting oxides. Nature Communications 4, (2013) ✦ Bhatia,A. et al. High-Mobility Bismuth-based Transparent p-Type Oxide from High- Throughput Material Screening. Chemistry of Materials 28, 30–34 (2015) ✦ Ricci, F. et al.An ab initio electronic transport database for inorganic materials. Scientific Data 4, (2017) Prediction Screening based on band gap, transport properties and band alignments. Experiment Predictions revealed material with s–p hybridized valence band (thought to correlate well with dopability). When synthesized, material has excellent transparency and readily dopable with K. Ba2BiTaO6 MP for thermoelectrics References ✦ Aydemir, U. et al.YCuTe2: a member of a new class of thermoelectric materials with CuTe4-based layered structure. Journal of Materials Chemistry A 4, 2461–2472 (2016) ✦ Zhu, H. et al. Computational and experimental investigation of TmAgTe2and XYZ2compounds, a new group of thermoelectric materials identified by first-principles high-throughput screening. Journal of Materials Chemistry C 3, 10554–10565 (2015). ✦ Pöhls, J.-H. et al. Metal phosphides as potential thermoelectric materials. Journal of Materials Chemistry C 5, 12441–12456 (2017). Prediction Screening of tens of thousands of materials with predicted electron transport properties revealed a family of promising XYZ2 candidates Experiment Several materials made: YCuTe2 (zT = 0.75), TmAgTe2 (zT = 0.47, 1.8 theoretical), novel NiP2 phosphide TmAgTe2 MP for phosphors References ✦ Wang, Z. et al. Mining Unexplored Chemistries for Phosphors for High-Color- Quality White-Light-Emitting Diodes. Joule 2, 914–926 (2018) ✦ Li, S. et al. Data-Driven Discovery of Full-Visible-Spectrum Phosphor. Chemistry of Materials 31, 6286–6294 (2019) ✦ Ha, J. et al. Color tunable single-phase Eu2+ and Ce3+ co-activated Sr2LiAlO4 phosphors. Journal of Materials Chemistry C 7, 7734–7744 (2019) Prediction Statistical analysis of existing materials that co-occur with word ‘phosphor’ followed by structure prediction for new materials Experiment Predicted first known Sr-Li- Al-N quaternary, showed green-yellow/blue emission with quantum efficiency of 25% (Eu), 40% (Ce), 55% (co-activated Eu, Ce) Sr2LiAlN4 ≈ç ≈ 19
  20. One of the applications we looked into was thermoelectric materials 20 • A thermoelectric material generates a voltage based on thermal gradient • Applications • Heat to electricity • Refrigeration • Advantages include: • Reliability • Easy to scale to different sizes (including compact)
  21. It is difficult to balance trade-offs in thermoelectrics properties, so use screening 21 ZT = α2σT/κ power factor >2 mW/mK2 (PbTe=10 mW/mK2) Seebeck coefficient > 100 V/K Band structure + Boltztrap electrical conductivity > 103 /(ohm-cm) Band structure + Boltztrap thermal conductivity < 1 W/(m*K) •  e from Boltztrap •  l difficult (phonon-phonon scattering) Heavy band: ü Large DOS (higher Seebeck and more carriers) ✗ Large effective mass (poor mobility) Light band: ü Small effective mass (improved mobility) ✗ Small DOS (lower Seebeck, fewer carriers) Multiple bands, off symmetry: ü Large DOS with small effective mass ✗ Difficult to design! E k ~50,000 crystal structures and band structures from Materials Project are used as a source F. Ricci, et al., An ab initio electronic transport database for inorganic materials, Sci. Data. 4 (2017) 170085. We compute electronic transport properties with BoltzTraP and minimum thermal conductivity (Cahill- Pohl) for some compounds About 300GB of electronic transport data is generated. All data is available free for download.
  22. We found several compounds with promising figure-of-merit, but no breakthroughs 22 • Calculations: trigonal p- TmAgTe2 could have power factor up to 8 mW/mK2 • requires 1020/cm3 carriers experiment computation • Calculations: p-YCuTe2 could only reach PF of 0.4 mW/mK2 • SOC inhibits PF • if thermal conductivity is low (e.g., 0.4, we get zT ~1) • Expt: zT ~0.75 – not too far from calculation limit • carrier concentration of 1019 • Decent performance, but unlikely to be improved with further optimization • Expt: p-zT only 0.35 despite very low thermal conductivity (~0.25 W/mK) • Limitation: carrier concentration (~1017/cm3) • likely limited by TmAg defects, as determined by followup calculations • Later, we achieved zT ~ 0.47 using Zn-doping TmAgTe2 YCuTe2
  23. We also developed a new method for more accurately screening electronic transport 23 Old method (BoltzTraP – screening is qualitative w/pitfalls) New method (AMSET – screening is more quantitative) Ganose, A. M.; Park, J.; Faghaninia, A.; Woods-Robinson, R.; Persson, K. A.; Jain, A. Efficient Calculation of Carrier Scattering Rates from First Principles. Nat Commun 2021, 12 (1), 2222. acoustic deformation potential (ad) deformation potential, elastic tensor ionized impurity (ii) dielectric tensor piezoelectric (pi) dielectric tensor, piezoelectric tensor polar optical phonon (po) dielectric tensor, polar phonon frequency • The method, AMSET, was in development for ~5 years and took a very talented postdoc (A. Ganose) to finalize everything • Can calculate e- mobility + Seebeck coefficient much more accurately than standard models
  24. What about machine learning? 24 • “Simulation-only” screening is becoming rarer • More common now is to integrate machine learning models before performing expensive calculations • Our group developed a popular open-source library called “matminer” to help with ML in materials • Since then, we’ve been interested in benchmarking methods from the community MATERIAL FEATURES PROPERTY TiO2 rutile F11 F12 … F1N gap = 3.0 eV C diamond F21 F22 … F2N gap = 5.5 eV … … … … … … PbTe rocksalt FM1 FM2 … FMN gap = 0.3 eV Python ML Libraries Data Featurization Data Retrieval Data Visualization Materials Databases MPDS Citrine Materials Project
  25. Proper benchmarking is becoming more of an issue in materials ML New algorithms are constantly reported! 25
  26. But it is very difficult to compare algorithms 26 Data set used in study A Data set used in study B Data set used in study C • Different data sets • Source (e.g., OQMD vs MP vs JARVIS) • Quantity (e.g., MP 2019 vs MP 2022) • Subset / data filtering (e.g., ehull<X) • Different evaluation metrics • Test set vs. cross validation? • Different test set fraction? • Can be difficult to install and retrain many of these algorithms MAE 5-Fold CV = 0.102 eV RMSE Test set = 0.098 eV vs. ? ?
  27. Matbench includes 13 different ML tasks 27 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138.
  28. How to read the Matbench leaderboard 28 Bigger datasets Better relative performance • A scaled error of 0.0 means all predictions are correct • A scaled error of 1.0 is equal to always predicting the average value
  29. Magpie + SCF Model • Composition features using chemical descriptors such as averages/stdevs of elemental properties such as melting point, electronegativity • Structure features using sine Coulomb matrix 29 Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016). Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101.
  30. MODNet Model 30 De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. Journal of Physics: Condensed Matter, Volume 33, Number 40, 2021
  31. CGCNN Model 31 Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301.
  32. ALIGNN Model 32 Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8.
  33. How much have we improved overall? 33 • In some cases (e.g., Ef DFT) we have made a lot of improvement • In contrast, for others (e.g., σy steel alloys) we have barely improved • Possible reasons • Amount of attention paid to certain problems • Small vs large data emphasis – there is a lot more room for improvement for small data
  34. How else can machine learning be used? 34 Flood of information Important things get missed Useful data, but unstructured NLP algorithms
  35. The types of features that would be very helpful for materials research 35 5 Zinc oxide ZnO OZn Chemistry aware search (same input, same results) Summary data • Physical properties • Synthesis information • Known applications ferroelectrics All known compositions (PbTiO3, BaTiO3, etc.) Links to computational databases User annotates a small number of example text for data extraction annotation source text Train custom model for completing annotations Apply to entire literature (millions of articles) or internal text database + question and answer, e.g. • What is the band gap of “Si”? • What are all the known dopants into GaAs? • What are all materials studied as thermoelectrics?
  36. 36 We developed a pipeline to extract data from materials science abstracts Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  37. The resulting model can label abstracts 37 Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9.
  38. And enables new kinds of searches … 38
  39. We also found that word embeddings trained on literature have hidden chemical information 39 • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target When we train word2vec on inorganic materials science abstracts, we get representations in-line with chemical knowledge crystal structures of the elements
  40. This hidden information can be used to predict compounds that might be interesting 40 • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP)
  41. For example, we can rank which compounds are likely to co-occur with “thermoelectric” in the future 41 • For every year since 2001, see which compounds we would have predicted using only literature data until that point in time • Make predictions of what materials are the most promising thermoelectrics for data until that year • See if those materials were actually studied as thermoelectrics in subsequent years Investigated as thermoelectrics (independently of our study) Investigated by our own collaborators (as a result of our study)
  42. We’ve since also applied NLP to synthesis and are working actively in this area 42
  43. Outline 43 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  44. Moving from the virtual world to the physical world – automated synthesis 44 In operation: XRD Robot Box furnace x 4 Tube furnace x 4 Arriving soon: SEM/EDS (Early June) Labman dosing and mixing LBNL bldg. 30 Dosing and mixing
  45. Lab starting to take shape … 45 Courtesy Y. Fei, Ceder Group
  46. And once again, we need workflow software! 46 • Monitor the lab and runs experiment on different devices • Collect data generated in the experiment • Handle exceptions in the lab
  47. Conclusions • A lot of things can change in 15 years • 15 years ago, the idea of high-throughput DFT was scoffed at by many researchers (“too computationally expensive”, “theory not good enough”, “people will be confused”) • Today it has become a standard procedure for materials design • I got to see the technique grow from being used by a handful of people with the smallest possible conference sessions to now being a large, standing room-only symposia at large conferences • I see similar changes happening or happened in emerging areas • Machine learning in materials was a niche subject, now it’s potentially bigger than DFT-based screening • NLP in materials is still small, but the trajectory looks on-track to become big (at a slower pace) • Automated synthesis is still small, but that trajectory is growing very rapidly • I’ve been fortunate to be a part of many great projects and teams at the lab and am looking forward to the next iteration of materials design! 47
  48. Acknowledgements • My mentors and advisors, without whom I wouldn’t have a job • Vivian Stojanoff (SULI adviser), Gerd Ceder (PhD advisor), Kristin Persson (postdoc advisor + early staff supervisor) • Our research group, without whom there’d be no exciting research results • Our collaborators • Entire Materials Project team • J. Snyder and J. Pohls who took time-consuming experimental leaps on computational screening results for thermoelectrics • Our funders • DOE BES, DOE EERE, Toyota Research Institutes, LBNL LDRD 48