Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with
Supercomputing and Machine Learning
Anubhav Jain
Lawrence Berkeley National Laboratory
Alvarez seminar series
Slides (already) posted to hackingmaterials.lbl.gov
Outline
2
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
I began my research career with a high school
internship program at Brookhaven NL
• The project was supposed to
be about how to build large
rectifier circuits to ensure a
stable power supply to
superconducting magnets
• But mostly, we just goofed
around
3
Spending half a day getting a pen cap out of a colleague’s cast
Fun with the Plan9 operating system and circuit diagrams
In college, I did an initial SULI internship on
functional MRI (fMRI) research
• We wanted to see if all fMRI
signals were real or if some
signals were spurious
• I also did a lot of manual data
cleaning, and I always wondered
why they didn’t just write a
program to do this kind of grunt
work (but was too shy to ask)
• Sometimes I actually ran live
scans on subjects with little
assistance, something that would
definitely not pass safety checks
today … (and maybe didn’t at the
time either)
4
increasing rotation leads to more
spurious signals
I then did another SULI internship, this time
engaging for a couple of years and publishing!
• Goal was to have a robotic system for
performing protein crystallography –
allowing experiments to run
autonomously / overnight and at
higher rates
• My portion of the robot system was
to automatically align protein crystals
in an x-ray using computer vision
• Lots of Java programming and
traditional computer vision
5
Today, I’ve advised a lot of SULI interns myself – many of whom are now
PhD students pushing forward their own research boundaries!
6
For my PhD application, I applied (and was
admitted) for doing biology / materials science
research
• I was targeting drug delivery materials
• But I didn’t get a position in any labs related to biomaterials
• So I ended up taking a position with my thermodynamics teacher (G.
Ceder), who seemed smart and also played with dinosaur figurines in
class
7
In the end, I focused on automating
calculations for materials screening for PhD
• Traditionally, theoretical
calculations on materials
were performed one-at-a-
time, mostly manually
• We built a complex system
to automate these
calculations and used it to
screen Li-ion battery
cathode materials
• We also began a process of
putting the data online …
8
+ )
};
({
)
};
({
t
r
H
dt
t
r
d
i i
i
Y
=
Y Ù
!
+
Total energy
Optimized structure
Magnetic ground state
Charge density
Band structure / DOS
H = Ñi
2
i =1
Ne
å + Vnuclear (ri )
i=1
Ne
å + Veffective(ri )
i =1
Ne
å
Setyawan, Curtarolo
Comp. Mat. Sci (2010) 384
Xeon cores
134,000
lines of code
50
core tables
Chemistry Novelty Energy density
vs. LiFePO4
% of theoretical capacity
already achieved in the lab
Li9V3(P2O7)3(PO4)2 New 20% greater ~65%
Origin:
V to Fe substitution in Li9Fe3(P2O7)3(PO4)2*
Remarks:
• Structure has “layers” and “tunnels”
• Pyrophosphate-phosphate mixture
• Potential 2-electron material
Outline
10
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
Yet another workflow software ….
• By 2011, our computing infrastructure at MIT used for battery screening
was showing a lot of wear and tear / barnacles. It was also not suitable
for running at LBNL’s supercomputers
• We essentially rebuilt 5 years of work from scratch
• Part of this was creating a new workflow software that also merged with
LBNL work called “rockets” for launching jobs. We called the new
software “FireWorks” for naming harmony, but is a name I regret now …
• This was most of what I did during the Alvarez fellowship
• Supported by the infinite patience of Kristin Persson who co-advised by
Alvarez fellowship and provided supplemental funding, in addition to D.
Bailey who served as my CRD host
• Note that in the end, things were better by scrapping almost all of
“rockets” and designing FireWorks from scratch
11
https://xkcd.com/927/
Note: we did think a lot about whether to use an existing
workflow package, but none met our needs for (i) ease of use
(could be operated by scientists), (ii) good documentation,
and (iii) compatibility with error-prone and dynamic high-
throughput workflows
I spent a lot of time developing FireWorks and associated
infrastructure for high-throughput computing
• We did several things that were new at
the time
• Based everything off MongoDB
• Really planned for job failures and reruns
• Took into account that duplicate steps of
workflows may be submitted, but should
run only once
• Allowed jobs to modify their own workflow
graph or create new workflows
• Spent a lot of time on documentation and
support
• FireWorks continues to be an active
project and is now largely community
supported
12
I spent a lot of time programming …
Rocketsled: Use FireWorks to perform virtual active
learning, even when simulations are expensive
and require supercomputers
Borealis: Run FireWorks in the cloud via GCP
[[externally developed and maintained]]
atomate: Use FireWorks to run materials
science calculations
Growing the Materials Project
• Apart from the workflow software
itself, I was running a lot of density
functional calculations to populate a
public database of calculations (The
Materials Project, headed by K.
Persson)
• Interest grew steadily in this resource
and a few core members
• Each of us was wearing lots of hats –
materials scientists, web developers,
workflow programmers, REST API
developers
13
“I am so incredibly happy an
effort like this exists now... I
have been lamenting for years
that despite the importance of
materials we have remained
relatively unaided by the
information age. Please please
don't stop growing!” Cymbet
A continuing challenge has been that every
mistake in high-throughput is magnified …
“I’m overly paranoid probably because I (and others on
the Materials Project team) spend inordinate
amounts of time fixing problems in the Materials Project
data. A search for the word “bug” in my email gives ~500
results in the past year (and there are additional
“issues”, “problems”, and “errors”).
… trying to exterminate the Materials Project’s bugs can
be somewhat maddening – the past few years have
demonstrated that the infestation will always return,
usually based on something that appears innocent at
first glance …
For example, on multiple occasions, code that
incorrectly set (or failed to set) a single input tag
ruined tens of thousands of dollars worth of computing
and several weeks of work. Currently, we’re struggling
to find out whether old bugs in a crystal structure
matching code may have affected what we’ve computed and
potentially any of the reported results …”
14
(myself in a blog post about MP work)
Outline
15
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
Transitioning to LBNL staff
• I became staff in 2013 after being hired by K. Persson
• At first, this mainly meant that I spent more time training new
postdocs in some of the things we were doing and helping launch an
new project on multivalent batteries
• The real career game changer was when I got a DOE Early Career
Award in 2015, which came with enough funding to make me an
independent researcher essentially overnight
• Nevertheless, continued working on past projects like Materials
Project to this day (first as co-PI, now Associate Director)
16
The Materials Project continues to grow
• The Materials Project has grown
beyond what most of us imagined
• The team now includes ~3-4 staff
dedicated to infrastructure and
scaling
• Staff web developer currently needed!
• FireWorks is still used to run the
calculations
• We’ve begun new outreach efforts,
like the MP seminar series
• https://materialsproject.org/seminars
17
> 180,000 registered
users
4
2. Materials Project links
to your contribution
3. Your data set and
paper are linked
1. Google links to
Materials Project page
18
A new phase of Materials Project: researchers can contribute
their own data sets to MP
Today, the Materials Project has led to
many examples of “computer to lab”
success stories
MP for p-type transparent conductors
References
✦ Hautier, G., Miglio,A., Ceder, G., Rignanese, G.-M. & Gonze, X. Identification and
design principles of low hole effective mass p-type transparent conducting oxides.
Nature Communications 4, (2013)
✦ Bhatia,A. et al. High-Mobility Bismuth-based Transparent p-Type Oxide from High-
Throughput Material Screening. Chemistry of Materials 28, 30–34 (2015)
✦ Ricci, F. et al.An ab initio electronic transport database for inorganic materials.
Scientific Data 4, (2017)
Prediction
Screening based on band
gap, transport properties
and band alignments.
Experiment
Predictions revealed
material with s–p
hybridized valence band
(thought to correlate
well with dopability).
When synthesized,
material has excellent
transparency and readily
dopable with K.
Ba2BiTaO6
MP for thermoelectrics
References
✦ Aydemir, U. et al.YCuTe2: a member of a new class of thermoelectric materials with
CuTe4-based layered structure. Journal of Materials Chemistry A 4, 2461–2472 (2016)
✦ Zhu, H. et al. Computational and experimental investigation of TmAgTe2and
XYZ2compounds, a new group of thermoelectric materials identified by first-principles
high-throughput screening. Journal of Materials Chemistry C 3, 10554–10565 (2015).
✦ Pöhls, J.-H. et al. Metal phosphides as potential thermoelectric materials. Journal of
Materials Chemistry C 5, 12441–12456 (2017).
Prediction
Screening of tens of
thousands of materials
with predicted electron
transport properties
revealed a family of
promising XYZ2
candidates
Experiment
Several materials made:
YCuTe2 (zT = 0.75),
TmAgTe2 (zT = 0.47, 1.8
theoretical), novel NiP2
phosphide
TmAgTe2
MP for phosphors
References
✦ Wang, Z. et al. Mining Unexplored Chemistries for Phosphors for High-Color-
Quality White-Light-Emitting Diodes. Joule 2, 914–926 (2018)
✦ Li, S. et al. Data-Driven Discovery of Full-Visible-Spectrum Phosphor. Chemistry of
Materials 31, 6286–6294 (2019)
✦ Ha, J. et al. Color tunable single-phase Eu2+ and Ce3+ co-activated Sr2LiAlO4
phosphors. Journal of Materials Chemistry C 7, 7734–7744 (2019)
Prediction
Statistical analysis of existing
materials that co-occur with
word ‘phosphor’ followed
by structure prediction for
new materials
Experiment
Predicted first known Sr-Li-
Al-N quaternary, showed
green-yellow/blue emission
with quantum efficiency of
25% (Eu), 40% (Ce), 55%
(co-activated Eu, Ce)
Sr2LiAlN4
≈ç ≈
19
One of the applications we looked into was
thermoelectric materials
20
• A thermoelectric material
generates a voltage based on
thermal gradient
• Applications
• Heat to electricity
• Refrigeration
• Advantages include:
• Reliability
• Easy to scale to different sizes
(including compact)
www.alphabetenergy.com
It is difficult to balance trade-offs in
thermoelectrics properties, so use screening
21
ZT = α2σT/κ
power factor
>2 mW/mK2
(PbTe=10 mW/mK2)
Seebeck coefficient
> 100 V/K
Band structure + Boltztrap
electrical conductivity
> 103 /(ohm-cm)
Band structure + Boltztrap
thermal conductivity
< 1 W/(m*K)
• e from Boltztrap
• l difficult (phonon-phonon scattering)
Heavy band:
ü Large DOS
(higher Seebeck and more carriers)
✗ Large effective mass
(poor mobility)
Light band:
ü Small effective mass
(improved mobility)
✗ Small DOS
(lower Seebeck, fewer carriers)
Multiple bands, off symmetry:
ü Large DOS with small effective
mass
✗ Difficult to design!
E
k
~50,000 crystal
structures and
band structures
from Materials
Project are used
as a source F. Ricci, et al., An ab initio electronic transport
database for inorganic materials, Sci. Data. 4
(2017) 170085.
We compute electronic
transport properties
with BoltzTraP and
minimum thermal
conductivity (Cahill-
Pohl) for some
compounds
About 300GB of
electronic transport
data is generated. All
data is available free
for download.
We found several compounds with promising
figure-of-merit, but no breakthroughs
22
• Calculations:
trigonal p-
TmAgTe2 could
have power
factor up to 8
mW/mK2
• requires 1020/cm3
carriers
experiment
computation
• Calculations: p-YCuTe2 could
only reach PF of 0.4
mW/mK2
• SOC inhibits PF
• if thermal conductivity is low
(e.g., 0.4, we get zT ~1)
• Expt: zT ~0.75 – not too far
from calculation limit
• carrier concentration of 1019
• Decent performance, but
unlikely to be improved with
further optimization
• Expt: p-zT only 0.35 despite
very low thermal
conductivity (~0.25 W/mK)
• Limitation: carrier
concentration (~1017/cm3)
• likely limited by TmAg
defects, as determined by
followup calculations
• Later, we achieved zT ~ 0.47
using Zn-doping
TmAgTe2
YCuTe2
We also developed a new method for more
accurately screening electronic transport
23
Old method (BoltzTraP – screening is qualitative w/pitfalls)
New method (AMSET – screening is more quantitative)
Ganose, A. M.; Park, J.; Faghaninia, A.; Woods-Robinson, R.; Persson, K. A.; Jain, A. Efficient Calculation of Carrier Scattering Rates from First
Principles. Nat Commun 2021, 12 (1), 2222.
acoustic deformation potential (ad)
deformation potential, elastic tensor
ionized impurity (ii)
dielectric tensor
piezoelectric (pi)
dielectric tensor, piezoelectric tensor
polar optical phonon (po)
dielectric tensor, polar phonon frequency
• The method, AMSET, was in development for
~5 years and took a very talented postdoc (A.
Ganose) to finalize everything
• Can calculate e- mobility + Seebeck coefficient
much more accurately than standard models
What about machine learning?
24
• “Simulation-only” screening is
becoming rarer
• More common now is to integrate
machine learning models before
performing expensive calculations
• Our group developed a popular
open-source library called
“matminer” to help with ML in
materials
• Since then, we’ve been interested in
benchmarking methods from the
community
MATERIAL FEATURES PROPERTY
TiO2 rutile F11 F12 … F1N gap = 3.0 eV
C diamond F21 F22 … F2N gap = 5.5 eV
… … … … … …
PbTe rocksalt FM1 FM2 … FMN gap = 0.3 eV
Python
ML Libraries
Data
Featurization
Data
Retrieval
Data
Visualization
Materials Databases
MPDS
Citrine
Materials
Project
Proper benchmarking is becoming more of an
issue in materials ML
New algorithms are constantly reported!
25
But it is very difficult to compare
algorithms
26
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Different data sets
• Source (e.g., OQMD vs MP vs JARVIS)
• Quantity (e.g., MP 2019 vs MP 2022)
• Subset / data filtering (e.g., ehull<X)
• Different evaluation metrics
• Test set vs. cross validation?
• Different test set fraction?
• Can be difficult to install and retrain
many of these algorithms
MAE 5-Fold CV = 0.102 eV
RMSE Test set = 0.098 eV
vs.
? ?
Matbench includes 13 different ML tasks
27
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
How to read the Matbench leaderboard
28
Bigger datasets
Better
relative
performance
• A scaled error of 0.0 means all
predictions are correct
• A scaled error of 1.0 is equal
to always predicting the
average value
Magpie + SCF Model
• Composition features using
chemical descriptors such as
averages/stdevs of elemental
properties such as melting
point, electronegativity
• Structure features using sine
Coulomb matrix
29
Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016).
Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101.
https://matbench.materialsproject.org
MODNet Model
30
De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. Journal of Physics:
Condensed Matter, Volume 33, Number 40, 2021
https://matbench.materialsproject.org
CGCNN Model
31
Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301.
https://matbench.materialsproject.org
ALIGNN Model
32
Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8.
https://matbench.materialsproject.org
How much have we
improved overall?
33
• In some cases (e.g., Ef DFT) we
have made a lot of
improvement
• In contrast, for others (e.g., σy
steel alloys) we have barely
improved
• Possible reasons
• Amount of attention paid to
certain problems
• Small vs large data emphasis –
there is a lot more room for
improvement for small data
How else can machine learning be used?
34
Flood of information
Important things get missed
Useful data, but unstructured
NLP algorithms
The types of features that would be very
helpful for materials research
35
5
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
User annotates a small
number of example text for
data extraction
annotation
source text
Train custom model for
completing annotations
Apply to entire literature (millions of
articles) or internal text database
+ question and answer, e.g.
• What is the band gap of
“Si”?
• What are all the known
dopants into GaAs?
• What are all materials
studied as thermoelectrics?
36
We developed a pipeline to extract data from
materials science abstracts
Weston, L. et al Named Entity
Recognition and Normalization Applied
to Large-Scale Information Extraction
from the Materials Science Literature. J.
Chem. Inf. Model. (2019).
The resulting model can label abstracts
37
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
We also found that word embeddings trained on
literature have hidden chemical information
39
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
When we train word2vec on
inorganic materials science
abstracts, we get representations
in-line with chemical knowledge
crystal structures of the elements
This hidden information can be used to
predict compounds that might be interesting
40
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors! (DFT+BoltzTraP)
For example, we can rank which compounds are
likely to co-occur with “thermoelectric” in the
future
41
• For every year since 2001,
see which compounds we
would have predicted using
only literature data until that
point in time
• Make predictions of what
materials are the most
promising thermoelectrics
for data until that year
• See if those materials were
actually studied as
thermoelectrics in
subsequent years
Investigated as thermoelectrics
(independently of our study)
Investigated by our own collaborators
(as a result of our study)
We’ve since also applied NLP to synthesis and
are working actively in this area
42
Outline
43
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
Moving from the virtual world to the physical
world – automated synthesis
44
In operation:
XRD
Robot
Box furnace
x 4
Tube furnace
x 4
Arriving soon:
SEM/EDS (Early June)
Labman dosing and mixing
LBNL bldg. 30
Dosing and mixing
And once again, we need workflow software!
46
• Monitor the lab and runs experiment on
different devices
• Collect data generated in the experiment
• Handle exceptions in the lab
Conclusions
• A lot of things can change in 15 years
• 15 years ago, the idea of high-throughput DFT was scoffed at by many researchers (“too
computationally expensive”, “theory not good enough”, “people will be confused”)
• Today it has become a standard procedure for materials design
• I got to see the technique grow from being used by a handful of people with the smallest
possible conference sessions to now being a large, standing room-only symposia at large
conferences
• I see similar changes happening or happened in emerging areas
• Machine learning in materials was a niche subject, now it’s potentially bigger than DFT-based
screening
• NLP in materials is still small, but the trajectory looks on-track to become big (at a slower
pace)
• Automated synthesis is still small, but that trajectory is growing very rapidly
• I’ve been fortunate to be a part of many great projects and teams at the lab and
am looking forward to the next iteration of materials design!
47
Acknowledgements
• My mentors and advisors, without whom I wouldn’t have a job
• Vivian Stojanoff (SULI adviser), Gerd Ceder (PhD advisor), Kristin Persson
(postdoc advisor + early staff supervisor)
• Our research group, without whom there’d be no exciting research
results
• Our collaborators
• Entire Materials Project team
• J. Snyder and J. Pohls who took time-consuming experimental leaps on
computational screening results for thermoelectrics
• Our funders
• DOE BES, DOE EERE, Toyota Research Institutes, LBNL LDRD
48