Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
Software tools for high-throughput materials data generation and data mining
1. Software Tools for High-throughput Materials
Data Generation and Data Mining
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
2018 TMS Conference
Slides (already) posted to https://hackingmaterials.lbl.gov/
2. 2
A schematic of “materials genomics” approaches to
materials science
data
applications
methods
(theory,
ML)
software
implementation
3. 3
Our group builds and maintain several
open-source software libraries
Data generation Data analysis
run and manage millions of computational
tasks over large computing resources
library of FireWorks-compatible workflows
for materials science applications
materials data retrieval, featurization,
and visualization for machine learning
tools for crystal manipulation, data
analysis, and simulation software I/O
*led by Ong group, UCSD
tools for inverse optimation / adaptive design –
ML chooses what calculations to run
4. 4
This talk will focus on atomate and matminer
Data generation Data analysis
run and manage millions of computational
tasks over large computing resources
library of FireWorks-compatible workflows
for materials science applications
materials data retrieval, featurization,
and visualization for machine learning
tools for crystal manipulation, data
analysis, and simulation software I/O
*led by Ong group, UCSD
tools for inverse optimation / adaptive design –
ML chooses what calculations to run
6. Today, automated (“high-throughput”) calculations play an
important role in materials data generation
6
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier, Sci
Data 2017, 4, 170085.!
7. Today, automated (“high-throughput”) calculations play an
important role in materials data generation
7
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier, Sci
Data 2017, 4, 170085.!
Atomate’s goal: make
it easy to generate
comparable data sets
on your own
8. A “black-box” view of performing a calculation
8
“something”!
Results!!
researcher!
What is the
GGA-PBE elastic
tensor of GaAs?
9. Unfortunately, the inside of the “black box”
is usually tedious and “low-level”
9
lots of tedious,
low-level work…!
Results!!
researcher!
What is the
GGA-PBE elastic
tensor of GaAs?
Input file flags
SLURM format
how to fix ZPOTRF?
q set up the structure coordinates
q write input files, double-check all
the flags
q copy to supercomputer
q submit job to queue
q deal with supercomputer
headaches
q monitor job
q fix error jobs, resubmit to queue,
wait again
q repeat process for subsequent
calculations in workflow
q parse output files to obtain results
q copy and organize results, e.g., into
Excel
10. What would be a better way?
10
“something”!
Results!!
researcher!
What is the
GGA-PBE elastic
tensor of GaAs?
11. What would be a better way?
11
Results!!
researcher!
What is the
GGA-PBE elastic
tensor of GaAs?
Workflows to run!
q band structure!
q surface energies!
ü elastic tensor!
q Raman spectrum!
q QH thermal expansion!
12. Ideally the method should scale to millions of calculations
12
Results!!
researcher!
Start with all binary
oxides, replace O->S,
run several different
properties
Workflows to run!
ü band structure!
ü surface energies!
ü elastic tensor!
q Raman spectrum!
q QH thermal expansion!
q spin-orbit coupling!
13. Atomate tries make it easy, automatic, and flexible to
generate data with existing simulation packages
13
Results!!
researcher!
Run many different
properties of many
different materials!
14. Atomate contains a library of simulation procedures
14
VASP-based
• band structure
• spin-orbit coupling
• hybrid functional
calcs
• elastic tensor
• piezoelectric tensor
• Raman spectra
• NEB
• GIBBS method
• QH thermal
expansion
• AIMD
• ferroelectric
• surface adsorption
• work functions
Other
• BoltzTraP
• FEFF method
• LAMMPS MD
Mathew, K. et al Atomate: A high-level interface to generate, execute, and analyze
computational materials science workflows, Comput. Mater. Sci. 139 (2017) 140–152.
15. Each simulation procedure translates high-level instructions
into a series of low-level tasks
15
quickly and automatically translate PI-style (minimal)
specifications into well-defined FireWorks workflows
What is the
GGA-PBE elastic
tensor of GaAs?
M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, et al.,
Charting the complete elastic properties of inorganic crystalline compounds,
Sci. Data. 2 (2015).
16. Atomate thus encodes and standardizes knowledge about
running various kinds of simulations from domain experts
16
K. Mathew J. Montoya S. Dwaraknath A. Faghaninia
All past and present knowledge, from everyone in the group,
everyone previously in the group, and our collaborators,
about how to run calculations
M. Aykol
S.P. Ong
B. Bocklund T. Smidt
H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood
Z.K. Liu J. Neaton K. Persson A. Jain
+
17. 17
Atomate’s main goal – convert structures to workflows
One can convert a crystal structure to a
Workflow object in one line of code – or one can
customize the workflow via multiple methods
18. 18
Full operation diagram
job 1
job 2
job 3 job 4
structure! workflow! database of
all workflows!
automatically submit + execute!output files + database!
19. 19
The atomate database makes it easy to perform various
analyses with pymatgen
atomate output
database!
phase
diagrams
Pourbaix
diagrams
diffusivity via MDband structure analysis
20. 20
Many research groups have run tens of thousands of
materials science workflows with atomate
also used by:
• Persson research group, UC Berkeley
• Ong research group, UC San Diego
• Neaton research group, UC Berkeley
• Liu research group, Penn State
• Groups not developing on atomate!
• e.g., see “Thermal expansion of quaternary nitride coatings” by
Tasnadi et al.
atomate now powers the Materials
Project and will be used to run
hundreds of thousands of
simulations in the next year
(www.materialsproject.org)
21. • Link to code:
– https://www.github.com/hackingmaterials/atomate
• License: BSD
– open-source, can be used with commercial software
– like MIT license but clause to not abuse the Berkeley Lab
name, e.g. for advertising purposes
• Help and support
– https://groups.google.com/forum/#!forum/atomate
• Citation with further information:
– Mathew, K. et al. Atomate: A high-level interface to
generate, execute, and analyze computational materials
science workflows. Comput. Mater. Sci. 139, 140–152
(2017).
21
Further information on atomate
23. Goal of matminer: connect materials data with data mining
algorithms and data visualization libraries
23
MATERIAL FEATURES PROPERTY
TiO2 rutile F11 F12 … F1N gap = 3.0 eV
C diamond F21 F22 … F2N gap = 5.5 eV
… … … … … …
PbTe rocksalt FM1 FM2 … FMN gap = 0.3 eV
Python
ML Libraries
Data
Featurization
Data
Retrieval
Data
Visualization
Materials Databases
MPDSCitrine
Materials
Project
24. A total of 39 featurizer
classes can generate
thousands of potential
descriptors
24
Matminer contains a library of descriptors for various
materials science entities
feat = EwaldEnergy([options])
y = feat.featurize([input_data])
• compatible with
scikit-learn
pipelining
• automatically deploy
multiprocessing to
parallelize over data
• include citations to
methodology papers
25. matminer also contains easy integration with Plotly for
quickly creating interactive, shareable HTML graphs
25
27. Example 1: combining data from Citrine and MP to plot
computed vs. experimental band gap
27
DataFrame
Data
Retrieval
Data
Visualization
Materials Databases
Citrine Materials
Project
MATERIAL PROPERTY
TiO2 rutile gap = 3.0 eV
C diamond gap = 5.5 eV
… …
PbTe rocksalt gap = 0.3 eV
Run the full Jupyter
notebook:
!
https://github.com/
hackingmaterials/
matminer_examples!
!
(experiment_vs_computed_
bandgap.ipynb)!
28. Example 2: predicting bulk modulus from MP data
28
MATERIAL FEATURES PROPERTY
TiO2 rutile F11 F12 … F1N E = 400
C diamond F21 F22 … F2N E = 230
… … … … … …
PbTe rocksalt FM1 FM2 … FMN E = 120
Data
Featurization
Data
Retrieval
Python ML
libraries
Materials Databases
Materials
Project
mean RMSE: 20 GPa
(10-fold CV)
Run the full Jupyter
notebook:
!
https://github.com/
hackingmaterials/
matminer_examples!
!
(intro_predicting_bulk_mo
dulus.ipynb)!
29. Example 3: crystal structure similarity
29
Goal: determine crystal structure “similarity” between all
structure pairs in MP database
Example: BCC,
CsCl, and
Heusler are all
orderings into the
same essential
crystal
Difficulty:
different bond
lengths, # of
atoms, small
distortions, etc
30. 30
Procedure for xtal structure similarity
MATERIAL FEATURES PROPERTY
TiO2 rutile F11 F12 … F1N E = 400
C diamond F21 F22 … F2N E = 230
… … … … … …
PbTe rocksalt FM1 FM2 … FMN E = 120
Data
Featurization
Data
Retrieval
Vector distance
between features
Materials Databases
Materials
Project
Result: matrix of pairwise similarities
between all structures in MP
SiteStatsFingerprint based on
CrystalSiteFingerprint(“cn”):!
~75-element vector!
31. Results on MP web site, e.g. for BCC-like structures
31
https://www.materialsproject.org/materials/mp-91/!
Target: W
similar structures
(distance near 0)
Cs3Sb!
TiGaFeCo!
CeMg2Cu!
32. • Link to code:
– https://www.github.com/hackingmaterials/matminer
• License: BSD
– open-source, can be used with commercial software
– like MIT license but clause to not misuse the Berkeley
Lab name, e.g. for advertising purposes
• Help and support
– https://groups.google.com/forum/#!forum/matminer
• Expected paper submission this month …
32
Further information on matminer