Materials Project computation and database infrastructure
1. Materials Project computation and
database infrastructure
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
Presentation given to Delaware Energy Institute, 2018
Slides (already) posted to https://hackingmaterials.lbl.gov
2. Outline
2
① Introduction to the Materials Project
② Materials Project computation infrastructure
③ Database considerations
3. The Materials Project database
• Online resource of density
functional theory simulation data
for ~85,000 inorganic materials
• Includes band structures, elastic
tensors, piezoelectric tensors,
battery properties and more
• >60,000 registered users
• Free
• www.materialsproject.org
3
Jain et al. Commentary: The Materials Project: A
materials genome approach to accelerating
materials innovation. APL Mater. 1, 11002 (2013).
4. 4
Many data sets are available!
M. De Jong et
al. Sci. Data,
2015, 2,
150009.
]
M. De Jong et
al. Sci. Data,
2015, 2,
150009.
6. Outline
6
① Introduction to the Materials Project
② Materials Project computation infrastructure
③ Database considerations
7. A “black-box” view of performing a calculation
7
“something”
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
8. Unfortunately, the inside of the “black box”
is usually tedious and “low-level”
8
lots of tedious,
low-level work…
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
Input file flags
SLURM format
how to fix ZPOTRF?
q set up the structure coordinates
q write input files, double-check all
the flags
q copy to supercomputer
q submit job to queue
q deal with supercomputer
headaches
q monitor job
q fix error jobs, resubmit to queue,
wait again
q repeat process for subsequent
calculations in workflow
q parse output files to obtain results
q copy and organize results, e.g., into
Excel
9. What would be a better way?
9
“something”
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
10. What would be a better way?
10
Results!
researcher
What is the
GGA-PBE elastic
tensor of GaAs?
Workflows to run
q band structure
q surface energies
ü elastic tensor
q Raman spectrum
q QH thermal expansion
11. Ideally the method should scale to millions of calculations
11
Results!
researcher
Start with all binary
oxides, replace O->S,
run several different
properties
Workflows to run
ü band structure
ü surface energies
ü elastic tensor
q Raman spectrum
q QH thermal expansion
q spin-orbit coupling
12. Atomate tries make it easy, automatic, and flexible to
generate data with existing simulation packages
12
Results!
researcher
Run many different
properties of many
different materials!
13. Atomate contains a library of simulation procedures
13
VASP-based
• band structure
• spin-orbit coupling
• hybrid functional
calcs
• elastic tensor
• piezoelectric tensor
• Raman spectra
• NEB
• GIBBS method
• QH thermal
expansion
• AIMD
• ferroelectric
• surface adsorption
• work functions
Other
• BoltzTraP
• FEFF method
• LAMMPS MD
Mathew, K. et al Atomate: A high-level interface to generate, execute, and analyze
computational materials science workflows, Comput. Mater. Sci. 139 (2017) 140–152.
14. Each simulation procedure translates high-level instructions
into a series of low-level tasks
14
quickly and automatically translate PI-style (minimal)
specifications into well-defined FireWorks workflows
What is the
GGA-PBE elastic
tensor of GaAs?
M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, et al.,
Charting the complete elastic properties of inorganic crystalline compounds,
Sci. Data. 2 (2015).
15. Atomate thus encodes and standardizes knowledge about
running various kinds of simulations from domain experts
15
K. Mathew J. Montoya S. Dwaraknath A. Faghaninia
All past and present knowledge, from everyone in the group,
everyone previously in the group, and our collaborators,
about how to run calculations
M. Aykol
S.P. Ong
B. Bocklund T. Smidt
H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood
Z.K. Liu J. Neaton K. Persson A. Jain
+
16. 16
Full operation diagram
job 1
job 2
job 3 job 4
structure workflow database of
all workflows
automatically submit + executeoutput files + database
17. 17
Full operation diagram
job 1
job 2
job 3 job 4
structure workflow database of
all workflows
automatically submit + executeoutput files + database
18. • Pymatgen can retrieve crystal
structures from the Materials
Project database (MPRester class)
• It can also manipulate crystal
structures
– substitutions
– supercell creation
– order-disorder (shown at right)
– interstitial finding
– surface / slab generation
• A visual interface to many of the
tools are in Materials Project’s
“Crystal Toolkit” app
18
Crystal structure generation via pymatgen
Example: Order-disorder
resolve partial or mixed
occupancies into a fully
ordered crystal structure
(e.g., mixed oxide-fluoride site
into separate oxygen/fluorine)
19. 19
Full operation diagram
job 1
job 2
job 3 job 4
structure workflow database of
all workflows
automatically submit + executeoutput files + database
20. 20
Atomate’s main goal – convert structures to workflows
Workflows consist of a series of jobs (“FireWorks”), each
with multiple tasks. Atomate jobs typically (i) run a
calculation and (ii) store the results in a database
21. 21
Full operation diagram
job 1
job 2
job 3 job 4
structure workflow database of
all workflows
automatically submit + executeoutput files + database
22. FireWorks allows you to write your workflow once and
execute (almost) anywhere
22
• Execute workflows
locally or at a
supercomputing
center
• Queue systems
supported
– PBS
– SGE
– SLURM
– IBM LoadLeveler
– NEWT (a REST-based
API at NERSC)
– Cobalt (Argonne LCF)
24. • Job provenance and automatic metadata storage
• Detect and rerun failures
• “Dynamic” workflows that change behavior based on
results
• Customize job priorities
• Much more…
24
Other features
25. 25
Full operation diagram
job 1
job 2
job 3 job 4
structure workflow database of
all workflows
automatically submit + executeoutput files + database
27. 27
The atomate database makes it easy to perform various
analyses with pymatgen
atomate output
database(s)
phase
diagrams
Pourbaix
diagrams
diffusivity via MDband structure analysis
28. 28
Many research groups have run tens of thousands of
materials science workflows with atomate
also used by:
• Persson research group, UC Berkeley
• Ong research group, UC San Diego
• Neaton research group, UC Berkeley
• Liu research group, Penn State
• Groups not developing on atomate!
• e.g., see “Thermal expansion of quaternary nitride coatings” by
Tasnadi et al.
atomate now powers the Materials
Project and will be used to run
hundreds of thousands of
simulations in the next year
(www.materialsproject.org)
29. Outline
29
① Introduction to the Materials Project
② Materials Project computation infrastructure
③ Database considerations
30. 30
About a decade ago, we were using a SQL infrastructure
Main problems we ran into:
• Too static – every time we wanted
to store a new kind of data, the DB
master needed to “design and
update” the database schema
• Too difficult for newcomers –
constructing queries (joins, etc.).
We actually designed a system to
help people make queries, which is
common
31. 31
Since then, we have switched to MongoDB –
a “noSQL” database
Major advantages
• Very dynamic – easy to add
new data types without
interfering with old data
types or redesigning
everything. No central
“database master” needed
• Easy for newcomers – easy
syntax, no complex “joins”,
easy to visualize results
• Easy object-relational
mapping – built our
pymatgen code so that any
objects (e.g., band
structures, crystal
structures, etc.) could be
exported to a database or
imported from a database
easily
32. 32
How we store computed data
Data is stored in “collections”. Each collection is a set of documents that can be queried.
Each document
consists of nested key-
value pairs
(“dictionaries”) or
arrays.
e.g. one can search for:
{“tags”: “phosphides”}
to retrieve all
documents tagged
with “phosphide”
33. 33
Each collection has a set of standard keys
Data is stored in “collections”. Each collection is a set of documents that can be queried.
materials collection – each
document represents a
material, with keys like
“formula” and “band_gap”
tasks collection – each
document represents a
DFT calculation, with keys
like “dir_name” and
“input.parameters”
workflows collection – each
document represents a
calculation workflow, with
keys like “nodes” and
“links”
Typically, each document within a collection will be of a uniform
format, but this not a hard requirement in MongoDB.
34. 1. As described previously: for each data type (a
“material”, “task”, “workflow”, etc.) decide on a
set of fields that describe each instance of that
data type. In MongoDB, these fields can easily
be changed or added to later if needed.
2. Try to create a single collection and document
format that can handle any kind of materials
data!
– example 1: “PIF” file format from Citrine[1]
– example 2: MPContribs from Materials Project[2]
34
Two approaches to store data in MongoDB
[1] J. O’Mara, B. Meredig, K. Michel, Materials Data
Infrastructure : A Case Study of the Citrination Platform to
Examine Data Import , Storage , and Access, Jom. (2016).
[2] P. Huck, D. Gunter, S. Cholia, D. Winston, A.T. N’Diaye, K. Persson, User
applications driven by the community contribution framework MPContribs
in the Materials Project, Concurr. Comput. Pract. Exp. 22 (2015)
37. Funding: DOE-BES Materials Science Division, Computing: NERSC
37
Who to talk to next!
The current “Guardians of the MP infrastructure”
Slides (already) posted to https://hackingmaterials.lbl.gov