Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Open-source tools for generating and analyzing large materials data sets
1. Open-source tools for generating and
analyzing large materials data sets
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
ACS Spring, April 2017
Slides (already) posted to http://www.slideshare.net/anubhavster
Link is also listed at end of talk
2. 2
“Civilization advances by extending the number of
important operations which we can perform
without thinking about them.”
- Alfred North Whitehead
3. We don’t work on catalysis, but we do write software
• We don’t do research into heterogeneous
catalysis
• We do build software to:
– execute millions of calculations on supercomputing
centers
– make it more straightforward to run density
functional theory calculations (mostly VASP, some
Gaussian/Q-Chem)
– perform structural manipulations
– analyze the results of calculations
3
4. Software technologies that we contribute to
4
(automatic materials
science workflows)
Custodian
(calculation error
recovery)
(materials analysis
framework)
Base packages Derived packages
(workflow definition &
execution)
These are all open-source:
• FireWorks, atomate, and matminer are led by our group
• pymatgen and custodian are led by Prof. Ong group (UC San Diego)
• All developed in coordination with Persson group (UC Berkeley)
(materials data mining)
5. Applications: The Materials Project database
5
Jain*, Ong*, Hautier, Chen, Richards, Dacek, Cholia, Gunter, Skinner, Ceder, and
Persson, APL Mater., 2013, 1, 011002. *equal contributions!
The Materials Project (http://www.materialsproject.org)
free and open
~30,000 registered users
around the world
>65,000 compounds
calculated
Data includes
• thermodynamic props.
• electronic band structure
• aqueous stability (E-pH)
• elasticity tensors
• piezoelectric tensors
>75 million CPU-hours
invested = massive scale!
6. Applications: The Electrolyte Genome
6
data on ~22,000 molecules
(mainly geometry + IP/EA via
full adiabatic calcs)
Also deployed on the
Materials Project web site
L. Cheng, R.S. Assary, X. Qu, A. Jain, S.P. Ong, N.N. Rajput, et al.,
J. Phys. Chem. Lett. 6 (2015) 283–291.!
!
X. Qu, A. Jain, N.N. Rajput, L. Cheng, Y. Zhang, S.P. Ong, et al.,
Comput. Mater. Sci. 103 (2015) 56–67.!
7. Applications: Crystalium (Ong / Persson)
7
http://crystalium.materialsvirtuallab.org
surface energies for 142 polymorphs of
72 elements + rotatable Wulff shapes
certainly applicable to catalysis
computed & maintained by the Ong
group (UC San Diego) with support
from Persson Group (UC Berkeley)
R. Tran, Z. Xu, B. Radhakrishnan, D. Winston, W. Sun,
K. A. Persson, and S. P. Ong, Sci. Data, 2016, 3, 160080.!
8. Applications: Rapid data generation
8
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier (in
submission)!
9. Let’s revisit the libraries
9
(automatic materials
science workflows)
Custodian
(calculation error
recovery)
(materials analysis
framework)
Base packages Derived packages
(workflow definition &
execution)
These are all open-source:
• FireWorks, atomate, and matminer are led by our group
• pymatgen and custodian are led by Prof. Ong group (UC San Diego)
• All developed in coordination with Persson group (UC Berkeley)
(materials data mining)
10. pymatgen – object-oriented materials analysis
10
www.pymatgen.org!
Ong, S. P., Richards, W. D., Jain, A., Hautier, G., Kocher, M., Cholia, S.,
Gunter, D., Chevrier, V. L., Persson, K. a. & Ceder, G. Python Materials
Genomics (pymatgen): A robust, open-source python library for
materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).!
11. pymatgen – examples of analyses
11
phase diagrams
Pourbaix diagrams
diffusivity from MDband structure analysis
12. pymatgen - many useful tools made accessible
12
Structure Matcher
analyzes if two periodic
structures are equivalent, even
if they are in different settings
or have minor distortions
= ?!
Order-disorder
resolve partial or mixed
occupancies into a fully
ordered crystal structure
(e.g., mixed oxide-fluoride site
into separate oxygen/fluorine)
Many other tools, such as:
• Automatic surface slab generator
• Bond-valence sums to determine valence
• Voronoi coordination as well as 3D coordination polyhedron analysis
• Automatically find and insert interstitial sites
• Powder diffraction pattern generation
• Simple cost and materials availability estimators
13. custodian – fixing job errors
• Custodian can wrap
around an executable
(e.g., VASP)
– i.e., run custodian instead of
directly running VASP
• During execution,
custodian will monitor
output files and detect
errors / problems
– If so, it can change input files
and rerun the job
– e.g., if ZPOTRF error
detected, rerun with ISYM=0
– ever-expanding library of
fixes
13
14. FireWorks – scientific workflow software
• FireWorks is an open-source scientific
workflow software
• Materials Project, JCESR, and other
projects manage their runs with
FireWorks
– >1 million jobs
– >100 million CPU-hours
– multiple computing clusters
• You can write any kind of workflow
– e.g., FireWorks is used for graphics
processing, machine learning, document
processing, and protein folding
– #1 Google hit for “Python workflow
software”, top 5 for general scientific
workflow software
• Detailed tutorials are available
14
Jain, A., Ong, S. P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M.,
Petretto, G., Rignanese, G.-M., Hautier, G., Gunter, D. & Persson, K. A.
FireWorks: a dynamic workflow system designed for high-throughput
applications. Concurr. Comput. Pract. Exp. 22, 5037–5059 (2015).!
www.pythonhosted.org/
FireWorks!
15. FireWorks – screenshot of jobs status
15
Live version at http://fireworks.dash.materialsproject.org
16. atomate – our newest code (currently in beta)
16
Redesigns an older,, clunkier code (MPWorks)
translate minimal specifications into well-defined
FireWorks workflows. (FireWorks handles all the
execution and job management details)
What is the
GGA-PBE elas0c
tensor of GaAs?
17. Advantages – reduce specialization
Because of the steep learning curve to
computational methods, there is often a single
group member assigned to a technique
17
“Alice knows how to do charged defect calculations.”!
“Bob is the one who can properly converge GW runs.”!
“Olga has all the scripts for phonon calculations.”!
18. Advantages – reduce errors
Let’s take a look at two alternate universes:
Automation reduces your chances
of being caught in universe #2!! 18
researcher! has coffee!
copies files from!
previous simulation!
edits 5 lines!
runs simulation,!
creates report!
forgets coffee!
copies files from!
previous simulation!
edits 4 lines!
forgets!
LHFCALC=F!
creates report, !
looks fine at first, !
in a month!
discovers it used the !
wrong functional!
1
2
researcher!
19. atomate – what’s available?
19
K. Mathew J. Montoya S. Dwaraknath A. Faghaninia
• band structure
• spin-orbit coupling
• hybrid functional calcs
• elastic tensor
• piezoelectric tensor
• Raman spectra
• NEB
• GIBBS method
• QH thermal expansion
• AIMD
• FEFF method
• LAMMPS MD
All past and present knowledge, from
everyone in the group, everyone previously
in the group, and outside collaborators,
about how to run calculations
M. Aykol
S.P. Ong
B. Bocklund T. Smidt
H. Tang
20. matminer (still in alpha)
20
MatMiner’s goal: help enable data mining studies
in materials science
21. matminer usage
• Examples of usage on
the github page:
– https://github.com/
hackingmaterials/
matminer
• Coming next: new
types of crystal
structure descriptors
based on local
environment
21
22. Some lessons learned (1)
• In the beginning, strong central coordination from
authority was needed to develop these
– require that people contribute to common code, e.g.
pymatgen, and not write their own detached scripts
• Once a code was “established”, less authority was
needed
– people voluntarily contributed improvements rather than
writing their own code because this benefited them
• Today the process is almost completely
decentralized
– culture has changed
– even for new codes, people rally around it rather than
build independent things
22
23. Some lessons learned (2)
• It is helpful to have a strong BDFL (benevolent
dictator for life) for each codebase
• Requirements for the BDFL:
– very detail-oriented
– cares about the code itself, not just the application
– cares more about the code quality than about offending
teammates, i.e., will not accept poor quality contributions
– at the same time, able to rally support from people and
convince them to contribute or clean up code
– willing to work overtime to do things like write detailed
docs, answer questions from users, advocate for the code,
review commits, etc.
– derives joy from building and deploying things!
23
24. Some lessons learned (3)
• Computer scientists are useful for staying up to date in
the fast-moving world of software
– 2006: I took a graduate class in databases at a top CS
university; all SQL, not a single mention of “NoSQL”
– 2007: we use SQL to build a precursor to Materials Project
– 2011: We are designing the framework for Materials Project; I
have lots of experience with SQL and confident this is the way
to go; a computer scientist casually mentions NoSQL, its
growing prominence, and its potential applicability to our
problem
– 2017: We do almost everything in NoSQL
• Lesson: software moves fast! Much faster than materials
science knowledge or methods. Don’t use “up to date”
data from 5 years ago to inform your decision.
24
25. Further resources
• The Github web sites
– www.github.com/materialsproject
– www.github.com/hackingmaterials
• Software carpentry
• https://software-carpentry.org
25
26. Acknowledgements
• Research group of Prof. Shyue Ping Ong
• Research group of Prof. Kristin Persson
• Funding: US Dept of Energy, Materials Science
Division
…and all extended collaborators for these various
projects!
26
Slides (already) posted to http://www.slideshare.net/anubhavster