Software tools to facilitate materials science research
1. Software tools to facilitate
materials science research
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
S2I2 Workshop, Feb 2017
Slides (already) posted to http://www.slideshare.net/anubhavster
2. What we work on
• We don’t develop or
debut the new and
fashionable
computational methods
• We adopt methods,
standardize the parts
that are ready for mass
reproduction, and
execute them over
thousands of materials
2
3. Our research interests as materials scientists
3
High-throughput calculations
(each point is a possible battery cathode)
Discovery of new functional materials
(e.g., new bulk thermoelectrics)
4. A user’s perspective of materials simulation
4
“something”!
Results!
PI!
What is the
GGA-PBE elas0c
tensor of GaAs?
5. A user’s perspective of materials simulation
5
“something”!
= student/postdoc!
Results!
PI!
What is the
GGA-PBE elas0c
tensor of GaAs?
Input file flags
Queue format
how to fix ZPOTRF?
6. Why this system?
• It works!
• Many aspects of running
simulations seem tailor-
made for assigning to
students/postdocs
– requires specialized
knowledge
– labor intensive
– helpful to have a high pain
threshold
• But there are also
disadvantages…
6
Nicola Marzari’s “Middle
Age Workshop” analogy
7. Staff specialization can get out of control
Because of the steep learning curve to
computational methods, there is often a single
group member assigned to a technique
7
“Alice knows how to do charged defect calculations.”!
“Bob is the one who can properly converge GW runs.”!
“Olga has all the scripts for phonon calculations.”!
8. Errors are all too common
Let’s take a look at two alternate universes:
Which universe you are in?
Are you sure? 8
student! has coffee!
copies files from!
previous simulation!
edits 5 lines!
runs simulation,!
delivers report!
student! forgets coffee!
copies files from!
previous simulation!
edits 4 lines!
forgets!
LHFCALC=F!
delivers report, !
looks fine at first, !
in a month you !
discover it was wrong!
1
2
9. Takes too long to get results
• Calculations are labor intensive!
– set up the structure coordinates
– write input files, double-check all the flags
– copy to supercomputer
– submit job to queue
– deal with supercomputer headaches
– monitor job
– fix error jobs, resubmit to queue, wait again
– repeat process for subsequent calculations in
workflow
– parse output files to obtain results
– copy and organize results, e.g., into Excel
9
10. There is a lot of back-and-forth in the analysis
• Student/postdoc presents Powerpoint/Excel of the
results
• PI wants to know certain details or follow up based
on the data, which are missing from the
Powerpoint/Excel
• Student/postdoc says “I will get back to you”, goes
back to office, re-processes the data, and prepares
a revised report within a few days
• Repeat…
10
11. What would be a better way?
11
“something”!
= a computer!
!
Results!
PI!
What is the
GGA-PBE elas0c
tensor of GaAs?
12. All past and present knowledge, from
everyone in the group, everyone previously
in the group, and outside collaborators,
about how to run calculations
Reduce specialization
12
13. Reduce errors and improve efficiency
• Computers can’t forget to set an input flag
• Computers (in theory) can create, correct,
submit, parse, and deliver the results of
calculations much faster than even the fastest
student
13
14. Improve analytics / visualization
• Excel and Powerpoint
works for a curated view
of the results
• But online analytics
would allow you to do
things like:
– view crystal structures on
demand
– generate the plot you
want
14
15. So this the vision we want – is it achievable?
15
“something”!
= a computer!
!
Results!
PI!
What is the
GGA-PBE elas0c
tensor of GaAs?
16. Yes! – and it is available on Materials Project
16
Input generation
(parameter choice)
Workflow mapping Supercomputer
submission /
monitoring
Error
handling
File Transfer
File Parsing /
DB insertion
Custom material
Submit!
www.materialsproject.org
“Crystal Toolkit”
Anyone can find, edit,
and submit (suggest)
structures
Currently, this feature is available for:
• structure optimization
• band structures
• elastic tensors
17. Software technologies to enable automatization
17
(automatic materials
science workflows)
Custodian
(calculation error
recovery)
(materials analysis
framework)
Base packages
Derived package
(workflow framework and
supercomputer interface)
These are all open-source:
• pymatgen and custodian are led by Prof. Ong group (UC San Diego)
• Developed in coordination with the Materials Project and Persson group
18. pymatgen – object-oriented materials analysis
18
www.pymatgen.org!
Ong, S. P., Richards, W. D., Jain, A., Hautier, G., Kocher, M., Cholia, S., Gunter,
D., Chevrier, V. L., Persson, K. a. & Ceder, G. Python Materials Genomics
(pymatgen): A robust, open-source python library for materials analysis.
Comput. Mater. Sci. 68, 314–319 (2013).!
19. pymatgen – examples of analyses
19
phase diagrams
Pourbaix diagrams
diffusivity from MDband structure analysis
20. pymatgen - many useful tools made accessible
20
Structure Matcher
analyzes if two periodic
structures are equivalent, even
if they are in different settings
or have minor distortions
= ?!
Order-disorder
resolve partial or mixed
occupancies into a fully
ordered crystal structure
(e.g., mixed oxide-fluoride site
into separate oxygen/fluorine)
Many other tools, such as:
• Bond-valence sums to determine valence
• Voronoi coordination as well as 3D coordination polyhedron analysis
• Automatically find and insert interstitial sites
• Diffraction pattern modeling
• Simple cost and materials availability estimators
21. custodian – fixing job errors
• Custodian can wrap
around an executable
(e.g., VASP)
– i.e., run custodian instead of
directly running VASP
• During execution,
custodian will monitor
output files and detect
errors / problems
– If so, it can change input files
and rerun the job
– e.g., if ZPOTRF error
detected, rerun with ISYM=0
– ever-expanding library of
fixes
21
22. FireWorks – scientific workflow software
• FireWorks is an open-source scientific
workflow software
• Materials Project, JCESR, and other
projects manage their runs with
FireWorks
– >1 million jobs
– >100 million CPU-hours
– multiple computing clusters
• You can write any workflow
– e.g., FireWorks is used for graphics
processing, machine learning, document
processing, and protein folding
– #1 Google hit for “Python workflow
software”, top 5 for general scientific
workflow software
• Detailed tutorials are available
22
Jain, A., Ong, S. P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M.,
Petretto, G., Rignanese, G.-M., Hautier, G., Gunter, D. & Persson, K. A.
FireWorks: a dynamic workflow system designed for high-throughput
applications. Concurr. Comput. Pract. Exp. 22, 5037–5059 (2015).!
www.pythonhosted.org/
FireWorks!
23. FireWorks – screenshot of jobs status
23
Live version at http://fireworks.dash.materialsproject.org
24. atomate – our newest code (redesigns our older codes)
24
translate PI-style (minimal) specifications into well-
defined FireWorks workflows
(FireWorks handles all the execution and
job management details)
What is the
GGA-PBE elas0c
tensor of GaAs?
25. atomate – what’s available?
25
K. Mathew J. Montoya S. DwaraknathA. Faghaninia
• band structure
• spin-orbit coupling
• hybrid functional calcs
• elastic tensor
• piezoelectric tensor
• Raman spectra
• GIBBS method
• QH thermal expansion
• AIMD
• FEFF method
• LAMMPS MD
All past and present knowledge, from
everyone in the group, everyone previously
in the group, and outside collaborators,
about how to run calculations
M. Aykol S.P. Ong
26. Further resources
• The Github web sites
– www.github.com/materialsproject
– www.github.com/hackingmaterials
• Software carpentry
• https://software-carpentry.org
26
27. Needed: better way to learn methods
• It can take many months, and perhaps even an internship in a
group with relevant expertise, to learn to use a new method
• Workshops are one way to speed the process
• However, self-serve ways to learn new methods would be
wonderful
– e.g., web tutorials that mix together theory and practice
• Consider: what fraction of people could learn to correctly use
your code/method given only a single web link and no direct
communication with anyone? (they are allowed to find and
use other web resources based on the initial link)
– Example: https://www.youtube.com/user/MaterialsProject
27
28. Needed: curation of tools and methods
• A place to kick-start discovery and learning of
new codes and tools:
– “Too basic” example: http://materials.sh (Shyue Ping
Ong, UCSD)
– “Too complex/messy” example: Nanohub
28
29. Needed: standardizing data *containers*
• Different codes will have different inputs and
outputs, so obviously data organization will vary
• But the “container” of the data organization can be
consistent. e.g., you can represent arrays within:
– JSON
– YAML
– XML
– HDF5
– but don’t invent your own format to represent an array!
• Some of these container formats are human-
readable, i.e., easy to edit in a text editor
• No more “code parses custom input file format to
produce custom output file format”
29
30. Needed: other ways to improve accuracy
30
DFT band gap = cheap lens Some kind of super
accurate post-Bethe-
Salpeter method
How to improve image quality? Strategy 1
31. Needed: other ways to improve accuracy
31
Computer algorithms
improve image
How to improve image quality? Strategy 2
Software corrects for cheap lens. e.g.,
distortion, two images to create depth of field
32. Needed: other ways to improve accuracy
32
correct and mix
cheap/simple
calculations to
improve output
quality
Jain, A., Hau0er, G., Ong, S. P.,
Moore, C. J., Fischer, C. C., Persson,
K. A. & Ceder, G. Forma0on
enthalpies by mixing GGA and
GGA+U calcula0ons. Phys. Rev. B
84, 45115 (2011).
!
33. Needed: other ways to improve accuracy
33
Correcting the DFT is necessary to getting decent phase diagrams
Almost everyone that is practicing new materials
design does some flavor of post-correction (e.g., gas
phase energies)
More effort into comparing, developing, and
validating such methods is needed.
Jain, A., Hau0er, G., Ong, S. P.,
Moore, C. J., Fischer, C. C., Persson,
K. A. & Ceder, G. Forma0on
enthalpies by mixing GGA and
GGA+U calcula0ons. Phys. Rev. B
84, 45115 (2011).
!
35. Some lessons learned (1)
• In the beginning, strong central coordination from
authority was needed to develop these
– require that people contribute to common code, e.g.
pymatgen, and not write their own detached scripts
• Once a code was “established”, less authority was
needed
– people voluntarily contributed improvements rather than
writing their own code because this benefited them
• Today the process is almost completely
decentralized
– culture has changed
– even for new codes, people rally around it rather than
build independent things
35
36. Some lessons learned (2)
• It is helpful to have a strong BDFL (benevolent
dictator for life) for each codebase
• Requirements for the BDFL:
– very detail-oriented
– cares about the code itself, not just the application
– cares more about the code quality than about offending
teammates, i.e., will not accept poor quality contributions
– at the same time, able to rally support from people and
convince them to contribute or clean up code
– willing to work overtime to do things like write detailed
docs, advocate for the code, review commits, etc.
– derives joy from building and deploying things!
36
37. Some lessons learned (3)
• Spending time to do things like improve code-cleanliness, writing
unit tests, writing documentation, etc. is not such a “noble” and
“self-sacrificing” act like people make it out to be
– I’ve referred my own documentation many times
– I’ve saved myself from a world of trouble by previously writing unit tests to
detect bugs
– I’ve been able to write and build large code much faster due to previous
commitments to code cleanliness (and been slowed down in my progress
when I’ve relaxed these constraints)
• We don’t like to admit this, but a lack of attention to detail in the
past has easily cost us tens of thousands of dollars in wasted
computing and countless labor hours – but some of this is inevitable
with large projects
37
38. Some lessons learned (4)
• Computer scientists are useful for staying up to
date in the fast-moving world of software
– 2006: I took a graduate class in databases at MIT; all SQL,
not a single mention of “NoSQL”
– 2011: We are designing the framework for Materials
Project; I have lots of experience with SQL; a computer
scientist casually mentions NoSQL, its growing
prominence, and its potential applicability to our problem
– 2017: We do almost everything in NoSQL
• Lesson: software moves fast! Much faster than
materials science knowledge or methods. Don’t use
data from 5 years ago to inform your decision.
38