Natural language processing for extracting synthesis recipes and applications to autonomous laboratories
Natural language processing for extracting
synthesis recipes and applications to
autonomous laboratories
Anubhav Jain
Lawrence Berkeley National Laboratory
COMBI workshop, Sept 2022
Slides (already) posted to hackingmaterials.lbl.gov
Autonomous labs can benefit from access to external
data sets
2
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab A
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab B
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
Autonomous labs can benefit from access to external
data sets
3
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab A
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab B
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
The NLP Solution to Literature Data
• A lot of prior experimental data already exists in the literature that would take
untold costs and labor to replicate again
• Advantages to this data set are broad coverage of materials and techniques
• Disadvantages include:
• Getting access to the data
• lack of negative examples in the data
• missing / unreliable information
• difficulty to obtain structured data from unstructured text
• Natural language processing can help with the last part, although considerable
difficulties are still involved
• Named entity recognition
• Identify precursors, amounts, characteristics, etc.
• Relationship modeling
• Relate the extracted entities to one another
Previous approach for extracting data from
text
5
Weston, L. et al Named Entity Recognition
and Normalization Applied to Large-Scale
Information Extraction from the Materials
Science Literature. J. Chem. Inf. Model.
(2019)
Recently, we also tried BERT variants
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
Models were good for labeling entities, but
didn’t understand relationships
6
Named Entity Recognition
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
Trewartha, A.; Walker, N.; Huo, H.; Lee, S.;
Cruse, K.; Dagdelen, J.; Dunn, A.; Persson,
K. A.; Ceder, G.; Jain, A. Quantifying the
Advantage of Domain-Specific Pre-Training
on Named Entity Recognition Tasks in
Materials Science. Patterns 2022, 3 (4),
100488.
A Sequence-to-Sequence Approach
• Language model takes a sequence of tokens as input and
outputs a sequence of tokens
• Maximizes the likelihood of the output conditioned on the input
• Additionally includes task conditioning
• Capacity for “understanding” language as well as “world
knowledge”
• Task conditioning with arbitrary Seq2Seq provides extremely
flexible framework
• Large seq2seq2 models can generate text that naturally
completes a paragraph
But it’s not perfect for technical data
11
Seq2Seq model
(GPT3)
Text in (“prompt”) Text out (“completion”)
A workflow for fine-tuning GPT-3
1. Initial training set of templates
filled via zero-shot Q/A
2. Fine-tune model to fill
templates
3. Predict new set of templates
4. Correct the new templates
5. Add the corrected templates to
the training set
6. Repeat steps 2-5 as necessary
Templated extraction of synthesis recipes
• Annotate paragraphs to output
structured recipe templates
• JSON-format
• Designed using domain knowledge
from experimentalists
• Template is relation graph to be
filled in by model
Performance (work in progress, initial tests)
• Precision: 90%
• Recall: 90%
• F1 Score: 90%
• Transcription: 97%
• Overall: 86%
• F1 accuracy for placing information in the right fields
• Transcription accuracy for putting the right information in said fields
Applied to solid state synthesis / doping
We have performed the first-principles calculations onto the structural,
electronic and magnetic properties of seven 3d transition-metal (TM=V, Cr,
Mn, Fe, Co, Ni and Cu) atom substituting cation Zn in both zigzag (10,0) and
armchair (6,6) zinc oxide nanotubes (ZnONTs). The results show that there
exists a structural distortion around 3d TM impurities with respect to the
pristine ZnONTs. The magnetic moment increases for V-, Cr-doped ZnONTs
and reaches maximum for Mn-doped ZnONTs, and then decreases for Fe-, Co-
, Ni- and Cu-doped ZnONTs successively, which is consistent with the
predicted trend of Hund’s rule for maximizing the magnetic moments of the
doped TM ions. However, the values of the magnetic moments are smaller than
the predicted values of Hund’s rule due to strong hybridization between p
orbitals of the nearest neighbor O atoms of ZnONTs and d orbitals of the TM
atoms. Furthermore, the Mn-, Fe-, Co-, Cu-doped (10,0) and (6,6) ZnONTs
with half-metal and thus 100% spin polarization characters seem to be good
candidates for spintronic applications.
Use in initial hypothesis generation
17
classifying AuNP
morphologies based
on precursors used
predicting AuNR
aspect ratios based
on amount of AgNO3
in growth solution
predicting doping – if
a material can be
doped with A, can it
be doped with B?
Developing an automated lab (“A-lab”) that makes use
of literature data is in progress
18
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab A
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab B
Plan
Synthesize
Characterize
Analyze
local db
Automated Lab C
Literature data
+ broad coverage
– difficult to parse
– lack negative examples
Other A-lab data
+ structured data formats
+ negative examples
– not much out there …
Theory data
+ readily available
– difficult to establish
relevance to synthesis
The A-lab facility is designed to handle inorganic
powders
19
In operation:
XRD
Robot
Box furnaces
Setting up:
Tube
furnace x 4
LBNL bldg. 30
Dosing and mixing
Facility will handle powder-
based synthesis of inorganic
materials, with automated
characterization and
experimental planning
Collaboration w/ G. Ceder & H. Kim
July 2022
- Tube furnaces and
SEM ready
Hardware
development
Platform
Integration
Automated
Synthesis
AI-guided
Synthesis
April 2022
Box furnace, XRD,
& robots ready
November 2022
- Powder dosing system
- First automated syntheses
Summer 2023
AI-guided synthesis
Closed-
Loop
Materials
Discovery
Summer 2024
Closed-loop
materials discovery
The continuing challenge – putting it all together!
Currently we are still working on various components
Historical-data
Initial hypotheses
data-api
Acknowledgements
NLP
• Nick Walker
• John Dagdelen
• Alex Dunn
• Sanghoon Lee
• Amalie Trewartha
22
A-lab
• Rishi Kumar
• Yuxing Fei
• Haegyum Kim
• Gerbrand Ceder
Funding provided by:
• U.S. Department of Energy, Basic Energy Science, “D2S2” program
• Toyota Research Institutes, Accelerated Materials Design program
• Lawrence Berkeley National Laboratory “LDRD” program
Slides (already) posted to hackingmaterials.lbl.gov