Anúncio
Anúncio

Mais conteúdo relacionado

Similar a Natural language processing for extracting synthesis recipes and applications to autonomous laboratories(20)

Mais de Anubhav Jain(20)

Anúncio

Natural language processing for extracting synthesis recipes and applications to autonomous laboratories

  1. Natural language processing for extracting synthesis recipes and applications to autonomous laboratories Anubhav Jain Lawrence Berkeley National Laboratory COMBI workshop, Sept 2022 Slides (already) posted to hackingmaterials.lbl.gov
  2. Autonomous labs can benefit from access to external data sets 2 Plan Synthesize Characterize Analyze local db Automated Lab A Plan Synthesize Characterize Analyze local db Automated Lab B Plan Synthesize Characterize Analyze local db Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis
  3. Autonomous labs can benefit from access to external data sets 3 Plan Synthesize Characterize Analyze local db Automated Lab A Plan Synthesize Characterize Analyze local db Automated Lab B Plan Synthesize Characterize Analyze local db Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis
  4. The NLP Solution to Literature Data • A lot of prior experimental data already exists in the literature that would take untold costs and labor to replicate again • Advantages to this data set are broad coverage of materials and techniques • Disadvantages include: • Getting access to the data • lack of negative examples in the data • missing / unreliable information • difficulty to obtain structured data from unstructured text • Natural language processing can help with the last part, although considerable difficulties are still involved • Named entity recognition • Identify precursors, amounts, characteristics, etc. • Relationship modeling • Relate the extracted entities to one another
  5. Previous approach for extracting data from text 5 Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019) Recently, we also tried BERT variants Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  6. Models were good for labeling entities, but didn’t understand relationships 6 Named Entity Recognition • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  7. A Sequence-to-Sequence Approach • Language model takes a sequence of tokens as input and outputs a sequence of tokens • Maximizes the likelihood of the output conditioned on the input • Additionally includes task conditioning • Capacity for “understanding” language as well as “world knowledge” • Task conditioning with arbitrary Seq2Seq provides extremely flexible framework • Large seq2seq2 models can generate text that naturally completes a paragraph
  8. How a sequence-to-sequence approach works 8 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  9. Another example 9 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  10. Structured data 10 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  11. But it’s not perfect for technical data 11 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  12. A workflow for fine-tuning GPT-3 1. Initial training set of templates filled via zero-shot Q/A 2. Fine-tune model to fill templates 3. Predict new set of templates 4. Correct the new templates 5. Add the corrected templates to the training set 6. Repeat steps 2-5 as necessary
  13. Templated extraction of synthesis recipes • Annotate paragraphs to output structured recipe templates • JSON-format • Designed using domain knowledge from experimentalists • Template is relation graph to be filled in by model
  14. Example Prediction
  15. Performance (work in progress, initial tests) • Precision: 90% • Recall: 90% • F1 Score: 90% • Transcription: 97% • Overall: 86% • F1 accuracy for placing information in the right fields • Transcription accuracy for putting the right information in said fields
  16. Applied to solid state synthesis / doping We have performed the first-principles calculations onto the structural, electronic and magnetic properties of seven 3d transition-metal (TM=V, Cr, Mn, Fe, Co, Ni and Cu) atom substituting cation Zn in both zigzag (10,0) and armchair (6,6) zinc oxide nanotubes (ZnONTs). The results show that there exists a structural distortion around 3d TM impurities with respect to the pristine ZnONTs. The magnetic moment increases for V-, Cr-doped ZnONTs and reaches maximum for Mn-doped ZnONTs, and then decreases for Fe-, Co- , Ni- and Cu-doped ZnONTs successively, which is consistent with the predicted trend of Hund’s rule for maximizing the magnetic moments of the doped TM ions. However, the values of the magnetic moments are smaller than the predicted values of Hund’s rule due to strong hybridization between p orbitals of the nearest neighbor O atoms of ZnONTs and d orbitals of the TM atoms. Furthermore, the Mn-, Fe-, Co-, Cu-doped (10,0) and (6,6) ZnONTs with half-metal and thus 100% spin polarization characters seem to be good candidates for spintronic applications.
  17. Use in initial hypothesis generation 17 classifying AuNP morphologies based on precursors used predicting AuNR aspect ratios based on amount of AgNO3 in growth solution predicting doping – if a material can be doped with A, can it be doped with B?
  18. Developing an automated lab (“A-lab”) that makes use of literature data is in progress 18 Plan Synthesize Characterize Analyze local db Automated Lab A Plan Synthesize Characterize Analyze local db Automated Lab B Plan Synthesize Characterize Analyze local db Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis
  19. The A-lab facility is designed to handle inorganic powders 19 In operation: XRD Robot Box furnaces Setting up: Tube furnace x 4 LBNL bldg. 30 Dosing and mixing Facility will handle powder- based synthesis of inorganic materials, with automated characterization and experimental planning Collaboration w/ G. Ceder & H. Kim July 2022 - Tube furnaces and SEM ready Hardware development Platform Integration Automated Synthesis AI-guided Synthesis April 2022 Box furnace, XRD, & robots ready November 2022 - Powder dosing system - First automated syntheses Summer 2023 AI-guided synthesis Closed- Loop Materials Discovery Summer 2024 Closed-loop materials discovery
  20. Early stages of the facility 20
  21. The continuing challenge – putting it all together! Currently we are still working on various components Historical-data Initial hypotheses data-api
  22. Acknowledgements NLP • Nick Walker • John Dagdelen • Alex Dunn • Sanghoon Lee • Amalie Trewartha 22 A-lab • Rishi Kumar • Yuxing Fei • Haegyum Kim • Gerbrand Ceder Funding provided by: • U.S. Department of Energy, Basic Energy Science, “D2S2” program • Toyota Research Institutes, Accelerated Materials Design program • Lawrence Berkeley National Laboratory “LDRD” program Slides (already) posted to hackingmaterials.lbl.gov
Anúncio