O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 28 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Mais de Anubhav Jain (20)

Mais recentes (20)

Anúncio

Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature

  1. 1. Natural Language Processing for Data Extraction and Synthesizability Prediction from the Energy Materials Literature Anubhav Jain Lawrence Berkeley National Laboratory MRS Fall meeting, Nov 2022 Slides (already) posted to hackingmaterials.lbl.gov
  2. 2. Literature data can be a key source of materials learning 2 Plan Synthesize Characterize Analyze local db + ML Automated Lab A Plan Synthesize Characterize Analyze Conventional Lab B Plan Synthesize Characterize Analyze local db + ML Automated Lab C Literature data + broad coverage – difficult to parse – lack negative examples – reproducibility Other A-lab data + structured data formats + negative examples – not much out there … Theory data + readily available – difficult to establish relevance to synthesis – computation time
  3. 3. Several research groups are now attempting to collect data sets from the research literature 3 Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019) Recently, we also tried BERT variants Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488.
  4. 4. Models were good for labeling entities, but didn’t understand relationships 4 Named Entity Recognition • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K. A.; Ceder, G.; Jain, A. Quantifying the Advantage of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Patterns 2022, 3 (4), 100488. Relationships have usually been extracted via either manual or semi-automated regular expression construction along with grammar tree analysis, e.g. ChemDataExtractor – can be tedious!
  5. 5. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 5
  6. 6. A Sequence-to-Sequence Approach • Language model takes a sequence of tokens as input and outputs a sequence of tokens • Maximizes the likelihood of the output conditioned on the input • Additionally includes task conditioning, which can learn the desired format for outputs • We’ve done many explorations now with OpenAI’s GPT-3 which has 175 billion parameters • interact with the model through their (paid) API, although costs are relatively modest • Capacity for “understanding” language as well as “world knowledge”
  7. 7. How a sequence-to-sequence approach works 7 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  8. 8. Another example 8 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  9. 9. Structured data 9 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  10. 10. But it’s not perfect for technical data 10 Seq2Seq model (GPT3) Text in (“prompt”) Text out (“completion”)
  11. 11. A workflow for fine-tuning GPT-3 1. Initial training set of templates filled mostly manually, as zero- shot GPT is often poor for technical tasks 2. Fine-tune model to fill templates, use the model to assist in annotation 3. Repeat as necessary until desired inference accuracy is achieved
  12. 12. This procedure can extract complex, hierarchical relationships between entities 12
  13. 13. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 13
  14. 14. Templated extraction of synthesis recipes • Annotate paragraphs to output structured recipe templates • JSON-format • Designed using domain knowledge from experimentalists • Template is relation graph to be filled in by model
  15. 15. Example Extraction for Au nanorod synthesis Note: we are still formally evaluating performance various issues in getting an accurate evaluation, e.g., predictions that are functionally correct but written differently
  16. 16. Analyzing AuNR synthesis data set 16 Note that this data set was collected manually via hand-tuned regular expressions, not NLP or GPT-3 as it was done in parallel to that work. We are currently looking at pros/cons of manual approach vs GPT_3 approach. Representing recipes as precursor vectors for machine learning
  17. 17. Training a decision tree to predict AuNR shape shows similar conclusions as literature 17 Rod Cube Rod Cube Bipyramid Star Bipyramid None None None None None None None • Decision tree shows seed capping agent type as first decision boundary for shape determination • “Citrate-capped gold seeds form penta-twinned structure, while CTAB-capped seeds are single crystalline, hence former leads to bipyramids and latter leads to rods”1,2 1 Liu and Guyot-Sionnest, J. Phys. Chem. B, 2005 109 (47), 22192-22200 2 Grzelczak et al., Chem. Soc. Rev., 2008,37, 1783-1791
  18. 18. We also see some effect of AgNO3 concentration on AuNR size, but data is noisy 18 N. D. Burrows et al., Langmuir 2017 33 (8), 1891-1907 growth: HAuCl4, CTAB, AA, AgNO3 growth: HAuCl4, CTAB, AA, AgNO3 w/ HAuCl4/CTAB<0.01 filter growth: HAuCl4, CTAB, AA, AgNO3 + HCl
  19. 19. Overall thoughts on AuNR data set • The seq2seq method is showing good capabilities in terms of extracting complex nanorod synthesis data • We are going to start integrating this into our own pipeline to replace manual regex for relationship extraction • Performing machine learning to form hypothesis generation on AuNR shape and size is messy • Data sets are messy, and not particularly large • Nevertheless, it is encouraging that conclusions from the literature can be automatically found by machine learning 19
  20. 20. Outline • Using sequence-to-sequence models for combined entity detection and relationship extraction • Analyzing synthesis of Au nanorods using literature data • Analyzing synthesis of phase-pure BiFeO3 using literature data 20
  21. 21. Seq2Seq approach for solid state synthesis Initial tests of the seq2seq method on solid state synthesis has encouraging results, but needs further testing
  22. 22. For now, we use manual data extraction to tackle the problem of BiFeO3 synthesis 22 340 total synthesis recipes (from 178 articles); 57 features per recipe
  23. 23. Machine learning (decision tree) predictions are in-line with common knowledge 23 Machine learning (decision tree) predictions are in-line with common knowledge 24
  24. 24. Missing synthesis information – can it be recovered / reproduced easily? 24 Could not reproduce Partially reproducible Reproducible
  25. 25. Exploring unexplored portions of synthesis space 25 These decision trees are interpretable, but are they physical?
  26. 26. Conclusions • As large language models grow larger and more capable, they are able to parse increasingly complex scientific text into structured formats • Applying NLP + ML on synthesis data sets shows that scientific heuristics can be automatically uncovered, which is promising • Nevertheless, issues remain in applying NLP to predictive synthesis • Reproducibility / missing information / conflicting information • General lack of negative examples • Unknown data quality • Thus, results from such techniques will likely need to be treated as initial hypotheses to be complemented by further experiments 26
  27. 27. Acknowledgements NLP (seq2seq) • Alex Dunn • John Dagdelen • Nick Walker • Sanghoon Lee • Amalie Trewartha 27 Funding provided by: • U.S. Department of Energy, Basic Energy Science, “D2S2” program • Toyota Research Institutes, Accelerated Materials Design program Slides (already) posted to hackingmaterials.lbl.gov AuNR analysis • Sanghoon Lee • Sam Gleason • Kevin Cruse BiFeO3 analysis • Kevin Cruse • Viktoriia Baibakova • Maged Abdelsamie • Kootak Hong • Carolin Sutter-Fella • Gerbrand Ceder
  28. 28. Sol-gel synthesis of BiFeO3 28

×