O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Natural Language Processing for Materials Design - What Can We Extract From the Research Literature

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 30 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Natural Language Processing for Materials Design - What Can We Extract From the Research Literature (20)

Anúncio

Mais de Anubhav Jain (19)

Mais recentes (20)

Anúncio

Natural Language Processing for Materials Design - What Can We Extract From the Research Literature

  1. 1. Natural Language Processing for Materials Design— What Can We Extract From the Research Literature? Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Spring April 18, 2021 Slides (already) posted to hackingmaterials.lbl.gov
  2. 2. 2 Useful information is scattered across papers – how can we make use of this data with NLP? papers to read “someday” NLP algorithms
  3. 3. • It is difficult to look up all information any given material due to the many different ways chemical compositions are written – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. – a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest “CuCrSe2” as a similar result • It is difficult to ask questions or compile summaries, e.g.: – What is the band gap of “Si”? – What are all the known dopants into GaAs? – What are all materials studied as thermoelectrics? 3 Traditional search doesn’t answer the questions we want
  4. 4. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science, connecting together topics of study, synthesis and characterization methods, and specific materials compositions • It is also an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  5. 5. One of our main projects concerns named entity recognition, or automatically labeling text 5 This allows for search and is crucial to downstream tasks
  6. 6. 6 > 4 million Papers Collected 31 million Properties 19 million Materials Mentions 8.8 million Characterization Methods 7.5 million Applications 5 million Synthesis Methods •Data Collection: Over 4 million papers collected from more than 2100 journals. Note – entities are currently extracted only from the abstracts of the papers
  7. 7. 7 Now we can search! Live on www.matscholar.com
  8. 8. 8 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  9. 9. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 9 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  10. 10. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 10 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  11. 11. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 11 Word embeddings trained on ”normal” text learns relationships between words
  12. 12. 12 There seems to be materials knowledge encoded in the word vectors trained on materials abstracts Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  13. 13. 13 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  14. 14. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 14 Step 4b: How do we train a model to recognize context?
  15. 15. 15 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  16. 16. 16 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  17. 17. 17 Step 5. Let the model label things for you! Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  18. 18. 18 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  19. 19. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 19 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  20. 20. 20 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  21. 21. 21 We also published a list of potential new thermoelectrics Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). It is one thing to retroactively test, but perhaps another to see how things go after publication
  22. 22. 22 Two were studied between submission and publication of manuscript Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  23. 23. 23 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  24. 24. 24 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  25. 25. 25 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). https://arxiv.org/abs/2010.08461
  26. 26. 26 Our collaborators also synthesized a prediction, finding a moderate zT of 0.14 Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  27. 27. 27 How is this working? “Context words” link together information from different sources
  28. 28. 28 Related projects undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Doping database Cube Rod Sphere Predicting shape of nanoparticles Original figure Data snippet extracted fully automatically Replotted data Data extraction from figures
  29. 29. 29 We are creating a comprehensive software library for materials science NLP research (multiple research groups) https://github.com/lbnlp
  30. 30. 30 The Matscholar team Kristin Persson Anubhav Jain Gerbrand Ceder John Dagdelen Leigh Weston Vahe Tshitoyan Amalie Trewartha Alex Dunn Viktoriia Baibakova Funding from (now at Google) (now at Medium) Slides (already) posted to hackingmaterials.lbl.gov

×