O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Materials design using knowledge from millions of journal articles via natural language processing techniques

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 57 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Materials design using knowledge from millions of journal articles via natural language processing techniques (20)

Anúncio

Mais de Anubhav Jain (18)

Mais recentes (20)

Anúncio

Materials design using knowledge from millions of journal articles via natural language processing techniques

  1. 1. Materials design using knowledge from millions of journal articles via natural language processing techniques Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA IMX Virtual Seminar, April 11 2021 Slides (already) posted to hackingmaterials.lbl.gov
  2. 2. • Often, materials are known for several decades before their functional applications are known – MgB2 sitting on lab shelves for 50 years before its identification as a superconductor in 2001 – LiFePO4 known since 1938, only identified as a Li-ion battery cathode in 1997 • Even after discovery, optimization and commercialization still take decades • How is this typically done? 2 Typically, both new materials discovery and optimization take decades
  3. 3. What constrains traditional approaches to materials design? 3 “[The Chevrel] discovery resulted from a lot of unsuccessful experiments of Mg ions insertion into well-known hosts for Li+ ions insertion, as well as from the thorough literature analysis concerning the possibility of divalent ions intercalation into inorganic materials.” -Aurbach group, on discovery of Chevrel cathode for multivalent (e.g., Mg2+) batteries Levi, Levi, Chasid, Aurbach J. Electroceramics (2009)
  4. 4. 4 Researchers are starting to fundamentally re-think how we invent the materials that make up our devices Next- generation materials design Computer- aided materials design Natural language processing “Self-driving laboratories”
  5. 5. Outline 5 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  6. 6. 6 Can ML help us work through our backlog of information we need to assimilate from text sources? papers to read “someday” NLP algorithms
  7. 7. • It is difficult to look up all information any given material due to the many different ways chemical compositions are written – a search for “TiNiSn” will give different results than “NiTiSn” – a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5” – a search for “SnBi4Te7” won’t match text that reads “we studied SnBi4X7 (X=S, Se, Te)”. – a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest “CuCrSe2” as a similar result • It is difficult to ask questions or compile summaries, e.g.: – What is the band gap of “Si”? – What are all the known dopants into GaAs? – What are all materials studied as thermoelectrics? 7 Traditional search doesn’t answer the questions we want
  8. 8. What is matscholar? • Matscholar is an attempt to organize the world’s information on materials science, connecting together topics of study, synthesis and characterization methods, and specific materials compositions • It is also an effort to use state-of-the-art natural language processing to make collective use of the information in millions of articles
  9. 9. One of our main projects concerns named entity recognition, or automatically labeling text 9 This allows for search and is crucial to downstream tasks
  10. 10. 1 0 > 4 million Papers Collected 31 million Properties 19 million Materials Mentions 8.8 million Characterization Methods 7.5 million Applications 5 million Synthesis Methods •Data Collection: Over 4 million papers collected from more than 2100 journals. Note – entities are currently extracted only from the abstracts of the papers
  11. 11. 11 Now we can search! Live on www.matscholar.com Live demo
  12. 12. • The publication data set is not complete • Currently analyzing abstracts only • The algorithms are not perfect • The search interface could be improved further • We would like to hear from you if you try this! 12 Limitations (it is not perfect)
  13. 13. 13 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  14. 14. Extracted 4 million abstracts of relevant scientific articles using various APIs from journal publishers Some are more difficult than others to obtain. Data cleaning is often needed (e.g., stray HTML tags, copyright statements) Abstract collection continues … 14 Step 1 – data collection
  15. 15. 15 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  16. 16. • First split the text into sentences – Seems simple, but remember edge cases like ”et al.” or “etc.” does not necessarily signify end of sentence despite the period • Then split the sentences into words – Tricky things are detecting and normalizing chemical formulas, selective lowercasing (“Battery” vs “battery” or “BaS” vs “BAs”), homogenizing numbers, etc. • Historically done with ChemDataExtractor* with some custom improvements – We are moving towards a fully custom tokenizer 16 Step 2 - tokenization *http://chemdataextractor.org
  17. 17. 17 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  18. 18. • Part A is marking abstracts as relevant / non-relevant to inorganic materials science • Part B is tediously labeling ~600 abstracts – Largely done by one person – Spot-check of 25 abstracts by a second person gave 87.4% agreement 18 Step 3 – hand label abstracts
  19. 19. 19 How does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  20. 20. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 20 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
  21. 21. • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target 21 Step 4a: the word2vec algorithm is used to “featurize” words Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017 “You shall know a word by the company it keeps” - John Rupert Firth (1957)
  22. 22. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 22 Word embeddings trained on ”normal” text learns relationships between words
  23. 23. 23 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  24. 24. • If you read this sentence: “The band gap of ___ is 4.5 eV” It is clear that the blank should be filled in with a material word (not a synthesis method, characterization method, etc.) How do we get a neural network to take into account context (as well as properties of the word itself)? 24 Step 4b: How do we train a model to recognize context?
  25. 25. 25 Step 4b.An LSTM neural net classifies words by reading word sequences Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  26. 26. 26 Ok so how does this work? High-level view Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  27. 27. 27 Step 5. Let the model label things for you! Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9. Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  28. 28. 28 Could these techniques also be used to predict which materials we might want to screen for an application? papers to read “someday” NLP algorithms
  29. 29. • The classic example is: – “king” - “man” + “woman” = ? → “queen” 29 Remember that word embeddings seem to learn relationships in text
  30. 30. 30 For scientific text, it learns scientific concepts as well crystal structures of the elements Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  31. 31. 31 There seems to be materials knowledge encoded in the word vectors Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  32. 32. 32 Note that more data is not always better! We want relevance Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  33. 33. 33 Word embeddings also have the periodic table encoded in it with no prior knowledge “word embedding” periodic table Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  34. 34. • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP) 34 Making predictions: dot products measure likelihood for words to co-occur Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  35. 35. 35 Try ”going back in time” and ranking materials, and follow what happens in later years Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  36. 36. – For every year since 2001, see which compounds we would have predicted using only literature data until that point in time – Make predictions of what materials are the most promising thermoelectrics for data until that year – See if those materials were actually studied as thermoelectrics in subsequent years 36 A more comprehensive “back in time” test Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  37. 37. 37 We also published a list of potential new thermoelectrics Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). It is one thing to retroactively test, but perhaps another to see how things go after publication
  38. 38. 38 Two were studied between submission and publication of manuscript Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  39. 39. 39 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  40. 40. 40 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  41. 41. 41 More were studied since then (mainly computationally) Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). https://arxiv.org/abs/2010.08461
  42. 42. 42 Our collaborators also synthesized a prediction, finding a moderate zT of 0.14 Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
  43. 43. 43 How is this working? “Context words” link together information from different sources
  44. 44. Outline 44 ① Natural language processing - where are we right now? ② What’s next for the NLP work?
  45. 45. 45 1.Automatic creation of structured materials databases from the literature, e.g. doping database Sentence Base Material Dopant Doping Concentr. …the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol% undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln. of Mg10Si2Sn3… Mg10Si2Sn3 Sb, Bi, Ca, Zn The zT of As2Cd3 with electron doping is found to be ~ with n=10^20cm-3 As2Cd3 electron n=10^20cm-3 This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type As2Cd3T As2Cd3 p-type p=10^20cm-3 The undoped and 0.25wt% La doped CdO films show 111… …however, …. for doping concentrations greater than 0.50wt%. CdO La 0.25wt%, >0.5% Will allow you to answer questions like “what are all the materials known to be doped with Eu3+” ?
  46. 46. Initial Results – 10,000 abstracts Dopants Base Materials
  47. 47. 47 2. Learning representations of materials ● Mat2vec suggested that embeddings contain chemical information ● Can we make embeddings for arbitrary materials as material descriptors? ● i.e., word embeddings for materials not in the literature ● Descriptors could be used for direct classification for application (link prediction), or quantitative property prediction (regression features)
  48. 48. 48 AnyMat2Vec expands word embeddings beyond materials seen explicitly by the algorithm
  49. 49. 49 Initial results – predicting experimental band gap from composition (~3000 data points)
  50. 50. 50 3. Creating a comprehensive software library for materials science NLP research (multiple LBNL research groups) https://github.com/lbnlp
  51. 51. 51 4. Getting data from figures Original figure Data snippet extracted fully automatically Replotted data Can we automatically extract structured information from figures?
  52. 52. 52 4a. Detecting the various regions of the plot (a) & (b): Human labeling of axes and legend regions (141 training figures) (c) – (f): Model predictions based on “faster_cnn_inception” model
  53. 53. 53 4b.Automatically reading the axis scales 1. Detect numbers using a custom configuration of the EasyOCR package 2. Develop algorithm to detect exponents based on size and position 3. Set tick marks at center of text height Number of entries POWER BASE a. Digit detection (previous step) b. Compile coordinates of detected numbers c. Separate into groups of low height and high height
  54. 54. 54 4c. Getting the data curves (color-based) Starting image 1. Automatically detect distinct colors using iterative k-means clustering 2. Decide which color channels contain relevant data
  55. 55. 55 4d. Initial application: emissivity spectra for nanomaterials (179 curves) film 1D grating 2D grating 2D cylindrical cavities bull’s eye microspheres
  56. 56. • There is a lot of data and knowledge in the historical corpus of scientific journal articles, but getting the knowledge has been difficult to do on a large scale • Machine learning presents a new frontier for being able to make use of this information 56 Conclusion
  57. 57. 57 The Matscholar team Kristin Persson Anubhav Jain Gerbrand Ceder John Dagdelen Leigh Weston Vahe Tshitoyan Amalie Trewartha Alex Dunn Viktoriia Baibakova Funding from (now at Google) (now at Medium) Slides (already) posted to hackingmaterials.lbl.gov + DOE ARPA-e (figure extraction)

×