O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

192 visualizações

Publicada em

PyParis 2017
http://pyparis.org

Publicada em: Tecnologia
  • Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...●●● https://tinyurl.com/make2793amonth
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • I went from getting $3 surveys to $500 surveys every day!! learn more...  https://tinyurl.com/realmoneystreams2019
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Seja a primeira pessoa a gostar disto

PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

  1. 1. How to prepare data for NLP Loryfel Nunez @lorynyc
  2. 2. California Gold Rush
  3. 3. “ Extracting actionable information from modern big data sets requires the equivalent processing infrastructure of extracting a nugget of GOLD from a mountain of DIRT. Nikolas Markou (via LInkedIn)
  4. 4. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2
  5. 5. How does it work, anyway?1
  6. 6. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  7. 7. Controlling the input Document Unit Representation of text
  8. 8. Inside the Machine Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of Novak and Kline for $10.99 per share .
  9. 9. BREAK IT DOWN 2
  10. 10. Let’s Break it Down á Novák Novák and Kline Smith acquires shares of Novak and Kline for $10.99 per share. Smith acquires shares of Novak and Kline for $10.99 per share. Smith Inc. acquires shares of Novak and Kline for $10.99 per share. Smith acquires common shares of N & K for $10.99/share.
  11. 11. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  12. 12. … if possible 2
  13. 13. Character á &amp; Do you know the encoding of your input data? ◉User tells you ◉Metadata ◉Figure it out (using chardet, or similar) ◉Have your own heuristics
  14. 14. Tokens Forty-two, 42 Post-colonial, postcolonial eBay, Ebay, EBAY, ebay Fed, FED, fed C.A.T., CAT Heuristics Mappings Transformations numToWord, POS (from SpaCy or NLTK)
  15. 15. Tokens STEMMING vs LEMMATIZATION import spacy from nltk.stem.porter import PorterStemmer nlp = spacy.load('en') stemmer = PorterStemmer() doc = nlp(u'She is an intelligence operative.') for word in doc: stemmed = stemmer.stem(word.text) print(word.text, " LEMMA => ", word.lemma_, " STEM => ", stemmed) She LEMMA => -PRON- STEM => she is LEMMA => be STEM => is an LEMMA => an STEM => an intelligence LEMMA => intelligence STEM => intellig operative LEMMA => operative STEM => oper . LEMMA => . STEM => . SpaCy, NLTK
  16. 16. Entities Novak and Kline, NK, NYSE:NK, Test Company June 30, 2017 06/30/2017 30/6/2017 Smith acquires shares of Novak and Kline for $10.99 per share . Smith acquires shares of NK for $10.99 per share . ORG acquires shares of ORG for $10.99 per share .
  17. 17. Hot or Not REMOVING HIGHLIGHTING WORDS Emails, dates, URLs, stop words hotwords More than WORDS tables Hot patterns textacy
  18. 18. In the real world <p><b>Smith Buys Novak</b></p> <p></p> <p>by Anna Smith<p> <p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for $10.99/share.</p> <table style="width:100%"> <tr><th>Col1</th><th>Col2</th> </tr> <tr><td>data1</td><td>data2</td></tr> </table>
  19. 19. IRL {‘title’: ‘Smith Buys …’, ‘original_text’: ‘LONDON --- Smith..’, ‘transformed_text’: { ‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘, ‘lemmatized’: ‘Smith Inc acquire share..’ ‘has_acquired: true }, ‘table’: ‘<table>….. </table>’ }
  20. 20. The General NLP Problem dog: 3, 2, 1 red coat: 0, 0, 1 😋 😭
  21. 21. Have an intuition on how things work Breaking data down Keep it simple .. if possible 1 3 2 -- how algorithms see text -- from bytes to documents -- patterns, normalization, metadata, actions (replace, remove, highlight)
  22. 22. ◉ Stanford NLP Group ◉ Spacy Documentation ◉ SciKit Learn Documentation ◉ The hard knocks of NLP projects References and other stuff
  23. 23. Any questions ? You can find me at ◉ @lorynyc ◉ loryn808@gmail.com Thanks!

×