O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Finding Unavailable Data

885 visualizações

Publicada em

This talk is about data - where to get it and how to create it if it doesn’t exist. I’ll take the audience through the process of creating the dataset for my most recent project and show how to view unavailable data as an opportunity rather than an obstacle to answering questions. I’ll cover how to get and read data as well as popular libraries for data analysis and processing in Python — NLTK (Natural Language Toolkit), Panda, Gensim and techniques like regular expressions.

Publicada em: Dados e análise
  • Seja o primeiro a comentar

Finding Unavailable Data

  1. 1. Yeli @YellzHeard omayeli.com
  2. 2. Finding Unavailable Data
  3. 3. What is the male equivalent of a nun?
  4. 4. google.com
  5. 5. google.com
  6. 6. quora.com
  7. 7. quora.com
  8. 8. english.stackexchange.com
  9. 9. 1. You can find all the gendered words. 2. You can find the equivalent of a gendered word.
  10. 10. → lady / gentleman → prince / princess → king / queen → father / mother → seamstress / seamster → ministress / minister → iron man → cougar
  11. 11. Where and how to get data
  12. 12. APIs Static Data Web Scraping
  13. 13. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male', 'boy', 'men', 'son', 'father', 'husband'] A gendered word is a word with one of these terms (above ) in its definition.
  14. 14. APIs: Application Programming Interface
  15. 15. programmableweb.com
  16. 16. wordnik.com
  17. 17. wordnik.com
  18. 18. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male', 'boy', 'men', 'son', 'father', 'husband']
  19. 19. 400 words
  20. 20. Static Data .json .txt .csv ...
  21. 21. → lady / gentleman → prince / princess → king / queen → father / mother → seamstress → ministress → iron man
  22. 22. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male', 'boy', 'men', 'son', 'father', 'husband']
  23. 23. Regular Expressions -> a sequence of characters that define a search pattern
  24. 24. regextester.com
  25. 25. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male', 'boy', 'men', 'son', 'father', 'husband']
  26. 26. regextester.com
  27. 27. regextester.com
  28. 28. ~ 8000 words
  29. 29. Patterns
  30. 30. Patterns -> object of a preposition
  31. 31. nltk -> natural language toolkit -> for processing the english language
  32. 32. text-processing.com/demo/tokenize Tokenization -> chopping up a string into pieces (called tokens) -> throwing away certain characters, such as punctuation
  33. 33. Patterns -> object of a preposition -> clothing items
  34. 34. collinsdictionary.com/us/word-list
  35. 35. Web Scraping Icons made by Smashicons from www.flaticon.com/authors/smashicons
  36. 36. urllib.request -> opening URLs BeautifulSoup -> parsing HTML documents
  37. 37. ~ 4000 words
  38. 38. → lady / gentleman → prince / princess → king / queen → father / mother → actor / actress
  39. 39. bionlp-www.utu.fi/wv_demo/
  40. 40. Word2Vec -> words to vectors
  41. 41. suriyadeepan.github.io
  42. 42. My meal wasn’t very tasty so I put some maggi on it. My meal wasn’t very tasty so I put some salt on it. My meal wasn’t very tasty so I put some seasoning on it. I sat on the chair to eat my meal.
  43. 43. Gensim -> Google trained word2vec model
  44. 44. Yeli @YellzHeard omayeli.com

×