O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

SUNG PARK PREDICT 422 Group Project Presentation

435 visualizações

Publicada em

  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

SUNG PARK PREDICT 422 Group Project Presentation

  1. 1. TEXT  MINING  DATA  SCIENCE  JOBS  IN  R   Sung  Park,  MSPA  Candidate   August  20,  2015   Northwestern  University   PREDICT  422-­‐DL  SecGon  55   1  
  2. 2. SUMMARY   •  IntroducGon   •  Resources   •  Data  Source     •  Data  ExtracGon   •  Data  PreparaGon   •  Supervised  Learning   2  
  3. 3. INTRODUCTION   •  ExploraGon  of  web  scraping  and  text  mining   capabiliGes  in  R   •  Unstructured  data   •  Kaggle.com  job  posGngs   •  ClassificaGon  using  machine  learning  algorithm   •  Data  scienGsts  vs.  non-­‐data  scienGsts     3  
  4. 4. RESOURCES   •  Text  AnalyGcs  Tutorial  in  R   •  Timothy  D’Auria,  Boston  Decision,  LLC   •  hUps://www.youtube.com/watch?v=j1V2McKbkLo   •  Web  Scraping  Tutorial  in  R   •  Sharon  Machlis,  Computerworld   •  hUps://www.youtube.com/watch?v=TPLMQnGw0Vk   •  Data  Science  in  R:  A  Case  Study  Approach  to  ComputaGonal   Reasoning  and  Problem  Solving   •  Deborah  Nolan  and  Duncan  Temple  Lang   •  Google  and  Stack  Overflow   4  
  5. 5. DATA  SOURCE   •  Kaggle.com/jobs   •  August  17,  2015   •  1,025  Job  PosGngs   •  Data  ScienGst   •  Big  Data  Engineer   •  Data  Science   Architect   •  Data  Analyst   •  MarkeGng  Analyst   •  StaGsGcian   •  Data  Science   Director   5  
  6. 6. DATA  EXTRACTION   •  Extracted  job  links   •  XML  Package   •  xpathSApply(doc,  "//h3/a/@href[starts-­‐with(.,  '/jobs')]")               •  Extracted  job  posGng  text   •  rvest  Package   •  html_text(html_nodes(htmlpage,  "div.postcontent"))   6  
  7. 7. DATA  PREPARATION   •  Cleaned  the  text  data   •  tm  Package   •  tm_map()   •  Remove  punctuaGons   •  Remove  white  spaces   •  Lower-­‐casing   •  Remove  stopwords   •  “a”,  “the”,  “and”,  “but”,  etc.   7  
  8. 8. DATA  PREPARATION   •  Created  the  term  document  matrix  (TDM)   8  
  9. 9. DATA  PREPARATION   •  TDM  consists  of  959  job  posGngs  and  73  terms   •  375  data  scienGsts  and  584  non-­‐data  scienGsts   •  Split  TDM  into  training  set  and  test  set   •  864  job  posGngs  in  training  sample   •  95  job  posGngs  in  test  sample   9  
  10. 10. SUPERVISED  LEARNING   •  K-­‐Nearest  Neighbor   •  Find  the  K  value  with  the  highest  classificaGon  accuracy               •  K=8  shows  the  best  result  with  82.98%  accuracy  rate   •  Confusion  matrix  shows  the  model  correctly  predicted  22   out  of  35  data  scienGst  job  posGngs   10  
  11. 11. SUPERVISED  LEARNING   •  ClassificaGon  Decision  Tree  (Gini  index)   •  The  classificaGon  accuracy  rate  is  96.8%   •  Confusion  matrix  shows  the  model  correctly  predicted  30   out  of  33  data  scienGst  job  posGngs   •  Key  terms  for  tree  construcGon:   11  
  12. 12. SUPERVISED  LEARNING   •  Bagging   •  The  classificaGon  accuracy  rate  is  96.8%     •  Confusion  matrix  shows  the  same  results  as  the   classificaGon  tree   12  
  13. 13. QUESTIONS?   COMMENTS?   Sung  Park,  MSPA  Candidate   August  20,  2015   Northwestern  University   PREDICT  422-­‐DL  SecGon  55   13  

×