SUNG PARK PREDICT 422 Group Project Presentation

TEXT
MINING
DATA
SCIENCE
JOBS
IN
R

Sung
Park,
MSPA
Candidate

August
20,
2015

Northwestern
University

PREDICT
422-‐DL
SecGon
55

1

SUMMARY

•  IntroducGon

•  Resources

•  Data
Source

•  Data
ExtracGon

•  Data
PreparaGon

•  Supervised
Learning

2

INTRODUCTION

•  ExploraGon
of
web
scraping
and
text
mining

capabiliGes
in
R

•  Unstructured
data

•  Kaggle.com
job
posGngs

•  ClassiﬁcaGon
using
machine
learning
algorithm

•  Data
scienGsts
vs.
non-‐data
scienGsts

3

RESOURCES

•  Text
AnalyGcs
Tutorial
in
R

•  Timothy
D’Auria,
Boston
Decision,
LLC

•  hUps://www.youtube.com/watch?v=j1V2McKbkLo

•  Web
Scraping
Tutorial
in
R

•  Sharon
Machlis,
Computerworld

•  hUps://www.youtube.com/watch?v=TPLMQnGw0Vk

•  Data
Science
in
R:
A
Case
Study
Approach
to
ComputaGonal

Reasoning
and
Problem
Solving

•  Deborah
Nolan
and
Duncan
Temple
Lang

•  Google
and
Stack
Overﬂow

4

DATA
SOURCE

•  Kaggle.com/jobs

•  August
17,
2015

•  1,025
Job
PosGngs

•  Data
ScienGst

•  Big
Data
Engineer

•  Data
Science

Architect

•  Data
Analyst

•  MarkeGng
Analyst

•  StaGsGcian

•  Data
Science

Director

5

DATA
EXTRACTION

•  Extracted
job
links

•  XML
Package

•  xpathSApply(doc,
"//h3/a/@href[starts-‐with(.,
'/jobs')]")

•  Extracted
job
posGng
text

•  rvest
Package

•  html_text(html_nodes(htmlpage,
"div.postcontent"))

6

DATA
PREPARATION

•  Cleaned
the
text
data

•  tm
Package

•  tm_map()

•  Remove
punctuaGons

•  Remove
white
spaces

•  Lower-‐casing

•  Remove
stopwords

•  “a”,
“the”,
“and”,
“but”,
etc.

7

DATA
PREPARATION

•  Created
the
term
document
matrix
(TDM)

8

DATA
PREPARATION

•  TDM
consists
of
959
job
posGngs
and
73
terms

•  375
data
scienGsts
and
584
non-‐data
scienGsts

•  Split
TDM
into
training
set
and
test
set

•  864
job
posGngs
in
training
sample

•  95
job
posGngs
in
test
sample

9

SUPERVISED
LEARNING

•  K-‐Nearest
Neighbor

•  Find
the
K
value
with
the
highest
classiﬁcaGon
accuracy

•  K=8
shows
the
best
result
with
82.98%
accuracy
rate

•  Confusion
matrix
shows
the
model
correctly
predicted
22

out
of
35
data
scienGst
job
posGngs

10

SUPERVISED
LEARNING

•  ClassiﬁcaGon
Decision
Tree
(Gini
index)

•  The
classiﬁcaGon
accuracy
rate
is
96.8%

•  Confusion
matrix
shows
the
model
correctly
predicted
30

out
of
33
data
scienGst
job
posGngs

•  Key
terms
for
tree
construcGon:

11

SUPERVISED
LEARNING

•  Bagging

•  The
classiﬁcaGon
accuracy
rate
is
96.8%

•  Confusion
matrix
shows
the
same
results
as
the

classiﬁcaGon
tree

12

QUESTIONS?

COMMENTS?

Sung
Park,
MSPA
Candidate

August
20,
2015

Northwestern
University

PREDICT
422-‐DL
SecGon
55

13

SUNG PARK PREDICT 422 Group Project Presentation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to SUNG PARK PREDICT 422 Group Project Presentation

Similar to SUNG PARK PREDICT 422 Group Project Presentation (20)

SUNG PARK PREDICT 422 Group Project Presentation