- The document describes using text mining and machine learning techniques in R to classify job postings from Kaggle.com as data scientist or non-data scientist roles. Term document matrices were created from the text of over 1,000 postings and different supervised learning algorithms like KNN, decision trees, and bagging were used to classify posts in the test dataset with over 80% accuracy.
1. TEXT
MINING
DATA
SCIENCE
JOBS
IN
R
Sung
Park,
MSPA
Candidate
August
20,
2015
Northwestern
University
PREDICT
422-‐DL
SecGon
55
1
2. SUMMARY
• IntroducGon
• Resources
• Data
Source
• Data
ExtracGon
• Data
PreparaGon
• Supervised
Learning
2
3. INTRODUCTION
• ExploraGon
of
web
scraping
and
text
mining
capabiliGes
in
R
• Unstructured
data
• Kaggle.com
job
posGngs
• ClassificaGon
using
machine
learning
algorithm
• Data
scienGsts
vs.
non-‐data
scienGsts
3
4. RESOURCES
• Text
AnalyGcs
Tutorial
in
R
• Timothy
D’Auria,
Boston
Decision,
LLC
• hUps://www.youtube.com/watch?v=j1V2McKbkLo
• Web
Scraping
Tutorial
in
R
• Sharon
Machlis,
Computerworld
• hUps://www.youtube.com/watch?v=TPLMQnGw0Vk
• Data
Science
in
R:
A
Case
Study
Approach
to
ComputaGonal
Reasoning
and
Problem
Solving
• Deborah
Nolan
and
Duncan
Temple
Lang
• Google
and
Stack
Overflow
4
5. DATA
SOURCE
• Kaggle.com/jobs
• August
17,
2015
• 1,025
Job
PosGngs
• Data
ScienGst
• Big
Data
Engineer
• Data
Science
Architect
• Data
Analyst
• MarkeGng
Analyst
• StaGsGcian
• Data
Science
Director
5
6. DATA
EXTRACTION
• Extracted
job
links
• XML
Package
• xpathSApply(doc,
"//h3/a/@href[starts-‐with(.,
'/jobs')]")
• Extracted
job
posGng
text
• rvest
Package
• html_text(html_nodes(htmlpage,
"div.postcontent"))
6
7. DATA
PREPARATION
• Cleaned
the
text
data
• tm
Package
• tm_map()
• Remove
punctuaGons
• Remove
white
spaces
• Lower-‐casing
• Remove
stopwords
• “a”,
“the”,
“and”,
“but”,
etc.
7
9. DATA
PREPARATION
• TDM
consists
of
959
job
posGngs
and
73
terms
• 375
data
scienGsts
and
584
non-‐data
scienGsts
• Split
TDM
into
training
set
and
test
set
• 864
job
posGngs
in
training
sample
• 95
job
posGngs
in
test
sample
9
10. SUPERVISED
LEARNING
• K-‐Nearest
Neighbor
• Find
the
K
value
with
the
highest
classificaGon
accuracy
• K=8
shows
the
best
result
with
82.98%
accuracy
rate
• Confusion
matrix
shows
the
model
correctly
predicted
22
out
of
35
data
scienGst
job
posGngs
10
11. SUPERVISED
LEARNING
• ClassificaGon
Decision
Tree
(Gini
index)
• The
classificaGon
accuracy
rate
is
96.8%
• Confusion
matrix
shows
the
model
correctly
predicted
30
out
of
33
data
scienGst
job
posGngs
• Key
terms
for
tree
construcGon:
11
12. SUPERVISED
LEARNING
• Bagging
• The
classificaGon
accuracy
rate
is
96.8%
• Confusion
matrix
shows
the
same
results
as
the
classificaGon
tree
12
13. QUESTIONS?
COMMENTS?
Sung
Park,
MSPA
Candidate
August
20,
2015
Northwestern
University
PREDICT
422-‐DL
SecGon
55
13