Email Classification based on their readability

CLASSIFYING EMAILS USING THEIR LANGUAGE
AND READABILITY
Rushdi Shams
Computational Linguistics Group
Department of Computer Science
University of Western Ontario,
London, Canada.
rshams@uwo.ca
Supervisor: Prof. Bob Mercer

PRESENTATION OUTLINE
• Text Denoising
• Keyphrase: What and Why
• Supervised automatic keyphrase indexing
• How they work
• Examples
• Effect of document size
• Objective
• Methods
• Datasets
• Training and Testing
• Performance Measures
• Denoising Threshold
• Results
• Conclusions and Future Work
2

INTRODUCTION
• Email spam is one of the major problems of
the today’s Internet
– Financial loss of institutions ($50B in 2005)
– Misuse of network traffic/storage
– Loss of work productivity, etc.
• In addition, spam emails constitute 75-80% of
total emails.
3
Total Emails
Spam
Ham

EXISTING EMAIL CLASSIFICATION APPROACHES
4
• More stable
• Fast
• Wide coverage
• Better results
• Less stable
• Fast
• Small coverage
• Good results
• Stable
• Slow
• Good coverage
• Good results

ML-BASED EMAIL CLASSIFICATION APPROACHES
5
• Limited features
• Language independent
• Less stability
• Unbound features
• Language dependent
• More stability
Contains both pros
and cons of the
previous two

PROPOSED APPROACH
6
Message m features
Classification
Algorithm
10 fold CV
Email Dataset
Performance

DATASET
7
Email Dataset
Dataset Messages Spam Rate Raw Texts? Year of
Curation
SpamAssassin 6,046 31.36% Yes 2002
LingSpam 2,893 16.63% No 2000
CSDMC2010 4,327 31.85% Yes 2010
• All data are preprocessed whenever necessary like removing headers,
subjects and attachments, and removing non-ASCII characters

FEATURES
8
Message m features
Groups Features
Traditional Spam
detection Features Spam Words Total HTML Tags Total Anchor Tags Total Regular Tags
Language based
Features
Alphanumeric
Words
Verbs Stop Words TF-ISF TF-IDF Grammar and
Spell Errors
Grammar Errors Spell Errors
Readability based
Features
Fog Index
(FI)
FKRI Smog Index FORCAST FRES Simple FI
Inverse FI Complex Words Simple Words Document Length Word Length TF-IDF
(Simple Words)
TF-IDF
(Complex Words)
• We extracted 39 Features and grouped them into 3 groups

FEATURE SELECTION
• For each dataset, we applied Boruta feature
selection algorithm on the extracted features
• The outcome shows that all of these features
are important to classify emails from the
datasets.
9

FEATURE SELECTION
• For each dataset, we applied Boruta feature
selection algorithm on the extracted features
• The outcome shows that all of these features
are important to classify emails from the
datasets.
– Exception on LingSpam dataset where word length
feature was labeled as unimportant.
10

IMPORTANCE OF FEATURES
(SNAPSHOT FOR SPAMASSASSIN)
11
Readability based
Features
Traditional Spam
detection Features
Language
based
Features

IMPORTANCE OF FEATURES (SPAMASSASSIN)
12

IMPORTANCE OF FEATURES (LINGSPAM)
13

IMPORTANCE OF FEATURES (CSDMC)
14

CLASSIFICATION ALGORITHM
1. Random Forest
[Jarrah et al. (2012), Hu et al. (2010)]
2. Boosted Random Forest with AdaBoost
[Zhang et al. (2004)]
3. Bagged Random Forest
4. Support Vector Machine (SVM)
[Jarrah et al. (2012), Hu et al. (2010), Ye et al.(2008),
Lai and Tsai (2004), Zhang et al. (2004)]
5. Naïve Bayes (NB)
[Hu et al. (2010), Haidar and Rocha (2008),
Metsis et al. (2008), Lai and Tsai (2004)]
15
Classification
Algorithm

PERFORMANCE EVALUATION
16
FP
FN
False Positive Rate or Ham Misclassification
False Negative Rate or Spam Misclassification
Accuracy or (1- Overall Misclassification)
Precision or Spam Discovery Rate
Recall or Spam Hit Rate
F1-Score
Area Under ROC Curve (AUC)

PERFORMANCE ON SPAMASSASSIN
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.035 0.093 94.707 0.923 0.907 0.915 0.979
Boosted RF 0.027 0.079 95.700 0.941 0.921 0.931 0.982
Bagged RF 0.023 0.099 95.353 0.948 0.901 0.924 0.986
SVM 0.052 0.292 87.265 0.861 0.708 0.777 0.828
NB 0.104 0.558 75.373 0.660 0.443 0.529 0.847
17
• Best FPR: Bagged RF
• Best FNR: Boosted RF
• Best ACC: Boosted RF
• Best Precision: Bagged RF
• Best Recall: Boosted RF
• Best F1: Boosted RF
• Best AUC: Bagged RF

PERFORMANCE ON LINGSPAM
18
RF 0.018 0.162 95.817 0.907 0.838 0.869 0.978
Boosted RF 0.017 0.162 95.886 0.910 0.838 0.871 0.977
Bagged RF 0.010 0.193 95.956 0.944 0.807 0.868 0.986
SVM 0.014 0.341 93.156 0.907 0.659 0.760 0.822
NB 0.219 0.277 77.186 0.402 0.723 0.515 0.831
• Best FNR: Boosted RF/RF
• Best ACC: Bagged RF
• Best Recall: Boosted RF/RF
• Best F1: Boosted RF

PERFORMANCE ON CSDMC
19
RF 0.040 0.092 94.338 0.914 0.908 0.911 0.980
Boosted RF 0.030 0.089 95.124 0.934 0.912 0.922 0.980
Bagged RF 0.021 0.107 95.193 0.953 0.893 0.922 0.988
SVM 0.028 0.390 85.718 0.913 0.610 0.730 0.792
NB 0.101 0.396 80.471 0.737 0.604 0.662 0.855
• Best FNR: Boosted RF
• Best ACC: Bagged RF
• Best Recall: Boosted RF
• Best F1: Boosted/Bagged RF

PERFORMANCE COMPARISON: SPAMASSASSIN
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Ma et al.
(2010)
Neural Nets
Precision (0.920)
Overall
Misclassification
(0.080)
Precision (0.948)
Overall
Misclassification
(0.043)
YES
Srisanyalak
and Sornil
(2007)
Neural Nets Accuracy (0.924) Accuracy (0.957) YES
Bratko et al.
(2006)
Statistical
FPR (0.001)
FNR (0.012)
AUC (0.982)
FPR (0.023)
FNR (0.079)
AUC (0.986)
YES
20

PERFORMANCE COMPARISON: LINGSPAM
Performance of
our approach
P < 0.05?
Basavaraju
and Pravakar
(2010)
BIRCH and
K-NNC
Precision (0.698)
Recall (0.637)
Specificity (0.828)
Accuracy(0.755)
Precision (0.944)
Recall (0.838)
Specificity (0.990)
Accuracy(0.960)
YES
Cormack and
Bratko (2006)
PPM AUC (0.960) AUC (0.986) YES
Yang et al.
(2011)
Naïve Bayes
Precision (0.943)
Recall (0.820)
AUC (0.992)
Precision (0.944)
Recall (0.838)
AUC (0.986)
YES
(for Recall)
21

PERFORMANCE COMPARISON: CSDMC
Performance of
our approach
P < 0.05?
Jarrah et al.
(2012)
RF
Precision (0.958)
Recall (0.958)
F1 (0.958)
AUC (0.981)
Precision (0.953)
Recall (0.912)
F1 (0.922)
AUC (0.988)
YES
(for Recall and F1)
Yang et al.
(2011)
Naïve Bayes
Precision (0.935)
Recall (1.000)
AUC (0.976)
Precision (0.953)
Recall (0.912)
AUC (0.988)
YES
Yang et al.
(2011)
SVM
Precision (0.943)
Recall (0.965)
AUC (0.995)
Precision (0.953)
Recall (0.912)
AUC (0.988)
YES
22

CONCLUSIONS
• Our spam classification approach performed
– the Best for LingSpam
• Smallest dataset
• Least no. of spams
• Hams are collected from forums
• Easy to achieve better FPR and Accuracy
– Better than many others for SpamAssassin and
comparably for CSDMC2010
• Similar spam:ham ratio
• Random ham and spam collection
23

CONCLUSIONS
• Using personalized email data rather than
random collection
– Enron-Spam
• Using probability scores of terms in email
contents from a Naïve Bayes spam filter as an
additional feature
24

Email Classification based on their readability

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Email Classification based on their readability

Semelhante a Email Classification based on their readability (20)

Mais de Rushdi Shams

Mais de Rushdi Shams (20)

Último

Último (20)

Email Classification based on their readability