SlideShare uma empresa Scribd logo
1 de 24
CLASSIFYING EMAILS USING THEIR LANGUAGE
AND READABILITY
Rushdi Shams
Computational Linguistics Group
Department of Computer Science
University of Western Ontario,
London, Canada.
rshams@uwo.ca
Supervisor: Prof. Bob Mercer
PRESENTATION OUTLINE
• Text Denoising
• Keyphrase: What and Why
• Supervised automatic keyphrase indexing
• How they work
• Examples
• Effect of document size
• Objective
• Methods
• Datasets
• Training and Testing
• Performance Measures
• Denoising Threshold
• Results
• Conclusions and Future Work
2
INTRODUCTION
• Email spam is one of the major problems of
the today’s Internet
– Financial loss of institutions ($50B in 2005)
– Misuse of network traffic/storage
– Loss of work productivity, etc.
• In addition, spam emails constitute 75-80% of
total emails.
3
Total Emails
Spam
Ham
EXISTING EMAIL CLASSIFICATION APPROACHES
4
• More stable
• Fast
• Wide coverage
• Better results
• Less stable
• Fast
• Small coverage
• Good results
• Stable
• Slow
• Good coverage
• Good results
ML-BASED EMAIL CLASSIFICATION APPROACHES
5
• Limited features
• Language independent
• Less stability
• Unbound features
• Language dependent
• More stability
Contains both pros
and cons of the
previous two
PROPOSED APPROACH
6
Message m features
Classification
Algorithm
10 fold CV
Email Dataset
Performance
DATASET
7
Email Dataset
Dataset Messages Spam Rate Raw Texts? Year of
Curation
SpamAssassin 6,046 31.36% Yes 2002
LingSpam 2,893 16.63% No 2000
CSDMC2010 4,327 31.85% Yes 2010
• All data are preprocessed whenever necessary like removing headers,
subjects and attachments, and removing non-ASCII characters
FEATURES
8
Message m features
Groups Features
Traditional Spam
detection Features Spam Words Total HTML Tags Total Anchor Tags Total Regular Tags
Language based
Features
Alphanumeric
Words
Verbs Stop Words TF-ISF TF-IDF Grammar and
Spell Errors
Grammar Errors Spell Errors
Readability based
Features
Fog Index
(FI)
FKRI Smog Index FORCAST FRES Simple FI
Inverse FI Complex Words Simple Words Document Length Word Length TF-IDF
(Simple Words)
TF-IDF
(Complex Words)
• We extracted 39 Features and grouped them into 3 groups
FEATURE SELECTION
• For each dataset, we applied Boruta feature
selection algorithm on the extracted features
• The outcome shows that all of these features
are important to classify emails from the
datasets.
9
FEATURE SELECTION
• For each dataset, we applied Boruta feature
selection algorithm on the extracted features
• The outcome shows that all of these features
are important to classify emails from the
datasets.
– Exception on LingSpam dataset where word length
feature was labeled as unimportant.
10
IMPORTANCE OF FEATURES
(SNAPSHOT FOR SPAMASSASSIN)
11
Readability based
Features
Traditional Spam
detection Features
Language
based
Features
IMPORTANCE OF FEATURES (SPAMASSASSIN)
12
IMPORTANCE OF FEATURES (LINGSPAM)
13
IMPORTANCE OF FEATURES (CSDMC)
14
CLASSIFICATION ALGORITHM
1. Random Forest
[Jarrah et al. (2012), Hu et al. (2010)]
2. Boosted Random Forest with AdaBoost
[Zhang et al. (2004)]
3. Bagged Random Forest
4. Support Vector Machine (SVM)
[Jarrah et al. (2012), Hu et al. (2010), Ye et al.(2008),
Lai and Tsai (2004), Zhang et al. (2004)]
5. Naïve Bayes (NB)
[Hu et al. (2010), Haidar and Rocha (2008),
Metsis et al. (2008), Lai and Tsai (2004)]
15
Classification
Algorithm
PERFORMANCE EVALUATION
16
FP
FN
False Positive Rate or Ham Misclassification
False Negative Rate or Spam Misclassification
Accuracy or (1- Overall Misclassification)
Precision or Spam Discovery Rate
Recall or Spam Hit Rate
F1-Score
Area Under ROC Curve (AUC)
PERFORMANCE ON SPAMASSASSIN
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.035 0.093 94.707 0.923 0.907 0.915 0.979
Boosted RF 0.027 0.079 95.700 0.941 0.921 0.931 0.982
Bagged RF 0.023 0.099 95.353 0.948 0.901 0.924 0.986
SVM 0.052 0.292 87.265 0.861 0.708 0.777 0.828
NB 0.104 0.558 75.373 0.660 0.443 0.529 0.847
17
• Best FPR: Bagged RF
• Best FNR: Boosted RF
• Best ACC: Boosted RF
• Best Precision: Bagged RF
• Best Recall: Boosted RF
• Best F1: Boosted RF
• Best AUC: Bagged RF
PERFORMANCE ON LINGSPAM
18
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.018 0.162 95.817 0.907 0.838 0.869 0.978
Boosted RF 0.017 0.162 95.886 0.910 0.838 0.871 0.977
Bagged RF 0.010 0.193 95.956 0.944 0.807 0.868 0.986
SVM 0.014 0.341 93.156 0.907 0.659 0.760 0.822
NB 0.219 0.277 77.186 0.402 0.723 0.515 0.831
• Best FPR: Bagged RF
• Best FNR: Boosted RF/RF
• Best ACC: Bagged RF
• Best Precision: Bagged RF
• Best Recall: Boosted RF/RF
• Best F1: Boosted RF
• Best AUC: Bagged RF
PERFORMANCE ON CSDMC
19
FPR FNR Accuracy % Precision Recall F1 AUC
RF 0.040 0.092 94.338 0.914 0.908 0.911 0.980
Boosted RF 0.030 0.089 95.124 0.934 0.912 0.922 0.980
Bagged RF 0.021 0.107 95.193 0.953 0.893 0.922 0.988
SVM 0.028 0.390 85.718 0.913 0.610 0.730 0.792
NB 0.101 0.396 80.471 0.737 0.604 0.662 0.855
• Best FPR: Bagged RF
• Best FNR: Boosted RF
• Best ACC: Bagged RF
• Best Precision: Bagged RF
• Best Recall: Boosted RF
• Best F1: Boosted/Bagged RF
• Best AUC: Bagged RF
PERFORMANCE COMPARISON: SPAMASSASSIN
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Ma et al.
(2010)
Neural Nets
Precision (0.920)
Overall
Misclassification
(0.080)
Precision (0.948)
Overall
Misclassification
(0.043)
YES
Srisanyalak
and Sornil
(2007)
Neural Nets Accuracy (0.924) Accuracy (0.957) YES
Bratko et al.
(2006)
Statistical
FPR (0.001)
FNR (0.012)
AUC (0.982)
FPR (0.023)
FNR (0.079)
AUC (0.986)
YES
20
PERFORMANCE COMPARISON: LINGSPAM
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Basavaraju
and Pravakar
(2010)
BIRCH and
K-NNC
Precision (0.698)
Recall (0.637)
Specificity (0.828)
Accuracy(0.755)
Precision (0.944)
Recall (0.838)
Specificity (0.990)
Accuracy(0.960)
YES
Cormack and
Bratko (2006)
PPM AUC (0.960) AUC (0.986) YES
Yang et al.
(2011)
Naïve Bayes
Precision (0.943)
Recall (0.820)
AUC (0.992)
Precision (0.944)
Recall (0.838)
AUC (0.986)
YES
(for Recall)
21
PERFORMANCE COMPARISON: CSDMC
Author Algorithm Reported Performance
Performance of
our approach
P < 0.05?
Jarrah et al.
(2012)
RF
Precision (0.958)
Recall (0.958)
F1 (0.958)
AUC (0.981)
Precision (0.953)
Recall (0.912)
F1 (0.922)
AUC (0.988)
YES
(for Recall and F1)
Yang et al.
(2011)
Naïve Bayes
Precision (0.935)
Recall (1.000)
AUC (0.976)
Precision (0.953)
Recall (0.912)
AUC (0.988)
YES
Yang et al.
(2011)
SVM
Precision (0.943)
Recall (0.965)
AUC (0.995)
Precision (0.953)
Recall (0.912)
AUC (0.988)
YES
22
CONCLUSIONS
• Our spam classification approach performed
– the Best for LingSpam
• Smallest dataset
• Least no. of spams
• Hams are collected from forums
• Easy to achieve better FPR and Accuracy
– Better than many others for SpamAssassin and
comparably for CSDMC2010
• Similar spam:ham ratio
• Random ham and spam collection
23
CONCLUSIONS
• Using personalized email data rather than
random collection
– Enron-Spam
• Using probability scores of terms in email
contents from a Naïve Bayes spam filter as an
additional feature
24

Mais conteúdo relacionado

Mais procurados

Email security - Netwroking
Email security - Netwroking Email security - Netwroking
Email security - Netwroking Salman Memon
 
Entropy and denial of service attacks
Entropy and denial of service attacksEntropy and denial of service attacks
Entropy and denial of service attackschris zlatis
 
2017 Security Report Presentation
2017 Security Report Presentation2017 Security Report Presentation
2017 Security Report Presentationixiademandgen
 
NIST cybersecurity framework
NIST cybersecurity frameworkNIST cybersecurity framework
NIST cybersecurity frameworkShriya Rai
 
MAC-Message Authentication Codes
MAC-Message Authentication CodesMAC-Message Authentication Codes
MAC-Message Authentication CodesDarshanPatil82
 
Network security - OSI Security Architecture
Network security - OSI Security ArchitectureNetwork security - OSI Security Architecture
Network security - OSI Security ArchitectureBharathiKrishna6
 
Steganography
SteganographySteganography
SteganographyPREMKUMAR
 
Random Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network SecurityRandom Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network SecurityMahbubur Rahman
 
Privacy in Computing - Impact on emerging technologies
Privacy in Computing - Impact on emerging technologiesPrivacy in Computing - Impact on emerging technologies
Privacy in Computing - Impact on emerging technologiesMensah Sitti
 
Piggy Backing & Tailgating (Security)
Piggy Backing & Tailgating (Security)Piggy Backing & Tailgating (Security)
Piggy Backing & Tailgating (Security)GAURAV. H .TANDON
 
Network security & cryptography full notes
Network security & cryptography full notesNetwork security & cryptography full notes
Network security & cryptography full notesgangadhar9989166446
 
Forcepoint Advanced Malware Detection
Forcepoint Advanced Malware DetectionForcepoint Advanced Malware Detection
Forcepoint Advanced Malware DetectionForcepoint LLC
 

Mais procurados (20)

Email security - Netwroking
Email security - Netwroking Email security - Netwroking
Email security - Netwroking
 
Entropy and denial of service attacks
Entropy and denial of service attacksEntropy and denial of service attacks
Entropy and denial of service attacks
 
2017 Security Report Presentation
2017 Security Report Presentation2017 Security Report Presentation
2017 Security Report Presentation
 
RSA algorithm
RSA algorithmRSA algorithm
RSA algorithm
 
IDS and IPS
IDS and IPSIDS and IPS
IDS and IPS
 
Firewall in Network Security
Firewall in Network SecurityFirewall in Network Security
Firewall in Network Security
 
NIST cybersecurity framework
NIST cybersecurity frameworkNIST cybersecurity framework
NIST cybersecurity framework
 
Pgp
PgpPgp
Pgp
 
MAC-Message Authentication Codes
MAC-Message Authentication CodesMAC-Message Authentication Codes
MAC-Message Authentication Codes
 
Information security
Information securityInformation security
Information security
 
Network security - OSI Security Architecture
Network security - OSI Security ArchitectureNetwork security - OSI Security Architecture
Network security - OSI Security Architecture
 
Banner grabbing
Banner grabbingBanner grabbing
Banner grabbing
 
Port Scanning
Port ScanningPort Scanning
Port Scanning
 
Steganography
SteganographySteganography
Steganography
 
Random Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network SecurityRandom Oracle Model & Hashing - Cryptography & Network Security
Random Oracle Model & Hashing - Cryptography & Network Security
 
Privacy in Computing - Impact on emerging technologies
Privacy in Computing - Impact on emerging technologiesPrivacy in Computing - Impact on emerging technologies
Privacy in Computing - Impact on emerging technologies
 
Piggy Backing & Tailgating (Security)
Piggy Backing & Tailgating (Security)Piggy Backing & Tailgating (Security)
Piggy Backing & Tailgating (Security)
 
Ethics in-information-security
Ethics in-information-securityEthics in-information-security
Ethics in-information-security
 
Network security & cryptography full notes
Network security & cryptography full notesNetwork security & cryptography full notes
Network security & cryptography full notes
 
Forcepoint Advanced Malware Detection
Forcepoint Advanced Malware DetectionForcepoint Advanced Malware Detection
Forcepoint Advanced Malware Detection
 

Destaque

Lec 24. Dynamic Memory Allocation
Lec 24. Dynamic Memory AllocationLec 24. Dynamic Memory Allocation
Lec 24. Dynamic Memory AllocationRushdi Shams
 
L16 l17 Data Warehousing
L16 l17  Data WarehousingL16 l17  Data Warehousing
L16 l17 Data WarehousingRushdi Shams
 
Lec 20. Structure (Part II)
Lec 20. Structure (Part II)Lec 20. Structure (Part II)
Lec 20. Structure (Part II)Rushdi Shams
 
Lec 05. While Loop
Lec 05. While LoopLec 05. While Loop
Lec 05. While LoopRushdi Shams
 
L13 why software fails
L13  why software failsL13  why software fails
L13 why software failsRushdi Shams
 
Lec 03. Arithmetic Operator / Relational Operator
Lec 03. Arithmetic Operator / Relational OperatorLec 03. Arithmetic Operator / Relational Operator
Lec 03. Arithmetic Operator / Relational OperatorRushdi Shams
 
Propositional logic
Propositional logicPropositional logic
Propositional logicRushdi Shams
 

Destaque (9)

Lec 16. Strings
Lec 16. StringsLec 16. Strings
Lec 16. Strings
 
Lec 24. Dynamic Memory Allocation
Lec 24. Dynamic Memory AllocationLec 24. Dynamic Memory Allocation
Lec 24. Dynamic Memory Allocation
 
L16 l17 Data Warehousing
L16 l17  Data WarehousingL16 l17  Data Warehousing
L16 l17 Data Warehousing
 
L4 vpn
L4  vpnL4  vpn
L4 vpn
 
Lec 20. Structure (Part II)
Lec 20. Structure (Part II)Lec 20. Structure (Part II)
Lec 20. Structure (Part II)
 
Lec 05. While Loop
Lec 05. While LoopLec 05. While Loop
Lec 05. While Loop
 
L13 why software fails
L13  why software failsL13  why software fails
L13 why software fails
 
Lec 03. Arithmetic Operator / Relational Operator
Lec 03. Arithmetic Operator / Relational OperatorLec 03. Arithmetic Operator / Relational Operator
Lec 03. Arithmetic Operator / Relational Operator
 
Propositional logic
Propositional logicPropositional logic
Propositional logic
 

Semelhante a Email Classification based on their readability

Network optimization presentation generic dec18
Network optimization presentation generic dec18Network optimization presentation generic dec18
Network optimization presentation generic dec18frankjoh
 
Fracton tarec in offerings intro
Fracton tarec in offerings introFracton tarec in offerings intro
Fracton tarec in offerings introfrankjoh
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval
 
ATI Laser RADAR and Applications Training for Advanced Students Course Sampler
ATI Laser RADAR and Applications Training for Advanced Students Course SamplerATI Laser RADAR and Applications Training for Advanced Students Course Sampler
ATI Laser RADAR and Applications Training for Advanced Students Course SamplerJim Jenkins
 
FARO Technology ScanArm HD
FARO Technology ScanArm HDFARO Technology ScanArm HD
FARO Technology ScanArm HDGregory Modé
 
Vikrant Tiwari Resume(RNO)
Vikrant Tiwari Resume(RNO)Vikrant Tiwari Resume(RNO)
Vikrant Tiwari Resume(RNO)vikrant tiwari
 
Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...
Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...
Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...IT Arena
 
AESA Airborne Radar Theory and Operations Technical Training Course Sampler
AESA Airborne Radar Theory and Operations Technical Training Course SamplerAESA Airborne Radar Theory and Operations Technical Training Course Sampler
AESA Airborne Radar Theory and Operations Technical Training Course SamplerJim Jenkins
 
AQM performance for VOIP
AQM performance for VOIPAQM performance for VOIP
AQM performance for VOIPMakkawy khair
 
Tems layer3_messages
Tems  layer3_messagesTems  layer3_messages
Tems layer3_messagesbadgirl3086
 
Temslayer3messages 120420125049-phpapp01
Temslayer3messages 120420125049-phpapp01Temslayer3messages 120420125049-phpapp01
Temslayer3messages 120420125049-phpapp01Akhtar Khan
 
Spectrum management best practices in a Gigabit wireless world
Spectrum management best practices in a Gigabit wireless worldSpectrum management best practices in a Gigabit wireless world
Spectrum management best practices in a Gigabit wireless worldCisco Canada
 
Sell Sheet | Leupold RX-1600i TBR | Optics Trade
Sell Sheet | Leupold RX-1600i TBR | Optics TradeSell Sheet | Leupold RX-1600i TBR | Optics Trade
Sell Sheet | Leupold RX-1600i TBR | Optics TradeOptics-Trade
 
Laparoscopic Surgery Training System.ppt
Laparoscopic Surgery Training System.pptLaparoscopic Surgery Training System.ppt
Laparoscopic Surgery Training System.pptm4r15
 

Semelhante a Email Classification based on their readability (20)

Network optimization presentation generic dec18
Network optimization presentation generic dec18Network optimization presentation generic dec18
Network optimization presentation generic dec18
 
Fracton tarec in offerings intro
Fracton tarec in offerings introFracton tarec in offerings intro
Fracton tarec in offerings intro
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
 
Ijecet 06 10_006
Ijecet 06 10_006Ijecet 06 10_006
Ijecet 06 10_006
 
ISSCS2011
ISSCS2011ISSCS2011
ISSCS2011
 
ATI Laser RADAR and Applications Training for Advanced Students Course Sampler
ATI Laser RADAR and Applications Training for Advanced Students Course SamplerATI Laser RADAR and Applications Training for Advanced Students Course Sampler
ATI Laser RADAR and Applications Training for Advanced Students Course Sampler
 
FARO Technology ScanArm HD
FARO Technology ScanArm HDFARO Technology ScanArm HD
FARO Technology ScanArm HD
 
New project 18
New project 18New project 18
New project 18
 
Vikrant Tiwari Resume(RNO)
Vikrant Tiwari Resume(RNO)Vikrant Tiwari Resume(RNO)
Vikrant Tiwari Resume(RNO)
 
Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...
Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...
Roman Nemish. Global IoT Technologies, Most Common Use Cases and Success Stra...
 
Sqi analyisis
Sqi analyisisSqi analyisis
Sqi analyisis
 
Cfo in ofdm
Cfo in ofdmCfo in ofdm
Cfo in ofdm
 
Tems layer3 messages
Tems layer3 messagesTems layer3 messages
Tems layer3 messages
 
AESA Airborne Radar Theory and Operations Technical Training Course Sampler
AESA Airborne Radar Theory and Operations Technical Training Course SamplerAESA Airborne Radar Theory and Operations Technical Training Course Sampler
AESA Airborne Radar Theory and Operations Technical Training Course Sampler
 
AQM performance for VOIP
AQM performance for VOIPAQM performance for VOIP
AQM performance for VOIP
 
Tems layer3_messages
Tems  layer3_messagesTems  layer3_messages
Tems layer3_messages
 
Temslayer3messages 120420125049-phpapp01
Temslayer3messages 120420125049-phpapp01Temslayer3messages 120420125049-phpapp01
Temslayer3messages 120420125049-phpapp01
 
Spectrum management best practices in a Gigabit wireless world
Spectrum management best practices in a Gigabit wireless worldSpectrum management best practices in a Gigabit wireless world
Spectrum management best practices in a Gigabit wireless world
 
Sell Sheet | Leupold RX-1600i TBR | Optics Trade
Sell Sheet | Leupold RX-1600i TBR | Optics TradeSell Sheet | Leupold RX-1600i TBR | Optics Trade
Sell Sheet | Leupold RX-1600i TBR | Optics Trade
 
Laparoscopic Surgery Training System.ppt
Laparoscopic Surgery Training System.pptLaparoscopic Surgery Training System.ppt
Laparoscopic Surgery Training System.ppt
 

Mais de Rushdi Shams

Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchRushdi Shams
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IRRushdi Shams
 
Machine learning with nlp 101
Machine learning with nlp 101Machine learning with nlp 101
Machine learning with nlp 101Rushdi Shams
 
Semi-supervised classification for natural language processing
Semi-supervised classification for natural language processingSemi-supervised classification for natural language processing
Semi-supervised classification for natural language processingRushdi Shams
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: ParsingRushdi Shams
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
L1 l2 l3 introduction to machine translation
L1 l2 l3  introduction to machine translationL1 l2 l3  introduction to machine translation
L1 l2 l3 introduction to machine translationRushdi Shams
 
Syntax and semantics
Syntax and semanticsSyntax and semantics
Syntax and semanticsRushdi Shams
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logicRushdi Shams
 
Knowledge structure
Knowledge structureKnowledge structure
Knowledge structureRushdi Shams
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representationRushdi Shams
 
L5 understanding hacking
L5  understanding hackingL5  understanding hacking
L5 understanding hackingRushdi Shams
 
L2 Intrusion Detection System (IDS)
L2  Intrusion Detection System (IDS)L2  Intrusion Detection System (IDS)
L2 Intrusion Detection System (IDS)Rushdi Shams
 
L2 l3 l4 software process models
L2 l3 l4  software process modelsL2 l3 l4  software process models
L2 l3 l4 software process modelsRushdi Shams
 
L1 overview of software engineering
L1  overview of software engineeringL1  overview of software engineering
L1 overview of software engineeringRushdi Shams
 

Mais de Rushdi Shams (20)

Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better Research
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IR
 
Machine learning with nlp 101
Machine learning with nlp 101Machine learning with nlp 101
Machine learning with nlp 101
 
Semi-supervised classification for natural language processing
Semi-supervised classification for natural language processingSemi-supervised classification for natural language processing
Semi-supervised classification for natural language processing
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
L1 l2 l3 introduction to machine translation
L1 l2 l3  introduction to machine translationL1 l2 l3  introduction to machine translation
L1 l2 l3 introduction to machine translation
 
Syntax and semantics
Syntax and semanticsSyntax and semantics
Syntax and semantics
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
 
L15 fuzzy logic
L15  fuzzy logicL15  fuzzy logic
L15 fuzzy logic
 
Knowledge structure
Knowledge structureKnowledge structure
Knowledge structure
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 
First order logic
First order logicFirst order logic
First order logic
 
Belief function
Belief functionBelief function
Belief function
 
L5 understanding hacking
L5  understanding hackingL5  understanding hacking
L5 understanding hacking
 
L3 defense
L3  defenseL3  defense
L3 defense
 
L2 Intrusion Detection System (IDS)
L2  Intrusion Detection System (IDS)L2  Intrusion Detection System (IDS)
L2 Intrusion Detection System (IDS)
 
L1 phishing
L1  phishingL1  phishing
L1 phishing
 
L2 l3 l4 software process models
L2 l3 l4  software process modelsL2 l3 l4  software process models
L2 l3 l4 software process models
 
L1 overview of software engineering
L1  overview of software engineeringL1  overview of software engineering
L1 overview of software engineering
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 

Último (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 

Email Classification based on their readability

  • 1. CLASSIFYING EMAILS USING THEIR LANGUAGE AND READABILITY Rushdi Shams Computational Linguistics Group Department of Computer Science University of Western Ontario, London, Canada. rshams@uwo.ca Supervisor: Prof. Bob Mercer
  • 2. PRESENTATION OUTLINE • Text Denoising • Keyphrase: What and Why • Supervised automatic keyphrase indexing • How they work • Examples • Effect of document size • Objective • Methods • Datasets • Training and Testing • Performance Measures • Denoising Threshold • Results • Conclusions and Future Work 2
  • 3. INTRODUCTION • Email spam is one of the major problems of the today’s Internet – Financial loss of institutions ($50B in 2005) – Misuse of network traffic/storage – Loss of work productivity, etc. • In addition, spam emails constitute 75-80% of total emails. 3 Total Emails Spam Ham
  • 4. EXISTING EMAIL CLASSIFICATION APPROACHES 4 • More stable • Fast • Wide coverage • Better results • Less stable • Fast • Small coverage • Good results • Stable • Slow • Good coverage • Good results
  • 5. ML-BASED EMAIL CLASSIFICATION APPROACHES 5 • Limited features • Language independent • Less stability • Unbound features • Language dependent • More stability Contains both pros and cons of the previous two
  • 6. PROPOSED APPROACH 6 Message m features Classification Algorithm 10 fold CV Email Dataset Performance
  • 7. DATASET 7 Email Dataset Dataset Messages Spam Rate Raw Texts? Year of Curation SpamAssassin 6,046 31.36% Yes 2002 LingSpam 2,893 16.63% No 2000 CSDMC2010 4,327 31.85% Yes 2010 • All data are preprocessed whenever necessary like removing headers, subjects and attachments, and removing non-ASCII characters
  • 8. FEATURES 8 Message m features Groups Features Traditional Spam detection Features Spam Words Total HTML Tags Total Anchor Tags Total Regular Tags Language based Features Alphanumeric Words Verbs Stop Words TF-ISF TF-IDF Grammar and Spell Errors Grammar Errors Spell Errors Readability based Features Fog Index (FI) FKRI Smog Index FORCAST FRES Simple FI Inverse FI Complex Words Simple Words Document Length Word Length TF-IDF (Simple Words) TF-IDF (Complex Words) • We extracted 39 Features and grouped them into 3 groups
  • 9. FEATURE SELECTION • For each dataset, we applied Boruta feature selection algorithm on the extracted features • The outcome shows that all of these features are important to classify emails from the datasets. 9
  • 10. FEATURE SELECTION • For each dataset, we applied Boruta feature selection algorithm on the extracted features • The outcome shows that all of these features are important to classify emails from the datasets. – Exception on LingSpam dataset where word length feature was labeled as unimportant. 10
  • 11. IMPORTANCE OF FEATURES (SNAPSHOT FOR SPAMASSASSIN) 11 Readability based Features Traditional Spam detection Features Language based Features
  • 12. IMPORTANCE OF FEATURES (SPAMASSASSIN) 12
  • 13. IMPORTANCE OF FEATURES (LINGSPAM) 13
  • 15. CLASSIFICATION ALGORITHM 1. Random Forest [Jarrah et al. (2012), Hu et al. (2010)] 2. Boosted Random Forest with AdaBoost [Zhang et al. (2004)] 3. Bagged Random Forest 4. Support Vector Machine (SVM) [Jarrah et al. (2012), Hu et al. (2010), Ye et al.(2008), Lai and Tsai (2004), Zhang et al. (2004)] 5. Naïve Bayes (NB) [Hu et al. (2010), Haidar and Rocha (2008), Metsis et al. (2008), Lai and Tsai (2004)] 15 Classification Algorithm
  • 16. PERFORMANCE EVALUATION 16 FP FN False Positive Rate or Ham Misclassification False Negative Rate or Spam Misclassification Accuracy or (1- Overall Misclassification) Precision or Spam Discovery Rate Recall or Spam Hit Rate F1-Score Area Under ROC Curve (AUC)
  • 17. PERFORMANCE ON SPAMASSASSIN FPR FNR Accuracy % Precision Recall F1 AUC RF 0.035 0.093 94.707 0.923 0.907 0.915 0.979 Boosted RF 0.027 0.079 95.700 0.941 0.921 0.931 0.982 Bagged RF 0.023 0.099 95.353 0.948 0.901 0.924 0.986 SVM 0.052 0.292 87.265 0.861 0.708 0.777 0.828 NB 0.104 0.558 75.373 0.660 0.443 0.529 0.847 17 • Best FPR: Bagged RF • Best FNR: Boosted RF • Best ACC: Boosted RF • Best Precision: Bagged RF • Best Recall: Boosted RF • Best F1: Boosted RF • Best AUC: Bagged RF
  • 18. PERFORMANCE ON LINGSPAM 18 FPR FNR Accuracy % Precision Recall F1 AUC RF 0.018 0.162 95.817 0.907 0.838 0.869 0.978 Boosted RF 0.017 0.162 95.886 0.910 0.838 0.871 0.977 Bagged RF 0.010 0.193 95.956 0.944 0.807 0.868 0.986 SVM 0.014 0.341 93.156 0.907 0.659 0.760 0.822 NB 0.219 0.277 77.186 0.402 0.723 0.515 0.831 • Best FPR: Bagged RF • Best FNR: Boosted RF/RF • Best ACC: Bagged RF • Best Precision: Bagged RF • Best Recall: Boosted RF/RF • Best F1: Boosted RF • Best AUC: Bagged RF
  • 19. PERFORMANCE ON CSDMC 19 FPR FNR Accuracy % Precision Recall F1 AUC RF 0.040 0.092 94.338 0.914 0.908 0.911 0.980 Boosted RF 0.030 0.089 95.124 0.934 0.912 0.922 0.980 Bagged RF 0.021 0.107 95.193 0.953 0.893 0.922 0.988 SVM 0.028 0.390 85.718 0.913 0.610 0.730 0.792 NB 0.101 0.396 80.471 0.737 0.604 0.662 0.855 • Best FPR: Bagged RF • Best FNR: Boosted RF • Best ACC: Bagged RF • Best Precision: Bagged RF • Best Recall: Boosted RF • Best F1: Boosted/Bagged RF • Best AUC: Bagged RF
  • 20. PERFORMANCE COMPARISON: SPAMASSASSIN Author Algorithm Reported Performance Performance of our approach P < 0.05? Ma et al. (2010) Neural Nets Precision (0.920) Overall Misclassification (0.080) Precision (0.948) Overall Misclassification (0.043) YES Srisanyalak and Sornil (2007) Neural Nets Accuracy (0.924) Accuracy (0.957) YES Bratko et al. (2006) Statistical FPR (0.001) FNR (0.012) AUC (0.982) FPR (0.023) FNR (0.079) AUC (0.986) YES 20
  • 21. PERFORMANCE COMPARISON: LINGSPAM Author Algorithm Reported Performance Performance of our approach P < 0.05? Basavaraju and Pravakar (2010) BIRCH and K-NNC Precision (0.698) Recall (0.637) Specificity (0.828) Accuracy(0.755) Precision (0.944) Recall (0.838) Specificity (0.990) Accuracy(0.960) YES Cormack and Bratko (2006) PPM AUC (0.960) AUC (0.986) YES Yang et al. (2011) Naïve Bayes Precision (0.943) Recall (0.820) AUC (0.992) Precision (0.944) Recall (0.838) AUC (0.986) YES (for Recall) 21
  • 22. PERFORMANCE COMPARISON: CSDMC Author Algorithm Reported Performance Performance of our approach P < 0.05? Jarrah et al. (2012) RF Precision (0.958) Recall (0.958) F1 (0.958) AUC (0.981) Precision (0.953) Recall (0.912) F1 (0.922) AUC (0.988) YES (for Recall and F1) Yang et al. (2011) Naïve Bayes Precision (0.935) Recall (1.000) AUC (0.976) Precision (0.953) Recall (0.912) AUC (0.988) YES Yang et al. (2011) SVM Precision (0.943) Recall (0.965) AUC (0.995) Precision (0.953) Recall (0.912) AUC (0.988) YES 22
  • 23. CONCLUSIONS • Our spam classification approach performed – the Best for LingSpam • Smallest dataset • Least no. of spams • Hams are collected from forums • Easy to achieve better FPR and Accuracy – Better than many others for SpamAssassin and comparably for CSDMC2010 • Similar spam:ham ratio • Random ham and spam collection 23
  • 24. CONCLUSIONS • Using personalized email data rather than random collection – Enron-Spam • Using probability scores of terms in email contents from a Naïve Bayes spam filter as an additional feature 24