O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Improving spam detection with automaton

2.780 visualizações

Publicada em

Simple presentation about Improving spam detection with automaton

Publicada em: Internet
  • Entre para ver os comentários

Improving spam detection with automaton

  1. 1. 1/ 17 ® Improving SPAM detection 1 de março 2016 ®
  2. 2. 2/ 17 ® Whois ● Antonio Costa – Cooler ● Just another System analyst ● Github CoolerVoid ● ● https://github.com/CoolerVoid Contact: acosta@conviso.com.br coolerlair@gmail.com
  3. 3. 3/ 17 ® How it works ● Anti-Spam - The common way ● Get E-mails POP3 / IMAP ... ● Validate ● Clean all and tokenization ● BoW (Bag-of-words), SoW(Set-of-Words)... ● tf–idf (term frequency–inverse document frequency)... ● Supervised learning ● Classification (SVM, KNN, NB, Random forest... )
  4. 4. 4/ 17 ® How it works ● Anti-Spam - The common way ● Get E-mails POP3 / IMAP ● Validate – Country-based filtering – DNS-based blacklists – Enforcing RFC standards – SMTP callback verification
  5. 5. 5/ 17 ® ● DNS-based blacklists
  6. 6. 6/ 17 ® Wake UP
  7. 7. 7/ 17 ® How it works ● Anti-Spam - The common way ● Get E-mails POP3 / IMAP ... - INPUT STRING ● Validate ● Clean all and tokenization ● BoW (Bag-of-words), SoW(Set-of-Words), tf–idf (term frequency–inverse document frequency)... Create MATRIX ● Supervised learning – USING MATRIX ● Classification (SVM, KNN, NB, Random forest... )
  8. 8. 8/ 17 ® Bag-of-words [ 1 ] - “Luan likes to make hacking. Josimar likes to make hacking too.” [ 2 ] - “Luan also likes to web hacking.” ● Create array of words ( tokenize... ) { “Luan”,”likes”,”to”,”make”,”hacking”,”Josimar”,”too”, ”also”,”web”} Total of 9 elements ● Count number of appers ! [0] – { 1, 2, 2, 2, 2, 1, 1, 0, 0 } [1] – { 1, 1, 1, 0, 1, 0, 0, 1, 1 }
  9. 9. 9/ 17 ® The common way Look this following
  10. 10. 10/ 17 ® The common way Why naive bayes ? ● At my tests ! KNN 96% Slow Super simple, you're just doing a bunch of counts. Naive Bayes is an eager learning classifier and it is much faster than KNN. Nodaways it could be used for prediction in real time. Classifier Accuracy Performance SVM 92% Medium NB 94% Fast
  11. 11. 11/ 17 ® My way Automatos like a Match Rules ● Gain Accuracy ! ● Gain Performance ! ● Because can match to SPAM before to use classifier ! ● www.site.com/www.bank.com/ ● URL/malware.exe rule like URL/[a-zA-Z]*.exe ... ● Rule like to detect IP at URL ● Deterministic finite automaton to detect ● Use ranking ! NB 94% +4% Fast
  12. 12. 12/ 17 ® My way Automatos like a Match Rules ● Gain Accuracy ! ● Gain Performance ! ● Because can match to SPAM before to use classifier ! ● Deterministic finite automaton at Rules to detect ● www.site.com/www.bank.com/ ● URL/malware.exe rule like URL/[a-zA-Z]*.exe ... ● Rule like to detect IP at URL ● Rule to detect Phishing ● Use Ranking ! NB 94% +4% Fast
  13. 13. 13/ 17 ® Why Ranking ? Automatos like a Match Rules ● Gain Accuracy ! NB 94% +4% Fast
  14. 14. 14/ 17 ® E-mail audit The project ! ● C++ at all source code ! 100% Open Source ! ● IMAP – communication ● Blacklists – DNS, bad domains, e-mail address... ● Deterministic Finite Automaton – Filters ● Tf–idf (term frequency–inverse document frequency) ● Naive bayes – classifier
  15. 15. 15/ 17 ® My way Automatos like a Match Rules ● Gain Accuracy ! ● Gain Performance ! ● Because can match to SPAM before to use classifier ! ● www.site.com/www.bank.com/ ● URL/malware.exe rule like URL/[a-zA-Z]*.exe ... ● Rule like to detect IP at URL ● Deterministic finite automaton to detect ● Use ranking ! NB 94% +4% Fast
  16. 16. 16/ 17 ® E-mail audit The project ! ● At the future, using GPU to use KNN and automatons... ● Results with GPU turns all fast... ● Next step 100% of accuracy ? https://github.com/CoolerVoid/email_audit
  17. 17. 17/ 17 ® Thanks ● https://github.com/CoolerVoid

×