10. Best Guess for Gender 100% male 0% female 50% male 50% female Weight (in pounds) Best Guess 0% male 100% female Height (in inches) 10
11. One Dimension Only 0.15 – 0.10 – 0.05 – 0.00 – 55 60 65 70 75 Height (in inches) 11
12. Better Features 200 – 180 – 160 – 140 – 120 – 100 – Weight (in pounds) 800 900 1000 1100 1200 Buttock Circumference: “The circumference of the body measured at the level of the maximum posterior protuberance of the buttocks.” 12
13. Best Guess for Revised Features 13 Weight (in pounds) Best Guess Buttock Circumference
14. Further Improving the Separation Signal to Noise Features with very different distribution per class Correlation Features with low correlation Dimensionality Consider more features at the same time 14
19. Some English Words militate caterwaul deracinate arrant concinnity imprecation vertiginous profuse 19
20. Some English Explanations militate: to have force or influence caterwaul: to make a harsh cry or screech deracinate: to uproot arrant: outright; thoroughgoing concinnity: elegance – used chiefly of literary style imprecation: a curse vertiginous: causing dizziness; also, giddy; dizzy profuse: plentiful; copious 20 Source: http://dictionary.reference.com/
23. Markov Chains .0073 .0641 .0213 .0912 .0912 .0732 .0014 .2175 .0143 .2626 .0301 .0939 .0322 .2419 .3598 .1457 .0633 .1064 .0588 .1733 .0872 .2738 .0431 .1534 .0932 .0714 .2936 .0437 .1860 .0196 .0371 .0291 .1932 .1120 .1269 .0411 .4759 .2979 ab bn nk ko of fp pu nj ja fe er rr ry yl li ne es eb ba ay un in Analysis of recent domain registrations Using Second Order Markov Chains to detect potentially malicious domain names bnkofpunjab is not legitimate ferrylines.com is legitimate ebay.com is not determinable 23
24. Limitations of the Markov model Useful to detect malicious domain names Very effective for randomly generated names Detects some legitimate domain names as malicious domains Malicious names similar to legitimate ones (e.g. ebay.com phishing sites) International domain names and punycode Solution: add DNS related features into classification process 24
25. DNS Features The number of the nameservers that hosted or are hosting this domain The average time of one nameserver to host this domain The maximum time of one nameserver to host this domain The minimum time of one nameserver to host this domain The number of non-activated nameservers that hosted this domain before Whether the domain is an international one 25
26. 0.15 – 0.10 – 0.05 – 0.00 – Example Feature Density 0 200 400 600 Time of domain on name server (in days) 26
29. IP Blacklist Lookup Mail server looks up sender IP over DNS Simple classifier modeled on IP blacklist query logs Narrow data set – queried IP, source IP, timestamp Deep data set – billions of query records monthly More complex data can be included 29
30. Q? Q=x Q? Q=x IP Lookups Sender Receiver DNS Reputation server <Q, S, T> IP=S IP=Q 30