2. Table of Contents
What is Machine Learning (ML)?
Cybersecurity Fundamentals
Why ML in Cybersecurity?
Application of ML in Cybersecurity
◦ Automatic Intrusion detection using ML
Phishing URL detection
Malware detection
Network behavior anomaly detection (NBAD)
Insider threat detection
Detection DDOS (Distributed Denial of Service)
◦ Assessing password strength using ML
◦ Deep steganography for encrypting messages
Conclusion
References
Vs.
3. Cybersecurity Fundamentals
• Cybersecurity is the protection of computer network
from the theft or damage of hardware, software or
electronic data as well as from the disruption or
misdirection of the services by unauthorized
entities/hackers
• Importance of Cybersecurity
• Increased usage of Cloud services
• Smartphones
• IoT devices
• Digitalization of manufacturing Industries & Oil
refineries
4. Why ML in Cybersecurity?
• Traditional systems are based on rule-based or
known signatures to filter the malicious content
• Hackers are becoming more sophisticated, changing
what they target, how they affect organizations and
their methods of attack for different security systems
• ML systems are based on behaviors rather than rule
based; this enables to protect the future attacks
based on patterns rather than strict rules
• ML systems are ideal to learn behaviors when the
increase in usage (more and more data) of systems,
subsequently increased in number of attacks
Traditional Security
ML based Security
**Source: Kaspersky
5. What is Phishing URL?
Phishing is most successful modes of attack for
hackers. Phishing usually starts with sending
malicious URLs through emails
Phishing websites try to obtain user credentials
by appearing as a legitimate website.
Phishing URLs sometimes differ from original
websites by changing in single character,
especially at the places where the high chances
of typo or blind spot
6. Phishing URL Detection using ML example
Datafile “phishing-dataset.7z” saved in the following
location
https://github.com/PacktPublishing/Machine-Learning-
for-Cybersecurity-Cookbook/tree/master/Chapter06
Method to process data using ML model
◦ Split the data into Train & Test of 80%/20% ratio
◦ Import Random forest classifier
◦ Train the ML model on training data
◦ Evaluate model on testing data using confusion
matrix
◦ Following is the test confusion matrix
Attributes Values Attributes Values
Having an IP address { 1,0 } SFH { -1,1 }
Having a long URL { 1,0,-1 } Submitting to email { 1,0 }
Uses Shortening Service { 0,1 } Abnormal URL { 1,0 }
Having the '@' symbol { 0,1 } Redirect { 0,1 }
Double slash redirecting { 0,1 } On mouseover { 0,1 }
Having a prefix and suffix { -1,0,1 } Right-click { 0,1 }
Having a subdomain { -1,0,1 } Pop-up window { 0,1 }
SSLfinal state { -1,1,0 } Iframe { 0,1 }
Domain registration length { 0,1,-1 } Age of domain { -1,0,1 }
Favicon { 0,1 } DNS record { 1,0 }
Is a standard port { 0,1 } Web traffic { -1,0,1 }
Uses HTTPS tokens { 0,1 } Page rank { -1,0,1 }
Request_URL { 1,-1 } Google index { 0,1 }
Abnormal URL anchor { -1,0,1 } Links pointing to page { 1,0,-1 }
Links_in_tags { 1,-1,0 } Statistical report { 1,0 }
Result { 1,-1 }
**source: Machine Learning for Cybersecurity
Cookbook
7. What is Malware?
Malware is any software intentionally designed
to create damage to computer, server or
network of computers. Popular example of
malwares are Trojan horse, ransomware,
spyware and scareware etc.
Most popular method Hackers try to sneak the
malicious files into network is by concealing the
file type/extension
Example:
◦ System administrator disable execution of all
powershell scripts with extension “.ps1”
◦ Hacker changes/removes the “.ps1”
extension of the file
◦ Only by examining the content in the file,
one can identify whether it is malicious
Firewall
8. Malware Detection using ML example
Datafile “Benign PE Samples 1.7z” “Malicious PE Samples 1.7z”
saved in the following location
https://github.com/PacktPublishing/Machine-Learning-for-
Cybersecurity-Cookbook
Method for malware static detection using ML model on PE
(Portable Executable file)
◦ Reads the binary sequence of a binary file
◦ Creates a list of N-grams from a byte sequence
◦ Select 100 most frequent 2-grams as feature
◦ Create TF-IDF vectorizer
◦ Train and test split the data
◦ Fit the ML model on Train data
◦ Plot the confusion matrix on test data
**source: Machine Learning for Cybersecurity
Cookbook
9. Network Behavior Anomaly Detection (NBAD)
NBAD is a continuous monitoring of computer
network for unusual or suspicious trends or events
and raise alarms at real-time to highlight threat
NBAD works on characteristics like traffic volume,
band width and protocol use
Situations in which NBAD can outperform
signature-based detection
◦ New zero-day attacks
◦ When the threat traffic is encrypted
Typical usage scenario of NBAD
◦ Log analysis
◦ Packet inspection system
◦ Flow monitoring system
◦ Route analytics
**source: https://www.researchgate.net/figure/Block-diagram-of-Network-based-Anomaly-
Detection-System-that-jointly-employs-the-proposed_fig3_220673441
NBAD system high level overview
Alarm
10. NBAD using ML example
KDD dataset has been used, file “kddcup_dataset.csv”
saved in the following location
https://github.com/PacktPublishing/Machine-Learning-for-
Cybersecurity-Cookbook/tree/master/Chapter06
Major type of variables used are
◦ Bytes sent, login attempts, TCP errors, Source bytes and
Destination bytes
Detecting anomalies in network using k-means using
Pyspark for handling large volumes of data
◦ OHE categorical features
◦ Normalize both categorical & continuous features
◦ Apply k-means algorithm to find best possible cluster
number
◦ Apply k-means algorithm to cluster data to find the
anomaly
**source: Hands-on Machine Learning for
Cybersecurity
Attributes
duration num_root diff_srv_rate
protocol_type num_file_creations srv_diff_host_rate
flag num_shells dst_host_count
src_bytes num_access_files dst_host_srv_count
dst_bytes
num_outbound_cmd
s dst_host_same_srv_rate
land is_host_login dst_host_diff_srv_rate
wrong_fragment is_guest_login dst_host_same_src_port_rate
urgent count dst_host_srv_diff_host_rate
hot srv_count dst_host_serror_rate
num_failed_logins serror_rate dst_host_srv_serror_rate
logged_in srv_serror_rate dst_host_rerror_rate
num_compromise
d rerror_rate dst_host_srv_rerror_rate
root_shell srv_rerror_rate label
su_attempted same_srv_rate
11. Insider Threat Detection
Insider Threat Detection is growing challenge for
employers. These are any actions taken by an employee
that are potentially harmful to the organization
Inside Threat actions ranging from unsanctioned data
transfer to advanced persistent threats (APT). Typical
profiles are
◦ Leaker
◦ Thief
◦ Saboteur
Some high-level indicators of threat includes
◦ Whether an email has been sent to an outsider
◦ Login occurred outside of business hours
**source: https://activtrak.com/insider-threat-
detection/
12. Insider Threat Detection using ML example
Datafile “r4.2.tar.bz” which is risk database CERT insider
threat scenario version 4.2 of Carnegie Mellon University.
Dataset is few months of traffic in single engineering
company of phone, logon, folder & system access
ftp://ftp.sei.cmu.edu/pub/cert-data/r4.2.tar.bz2
Method for anomaly detection on CERT 4.2 version
◦ Create important features out of raw data for monitoring
purposes like
Device, email, file, login, http
◦ Create series for each user level
◦ Split the data into train and test segments
◦ Apply Isolation Forest on the X values
◦ Apply threshold to plot the confusion matrix
**source: Hands-on Machine Learning for
Cybersecurity
https://towardsdatascience.com/outlier-detection-
with-isolation-forest-3d190448d45e
Normal Outlier
13. Detecting DDoS (Distributed Denial of Service)
DDoS is an attack in which traffic from different
sources floods a victim, resulting in interruption of
services
DDoS are basically 3 categories
◦ Application level
◦ Protocol
◦ Volumetric attacks
Currently DDoS defense is majorly manual, by
blocking certain IP addresses or identified domains
DDoS bots become more sophisticated, manual way
of blocking domains and addresses becoming
outdated
**source: https://www.cloudflare.com/en-in/learning/ddos/what-is-a-
ddos-attack/
DDoS working principle
14. Detecting DDoS (Distributed Denial of Service) using ML
Dataset CIC DoS datasets (2017) consists of 80% benign and 20%
DDoS traffic. Download “ddos_dataset.7z” from the following
location
https://github.com/PacktPublishing/Machine-Learning-for-
Cybersecurity-Cookbook/tree/master/Chapter06
Following features in the dataset used to detect label as “benign”
or “DDoS traffic”
◦ Fwd Pkt Len Mean (Mean of forward packet length)
◦ Fwd Seg Size Avg (Average segment size observed in forward
direction)
◦ Fwd Seg Size Min (Minimum segment size observed in forward
direction)
◦ Init Fwd/Bwd Win Byts (Number of bytes sent in Initial window in
forward/backward directions)
Machine learning model steps:
◦ Apply Random forest classifier on Training data
◦ Test model accuracy on Test data and plot confusion matrix
**source: https://www.cloudflare.com/en-in/learning/ddos/what-is-a-
ddos-attack/
15. Assessing password Strength using ML
Cracking password is the systematic endeavor of discovering the
password of a secure system
Assessing password using ML is based on training dataset
“passwordDataset.7z” https://github.com/PacktPublishing/Machine-
Learning-for-Cybersecurity-Cookbook/tree/master/Chapter07/
ML Methodology flow
◦ Break the password string into character level
◦ Apply TF-IDF vectorizer to convert the characters into numeric
format
◦ Split the data into Train & Test
◦ Apply XGB classifier on Train data and evaluate model on test data
**source: https://www.infosecurity-
magazine.com/blogs/password-strength-meters//
password strength
intel1 0
klara-tershina3H 2
czuodhj972 1
Trained model predicted on
1] qwerty -> 0
2] c9lCwLBFmdLbG6iWla4H -> 2
Password Dataset
16. Deep Steganography for encrypting messages
Steganography is the practice of
hiding message (Secret) within
another medium (Cover), such as file,
text, image or video
Secret -> Cover = Container
In deep learning, secret is distributed
across all bits, unlike in traditional
methods where secret is encoded in
LSB (Least Significant Bit)
Hiding network
H-net
Cover
Secret
Container
Revealing network
R-net
17. Conclusion
By utilizing ML based techniques one can combat various format of attacks in
advance
New zero-day attacks are very difficult to detect using traditional signature-
based techniques, which can be detected using ML based models
ML models predicts better with higher volume of data and its performance only
increases with the time
18. References
“Hands-on machine learning for cybersecurity” written by Soma Halder and Sinan Ozdemir by
Packt Publishing
“Machine Learning for Cybersecurity Cookbook” written by Emmanuel Tsukerman published by
Packt Publishing
“Mastering machine learning for penetration testing” written by Chiheb Chebbi published by
Packt Publishing
Machine learning techniques for intrusion detection by Mahdi Zamani et. al, arxiv 9th May
2015