This talk focuses on how AI can be leveraged to solve some of the subproblems in cybersecurity. The talk will start with a discussion on why there is a surge in data breaches, and cybersecurity attacks? Then I will discuss some of the use cases, data pipeline, and architectural details of AI solutions for the cybersecurity. Here is a detailed plan for the talk:
(1) The current state of Information security and tools (5 mins).
(2) A brief history and current status of using AI for the InfoSec (5 mins).
Currently, security data science tools primarily process raw data from multiple data sources such as network flows, authentication logs, firewall logs, endpoints, and detect anomalous events. These tools generate a large number of false positives, and they need to be further investigated by security analysts. Specifically, I will address the following questions:
- What is the foundation of current security data science tools?
- What are the pros and cons of existing tools?
(3) AI use cases, data pipeline, architecture, and data experiments (15 mins): Following questions will be addressed:
- What are the different use cases that can be enabled by AI?
- How would it transform the incident response?
What's a typical data pipeline and architecture of cybersecurity AI solution?
Demo 1: PowerShell Obfuscation Detection using Deep Learning Neural Networks
Demo 2: Malicious URL Detection using Recurrent Neural Networks
(4) Challenges and limitations of using AI alone for cybersecurity (5 mins)
- AI generates too many false positives
- Enterprises can investigate only 2-5% of alerts due to the limited number of security analysts
Need for an automated response, not just detection
(5) Our approach: fuse deception with AI (10 mins):
A key objective of the deception is to deceive the inside-network attacks and threats to detect, engage, trap, and remediate them. Deception provides high fidelity alerts, and AI delivers an ability to construct context about the alert. By fusing deception and data science, security analysts can do proactive defense. We shall demonstrate our approach with specific case studies:
- Demo 3- Detecting and Inferring threats in a high interaction decoy using AI engine
(6) Q&A (5 mins)
10. Security Data Sources
Network Logs
•Firewall
•IDS/IPS
•Network flow
•DNS
•Wi-fi
Easily into a few TBs of data per day
Endpoint Logs
•File System Changes
•Applications, Process,
OS logs
•Antivirus Alerts
Authentication Logs
• Windows Events
• Active Directory User Logs
• Privilege User
19. Tor-nonTor Traffic - Dataset
Activity Details
Browsing HTTP, HTTPS traffic using Chrome and Firefox
Email Mails delivered via SMTP/S and received via POP3/SSL
and IMAP/SSL, Thunderbird client
Chat Facebook, Hangout, ICQ and IAM chat activities
Audio-streaming Spotify audio streaming
Video-streaming Youtube and Vimeo services over Chrome and Firefox
File transfer Skype file transfers, FTP over SSH, FTP over SSL traffic
sessions
VoIP Facebook, Hangout and Skype
20. Demo Using Tensorflow and Keras
Tor
Traffic
Classification
Unknown
scripts
Feature
f1
Feature f2
Non-Tor
Traffic
22. Command and Control Detection
C&C domain examples:
• DGA based: gvludcvhcrjwmgq.in, uqvwxfrhhwreddf.yt
• non DGA based: thisisyourchangeqq.com, homejobsinstitute.biz
Ransomware
Malware
Enterprise Network
Main DB
Webserver
C&C server
Data
Command
Attacker
31. • Pros of DL in InfoSec:
• Find hidden patterns in big data - “Needle in the haystack”
• Able to correlate across events
• Cons of DL in InfoSec:
• Too many False Positives !!
• No labels —> Using ML, DL becomes difficult
• DL+ Deception - A unique Solution to find hidden threats
Summary