Discovering bots that automate attacks and perform malicious actions is a key area of cyber security research. In this presentation we scope out the bot detection problem in the context of web API bots, and suggest how sequential neural network models can be used to solve the key problem of differentiating human behaviour from that of a malicious automaton.
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
Mais conteúdo relacionado
Semelhante a Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation
Semelhante a Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation (20)
Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation
1. Town Crier
Recognising Behavioural Patterns of Web API
Bots Using Machine Learning Techniques :
Server-side strategies without client-side
instrumentation
Guntur Ravindra
Machine Learning Practitioner
2. Town Crier
Bot traffic is a significant % of API traffic
Attack Automation
10% Human
39%
Benign non-human
31%
Malicious non-human
20%
• About 60% of web API
traffic is generated by
automatons
• 1/2 of the automated traffic
is malicious
3. Town Crier
Bot classification based on high-level behaviour
• Bot classification in terms of their agenda
• Benign bots
• Malicious bots
• Threat automation bots or attack automation bots
• Bot classification based on level of obfuscation
• Human emulators
• Eg: Malicious bots, attack automation bots
• Service bots
• Eg: Benign bots
• Bot classification based on goal oriented functionality
• Eg: OWASP-OAT 21 classification https://owasp.org/www-pdf-archive/Automated-threat-handbook.pdf
4. Town Crier
Chances of detection success
• Approximately 60% of the traffic is bot traffic
• Out of this 60% traffic only 1/2 is due to benign bots and these bots are unlikely to exhibit human emulation
• Benign bots have nothing to hide from detectors
• They exhibit statistical properties that are easy to recognise
• As a result, there is only 30% chance of reliably detecting bots
• Of the remaining 70% traffic, less than 10% is due to attack automation bots
• Attack automation bots can be detected at the reconnaissance phase
• They are likely to fuzz variables, inject malicious patterns, make erroneous API calls
• These actions generate HTTP response errors and error messages in response body
• The remaining 60% traffic is confusing for bot detection algorithms
• Detection algorithms are likely to confuse malicious bots with human actions and hence these bots fly under the radar
5. Town Crier
Approaches to bot detection
Detection
techniques
Fingerprinting
User
agent
OS
Device
Progressive Challenge
Cookies JS CAPTCHA
Behavioural
Mouse
movements
Access speed
Server sideClient side
Request
frequency
Order and
sequence of
requests
Bounce rate
Content
7. Town Crier
Patterns exhibited by benign Bots
• Service Bots that are benign and are performing automated tasks on behalf of humans
• Perform simple tasks and are goal oriented
• Hashing, MinHash, Embedding, Markov Models
• Exhibit repetitive characteristics
• Skewed Probability distribution on API call sequences
• Exhibit periodicity
• Time-series seasonality
• Perform rapid access
• Skewed probability distribution on inter-API call intervals
• Submit deterministic input data
• Significant part of request body and request headers repeat in every observation
• Data specification errors are very rare
• Hashing, MinHash, fingerprinting on message body
• They have nothing to hide from bot detection security products
8. Town Crier
Patterns exhibited by of Malicious Bots
• Everything that benign bots are not
• Obfuscate activity to mimic human behaviour
• No rapid access
• Designed to avoid rate-limiting logic, statistical inference, and probability
distribution based detectors
• Periodicity is distributed across multiple APIs, multiple user Ids, and multiple
sites thereby making it difficult to detect at a single site
• Perform complex tasks uncharacteristic of normal human behaviour
9. Town Crier
Patterns exhibited by attack automation bots
• All patterns exhibited by malicious bots
• Display similar characteristics in similar business domains and have targeted functionality
• Discovery of these bots requires data collation from across multiple sites with similar business domains
• Reconnaissance phase of the kill chain is marred with
• API spec violations
• API response errors in response body
• Protocol error messages
• Understand statistical properties of these types of violations and build an anomaly detector
• Business logic abuse
• API access attempts will be targeted at business logic abuse
• Build dominant user behaviours across user cohorts and find anomalies in comparison to these behaviours
• Machine learning models need to combine request-response data with the API call sequence together and not use them in isolation
10. Town Crier
Bot detection strategy
Could be Human or Human-like Bot
Is Bot
Sequence of API calls
or
a set of sequences of API calls
exhibits periodicity, rapid access, simplicity, deterministic
input data, skewed statistical distributions
NOYESInput data has anomalies,
input data triggers malicious
signatures, output data has
error messages
NO
YES
Benign/Malicious
Bot
Attack automation Bot
Input data has anomalies, input data
triggers malicious signatures, output data
has error messages
NO
Could be Human
or malicious bot
Could be Attack
automation Bot
A user’s API call sequences
and data patterns fall into
clusters where there are many
more users
YES
N
O
Yes
Is Human Could be malicious
Bot
11. Town Crier
Most common Bot action
• Iterate through a list of items in the hope of getting a hit
• Defense: after a predetermined number of failures terminate the connection or blacklist the user or challenge with captcha
• Obfuscation Attack: iterate through the list using multiple user_ids (logins) on the same site and use very few attempts per
user_id
• Condition: attacker can create multiple user_ids on the site via automated means and can create them easily without
any extensive verification
• Condition: attacker does not require to create explicit user accounts
• Bot types: OAT-001 (Carding), OAT-002 (Token Cracking)
• Defense: Algorithm that can group multiple user_ids into the same bot actor. For this attributed threat actor, if the number
of failures are above a threshold, then all the user_ids under that threat actor are part of a bot attempting an iteration
attack
• Challenge: Can we really group multiple user_ids into the same bot-actor? —> Clustering on API access behaviour
• Obfuscation Attack: iterate through the list using multiple user-ids (logins) on different sites supporting the same input type
• Condition: attacker can create multiple user_ids on multiple sites via automated means and can easily create them
without any extensive verification
• Condition: attacker does not require to create explicit user accounts but there are multiple sites that can accommodate
the attacker
• Bot types: OAT-001 (Carding)
• Defense: Difficult to detect bots using this strategy with server-side detection
13. Town Crier
Data characteristics
Sequence of API calls observed at the server side in a single session
✴Inter-API call interval is
a random variable
✴Unique APIs called is probably
deterministic
and is dependent on
application functionality
✴Order in which APIs are called is a
random variable. It is could be
deterministic for a large population of
users, but for others it can vary due to
various navigation behaviours
✴Number of times an API is called in a
session is a random variable owing to
activity such as, hitting the browser
back button, navigating back to
starting point/page and jumping back
to another point in an app/page.
✴Input data to each API and response
data is a random variable and is user
dependent to some extent
Call to an earlier API
after a time delay
Simultaneous calls
to 2 APIs
Successive and rapid
calls to the same API
Calls to APIs not
seen earlier in the session
14. Town Crier
Feature representation
Interval between
At the API call in the session of the user
call and previous
API call
Name of the API
Values of request
header keys
Values of request
body keys
Session-level
count-based
statistics
Quantized in to 4 levels and encoded.
Encoded
Encoded
Encoded
Quantized histogram of bins and encoded.
API identity
Time feature
API data
API data
Session summary
15. Town Crier
Detecting anomalies in sequences of API calls :
addressing the human vs bot challenge
GRU or LSTM
network
Predicted ’N’ steps
of API calls
Sequence
comparison
Actual ’n’ steps of
API calls
Users are clustered into
groups based on
similarity in API access
sequences
API call sequences from
heavily populated clusters are
chosen
Feature representation
Sequential model
There is little innovation in training a model
The discriminating features lie in temporal data
Feature representation is the most crucial step
Identifying API call sequences of potentially
human users is another crucial step
16. Town Crier
Issues to address when applying LSTM models for
API call sequences
• User sessions can be of varying lengths
• Some sessions are as small as 3 calls while some sessions could be long-running sessions
• Common practice of padding sequences to generate sequences of length equal to the window length does not work
• Padding a session of length 3 with 17 fillers in order to match a window size of 20 adds more noise than useful
information
• User sessions need to be grouped into clusters based the sequence length
• One can build as many models as the number of such groups
• Ideally the clusters should be such that, less than 20% of the sequence length needs a filler in order to match the
window size.
• For a window size of say 20, sessions longer than 16 should form a cluster
• Window size must be adjusted as per the sessions length in a cluster
17. Town Crier
Performance of an LSTM-based anomaly detector
Expected API sequence ————>
Differentusersessions———————->
Predicted API sequence ————>
1.APIs is a session are represented as numbers
generated be a proprietary embedding scheme
(integers)
2.The predicted sequence of API calls is going to
be a sequence of floating point numbers.
3.The expected API call sequence of integers
must vary very little from the predicted API call
sequence of floating point numbers
4.Prediction error for each session is measured as
the mean absolute difference between the actual
API (integer) and the predicted API (float)
Sessions
Percentage of
data used
Prediction
error
Train 6818 80% 0.044
Test 1708 20% 0.0724
• API with number 26 is a /logout API. Most sessions end with call to this
API
• Humans usually access APIs using an application. As a result they exhibit
deterministic API sequences even if the input data varies per user. Hence
we see many sequences repeating.
• In the present data set, the API input data is also deterministic in many
samples. These data samples were generated by benign bots that
perform periodic yet repeated action.
• The very low prediction error could be attributed to the above reasons.
• Malicious bots exhibit an API sequence similar to that of benign bots and
humans
• However, the predicted API sequence of floating point numbers will have a
significant error compared to the input API sequence of integers