SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Town Crier
Recognising Behavioural Patterns of Web API
Bots Using Machine Learning Techniques :
Server-side strategies without client-side
instrumentation
Guntur Ravindra

Machine Learning Practitioner
Town Crier
Bot traffic is a significant % of API traffic
Attack Automation
10% Human
39%
Benign non-human
31%
Malicious non-human
20%
• About 60% of web API
traffic is generated by
automatons
• 1/2 of the automated traffic
is malicious
Town Crier
Bot classification based on high-level behaviour
• Bot classification in terms of their agenda

• Benign bots

• Malicious bots

• Threat automation bots or attack automation bots

• Bot classification based on level of obfuscation

• Human emulators

• Eg: Malicious bots, attack automation bots

• Service bots

• Eg: Benign bots

• Bot classification based on goal oriented functionality

• Eg: OWASP-OAT 21 classification https://owasp.org/www-pdf-archive/Automated-threat-handbook.pdf
Town Crier
Chances of detection success
• Approximately 60% of the traffic is bot traffic

• Out of this 60% traffic only 1/2 is due to benign bots and these bots are unlikely to exhibit human emulation

• Benign bots have nothing to hide from detectors

• They exhibit statistical properties that are easy to recognise

• As a result, there is only 30% chance of reliably detecting bots

• Of the remaining 70% traffic, less than 10% is due to attack automation bots

• Attack automation bots can be detected at the reconnaissance phase

• They are likely to fuzz variables, inject malicious patterns, make erroneous API calls

• These actions generate HTTP response errors and error messages in response body 

• The remaining 60% traffic is confusing for bot detection algorithms

• Detection algorithms are likely to confuse malicious bots with human actions and hence these bots fly under the radar
Town Crier
Approaches to bot detection
Detection
techniques
Fingerprinting
User
agent
OS
Device
Progressive Challenge
Cookies JS CAPTCHA
Behavioural
Mouse
movements
Access speed
Server sideClient side
Request
frequency
Order and
sequence of
requests
Bounce rate
Content
Town Crier
Behavioural patterns : Server-side
machine learning techniques
without client-side instrumentation
Town Crier
Patterns exhibited by benign Bots
• Service Bots that are benign and are performing automated tasks on behalf of humans 

• Perform simple tasks and are goal oriented

• Hashing, MinHash, Embedding, Markov Models

• Exhibit repetitive characteristics

• Skewed Probability distribution on API call sequences

• Exhibit periodicity

• Time-series seasonality

• Perform rapid access

• Skewed probability distribution on inter-API call intervals

• Submit deterministic input data

• Significant part of request body and request headers repeat in every observation

• Data specification errors are very rare

• Hashing, MinHash, fingerprinting on message body

• They have nothing to hide from bot detection security products
Town Crier
Patterns exhibited by of Malicious Bots
• Everything that benign bots are not

• Obfuscate activity to mimic human behaviour

• No rapid access

• Designed to avoid rate-limiting logic, statistical inference, and probability
distribution based detectors

• Periodicity is distributed across multiple APIs, multiple user Ids, and multiple
sites thereby making it difficult to detect at a single site

• Perform complex tasks uncharacteristic of normal human behaviour
Town Crier
Patterns exhibited by attack automation bots
• All patterns exhibited by malicious bots

• Display similar characteristics in similar business domains and have targeted functionality

• Discovery of these bots requires data collation from across multiple sites with similar business domains

• Reconnaissance phase of the kill chain is marred with

• API spec violations

• API response errors in response body

• Protocol error messages

• Understand statistical properties of these types of violations and build an anomaly detector

• Business logic abuse

• API access attempts will be targeted at business logic abuse

• Build dominant user behaviours across user cohorts and find anomalies in comparison to these behaviours

• Machine learning models need to combine request-response data with the API call sequence together and not use them in isolation
Town Crier
Bot detection strategy
Could be Human or Human-like Bot
Is Bot
Sequence of API calls
or
a set of sequences of API calls
exhibits periodicity, rapid access, simplicity, deterministic
input data, skewed statistical distributions
NOYESInput data has anomalies,
input data triggers malicious
signatures, output data has
error messages
NO
YES
Benign/Malicious
Bot
Attack automation Bot
Input data has anomalies, input data
triggers malicious signatures, output data
has error messages
NO
Could be Human
or malicious bot
Could be Attack
automation Bot
A user’s API call sequences
and data patterns fall into
clusters where there are many
more users
YES
N
O
Yes
Is Human Could be malicious
Bot
Town Crier
Most common Bot action
• Iterate through a list of items in the hope of getting a hit
• Defense: after a predetermined number of failures terminate the connection or blacklist the user or challenge with captcha
• Obfuscation Attack: iterate through the list using multiple user_ids (logins) on the same site and use very few attempts per
user_id
• Condition: attacker can create multiple user_ids on the site via automated means and can create them easily without
any extensive verification
• Condition: attacker does not require to create explicit user accounts
• Bot types: OAT-001 (Carding), OAT-002 (Token Cracking)
• Defense: Algorithm that can group multiple user_ids into the same bot actor. For this attributed threat actor, if the number
of failures are above a threshold, then all the user_ids under that threat actor are part of a bot attempting an iteration
attack
• Challenge: Can we really group multiple user_ids into the same bot-actor? —> Clustering on API access behaviour
• Obfuscation Attack: iterate through the list using multiple user-ids (logins) on different sites supporting the same input type
• Condition: attacker can create multiple user_ids on multiple sites via automated means and can easily create them
without any extensive verification
• Condition: attacker does not require to create explicit user accounts but there are multiple sites that can accommodate
the attacker
• Bot types: OAT-001 (Carding)
• Defense: Difficult to detect bots using this strategy with server-side detection
Town Crier
LSTM based models to differentiate
between humans and malicious
bots
Town Crier
Data characteristics
Sequence of API calls observed at the server side in a single session
✴Inter-API call interval is
a random variable
✴Unique APIs called is probably
deterministic
and is dependent on
application functionality
✴Order in which APIs are called is a
random variable. It is could be
deterministic for a large population of
users, but for others it can vary due to
various navigation behaviours
✴Number of times an API is called in a
session is a random variable owing to
activity such as, hitting the browser
back button, navigating back to
starting point/page and jumping back
to another point in an app/page.
✴Input data to each API and response
data is a random variable and is user
dependent to some extent
Call to an earlier API
after a time delay
Simultaneous calls
to 2 APIs
Successive and rapid
calls to the same API
Calls to APIs not
seen earlier in the session
Town Crier
Feature representation
Interval between
At the API call in the session of the user
call and previous
API call
Name of the API
Values of request
header keys
Values of request
body keys
Session-level
count-based
statistics
Quantized in to 4 levels and encoded.
Encoded
Encoded
Encoded
Quantized histogram of bins and encoded.
API identity
Time feature
API data
API data
Session summary
Town Crier
Detecting anomalies in sequences of API calls :
addressing the human vs bot challenge
GRU or LSTM
network
Predicted ’N’ steps
of API calls
Sequence
comparison
Actual ’n’ steps of
API calls
Users are clustered into
groups based on
similarity in API access
sequences
API call sequences from
heavily populated clusters are
chosen
Feature representation
Sequential model
There is little innovation in training a model
The discriminating features lie in temporal data
Feature representation is the most crucial step
Identifying API call sequences of potentially
human users is another crucial step
Town Crier
Issues to address when applying LSTM models for
API call sequences
• User sessions can be of varying lengths

• Some sessions are as small as 3 calls while some sessions could be long-running sessions

• Common practice of padding sequences to generate sequences of length equal to the window length does not work

• Padding a session of length 3 with 17 fillers in order to match a window size of 20 adds more noise than useful
information

• User sessions need to be grouped into clusters based the sequence length

• One can build as many models as the number of such groups

• Ideally the clusters should be such that, less than 20% of the sequence length needs a filler in order to match the
window size.

• For a window size of say 20, sessions longer than 16 should form a cluster

• Window size must be adjusted as per the sessions length in a cluster
Town Crier
Performance of an LSTM-based anomaly detector
Expected API sequence ————>
Differentusersessions———————->
Predicted API sequence ————>
1.APIs is a session are represented as numbers
generated be a proprietary embedding scheme
(integers)
2.The predicted sequence of API calls is going to
be a sequence of floating point numbers.
3.The expected API call sequence of integers
must vary very little from the predicted API call
sequence of floating point numbers
4.Prediction error for each session is measured as
the mean absolute difference between the actual
API (integer) and the predicted API (float)
Sessions
Percentage of
data used
Prediction
error
Train 6818 80% 0.044
Test 1708 20% 0.0724
• API with number 26 is a /logout API. Most sessions end with call to this
API
• Humans usually access APIs using an application. As a result they exhibit
deterministic API sequences even if the input data varies per user. Hence
we see many sequences repeating.
• In the present data set, the API input data is also deterministic in many
samples. These data samples were generated by benign bots that
perform periodic yet repeated action.
• The very low prediction error could be attributed to the above reasons.
• Malicious bots exhibit an API sequence similar to that of benign bots and
humans
• However, the predicted API sequence of floating point numbers will have a
significant error compared to the input API sequence of integers
Town Crier
- stay tuned
Work in progress (20th March 2020)

Mais conteúdo relacionado

Semelhante a Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation

Understanding Web Bots and How They Hurt Your Business
Understanding Web Bots and How They Hurt Your BusinessUnderstanding Web Bots and How They Hurt Your Business
Understanding Web Bots and How They Hurt Your Business
Imperva Incapsula
 
webservertrafficanalysis
webservertrafficanalysiswebservertrafficanalysis
webservertrafficanalysis
nitesh kanojiya
 
Bots mitigations overview with Advance WAF - Anti ... - DevCentral.pdf
Bots mitigations overview with Advance WAF - Anti ... - DevCentral.pdfBots mitigations overview with Advance WAF - Anti ... - DevCentral.pdf
Bots mitigations overview with Advance WAF - Anti ... - DevCentral.pdf
Lior Rotkovitch
 

Semelhante a Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation (20)

Borges rprojectcs691y
Borges rprojectcs691yBorges rprojectcs691y
Borges rprojectcs691y
 
Are Bot Operators Eating Your Lunch?
Are Bot Operators Eating Your Lunch?Are Bot Operators Eating Your Lunch?
Are Bot Operators Eating Your Lunch?
 
Automation Attacks At Scale
Automation Attacks At ScaleAutomation Attacks At Scale
Automation Attacks At Scale
 
End user-experience monitoring
End user-experience monitoring End user-experience monitoring
End user-experience monitoring
 
Vices & Devices - How IoT & Insecure APIs Became the New Cyber Battlefront
Vices & Devices - How IoT & Insecure APIs Became the New Cyber BattlefrontVices & Devices - How IoT & Insecure APIs Became the New Cyber Battlefront
Vices & Devices - How IoT & Insecure APIs Became the New Cyber Battlefront
 
Human Computer Interface -l2.pdf
Human Computer Interface -l2.pdfHuman Computer Interface -l2.pdf
Human Computer Interface -l2.pdf
 
Azure Application insights - An Introduction
Azure Application insights - An IntroductionAzure Application insights - An Introduction
Azure Application insights - An Introduction
 
Watch How the Giants Fall
Watch How the Giants FallWatch How the Giants Fall
Watch How the Giants Fall
 
Rtp rsp16-distil networks-final-deck
Rtp rsp16-distil networks-final-deckRtp rsp16-distil networks-final-deck
Rtp rsp16-distil networks-final-deck
 
Fluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data ConclaveFluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data Conclave
 
aip-developer-intro_pag2015
aip-developer-intro_pag2015aip-developer-intro_pag2015
aip-developer-intro_pag2015
 
2022 APIsecure_Securing Large API Ecosystems
2022 APIsecure_Securing Large API Ecosystems2022 APIsecure_Securing Large API Ecosystems
2022 APIsecure_Securing Large API Ecosystems
 
Realtime Applications with Laravel - LaravelLive India Online Meetup Jun 2020
Realtime Applications with Laravel - LaravelLive India Online Meetup Jun 2020Realtime Applications with Laravel - LaravelLive India Online Meetup Jun 2020
Realtime Applications with Laravel - LaravelLive India Online Meetup Jun 2020
 
Understanding Web Bots and How They Hurt Your Business
Understanding Web Bots and How They Hurt Your BusinessUnderstanding Web Bots and How They Hurt Your Business
Understanding Web Bots and How They Hurt Your Business
 
webservertrafficanalysis
webservertrafficanalysiswebservertrafficanalysis
webservertrafficanalysis
 
Open Bank Project Presentation Tel Aviv CA 4th April 2017
Open Bank Project Presentation Tel Aviv CA 4th April 2017 Open Bank Project Presentation Tel Aviv CA 4th April 2017
Open Bank Project Presentation Tel Aviv CA 4th April 2017
 
Bots mitigations overview with Advance WAF - Anti ... - DevCentral.pdf
Bots mitigations overview with Advance WAF - Anti ... - DevCentral.pdfBots mitigations overview with Advance WAF - Anti ... - DevCentral.pdf
Bots mitigations overview with Advance WAF - Anti ... - DevCentral.pdf
 
BSides Rochester 2018: Drew Kirkpatrick: Open Source SAST and DAST Tools for ...
BSides Rochester 2018: Drew Kirkpatrick: Open Source SAST and DAST Tools for ...BSides Rochester 2018: Drew Kirkpatrick: Open Source SAST and DAST Tools for ...
BSides Rochester 2018: Drew Kirkpatrick: Open Source SAST and DAST Tools for ...
 
We Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge ArchitectureWe Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge Architecture
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation

  • 1. Town Crier Recognising Behavioural Patterns of Web API Bots Using Machine Learning Techniques : Server-side strategies without client-side instrumentation Guntur Ravindra Machine Learning Practitioner
  • 2. Town Crier Bot traffic is a significant % of API traffic Attack Automation 10% Human 39% Benign non-human 31% Malicious non-human 20% • About 60% of web API traffic is generated by automatons • 1/2 of the automated traffic is malicious
  • 3. Town Crier Bot classification based on high-level behaviour • Bot classification in terms of their agenda • Benign bots • Malicious bots • Threat automation bots or attack automation bots • Bot classification based on level of obfuscation • Human emulators • Eg: Malicious bots, attack automation bots • Service bots • Eg: Benign bots • Bot classification based on goal oriented functionality • Eg: OWASP-OAT 21 classification https://owasp.org/www-pdf-archive/Automated-threat-handbook.pdf
  • 4. Town Crier Chances of detection success • Approximately 60% of the traffic is bot traffic • Out of this 60% traffic only 1/2 is due to benign bots and these bots are unlikely to exhibit human emulation • Benign bots have nothing to hide from detectors • They exhibit statistical properties that are easy to recognise • As a result, there is only 30% chance of reliably detecting bots • Of the remaining 70% traffic, less than 10% is due to attack automation bots • Attack automation bots can be detected at the reconnaissance phase • They are likely to fuzz variables, inject malicious patterns, make erroneous API calls • These actions generate HTTP response errors and error messages in response body • The remaining 60% traffic is confusing for bot detection algorithms • Detection algorithms are likely to confuse malicious bots with human actions and hence these bots fly under the radar
  • 5. Town Crier Approaches to bot detection Detection techniques Fingerprinting User agent OS Device Progressive Challenge Cookies JS CAPTCHA Behavioural Mouse movements Access speed Server sideClient side Request frequency Order and sequence of requests Bounce rate Content
  • 6. Town Crier Behavioural patterns : Server-side machine learning techniques without client-side instrumentation
  • 7. Town Crier Patterns exhibited by benign Bots • Service Bots that are benign and are performing automated tasks on behalf of humans • Perform simple tasks and are goal oriented • Hashing, MinHash, Embedding, Markov Models • Exhibit repetitive characteristics • Skewed Probability distribution on API call sequences • Exhibit periodicity • Time-series seasonality • Perform rapid access • Skewed probability distribution on inter-API call intervals • Submit deterministic input data • Significant part of request body and request headers repeat in every observation • Data specification errors are very rare • Hashing, MinHash, fingerprinting on message body • They have nothing to hide from bot detection security products
  • 8. Town Crier Patterns exhibited by of Malicious Bots • Everything that benign bots are not • Obfuscate activity to mimic human behaviour • No rapid access • Designed to avoid rate-limiting logic, statistical inference, and probability distribution based detectors • Periodicity is distributed across multiple APIs, multiple user Ids, and multiple sites thereby making it difficult to detect at a single site • Perform complex tasks uncharacteristic of normal human behaviour
  • 9. Town Crier Patterns exhibited by attack automation bots • All patterns exhibited by malicious bots • Display similar characteristics in similar business domains and have targeted functionality • Discovery of these bots requires data collation from across multiple sites with similar business domains • Reconnaissance phase of the kill chain is marred with • API spec violations • API response errors in response body • Protocol error messages • Understand statistical properties of these types of violations and build an anomaly detector • Business logic abuse • API access attempts will be targeted at business logic abuse • Build dominant user behaviours across user cohorts and find anomalies in comparison to these behaviours • Machine learning models need to combine request-response data with the API call sequence together and not use them in isolation
  • 10. Town Crier Bot detection strategy Could be Human or Human-like Bot Is Bot Sequence of API calls or a set of sequences of API calls exhibits periodicity, rapid access, simplicity, deterministic input data, skewed statistical distributions NOYESInput data has anomalies, input data triggers malicious signatures, output data has error messages NO YES Benign/Malicious Bot Attack automation Bot Input data has anomalies, input data triggers malicious signatures, output data has error messages NO Could be Human or malicious bot Could be Attack automation Bot A user’s API call sequences and data patterns fall into clusters where there are many more users YES N O Yes Is Human Could be malicious Bot
  • 11. Town Crier Most common Bot action • Iterate through a list of items in the hope of getting a hit • Defense: after a predetermined number of failures terminate the connection or blacklist the user or challenge with captcha • Obfuscation Attack: iterate through the list using multiple user_ids (logins) on the same site and use very few attempts per user_id • Condition: attacker can create multiple user_ids on the site via automated means and can create them easily without any extensive verification • Condition: attacker does not require to create explicit user accounts • Bot types: OAT-001 (Carding), OAT-002 (Token Cracking) • Defense: Algorithm that can group multiple user_ids into the same bot actor. For this attributed threat actor, if the number of failures are above a threshold, then all the user_ids under that threat actor are part of a bot attempting an iteration attack • Challenge: Can we really group multiple user_ids into the same bot-actor? —> Clustering on API access behaviour • Obfuscation Attack: iterate through the list using multiple user-ids (logins) on different sites supporting the same input type • Condition: attacker can create multiple user_ids on multiple sites via automated means and can easily create them without any extensive verification • Condition: attacker does not require to create explicit user accounts but there are multiple sites that can accommodate the attacker • Bot types: OAT-001 (Carding) • Defense: Difficult to detect bots using this strategy with server-side detection
  • 12. Town Crier LSTM based models to differentiate between humans and malicious bots
  • 13. Town Crier Data characteristics Sequence of API calls observed at the server side in a single session ✴Inter-API call interval is a random variable ✴Unique APIs called is probably deterministic and is dependent on application functionality ✴Order in which APIs are called is a random variable. It is could be deterministic for a large population of users, but for others it can vary due to various navigation behaviours ✴Number of times an API is called in a session is a random variable owing to activity such as, hitting the browser back button, navigating back to starting point/page and jumping back to another point in an app/page. ✴Input data to each API and response data is a random variable and is user dependent to some extent Call to an earlier API after a time delay Simultaneous calls to 2 APIs Successive and rapid calls to the same API Calls to APIs not seen earlier in the session
  • 14. Town Crier Feature representation Interval between At the API call in the session of the user call and previous API call Name of the API Values of request header keys Values of request body keys Session-level count-based statistics Quantized in to 4 levels and encoded. Encoded Encoded Encoded Quantized histogram of bins and encoded. API identity Time feature API data API data Session summary
  • 15. Town Crier Detecting anomalies in sequences of API calls : addressing the human vs bot challenge GRU or LSTM network Predicted ’N’ steps of API calls Sequence comparison Actual ’n’ steps of API calls Users are clustered into groups based on similarity in API access sequences API call sequences from heavily populated clusters are chosen Feature representation Sequential model There is little innovation in training a model The discriminating features lie in temporal data Feature representation is the most crucial step Identifying API call sequences of potentially human users is another crucial step
  • 16. Town Crier Issues to address when applying LSTM models for API call sequences • User sessions can be of varying lengths • Some sessions are as small as 3 calls while some sessions could be long-running sessions • Common practice of padding sequences to generate sequences of length equal to the window length does not work • Padding a session of length 3 with 17 fillers in order to match a window size of 20 adds more noise than useful information • User sessions need to be grouped into clusters based the sequence length • One can build as many models as the number of such groups • Ideally the clusters should be such that, less than 20% of the sequence length needs a filler in order to match the window size. • For a window size of say 20, sessions longer than 16 should form a cluster • Window size must be adjusted as per the sessions length in a cluster
  • 17. Town Crier Performance of an LSTM-based anomaly detector Expected API sequence ————> Differentusersessions———————-> Predicted API sequence ————> 1.APIs is a session are represented as numbers generated be a proprietary embedding scheme (integers) 2.The predicted sequence of API calls is going to be a sequence of floating point numbers. 3.The expected API call sequence of integers must vary very little from the predicted API call sequence of floating point numbers 4.Prediction error for each session is measured as the mean absolute difference between the actual API (integer) and the predicted API (float) Sessions Percentage of data used Prediction error Train 6818 80% 0.044 Test 1708 20% 0.0724 • API with number 26 is a /logout API. Most sessions end with call to this API • Humans usually access APIs using an application. As a result they exhibit deterministic API sequences even if the input data varies per user. Hence we see many sequences repeating. • In the present data set, the API input data is also deterministic in many samples. These data samples were generated by benign bots that perform periodic yet repeated action. • The very low prediction error could be attributed to the above reasons. • Malicious bots exhibit an API sequence similar to that of benign bots and humans • However, the predicted API sequence of floating point numbers will have a significant error compared to the input API sequence of integers
  • 18. Town Crier - stay tuned Work in progress (20th March 2020)