SlideShare uma empresa Scribd logo
1 de 9
Baixar para ler offline
1
A Static Approach to Harmful JavaScript Detection
Using Web Crawling
Chandan Sharma, Harsha Manivannan, Joel Wilhite*,
Dibakar Barua, and Garrett Mallory
Georgia Institute of Technology, School of Computer Science
{chandan.sharma24, har23k, jwilhite3, dibakar.barua92, gmallory3}@gatech.edu
Abstract
JavaScript is a small programming language that
is an essential element of the interactive web
pages of contemporary web browsing. JavaScript
is used to interact with a web user and to
customize content based on a variety of browser
types and user interaction.
As a richly featured programming language,
JavaScript has been a ripe target for malicious
actions against website users. It has been used by
malicious site operators and third party actors to
launch a variety of malicious actions against a
user's browser and general system. Modern
browsers have made great strides forward with
fixing security vulnerabilities and including filtering
technologies to limit the ability of malicious web
code to actually execute in the browser. Related
technologies like Google’s search engine include
an extensive blacklist of sites that are known to
host malicious content.
However, browsers will always be behind the
vulnerability discovery, patch creation, and patch
application cycle and modern malicious threats
are designed to infiltrate and propagate within the
time it takes for a malicious web site to be
included in malicious site blacklists.
Much of the malicious JavaScript code is
injected into otherwise benign web servers
through web server vulnerabilities and the
malicious JavaScript is distributed to site visitors
for a significant amount of time before the
intrusion is detected and removed. For static
malicious scripts standard hash based signature
checking could be effectively utilized in checking
JavaScript for known malicious
behaviors. However, most malicious scripts
include some portion of dynamic content that
significantly changes any hash based signature
checking system. In an effort to combat malicious
scripts that may be injected into web servers, we
propose a JavaScript scanning and classifying
system. This system would scan a website, pull
out all instances of JavaScript, and assign a
maliciousness score to each script.
Introduction
Earlier websites consisted of mostly static content
and relied on the the user interacting with a site to
initiate additional content requests by clicking on
links. As JavaScript and other browser based
scripting languages were introduced, sites started
including more interactive content and offloading
some of the site processing load to the browser,
reducing the amount of server side processing
required. JavaScript has become an integral part
of the modern web landscape; its inclusion allows
large sophisticated web sites to offer a wealth of
user interaction capability. It is also used to speed
up page load times and reduce the total network
bandwidth required by a site. A page load
initiated by the user downloads and renders an
initial page, as the user interacts with the page,
JavaScript can be used to load additional content
that is customized to the user; without waiting for a
monolithic page to load all data that could display
whether the it would be needed or not.
As with seemingly every new advancement in
the technology, nefarious elements began to use
JavaScript for malicious activities almost as soon
2
as it was introduced. Every advancement in
browser and JavaScript security is challenged by
new exploit techniques developed by the Blackhat
community. Protecting users from malicious
content is a never ending pursuit for browser
vendors and autonomous system operators. Many
of the more traditional methods of protecting
computing resources from malicious code have
also been applied to protecting browsers. Some of
these methods include: signature based
JavaScript checking[1], blacklisting domains and
IPs from which known malware has been
distributed [12] and disabling JavaScript all
together. Blacklists of known malicious domains
and IPs are of limited value because purveyors of
malicious JavaScript are constantly finding ways
to inject their malicious code into otherwise
reputable sites which are not on a blacklist and
will be quickly removed when the infection is
found and reported. Signature based checking on
JavaScripts are also of limited use considering
that most signatures are based on hashes of
known malicious content. Most malicious
JavaScripts contain small variations such that old
versions of signatures become obsolete almost
immediately.
Considering the limitations of traditional
approaches to recognizing malicious JavaScript
we propose an additional method that does not
depend on blacklists or signatures to determine
whether a JavaScript is malicious or not. In this
paper we present a lightweight JavaScript
classifier that looks a what a JavaScript does and
looks like, to determine if it is malicious or not.
Similar to how a code review analyzer can check
C code for improper string copy, unbounded
arrays, and other coding practices that are known
to be vulnerable, our JavaScript classifier
analyzes what a script does and assigns a
maliciousness score based on what features are
similar to known malicious JavaScripts. We will
also examine related work in this space, give a
detailed system overview, and speak extensively
on how we analyzed our data. We will also cover
limitations of these approaches and conclusions
drawn by this work. It is important to note this
classifier was built as a proof of concept and is not
intended to be a production quality product. The
goal of this project is to investigate if this line of
research holds potential for further research
investment by reaffirming existing literature.
Future Use
The limitations of a JavaScript classifier are that it
is not practical as an inline or hidden filtering
system due to the widespread use of SSL in
modern web sites. A preemptive approach could
be a JavaScript classifier run as a browser plugin.
The plugin combined with a sandbox space could
be used to render, classify, and block malicious
JavaScripts from executing outside the sandbox
space and interacting with the user. A fast
reactive approach could be an analyzer engine
utilized by an autonomous system operator which
is capable of dedicating the resources required to
scan a majority of sites based on DNS resolution
requests and populate a realtime blacklist based
on the maliciousness score. This could reduce but
not completely prevent the spread of JavaScript
based infections.
Related Work
There has been a lot of research published in the
space of using automated classifiers to identify
malicious content. These papers recognize the
shortcomings of contemporary blacklists,
signature detection, and other reactive methods
and instead propose alternative methods to
actively scan for maliciousness. These methods
are resilient to code obfuscation and are able to
pick up never before seen malicious JavaScripts
or drive-by downloads. Some popular
classification techniques use supervised machine
learning methods such as K-Nearest Neighbors,
Support Vector Machines, or Naive Bayes and
unsupervised methods such as K-Means or
Affinity propagation [2].
Training data often consists of URLs, screenshots
of webpages, textual content, structural tags, page
links (type or count), visual appearance, HTML
contents, JavaScript content, advertisement
3
delivery infrastructure or web page change
characteristics [4]. Of these, JavaScript was the
most popular subject for analysis with a multitude
of papers existing aimed to statically or
dynamically classify scripts. Because of
JavaScript’s market presence, we chose to focus
on extracting features from scripts scraped from
tens of thousands of sites from the Alexa top 1M
sites. We also chose to pursue static analysis to
support the development of lightweight, end-user
friendly products.
System Overview
The resources allocated to this project was a
single virtual machine running Debian 7.9 with 6
2.1GHz processing cores, 17GB of ram, and 1TB
of storage. With this virtual machine we scraped a
subset of sites from the Alexa top 1M list,
rendered each page and extracted all scripts
present on the page. Scripts were stored as
individual files using its SHA-512 hash as the
filename in a domain-named-directory. This
allowed us to store a script in a deduplicated
fashion. Metadata about each script was stored in
a MongoDB database each time it was
encountered during a scrape whether or not the
script had been previously stored on disk. Python
2.7 was used as the programming language for
this project because of it’s ease of use and the
large amount of prebuilt modules available.
Our JavaScript collection engine is built with a
collection of Python modules glued together with
custom Python scripts. The engine consists of the
module Scrapy crawling a site, rendering the site
in Splash, then passing the rendered site to
BeautifulSoup for JavaScript and feature
extraction. Each script was stored as a distinct file
for later analysis. The virtual machine was able to
run 5 scrapy instances and one MongoDB
concurrently.
The total list of sites to scrape was divided up
into five chunks with each chunk allocated to a
Scrapy instance which in turn scraped it’s
allocated sites in manageable 200 site chunks.
We discovered that the scrapy/splash
infrastructure started to face problems when it
tried to contiguously scrap 300 or more. To
alleviate this problem, when each Scrapy instance
scraped 200 sites, it was shutdown and restarted
with the next 200 sites.
For rendering sites we used the Python module
Splash which is a very lightweight, headless
HTML browser, implemented as a Docker
container to help sandbox potential malicious
code from escaping the rendering environment.
Docker is a lightweight virtualization environment
that is available on the three most popular
operating systems Linux, Windows, and OS X.
Docker is similar to well known hypervisors in that
it provides security between simultaneously
running applications yet it only virtualized the
application and related libraries and not the entire
OS stack. Because we are actively trying to find
and execute bad things that will harm our browser,
we do all site rendering in docker containers that
are protected from the OS running on our virtual
machine.
After some initial tests using manually created
Splash Docker containers with pre-defined
interface ports for each instance, we switched to
Aquarium, a packaged multi container Splash
instance that load balances requests between
individual Splash containers. Aquarium uses
docker-compose to build and run multiple Splash
containers and a load balancing HAProxy
container to distribute render requests to the
Splash containers. Aquarium was periodically
stopped and reloaded to reinitialize all the Splash
rendering containers. This solved the problem of
Splash growing in memory usage until it’s memory
limit was reached and of removing any malicious
scripts that managed to infect a Splash container.
Beautifulsoup is a Python module that
automatically determines page encoding and
parses the page into an easily searchable DOM
tree. We use this dictionary to retrieve all
JavaScript instances from a site which we then
later pass on to feature extraction, maliciousness
verification, and machine learning classifiers. The
next page includes a diagram showing the flow of
information for our scraping system. Separate is
the classification system.
4
Fig 1: System Architecture Diagram
Data Collection and Training
Building our classifier consisted of two distinct
phases: mass data collection of JavaScripts with a
high certainty classification as malicious or benign,
and scraping a large group of unclassified scripts.
For training our classifier, we verify a script as
either benign or malicious and then label it
accordingly. In order to obtain a set of benign
JavaScripts we scraped the top 100 sites of the
Alexa list and labelled the scripts obtained as
benign.
Well known blacklists were scrapped to create a
database of scripts and each of these scripts were
passed to virustotal.com through the website’s
API to confirm its maliciouness. Virustotal
analyses the scripts with various anti-virus (AV)
engines and returns a score of the number of AV
engines which found the sample malicious. We
decided to only include scripts which were found
malicious by multiple AV engines.
This ensures that the scripts we labelled
malicious are truly malicious and not a false
prediction by a lone AV engine. We prune our
malicious scripts dataset to remove any
duplicates. The combined dataset is then passed
to the feature extraction module which extracts the
features and produces a CSV file which can be
used to train the classifier.
Feature Extraction
Most malicious JavaScripts are obfuscated and
packed in order to make their analysis difficult. We
used several metrics drawn from previous work
[5][10] to inform the static analysis of the
JavaScript code to form our feature vector. In
order to differentiate between malicious and
benign scripts, we decided to look at both the
structural as well as statistical features. We
analyze the structural features and a number of
feature based on the analysis of the Abstract
Syntax Tree (AST) using the slimit parser. The
statistical measures
5
computed include white space percentage, string
entropy, and average line length.
The complete list of 28 features which we
extracted are the following: number of eval
functions, number of setInterval() functions,
number of JavaScript string link() functions,
number of Javascript string search()
functions,number of exec() functions, number of
escape() functions, number of unescape()
functions, ratio of number of Javascript keywords
to number of normal words (from predefined list),
average entropy of all words, average entropy of
the script, number of long strings in the program
(with length >40), maximum entropy of the
program, average string length in the program,
maximum string length in the program, number of
functions or variables with long names(with
length>=40), number of direct string assignments,
number of string-modifying functions, number of
event attachment functions, number of DOM
modifying functions in the script, number of
suspicious strings (lookup from predefined list),
whitespace ratio (number of whitespace
characters to actually number of characters in the
script), number of strings containing only
hexadecimal characters, maximum number of
non printable characters in strings, average line
length in the script, number of times 'iframe' is
present in a string in the script, number of tags
with malicious names in the script (from
predefined list) and total length of the script.
Keywords-to-words ratio is useful to detect
malicious pages because in most exploits the
number of keywords like “var”, “for”, “while” and a
few others is limited while there are usually a large
number of other operations (such as
instantiations, arithmetical operations, function
calls). This usually does not happen in benign
scripts, where the occurrence of keywords is
usually higher. We also check for the presence of
shellcode in the scripts. We analyze the long
strings contained in the script to check if their
structure resembles shellcode. We use two
methods to confirm the same. The first method
considers the number of non-printable ASCII
characters in the string. The second one detects
shellcode composed only of hexadecimal
characters, i.e., it checks if the string is a
consecutive block of characters in the ranges a-f,
A-F, 0-9. We keep a count of the number of direct
string assignments . Malicious scripts tend to have
a large number of string assignments in order to
deobfuscate and encrypt the script. Drive by
download exploits usually call several of these
functions in order to instantiate vulnerable
components and/or create elements in the page
for the purpose of loading external scripts and
exploit pages.
Feature selection
We also tried to analyze and list the more
important features using feature selection
techniques like Scikit’s ExtraTreesClassifier
(which is an ensemble of randomized decision
trees). A ranking analysis of our classifier yielded
maximum scores for the following 5 features:
● 1) No. of functions or variables with long
names.
● 2) No. of direct string assignments.
● 3) No. of string modifying functions.
● 4) No. of escape() functions.
● 5) Average entropy of the script.
These features are accurate in that they cover a
variety of ways in which we manually characterize
a malicious script.
Model Training and Analysis
Our analysis was performed using several open
source machine learning packages. Weka (based
=-in Java) and Scikit-Learn (based in Python)
were both evaluated as candidates before we
opted for Scikit-Learn due to its customizability
and ease of integration into our Python
framework. Scikit is a machine learning Python
module that was used to analyze the bulk of our
data. We also used the module to handle
preprocessing on our data by applying a data
normalization function called Min-Max Scaler to all
datasets of JavaScripts features.
This estimator scales each feature individually
such that the final result is within {0, 1}. Scaling
6
eatures individually was important because of the
radical differences in feature ranges and
distribution over the range. This preprocessing
algorithm also has the benefit of maintaining
sparsity in the dataset, a genuine concern as our
collection of JavaScripts did not enumerate a
large portion of the feature space.
Collecting scripts with a very high probability of
being malicious was more challenging. We
scraped sites that were listed as containing
malicious content by well known blacklists
squidblacklist.org, malwaredomainlist.com,
malwareblacklist.com, and phishtank.com. Scripts
obtained from these blacklisted sites were
checked against virustotal.com to get a
determination of maliciousness. Out of 10,000 of
the JavaScripts scraped from blacklisted sites, we
found 12 unique scripts confirmed to be
malevolent and 8864 unique scripts found to be
benign.
Model Selection
When using machine learning for data analysis it
is essential to select an analysis model that is
designed to handle scope and type of data inputs.
For our data set, we evaluated a variety of models
to get a better understanding of what model would
be most useful. The models tested were: Support
Vector Machines (SVM), Random Forest Classifier
(RFC), Adaboost, Decision Trees, and Multinomial
Naive Bayes (MNB). We first split the data into a
training and testing set using a 75-25 stratified
shuffle split. For each model, we performed a grid
search with 10-fold cross-validation on the training
set to search the parameter space for the values
SVM C, gamma
RFC Number estimators
Adaboost Number estimators,
learning rate
Decision Trees Max depth,
min samples per leaf,
max number of leaves
MNB Alpha
Fig 2: Parameter Tuning
which give the highest estimator performance.
Figure 2 contains the parameters examined in this
way. All other parameters were set to default
Scikit values.
After equipping our estimators with the optimal
parameters for our data, we evaluated the
performance of each model on the separate
testing set using receiver operating characteristic
(ROC) curve plotting. Below are the ROC curves
for the two most competitive models, SVM and
RFC. Because of the limitations on size of the
testing set selection, we run this performance
evaluation multiple times and take the average of
the ROC results. All of the models and their
respective areas under the curve are plotted on
Figure 3. A perfect classifier would have an AUC
of 1.0 while a random binary binary classifier has
an AUC of 0.50 and is represented by the dashed
line.
Fig 3: SVM ROC
Fig 4: RFC ROC
7
Fig 5: All ROC Plot
Analysis problems and limitations
We find that the Random Forest Classifier gives
the best performance for detection of malicious
scripts with a 12/12 correct detection rate. When
we tested our 8864 benign scripts, SVM with RBF
kernel gives the best performance for detection of
benign scripts at 95.5% True Negative Rate. With
a False Negative Rate of 25% with SVM, our
classifier has difficulty when it encounters new
malicious scripts. Due to the small number of
malicious scripts being tested, we infer that the
odd 12/12 prediction for a few models cannot be
trusted and the results should have been accurate
across models. The reason is that we could not
provide the classifier with a sufficient variety of
bad scripts in the training set, due to the
limitations of our Ground Truth Generation and the
crawling infrastructure. With a True Negative Rate
of > 75% in 4/5 models, our classifier has a good
understanding of benign scripts, due to the large
and varied amount of them we provided in our
training set, and the in depth handpicking done at
selection.
The most significant limitation we encountered
during our analysis was obtaining a substantial
number and diverse set of malicious JavaScripts.
Using virustotal.com as a source of authority for
malicious scrips presented a timing issue as each
API request takes up to 15 seconds to return a
result and there is a maximum limit of 4 requests
per source IP per minute.
The small number of malicious samples we had
led to the creation of a very imbalanced data set
which led to overfitting, leading to very skewed
results. We tried to employ Tomek links
undersampling of the benign samples to solve this
problem. This reduced the presence of benign
scripts in our training set so as to avoid overfitting
the data to benign examples. The alternative,
over-sampling using SVM-SMOTE , feeds the
training data additional examples that were near
malicious examples in the high-dimensional
feature space in order to generate realistic
examples of malicious scripts. Unfortunately, this
“cloning” process resulted in false positive rates in
the double digits for malware detection. It would
appear that the examples we scraped are similar
in features to benign examples, a possible
limitation with static JavaScript analysis.
Even after collection of malicious samples and
their verification with VirusTotal, we manually
analyzed the samples and found many samples
which were very similar. Our sampling techniques
are dependent on the relative weight of the
samples of both calles. To avoid generating too
many points for what is essentially the same
sample, we filtered out all the duplicates by hand.
We also removed many scripts which were
detected to be malicious by only one or two AV
engines. There were many scripts which had
spurious values returned from the feature
extractor. These were also removed.
Discussion and Limitations
There were several difficulties involved with with
project; two of the largest were collecting enough
known malicious scripts and attempting to use the
very limited virtual machine that was allocated to
this project. Collecting known malicious scripts of
any kind is just difficult by nature. The difficulties
of the virtual machine stemmed from trying to use
an operating system that was a full version out of
date. There were issues utilizing the latest Python
modules that are usually intended by their
developers to run on only current versions of
8
operating systems. The second major difficulty
was the speed of the disk storage allocated to the
virtual machine. The speed of data writes to the
virtual machine file system was so slow that
unless very carefully watched the OS would build
such a high IO wait time that it would crash and
require a hard reboot by the hypervisor operator.
The issue was partially alleviated by network
mounting storage from a third party virtual
machine. Fortunately, the networking resources
assigned to the virtual machine was high enough
that all scraping and network storage was easily
handled within the allocated bandwidth.
These two main issues could be of reduced
impact by 1) partnering with some other entity that
already has a large collection of confirmed and
recent malicious scripts and 2) allocating faster
virtual or physical resources that would allow a
much higher rate of site scrapes and rendering.
Conclusions
Javascript’s market presence makes it a key
consideration in applications of internet abuse and
security. The dynamic delivery of user content is
essential for an interactive web experience but
also poses considerable risk to users. While there
have been great advances in the development of
filter technologies such as blacklists, it is
commonly accepted in the information security
field that these are inadequate in preventing
delivery of malicious content due to issues such
as code obfuscation, zero-day exploits, and
delayed updating of the blacklists to include
compromised sites. Deep inspection of JavaScript
through dynamic analysis is costly and detracts
from the user experience. Instead, we propose
lightweight static analysis of JavaScript using
machine learning techniques that are capable of
reacting to changes in the malicious content on
the web. This approach can train models offline
and select only those models which best address
the data the user is consuming.
We found that for our data from the Alexa top
1M sites, Support Vector Machines and Random
Forest Classifiers performed best while Naive
Bayes lagged in performance. This is in
agreement with the H.B. Kazemain, S.Ahmed.
paper [3] and Y.-T Hou et al paper [2]. Our
boosted decision tree (Adaboost) and Decision
Tree performance was on par with these other
results as well, confirming prior literature.
References
[1] Y.-T Hou, Y. Chang, T. Chen, C.-S Laih, C.-M
Chen. Malicious web content detection by
machine learning. In Expert Systems with
Applications, 37 (1) (2010), pp. 55-60.
[2] H.B. Kazemain, S.Ahmed. Comparisons of
machine learning techniques for detecting
malicious webpages. In Expert Systems with
Applications, 42 (1) (2015), pp. 1166-1177.
[3] J. Ma, L. K. Saul, S. Savage, G. M. Voelker.
Beyond blacklists: learning to detect malicious
web sites from suspicious URLs. In Proceedings
of the 15th ACM SIGKDD international conference
on Knowledge discovery and data mining, 2009.
[4] S. N. Bannur, L. K. Saul, S. Savage. Judging a
site by its content: learning the textual, structural,
and visual features of malicious web pages. In
Proceedings of the 4th annual ACM workshop on
Security and artificial intelligence, 2011.
[5] D. Canali, M. Cova, G. Vigna, C. Kruegel.
Prophiler: A fast filter for large-scale detection of
malicious web pages. In International World Wide
Web Conference Committee (IM3C2). 2011.
[6] A. Kadur, J. Du, P. Chawan, R. Mehere, R.
Ding, S. T. Muralidharan, V. Chamrani, V. Dsilva.
Web and search engine crawling for data
discovery. Georgia Institute of Technology, CS
6262.
9
[7] Z. Li, K. Zhang, Y. Xie, F. Yu, X. Wang.
Knowing your enemy: understanding and
detecting malicious web advertising. In
Proceedings of the 2012 ACM conference on
Computer and communications security, 2012.
[8] P. Ratanaworabhan, B. Livshits, B. Zorn.
Nozzle: A defense against heap-spraying code
injection attacks. USENIX Security Symposium,
2009.
[9] C. Curtsinger, B. Livshits, B. Zorn, C. Seifert.
Zozzle: Low-overhead Mostly Static JavaScript
Malware Detection. In Proceedings of the USENIX
Security Symposium, 2011.
[10] P. Likarish, E. Jung, I. Jo. Obfuscated
malicious JavaScript detection using classification
techniques. In MALWARE, 2009.
[11] E. Adar, J. Teevan, S. T. Dumais, J. L. Elsas.
The web changes everything: Understanding the
dynamics of web content. In Proceedings of the
Second ACM International Conference on Web
Search and Data Mining, 2009.
[12] M. Felegyhazi, C. Christian Kreibich, V.
Paxson.On the Potential of Proactive Domain
Blacklisting. In LEET'10 Proceedings of the 3rd
USENIX conference on Large-scale exploits and
emergent threats, 2010
[13] Alexa. The web information company.
http://www.alexa.com/. 2016.
[14] Virus Total. Free Online Virus, Malware and
URL Scanner. http://www.virustotal.com/. 2016.

Mais conteúdo relacionado

Mais procurados

Security Testing - Zap It
Security Testing - Zap ItSecurity Testing - Zap It
Security Testing - Zap It
Manjyot Singh
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
Vikram Parmar
 
XSS and CSRF with HTML5
XSS and CSRF with HTML5XSS and CSRF with HTML5
XSS and CSRF with HTML5
Shreeraj Shah
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
ijwscjournal
 
Web application security & Testing
Web application security  & TestingWeb application security  & Testing
Web application security & Testing
Deepu S Nath
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
Mayur Garg
 

Mais procurados (20)

Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
AJAX: How to Divert Threats
AJAX:  How to Divert ThreatsAJAX:  How to Divert Threats
AJAX: How to Divert Threats
 
Session7-XSS & CSRF
Session7-XSS & CSRFSession7-XSS & CSRF
Session7-XSS & CSRF
 
Security Testing - Zap It
Security Testing - Zap ItSecurity Testing - Zap It
Security Testing - Zap It
 
NullCon 2012 - Ra.2: blackbox DOM-based XSS scanner
NullCon 2012 - Ra.2: blackbox DOM-based XSS scannerNullCon 2012 - Ra.2: blackbox DOM-based XSS scanner
NullCon 2012 - Ra.2: blackbox DOM-based XSS scanner
 
Web Hacking
Web HackingWeb Hacking
Web Hacking
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Cyber ppt
Cyber pptCyber ppt
Cyber ppt
 
2013 OWASP Top 10
2013 OWASP Top 102013 OWASP Top 10
2013 OWASP Top 10
 
RSA Europe 2013 OWASP Training
RSA Europe 2013 OWASP TrainingRSA Europe 2013 OWASP Training
RSA Europe 2013 OWASP Training
 
XSS and CSRF with HTML5
XSS and CSRF with HTML5XSS and CSRF with HTML5
XSS and CSRF with HTML5
 
Source Code Analysis with SAST
Source Code Analysis with SASTSource Code Analysis with SAST
Source Code Analysis with SAST
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 
Hacking web applications
Hacking web applicationsHacking web applications
Hacking web applications
 
Website hacking and prevention (All Tools,Topics & Technique )
Website hacking and prevention (All Tools,Topics & Technique )Website hacking and prevention (All Tools,Topics & Technique )
Website hacking and prevention (All Tools,Topics & Technique )
 
Common Web Application Attacks
Common Web Application Attacks Common Web Application Attacks
Common Web Application Attacks
 
Web application security & Testing
Web application security  & TestingWeb application security  & Testing
Web application security & Testing
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Web Application Security
Web Application SecurityWeb Application Security
Web Application Security
 

Destaque

Professores 2012
Professores 2012Professores 2012
Professores 2012
QUEDINHA
 
Student Assignment Presentation Template
Student Assignment Presentation TemplateStudent Assignment Presentation Template
Student Assignment Presentation Template
Brandon S
 

Destaque (10)

E M F, E M R, Our Brains and the Weather - What's the link?
E M F, E M R, Our Brains and the Weather - What's the link?E M F, E M R, Our Brains and the Weather - What's the link?
E M F, E M R, Our Brains and the Weather - What's the link?
 
7.8.16 CV MMJ
7.8.16 CV MMJ7.8.16 CV MMJ
7.8.16 CV MMJ
 
Leviticus 13 commentary
Leviticus 13 commentaryLeviticus 13 commentary
Leviticus 13 commentary
 
Professores 2012
Professores 2012Professores 2012
Professores 2012
 
Planificación del Taller Educativo
Planificación del Taller EducativoPlanificación del Taller Educativo
Planificación del Taller Educativo
 
Student Assignment Presentation Template
Student Assignment Presentation TemplateStudent Assignment Presentation Template
Student Assignment Presentation Template
 
PPT Teknologi Pendidikan
PPT Teknologi PendidikanPPT Teknologi Pendidikan
PPT Teknologi Pendidikan
 
3 d semiconductor packaging
3 d semiconductor packaging3 d semiconductor packaging
3 d semiconductor packaging
 
Public Health/Health Care Partnerships: An Overview of the Landscape
Public Health/Health Care Partnerships: An Overview of the LandscapePublic Health/Health Care Partnerships: An Overview of the Landscape
Public Health/Health Care Partnerships: An Overview of the Landscape
 
Hospital/Community Partnership resource list
Hospital/Community Partnership resource listHospital/Community Partnership resource list
Hospital/Community Partnership resource list
 

Semelhante a CS6262_Group9_FinalReport

Cq3210191021
Cq3210191021Cq3210191021
Cq3210191021
IJMER
 
Top 10 Web Vulnerability Scanners
Top 10 Web Vulnerability ScannersTop 10 Web Vulnerability Scanners
Top 10 Web Vulnerability Scanners
wensheng wei
 
Introduction to Frontend Web Development
Introduction to Frontend Web DevelopmentIntroduction to Frontend Web Development
Introduction to Frontend Web Development
kavsinghta
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
ijwscjournal
 
Resume_Sandip_Mohod_Java_9_plus_years_exp
Resume_Sandip_Mohod_Java_9_plus_years_expResume_Sandip_Mohod_Java_9_plus_years_exp
Resume_Sandip_Mohod_Java_9_plus_years_exp
Sandip Mohod
 
dinesh_7.0_years_exp_in_java
dinesh_7.0_years_exp_in_javadinesh_7.0_years_exp_in_java
dinesh_7.0_years_exp_in_java
Dinesh Rajput
 
How to build a Portofino application
How to build a Portofino applicationHow to build a Portofino application
How to build a Portofino application
Giampiero Granatella
 

Semelhante a CS6262_Group9_FinalReport (20)

website vulnerability scanner and reporter research paper
website vulnerability scanner and reporter research paperwebsite vulnerability scanner and reporter research paper
website vulnerability scanner and reporter research paper
 
Project Presentation
Project Presentation Project Presentation
Project Presentation
 
Cq3210191021
Cq3210191021Cq3210191021
Cq3210191021
 
Top 10 Web Vulnerability Scanners
Top 10 Web Vulnerability ScannersTop 10 Web Vulnerability Scanners
Top 10 Web Vulnerability Scanners
 
Introduction to Frontend Web Development
Introduction to Frontend Web DevelopmentIntroduction to Frontend Web Development
Introduction to Frontend Web Development
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 
Web application penetration testing lab setup guide
Web application penetration testing lab setup guideWeb application penetration testing lab setup guide
Web application penetration testing lab setup guide
 
Resume_Sandip_Mohod_Java_9_plus_years_exp
Resume_Sandip_Mohod_Java_9_plus_years_expResume_Sandip_Mohod_Java_9_plus_years_exp
Resume_Sandip_Mohod_Java_9_plus_years_exp
 
A quick guide on Mobile App Backend development
A quick guide on Mobile App Backend developmentA quick guide on Mobile App Backend development
A quick guide on Mobile App Backend development
 
Java script Session No 1
Java script Session No 1Java script Session No 1
Java script Session No 1
 
5 Powerful Backend Frameworks for Web App Development in 2022
5 Powerful Backend Frameworks for Web App Development in 20225 Powerful Backend Frameworks for Web App Development in 2022
5 Powerful Backend Frameworks for Web App Development in 2022
 
Top 5 backend frameworks for web development in.pptx
Top 5 backend frameworks for web development in.pptxTop 5 backend frameworks for web development in.pptx
Top 5 backend frameworks for web development in.pptx
 
Node.js
Node.jsNode.js
Node.js
 
Secure Software Development with 3rd Party Dependencies
Secure Software Development with 3rd Party DependenciesSecure Software Development with 3rd Party Dependencies
Secure Software Development with 3rd Party Dependencies
 
MALICIOUS JAVASCRIPT DETECTION BASED ON CLUSTERING TECHNIQUES
MALICIOUS JAVASCRIPT DETECTION BASED ON CLUSTERING TECHNIQUESMALICIOUS JAVASCRIPT DETECTION BASED ON CLUSTERING TECHNIQUES
MALICIOUS JAVASCRIPT DETECTION BASED ON CLUSTERING TECHNIQUES
 
dinesh_7.0_years_exp_in_java
dinesh_7.0_years_exp_in_javadinesh_7.0_years_exp_in_java
dinesh_7.0_years_exp_in_java
 
Bshield osdi2006
Bshield osdi2006Bshield osdi2006
Bshield osdi2006
 
How to build a Portofino application
How to build a Portofino applicationHow to build a Portofino application
How to build a Portofino application
 
Making Of PHP Based Web Application
Making Of PHP Based Web ApplicationMaking Of PHP Based Web Application
Making Of PHP Based Web Application
 
Detection of Phishing Websites
Detection of Phishing WebsitesDetection of Phishing Websites
Detection of Phishing Websites
 

CS6262_Group9_FinalReport

  • 1. 1 A Static Approach to Harmful JavaScript Detection Using Web Crawling Chandan Sharma, Harsha Manivannan, Joel Wilhite*, Dibakar Barua, and Garrett Mallory Georgia Institute of Technology, School of Computer Science {chandan.sharma24, har23k, jwilhite3, dibakar.barua92, gmallory3}@gatech.edu Abstract JavaScript is a small programming language that is an essential element of the interactive web pages of contemporary web browsing. JavaScript is used to interact with a web user and to customize content based on a variety of browser types and user interaction. As a richly featured programming language, JavaScript has been a ripe target for malicious actions against website users. It has been used by malicious site operators and third party actors to launch a variety of malicious actions against a user's browser and general system. Modern browsers have made great strides forward with fixing security vulnerabilities and including filtering technologies to limit the ability of malicious web code to actually execute in the browser. Related technologies like Google’s search engine include an extensive blacklist of sites that are known to host malicious content. However, browsers will always be behind the vulnerability discovery, patch creation, and patch application cycle and modern malicious threats are designed to infiltrate and propagate within the time it takes for a malicious web site to be included in malicious site blacklists. Much of the malicious JavaScript code is injected into otherwise benign web servers through web server vulnerabilities and the malicious JavaScript is distributed to site visitors for a significant amount of time before the intrusion is detected and removed. For static malicious scripts standard hash based signature checking could be effectively utilized in checking JavaScript for known malicious behaviors. However, most malicious scripts include some portion of dynamic content that significantly changes any hash based signature checking system. In an effort to combat malicious scripts that may be injected into web servers, we propose a JavaScript scanning and classifying system. This system would scan a website, pull out all instances of JavaScript, and assign a maliciousness score to each script. Introduction Earlier websites consisted of mostly static content and relied on the the user interacting with a site to initiate additional content requests by clicking on links. As JavaScript and other browser based scripting languages were introduced, sites started including more interactive content and offloading some of the site processing load to the browser, reducing the amount of server side processing required. JavaScript has become an integral part of the modern web landscape; its inclusion allows large sophisticated web sites to offer a wealth of user interaction capability. It is also used to speed up page load times and reduce the total network bandwidth required by a site. A page load initiated by the user downloads and renders an initial page, as the user interacts with the page, JavaScript can be used to load additional content that is customized to the user; without waiting for a monolithic page to load all data that could display whether the it would be needed or not. As with seemingly every new advancement in the technology, nefarious elements began to use JavaScript for malicious activities almost as soon
  • 2. 2 as it was introduced. Every advancement in browser and JavaScript security is challenged by new exploit techniques developed by the Blackhat community. Protecting users from malicious content is a never ending pursuit for browser vendors and autonomous system operators. Many of the more traditional methods of protecting computing resources from malicious code have also been applied to protecting browsers. Some of these methods include: signature based JavaScript checking[1], blacklisting domains and IPs from which known malware has been distributed [12] and disabling JavaScript all together. Blacklists of known malicious domains and IPs are of limited value because purveyors of malicious JavaScript are constantly finding ways to inject their malicious code into otherwise reputable sites which are not on a blacklist and will be quickly removed when the infection is found and reported. Signature based checking on JavaScripts are also of limited use considering that most signatures are based on hashes of known malicious content. Most malicious JavaScripts contain small variations such that old versions of signatures become obsolete almost immediately. Considering the limitations of traditional approaches to recognizing malicious JavaScript we propose an additional method that does not depend on blacklists or signatures to determine whether a JavaScript is malicious or not. In this paper we present a lightweight JavaScript classifier that looks a what a JavaScript does and looks like, to determine if it is malicious or not. Similar to how a code review analyzer can check C code for improper string copy, unbounded arrays, and other coding practices that are known to be vulnerable, our JavaScript classifier analyzes what a script does and assigns a maliciousness score based on what features are similar to known malicious JavaScripts. We will also examine related work in this space, give a detailed system overview, and speak extensively on how we analyzed our data. We will also cover limitations of these approaches and conclusions drawn by this work. It is important to note this classifier was built as a proof of concept and is not intended to be a production quality product. The goal of this project is to investigate if this line of research holds potential for further research investment by reaffirming existing literature. Future Use The limitations of a JavaScript classifier are that it is not practical as an inline or hidden filtering system due to the widespread use of SSL in modern web sites. A preemptive approach could be a JavaScript classifier run as a browser plugin. The plugin combined with a sandbox space could be used to render, classify, and block malicious JavaScripts from executing outside the sandbox space and interacting with the user. A fast reactive approach could be an analyzer engine utilized by an autonomous system operator which is capable of dedicating the resources required to scan a majority of sites based on DNS resolution requests and populate a realtime blacklist based on the maliciousness score. This could reduce but not completely prevent the spread of JavaScript based infections. Related Work There has been a lot of research published in the space of using automated classifiers to identify malicious content. These papers recognize the shortcomings of contemporary blacklists, signature detection, and other reactive methods and instead propose alternative methods to actively scan for maliciousness. These methods are resilient to code obfuscation and are able to pick up never before seen malicious JavaScripts or drive-by downloads. Some popular classification techniques use supervised machine learning methods such as K-Nearest Neighbors, Support Vector Machines, or Naive Bayes and unsupervised methods such as K-Means or Affinity propagation [2]. Training data often consists of URLs, screenshots of webpages, textual content, structural tags, page links (type or count), visual appearance, HTML contents, JavaScript content, advertisement
  • 3. 3 delivery infrastructure or web page change characteristics [4]. Of these, JavaScript was the most popular subject for analysis with a multitude of papers existing aimed to statically or dynamically classify scripts. Because of JavaScript’s market presence, we chose to focus on extracting features from scripts scraped from tens of thousands of sites from the Alexa top 1M sites. We also chose to pursue static analysis to support the development of lightweight, end-user friendly products. System Overview The resources allocated to this project was a single virtual machine running Debian 7.9 with 6 2.1GHz processing cores, 17GB of ram, and 1TB of storage. With this virtual machine we scraped a subset of sites from the Alexa top 1M list, rendered each page and extracted all scripts present on the page. Scripts were stored as individual files using its SHA-512 hash as the filename in a domain-named-directory. This allowed us to store a script in a deduplicated fashion. Metadata about each script was stored in a MongoDB database each time it was encountered during a scrape whether or not the script had been previously stored on disk. Python 2.7 was used as the programming language for this project because of it’s ease of use and the large amount of prebuilt modules available. Our JavaScript collection engine is built with a collection of Python modules glued together with custom Python scripts. The engine consists of the module Scrapy crawling a site, rendering the site in Splash, then passing the rendered site to BeautifulSoup for JavaScript and feature extraction. Each script was stored as a distinct file for later analysis. The virtual machine was able to run 5 scrapy instances and one MongoDB concurrently. The total list of sites to scrape was divided up into five chunks with each chunk allocated to a Scrapy instance which in turn scraped it’s allocated sites in manageable 200 site chunks. We discovered that the scrapy/splash infrastructure started to face problems when it tried to contiguously scrap 300 or more. To alleviate this problem, when each Scrapy instance scraped 200 sites, it was shutdown and restarted with the next 200 sites. For rendering sites we used the Python module Splash which is a very lightweight, headless HTML browser, implemented as a Docker container to help sandbox potential malicious code from escaping the rendering environment. Docker is a lightweight virtualization environment that is available on the three most popular operating systems Linux, Windows, and OS X. Docker is similar to well known hypervisors in that it provides security between simultaneously running applications yet it only virtualized the application and related libraries and not the entire OS stack. Because we are actively trying to find and execute bad things that will harm our browser, we do all site rendering in docker containers that are protected from the OS running on our virtual machine. After some initial tests using manually created Splash Docker containers with pre-defined interface ports for each instance, we switched to Aquarium, a packaged multi container Splash instance that load balances requests between individual Splash containers. Aquarium uses docker-compose to build and run multiple Splash containers and a load balancing HAProxy container to distribute render requests to the Splash containers. Aquarium was periodically stopped and reloaded to reinitialize all the Splash rendering containers. This solved the problem of Splash growing in memory usage until it’s memory limit was reached and of removing any malicious scripts that managed to infect a Splash container. Beautifulsoup is a Python module that automatically determines page encoding and parses the page into an easily searchable DOM tree. We use this dictionary to retrieve all JavaScript instances from a site which we then later pass on to feature extraction, maliciousness verification, and machine learning classifiers. The next page includes a diagram showing the flow of information for our scraping system. Separate is the classification system.
  • 4. 4 Fig 1: System Architecture Diagram Data Collection and Training Building our classifier consisted of two distinct phases: mass data collection of JavaScripts with a high certainty classification as malicious or benign, and scraping a large group of unclassified scripts. For training our classifier, we verify a script as either benign or malicious and then label it accordingly. In order to obtain a set of benign JavaScripts we scraped the top 100 sites of the Alexa list and labelled the scripts obtained as benign. Well known blacklists were scrapped to create a database of scripts and each of these scripts were passed to virustotal.com through the website’s API to confirm its maliciouness. Virustotal analyses the scripts with various anti-virus (AV) engines and returns a score of the number of AV engines which found the sample malicious. We decided to only include scripts which were found malicious by multiple AV engines. This ensures that the scripts we labelled malicious are truly malicious and not a false prediction by a lone AV engine. We prune our malicious scripts dataset to remove any duplicates. The combined dataset is then passed to the feature extraction module which extracts the features and produces a CSV file which can be used to train the classifier. Feature Extraction Most malicious JavaScripts are obfuscated and packed in order to make their analysis difficult. We used several metrics drawn from previous work [5][10] to inform the static analysis of the JavaScript code to form our feature vector. In order to differentiate between malicious and benign scripts, we decided to look at both the structural as well as statistical features. We analyze the structural features and a number of feature based on the analysis of the Abstract Syntax Tree (AST) using the slimit parser. The statistical measures
  • 5. 5 computed include white space percentage, string entropy, and average line length. The complete list of 28 features which we extracted are the following: number of eval functions, number of setInterval() functions, number of JavaScript string link() functions, number of Javascript string search() functions,number of exec() functions, number of escape() functions, number of unescape() functions, ratio of number of Javascript keywords to number of normal words (from predefined list), average entropy of all words, average entropy of the script, number of long strings in the program (with length >40), maximum entropy of the program, average string length in the program, maximum string length in the program, number of functions or variables with long names(with length>=40), number of direct string assignments, number of string-modifying functions, number of event attachment functions, number of DOM modifying functions in the script, number of suspicious strings (lookup from predefined list), whitespace ratio (number of whitespace characters to actually number of characters in the script), number of strings containing only hexadecimal characters, maximum number of non printable characters in strings, average line length in the script, number of times 'iframe' is present in a string in the script, number of tags with malicious names in the script (from predefined list) and total length of the script. Keywords-to-words ratio is useful to detect malicious pages because in most exploits the number of keywords like “var”, “for”, “while” and a few others is limited while there are usually a large number of other operations (such as instantiations, arithmetical operations, function calls). This usually does not happen in benign scripts, where the occurrence of keywords is usually higher. We also check for the presence of shellcode in the scripts. We analyze the long strings contained in the script to check if their structure resembles shellcode. We use two methods to confirm the same. The first method considers the number of non-printable ASCII characters in the string. The second one detects shellcode composed only of hexadecimal characters, i.e., it checks if the string is a consecutive block of characters in the ranges a-f, A-F, 0-9. We keep a count of the number of direct string assignments . Malicious scripts tend to have a large number of string assignments in order to deobfuscate and encrypt the script. Drive by download exploits usually call several of these functions in order to instantiate vulnerable components and/or create elements in the page for the purpose of loading external scripts and exploit pages. Feature selection We also tried to analyze and list the more important features using feature selection techniques like Scikit’s ExtraTreesClassifier (which is an ensemble of randomized decision trees). A ranking analysis of our classifier yielded maximum scores for the following 5 features: ● 1) No. of functions or variables with long names. ● 2) No. of direct string assignments. ● 3) No. of string modifying functions. ● 4) No. of escape() functions. ● 5) Average entropy of the script. These features are accurate in that they cover a variety of ways in which we manually characterize a malicious script. Model Training and Analysis Our analysis was performed using several open source machine learning packages. Weka (based =-in Java) and Scikit-Learn (based in Python) were both evaluated as candidates before we opted for Scikit-Learn due to its customizability and ease of integration into our Python framework. Scikit is a machine learning Python module that was used to analyze the bulk of our data. We also used the module to handle preprocessing on our data by applying a data normalization function called Min-Max Scaler to all datasets of JavaScripts features. This estimator scales each feature individually such that the final result is within {0, 1}. Scaling
  • 6. 6 eatures individually was important because of the radical differences in feature ranges and distribution over the range. This preprocessing algorithm also has the benefit of maintaining sparsity in the dataset, a genuine concern as our collection of JavaScripts did not enumerate a large portion of the feature space. Collecting scripts with a very high probability of being malicious was more challenging. We scraped sites that were listed as containing malicious content by well known blacklists squidblacklist.org, malwaredomainlist.com, malwareblacklist.com, and phishtank.com. Scripts obtained from these blacklisted sites were checked against virustotal.com to get a determination of maliciousness. Out of 10,000 of the JavaScripts scraped from blacklisted sites, we found 12 unique scripts confirmed to be malevolent and 8864 unique scripts found to be benign. Model Selection When using machine learning for data analysis it is essential to select an analysis model that is designed to handle scope and type of data inputs. For our data set, we evaluated a variety of models to get a better understanding of what model would be most useful. The models tested were: Support Vector Machines (SVM), Random Forest Classifier (RFC), Adaboost, Decision Trees, and Multinomial Naive Bayes (MNB). We first split the data into a training and testing set using a 75-25 stratified shuffle split. For each model, we performed a grid search with 10-fold cross-validation on the training set to search the parameter space for the values SVM C, gamma RFC Number estimators Adaboost Number estimators, learning rate Decision Trees Max depth, min samples per leaf, max number of leaves MNB Alpha Fig 2: Parameter Tuning which give the highest estimator performance. Figure 2 contains the parameters examined in this way. All other parameters were set to default Scikit values. After equipping our estimators with the optimal parameters for our data, we evaluated the performance of each model on the separate testing set using receiver operating characteristic (ROC) curve plotting. Below are the ROC curves for the two most competitive models, SVM and RFC. Because of the limitations on size of the testing set selection, we run this performance evaluation multiple times and take the average of the ROC results. All of the models and their respective areas under the curve are plotted on Figure 3. A perfect classifier would have an AUC of 1.0 while a random binary binary classifier has an AUC of 0.50 and is represented by the dashed line. Fig 3: SVM ROC Fig 4: RFC ROC
  • 7. 7 Fig 5: All ROC Plot Analysis problems and limitations We find that the Random Forest Classifier gives the best performance for detection of malicious scripts with a 12/12 correct detection rate. When we tested our 8864 benign scripts, SVM with RBF kernel gives the best performance for detection of benign scripts at 95.5% True Negative Rate. With a False Negative Rate of 25% with SVM, our classifier has difficulty when it encounters new malicious scripts. Due to the small number of malicious scripts being tested, we infer that the odd 12/12 prediction for a few models cannot be trusted and the results should have been accurate across models. The reason is that we could not provide the classifier with a sufficient variety of bad scripts in the training set, due to the limitations of our Ground Truth Generation and the crawling infrastructure. With a True Negative Rate of > 75% in 4/5 models, our classifier has a good understanding of benign scripts, due to the large and varied amount of them we provided in our training set, and the in depth handpicking done at selection. The most significant limitation we encountered during our analysis was obtaining a substantial number and diverse set of malicious JavaScripts. Using virustotal.com as a source of authority for malicious scrips presented a timing issue as each API request takes up to 15 seconds to return a result and there is a maximum limit of 4 requests per source IP per minute. The small number of malicious samples we had led to the creation of a very imbalanced data set which led to overfitting, leading to very skewed results. We tried to employ Tomek links undersampling of the benign samples to solve this problem. This reduced the presence of benign scripts in our training set so as to avoid overfitting the data to benign examples. The alternative, over-sampling using SVM-SMOTE , feeds the training data additional examples that were near malicious examples in the high-dimensional feature space in order to generate realistic examples of malicious scripts. Unfortunately, this “cloning” process resulted in false positive rates in the double digits for malware detection. It would appear that the examples we scraped are similar in features to benign examples, a possible limitation with static JavaScript analysis. Even after collection of malicious samples and their verification with VirusTotal, we manually analyzed the samples and found many samples which were very similar. Our sampling techniques are dependent on the relative weight of the samples of both calles. To avoid generating too many points for what is essentially the same sample, we filtered out all the duplicates by hand. We also removed many scripts which were detected to be malicious by only one or two AV engines. There were many scripts which had spurious values returned from the feature extractor. These were also removed. Discussion and Limitations There were several difficulties involved with with project; two of the largest were collecting enough known malicious scripts and attempting to use the very limited virtual machine that was allocated to this project. Collecting known malicious scripts of any kind is just difficult by nature. The difficulties of the virtual machine stemmed from trying to use an operating system that was a full version out of date. There were issues utilizing the latest Python modules that are usually intended by their developers to run on only current versions of
  • 8. 8 operating systems. The second major difficulty was the speed of the disk storage allocated to the virtual machine. The speed of data writes to the virtual machine file system was so slow that unless very carefully watched the OS would build such a high IO wait time that it would crash and require a hard reboot by the hypervisor operator. The issue was partially alleviated by network mounting storage from a third party virtual machine. Fortunately, the networking resources assigned to the virtual machine was high enough that all scraping and network storage was easily handled within the allocated bandwidth. These two main issues could be of reduced impact by 1) partnering with some other entity that already has a large collection of confirmed and recent malicious scripts and 2) allocating faster virtual or physical resources that would allow a much higher rate of site scrapes and rendering. Conclusions Javascript’s market presence makes it a key consideration in applications of internet abuse and security. The dynamic delivery of user content is essential for an interactive web experience but also poses considerable risk to users. While there have been great advances in the development of filter technologies such as blacklists, it is commonly accepted in the information security field that these are inadequate in preventing delivery of malicious content due to issues such as code obfuscation, zero-day exploits, and delayed updating of the blacklists to include compromised sites. Deep inspection of JavaScript through dynamic analysis is costly and detracts from the user experience. Instead, we propose lightweight static analysis of JavaScript using machine learning techniques that are capable of reacting to changes in the malicious content on the web. This approach can train models offline and select only those models which best address the data the user is consuming. We found that for our data from the Alexa top 1M sites, Support Vector Machines and Random Forest Classifiers performed best while Naive Bayes lagged in performance. This is in agreement with the H.B. Kazemain, S.Ahmed. paper [3] and Y.-T Hou et al paper [2]. Our boosted decision tree (Adaboost) and Decision Tree performance was on par with these other results as well, confirming prior literature. References [1] Y.-T Hou, Y. Chang, T. Chen, C.-S Laih, C.-M Chen. Malicious web content detection by machine learning. In Expert Systems with Applications, 37 (1) (2010), pp. 55-60. [2] H.B. Kazemain, S.Ahmed. Comparisons of machine learning techniques for detecting malicious webpages. In Expert Systems with Applications, 42 (1) (2015), pp. 1166-1177. [3] J. Ma, L. K. Saul, S. Savage, G. M. Voelker. Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. [4] S. N. Bannur, L. K. Saul, S. Savage. Judging a site by its content: learning the textual, structural, and visual features of malicious web pages. In Proceedings of the 4th annual ACM workshop on Security and artificial intelligence, 2011. [5] D. Canali, M. Cova, G. Vigna, C. Kruegel. Prophiler: A fast filter for large-scale detection of malicious web pages. In International World Wide Web Conference Committee (IM3C2). 2011. [6] A. Kadur, J. Du, P. Chawan, R. Mehere, R. Ding, S. T. Muralidharan, V. Chamrani, V. Dsilva. Web and search engine crawling for data discovery. Georgia Institute of Technology, CS 6262.
  • 9. 9 [7] Z. Li, K. Zhang, Y. Xie, F. Yu, X. Wang. Knowing your enemy: understanding and detecting malicious web advertising. In Proceedings of the 2012 ACM conference on Computer and communications security, 2012. [8] P. Ratanaworabhan, B. Livshits, B. Zorn. Nozzle: A defense against heap-spraying code injection attacks. USENIX Security Symposium, 2009. [9] C. Curtsinger, B. Livshits, B. Zorn, C. Seifert. Zozzle: Low-overhead Mostly Static JavaScript Malware Detection. In Proceedings of the USENIX Security Symposium, 2011. [10] P. Likarish, E. Jung, I. Jo. Obfuscated malicious JavaScript detection using classification techniques. In MALWARE, 2009. [11] E. Adar, J. Teevan, S. T. Dumais, J. L. Elsas. The web changes everything: Understanding the dynamics of web content. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009. [12] M. Felegyhazi, C. Christian Kreibich, V. Paxson.On the Potential of Proactive Domain Blacklisting. In LEET'10 Proceedings of the 3rd USENIX conference on Large-scale exploits and emergent threats, 2010 [13] Alexa. The web information company. http://www.alexa.com/. 2016. [14] Virus Total. Free Online Virus, Malware and URL Scanner. http://www.virustotal.com/. 2016.