1. 1
A Static Approach to Harmful JavaScript Detection
Using Web Crawling
Chandan Sharma, Harsha Manivannan, Joel Wilhite*,
Dibakar Barua, and Garrett Mallory
Georgia Institute of Technology, School of Computer Science
{chandan.sharma24, har23k, jwilhite3, dibakar.barua92, gmallory3}@gatech.edu
Abstract
JavaScript is a small programming language that
is an essential element of the interactive web
pages of contemporary web browsing. JavaScript
is used to interact with a web user and to
customize content based on a variety of browser
types and user interaction.
As a richly featured programming language,
JavaScript has been a ripe target for malicious
actions against website users. It has been used by
malicious site operators and third party actors to
launch a variety of malicious actions against a
user's browser and general system. Modern
browsers have made great strides forward with
fixing security vulnerabilities and including filtering
technologies to limit the ability of malicious web
code to actually execute in the browser. Related
technologies like Google’s search engine include
an extensive blacklist of sites that are known to
host malicious content.
However, browsers will always be behind the
vulnerability discovery, patch creation, and patch
application cycle and modern malicious threats
are designed to infiltrate and propagate within the
time it takes for a malicious web site to be
included in malicious site blacklists.
Much of the malicious JavaScript code is
injected into otherwise benign web servers
through web server vulnerabilities and the
malicious JavaScript is distributed to site visitors
for a significant amount of time before the
intrusion is detected and removed. For static
malicious scripts standard hash based signature
checking could be effectively utilized in checking
JavaScript for known malicious
behaviors. However, most malicious scripts
include some portion of dynamic content that
significantly changes any hash based signature
checking system. In an effort to combat malicious
scripts that may be injected into web servers, we
propose a JavaScript scanning and classifying
system. This system would scan a website, pull
out all instances of JavaScript, and assign a
maliciousness score to each script.
Introduction
Earlier websites consisted of mostly static content
and relied on the the user interacting with a site to
initiate additional content requests by clicking on
links. As JavaScript and other browser based
scripting languages were introduced, sites started
including more interactive content and offloading
some of the site processing load to the browser,
reducing the amount of server side processing
required. JavaScript has become an integral part
of the modern web landscape; its inclusion allows
large sophisticated web sites to offer a wealth of
user interaction capability. It is also used to speed
up page load times and reduce the total network
bandwidth required by a site. A page load
initiated by the user downloads and renders an
initial page, as the user interacts with the page,
JavaScript can be used to load additional content
that is customized to the user; without waiting for a
monolithic page to load all data that could display
whether the it would be needed or not.
As with seemingly every new advancement in
the technology, nefarious elements began to use
JavaScript for malicious activities almost as soon
2. 2
as it was introduced. Every advancement in
browser and JavaScript security is challenged by
new exploit techniques developed by the Blackhat
community. Protecting users from malicious
content is a never ending pursuit for browser
vendors and autonomous system operators. Many
of the more traditional methods of protecting
computing resources from malicious code have
also been applied to protecting browsers. Some of
these methods include: signature based
JavaScript checking[1], blacklisting domains and
IPs from which known malware has been
distributed [12] and disabling JavaScript all
together. Blacklists of known malicious domains
and IPs are of limited value because purveyors of
malicious JavaScript are constantly finding ways
to inject their malicious code into otherwise
reputable sites which are not on a blacklist and
will be quickly removed when the infection is
found and reported. Signature based checking on
JavaScripts are also of limited use considering
that most signatures are based on hashes of
known malicious content. Most malicious
JavaScripts contain small variations such that old
versions of signatures become obsolete almost
immediately.
Considering the limitations of traditional
approaches to recognizing malicious JavaScript
we propose an additional method that does not
depend on blacklists or signatures to determine
whether a JavaScript is malicious or not. In this
paper we present a lightweight JavaScript
classifier that looks a what a JavaScript does and
looks like, to determine if it is malicious or not.
Similar to how a code review analyzer can check
C code for improper string copy, unbounded
arrays, and other coding practices that are known
to be vulnerable, our JavaScript classifier
analyzes what a script does and assigns a
maliciousness score based on what features are
similar to known malicious JavaScripts. We will
also examine related work in this space, give a
detailed system overview, and speak extensively
on how we analyzed our data. We will also cover
limitations of these approaches and conclusions
drawn by this work. It is important to note this
classifier was built as a proof of concept and is not
intended to be a production quality product. The
goal of this project is to investigate if this line of
research holds potential for further research
investment by reaffirming existing literature.
Future Use
The limitations of a JavaScript classifier are that it
is not practical as an inline or hidden filtering
system due to the widespread use of SSL in
modern web sites. A preemptive approach could
be a JavaScript classifier run as a browser plugin.
The plugin combined with a sandbox space could
be used to render, classify, and block malicious
JavaScripts from executing outside the sandbox
space and interacting with the user. A fast
reactive approach could be an analyzer engine
utilized by an autonomous system operator which
is capable of dedicating the resources required to
scan a majority of sites based on DNS resolution
requests and populate a realtime blacklist based
on the maliciousness score. This could reduce but
not completely prevent the spread of JavaScript
based infections.
Related Work
There has been a lot of research published in the
space of using automated classifiers to identify
malicious content. These papers recognize the
shortcomings of contemporary blacklists,
signature detection, and other reactive methods
and instead propose alternative methods to
actively scan for maliciousness. These methods
are resilient to code obfuscation and are able to
pick up never before seen malicious JavaScripts
or drive-by downloads. Some popular
classification techniques use supervised machine
learning methods such as K-Nearest Neighbors,
Support Vector Machines, or Naive Bayes and
unsupervised methods such as K-Means or
Affinity propagation [2].
Training data often consists of URLs, screenshots
of webpages, textual content, structural tags, page
links (type or count), visual appearance, HTML
contents, JavaScript content, advertisement
3. 3
delivery infrastructure or web page change
characteristics [4]. Of these, JavaScript was the
most popular subject for analysis with a multitude
of papers existing aimed to statically or
dynamically classify scripts. Because of
JavaScript’s market presence, we chose to focus
on extracting features from scripts scraped from
tens of thousands of sites from the Alexa top 1M
sites. We also chose to pursue static analysis to
support the development of lightweight, end-user
friendly products.
System Overview
The resources allocated to this project was a
single virtual machine running Debian 7.9 with 6
2.1GHz processing cores, 17GB of ram, and 1TB
of storage. With this virtual machine we scraped a
subset of sites from the Alexa top 1M list,
rendered each page and extracted all scripts
present on the page. Scripts were stored as
individual files using its SHA-512 hash as the
filename in a domain-named-directory. This
allowed us to store a script in a deduplicated
fashion. Metadata about each script was stored in
a MongoDB database each time it was
encountered during a scrape whether or not the
script had been previously stored on disk. Python
2.7 was used as the programming language for
this project because of it’s ease of use and the
large amount of prebuilt modules available.
Our JavaScript collection engine is built with a
collection of Python modules glued together with
custom Python scripts. The engine consists of the
module Scrapy crawling a site, rendering the site
in Splash, then passing the rendered site to
BeautifulSoup for JavaScript and feature
extraction. Each script was stored as a distinct file
for later analysis. The virtual machine was able to
run 5 scrapy instances and one MongoDB
concurrently.
The total list of sites to scrape was divided up
into five chunks with each chunk allocated to a
Scrapy instance which in turn scraped it’s
allocated sites in manageable 200 site chunks.
We discovered that the scrapy/splash
infrastructure started to face problems when it
tried to contiguously scrap 300 or more. To
alleviate this problem, when each Scrapy instance
scraped 200 sites, it was shutdown and restarted
with the next 200 sites.
For rendering sites we used the Python module
Splash which is a very lightweight, headless
HTML browser, implemented as a Docker
container to help sandbox potential malicious
code from escaping the rendering environment.
Docker is a lightweight virtualization environment
that is available on the three most popular
operating systems Linux, Windows, and OS X.
Docker is similar to well known hypervisors in that
it provides security between simultaneously
running applications yet it only virtualized the
application and related libraries and not the entire
OS stack. Because we are actively trying to find
and execute bad things that will harm our browser,
we do all site rendering in docker containers that
are protected from the OS running on our virtual
machine.
After some initial tests using manually created
Splash Docker containers with pre-defined
interface ports for each instance, we switched to
Aquarium, a packaged multi container Splash
instance that load balances requests between
individual Splash containers. Aquarium uses
docker-compose to build and run multiple Splash
containers and a load balancing HAProxy
container to distribute render requests to the
Splash containers. Aquarium was periodically
stopped and reloaded to reinitialize all the Splash
rendering containers. This solved the problem of
Splash growing in memory usage until it’s memory
limit was reached and of removing any malicious
scripts that managed to infect a Splash container.
Beautifulsoup is a Python module that
automatically determines page encoding and
parses the page into an easily searchable DOM
tree. We use this dictionary to retrieve all
JavaScript instances from a site which we then
later pass on to feature extraction, maliciousness
verification, and machine learning classifiers. The
next page includes a diagram showing the flow of
information for our scraping system. Separate is
the classification system.
4. 4
Fig 1: System Architecture Diagram
Data Collection and Training
Building our classifier consisted of two distinct
phases: mass data collection of JavaScripts with a
high certainty classification as malicious or benign,
and scraping a large group of unclassified scripts.
For training our classifier, we verify a script as
either benign or malicious and then label it
accordingly. In order to obtain a set of benign
JavaScripts we scraped the top 100 sites of the
Alexa list and labelled the scripts obtained as
benign.
Well known blacklists were scrapped to create a
database of scripts and each of these scripts were
passed to virustotal.com through the website’s
API to confirm its maliciouness. Virustotal
analyses the scripts with various anti-virus (AV)
engines and returns a score of the number of AV
engines which found the sample malicious. We
decided to only include scripts which were found
malicious by multiple AV engines.
This ensures that the scripts we labelled
malicious are truly malicious and not a false
prediction by a lone AV engine. We prune our
malicious scripts dataset to remove any
duplicates. The combined dataset is then passed
to the feature extraction module which extracts the
features and produces a CSV file which can be
used to train the classifier.
Feature Extraction
Most malicious JavaScripts are obfuscated and
packed in order to make their analysis difficult. We
used several metrics drawn from previous work
[5][10] to inform the static analysis of the
JavaScript code to form our feature vector. In
order to differentiate between malicious and
benign scripts, we decided to look at both the
structural as well as statistical features. We
analyze the structural features and a number of
feature based on the analysis of the Abstract
Syntax Tree (AST) using the slimit parser. The
statistical measures
5. 5
computed include white space percentage, string
entropy, and average line length.
The complete list of 28 features which we
extracted are the following: number of eval
functions, number of setInterval() functions,
number of JavaScript string link() functions,
number of Javascript string search()
functions,number of exec() functions, number of
escape() functions, number of unescape()
functions, ratio of number of Javascript keywords
to number of normal words (from predefined list),
average entropy of all words, average entropy of
the script, number of long strings in the program
(with length >40), maximum entropy of the
program, average string length in the program,
maximum string length in the program, number of
functions or variables with long names(with
length>=40), number of direct string assignments,
number of string-modifying functions, number of
event attachment functions, number of DOM
modifying functions in the script, number of
suspicious strings (lookup from predefined list),
whitespace ratio (number of whitespace
characters to actually number of characters in the
script), number of strings containing only
hexadecimal characters, maximum number of
non printable characters in strings, average line
length in the script, number of times 'iframe' is
present in a string in the script, number of tags
with malicious names in the script (from
predefined list) and total length of the script.
Keywords-to-words ratio is useful to detect
malicious pages because in most exploits the
number of keywords like “var”, “for”, “while” and a
few others is limited while there are usually a large
number of other operations (such as
instantiations, arithmetical operations, function
calls). This usually does not happen in benign
scripts, where the occurrence of keywords is
usually higher. We also check for the presence of
shellcode in the scripts. We analyze the long
strings contained in the script to check if their
structure resembles shellcode. We use two
methods to confirm the same. The first method
considers the number of non-printable ASCII
characters in the string. The second one detects
shellcode composed only of hexadecimal
characters, i.e., it checks if the string is a
consecutive block of characters in the ranges a-f,
A-F, 0-9. We keep a count of the number of direct
string assignments . Malicious scripts tend to have
a large number of string assignments in order to
deobfuscate and encrypt the script. Drive by
download exploits usually call several of these
functions in order to instantiate vulnerable
components and/or create elements in the page
for the purpose of loading external scripts and
exploit pages.
Feature selection
We also tried to analyze and list the more
important features using feature selection
techniques like Scikit’s ExtraTreesClassifier
(which is an ensemble of randomized decision
trees). A ranking analysis of our classifier yielded
maximum scores for the following 5 features:
● 1) No. of functions or variables with long
names.
● 2) No. of direct string assignments.
● 3) No. of string modifying functions.
● 4) No. of escape() functions.
● 5) Average entropy of the script.
These features are accurate in that they cover a
variety of ways in which we manually characterize
a malicious script.
Model Training and Analysis
Our analysis was performed using several open
source machine learning packages. Weka (based
=-in Java) and Scikit-Learn (based in Python)
were both evaluated as candidates before we
opted for Scikit-Learn due to its customizability
and ease of integration into our Python
framework. Scikit is a machine learning Python
module that was used to analyze the bulk of our
data. We also used the module to handle
preprocessing on our data by applying a data
normalization function called Min-Max Scaler to all
datasets of JavaScripts features.
This estimator scales each feature individually
such that the final result is within {0, 1}. Scaling
6. 6
eatures individually was important because of the
radical differences in feature ranges and
distribution over the range. This preprocessing
algorithm also has the benefit of maintaining
sparsity in the dataset, a genuine concern as our
collection of JavaScripts did not enumerate a
large portion of the feature space.
Collecting scripts with a very high probability of
being malicious was more challenging. We
scraped sites that were listed as containing
malicious content by well known blacklists
squidblacklist.org, malwaredomainlist.com,
malwareblacklist.com, and phishtank.com. Scripts
obtained from these blacklisted sites were
checked against virustotal.com to get a
determination of maliciousness. Out of 10,000 of
the JavaScripts scraped from blacklisted sites, we
found 12 unique scripts confirmed to be
malevolent and 8864 unique scripts found to be
benign.
Model Selection
When using machine learning for data analysis it
is essential to select an analysis model that is
designed to handle scope and type of data inputs.
For our data set, we evaluated a variety of models
to get a better understanding of what model would
be most useful. The models tested were: Support
Vector Machines (SVM), Random Forest Classifier
(RFC), Adaboost, Decision Trees, and Multinomial
Naive Bayes (MNB). We first split the data into a
training and testing set using a 75-25 stratified
shuffle split. For each model, we performed a grid
search with 10-fold cross-validation on the training
set to search the parameter space for the values
SVM C, gamma
RFC Number estimators
Adaboost Number estimators,
learning rate
Decision Trees Max depth,
min samples per leaf,
max number of leaves
MNB Alpha
Fig 2: Parameter Tuning
which give the highest estimator performance.
Figure 2 contains the parameters examined in this
way. All other parameters were set to default
Scikit values.
After equipping our estimators with the optimal
parameters for our data, we evaluated the
performance of each model on the separate
testing set using receiver operating characteristic
(ROC) curve plotting. Below are the ROC curves
for the two most competitive models, SVM and
RFC. Because of the limitations on size of the
testing set selection, we run this performance
evaluation multiple times and take the average of
the ROC results. All of the models and their
respective areas under the curve are plotted on
Figure 3. A perfect classifier would have an AUC
of 1.0 while a random binary binary classifier has
an AUC of 0.50 and is represented by the dashed
line.
Fig 3: SVM ROC
Fig 4: RFC ROC
7. 7
Fig 5: All ROC Plot
Analysis problems and limitations
We find that the Random Forest Classifier gives
the best performance for detection of malicious
scripts with a 12/12 correct detection rate. When
we tested our 8864 benign scripts, SVM with RBF
kernel gives the best performance for detection of
benign scripts at 95.5% True Negative Rate. With
a False Negative Rate of 25% with SVM, our
classifier has difficulty when it encounters new
malicious scripts. Due to the small number of
malicious scripts being tested, we infer that the
odd 12/12 prediction for a few models cannot be
trusted and the results should have been accurate
across models. The reason is that we could not
provide the classifier with a sufficient variety of
bad scripts in the training set, due to the
limitations of our Ground Truth Generation and the
crawling infrastructure. With a True Negative Rate
of > 75% in 4/5 models, our classifier has a good
understanding of benign scripts, due to the large
and varied amount of them we provided in our
training set, and the in depth handpicking done at
selection.
The most significant limitation we encountered
during our analysis was obtaining a substantial
number and diverse set of malicious JavaScripts.
Using virustotal.com as a source of authority for
malicious scrips presented a timing issue as each
API request takes up to 15 seconds to return a
result and there is a maximum limit of 4 requests
per source IP per minute.
The small number of malicious samples we had
led to the creation of a very imbalanced data set
which led to overfitting, leading to very skewed
results. We tried to employ Tomek links
undersampling of the benign samples to solve this
problem. This reduced the presence of benign
scripts in our training set so as to avoid overfitting
the data to benign examples. The alternative,
over-sampling using SVM-SMOTE , feeds the
training data additional examples that were near
malicious examples in the high-dimensional
feature space in order to generate realistic
examples of malicious scripts. Unfortunately, this
“cloning” process resulted in false positive rates in
the double digits for malware detection. It would
appear that the examples we scraped are similar
in features to benign examples, a possible
limitation with static JavaScript analysis.
Even after collection of malicious samples and
their verification with VirusTotal, we manually
analyzed the samples and found many samples
which were very similar. Our sampling techniques
are dependent on the relative weight of the
samples of both calles. To avoid generating too
many points for what is essentially the same
sample, we filtered out all the duplicates by hand.
We also removed many scripts which were
detected to be malicious by only one or two AV
engines. There were many scripts which had
spurious values returned from the feature
extractor. These were also removed.
Discussion and Limitations
There were several difficulties involved with with
project; two of the largest were collecting enough
known malicious scripts and attempting to use the
very limited virtual machine that was allocated to
this project. Collecting known malicious scripts of
any kind is just difficult by nature. The difficulties
of the virtual machine stemmed from trying to use
an operating system that was a full version out of
date. There were issues utilizing the latest Python
modules that are usually intended by their
developers to run on only current versions of
8. 8
operating systems. The second major difficulty
was the speed of the disk storage allocated to the
virtual machine. The speed of data writes to the
virtual machine file system was so slow that
unless very carefully watched the OS would build
such a high IO wait time that it would crash and
require a hard reboot by the hypervisor operator.
The issue was partially alleviated by network
mounting storage from a third party virtual
machine. Fortunately, the networking resources
assigned to the virtual machine was high enough
that all scraping and network storage was easily
handled within the allocated bandwidth.
These two main issues could be of reduced
impact by 1) partnering with some other entity that
already has a large collection of confirmed and
recent malicious scripts and 2) allocating faster
virtual or physical resources that would allow a
much higher rate of site scrapes and rendering.
Conclusions
Javascript’s market presence makes it a key
consideration in applications of internet abuse and
security. The dynamic delivery of user content is
essential for an interactive web experience but
also poses considerable risk to users. While there
have been great advances in the development of
filter technologies such as blacklists, it is
commonly accepted in the information security
field that these are inadequate in preventing
delivery of malicious content due to issues such
as code obfuscation, zero-day exploits, and
delayed updating of the blacklists to include
compromised sites. Deep inspection of JavaScript
through dynamic analysis is costly and detracts
from the user experience. Instead, we propose
lightweight static analysis of JavaScript using
machine learning techniques that are capable of
reacting to changes in the malicious content on
the web. This approach can train models offline
and select only those models which best address
the data the user is consuming.
We found that for our data from the Alexa top
1M sites, Support Vector Machines and Random
Forest Classifiers performed best while Naive
Bayes lagged in performance. This is in
agreement with the H.B. Kazemain, S.Ahmed.
paper [3] and Y.-T Hou et al paper [2]. Our
boosted decision tree (Adaboost) and Decision
Tree performance was on par with these other
results as well, confirming prior literature.
References
[1] Y.-T Hou, Y. Chang, T. Chen, C.-S Laih, C.-M
Chen. Malicious web content detection by
machine learning. In Expert Systems with
Applications, 37 (1) (2010), pp. 55-60.
[2] H.B. Kazemain, S.Ahmed. Comparisons of
machine learning techniques for detecting
malicious webpages. In Expert Systems with
Applications, 42 (1) (2015), pp. 1166-1177.
[3] J. Ma, L. K. Saul, S. Savage, G. M. Voelker.
Beyond blacklists: learning to detect malicious
web sites from suspicious URLs. In Proceedings
of the 15th ACM SIGKDD international conference
on Knowledge discovery and data mining, 2009.
[4] S. N. Bannur, L. K. Saul, S. Savage. Judging a
site by its content: learning the textual, structural,
and visual features of malicious web pages. In
Proceedings of the 4th annual ACM workshop on
Security and artificial intelligence, 2011.
[5] D. Canali, M. Cova, G. Vigna, C. Kruegel.
Prophiler: A fast filter for large-scale detection of
malicious web pages. In International World Wide
Web Conference Committee (IM3C2). 2011.
[6] A. Kadur, J. Du, P. Chawan, R. Mehere, R.
Ding, S. T. Muralidharan, V. Chamrani, V. Dsilva.
Web and search engine crawling for data
discovery. Georgia Institute of Technology, CS
6262.
9. 9
[7] Z. Li, K. Zhang, Y. Xie, F. Yu, X. Wang.
Knowing your enemy: understanding and
detecting malicious web advertising. In
Proceedings of the 2012 ACM conference on
Computer and communications security, 2012.
[8] P. Ratanaworabhan, B. Livshits, B. Zorn.
Nozzle: A defense against heap-spraying code
injection attacks. USENIX Security Symposium,
2009.
[9] C. Curtsinger, B. Livshits, B. Zorn, C. Seifert.
Zozzle: Low-overhead Mostly Static JavaScript
Malware Detection. In Proceedings of the USENIX
Security Symposium, 2011.
[10] P. Likarish, E. Jung, I. Jo. Obfuscated
malicious JavaScript detection using classification
techniques. In MALWARE, 2009.
[11] E. Adar, J. Teevan, S. T. Dumais, J. L. Elsas.
The web changes everything: Understanding the
dynamics of web content. In Proceedings of the
Second ACM International Conference on Web
Search and Data Mining, 2009.
[12] M. Felegyhazi, C. Christian Kreibich, V.
Paxson.On the Potential of Proactive Domain
Blacklisting. In LEET'10 Proceedings of the 3rd
USENIX conference on Large-scale exploits and
emergent threats, 2010
[13] Alexa. The web information company.
http://www.alexa.com/. 2016.
[14] Virus Total. Free Online Virus, Malware and
URL Scanner. http://www.virustotal.com/. 2016.