In this paper we present the SMalL Ontology for malicious software classification, SMalL Java Application for antivirus systems comparison and the SMalL knowledge based file format for malware related attacks. We believe that our ontology is able to aid the development of malware prevention software by offering a common knowledge base and a clear classification of the existing malicious software. The application is a prototype regarding how this ontology might be used in conjunction with known antivirus capabilities to offer a comprehensive comparison.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
SMalL - Semantic Malware Log Based Reporter
1. SMalL - Semantic Malware Log-based reporter
Stefan Ceriu, Stefan Prutianu
Faculty of Computer Science, „Al. I. Cuza“ University, Iasi, Romania
{ stefan.ceriu, stefan.prutianu}@info.uaic.ro
Abstract. In this paper we present the SMalL Ontology for malicious software
classification, SMalL Java Application for antivirus systems comparison and
the SMalL knowledge based file format for malware related attacks. We believe
that our ontology is able to aid the development of malware prevention software
by offering a common knowledge base and a clear classification of the existing
malicious software. The application is a prototype regarding how this ontology
might be used in conjunction with known antivirus capabilities to offer a
comprehensive comparison.
Keywords: malware, semantic web, jena, owl, protégé, ontology, virus, worm,
Trojan, spyware, crimeware;
1 Introduction
Malware, also known as malicious code and malicious software, refers to a
program that is inserted into a system, usually covertly, with the intent of
compromising the confidentiality, integrity, or availability of the victim‘s data,
applications, or operating system or otherwise annoying or disrupting the victim.
Malware has become the most significant external threat to most systems, causing
widespread damage and disruption, and necessitating extensive recovery efforts
within most organizations. Spyware malware intended to violate a user‘s privacy has
also become a major concern to organizations. Although privacy-violating malware
has been in use for many years, it has become much more widespread recently, with
spyware invading many systems to monitor personal activities and conduct financial
fraud. Organizations also face similar threats from a few forms of non-malware
threats that are often associated with malware. One of these forms that has become
commonplace is phishing, which is using deceptive computer-based means to trick
individuals into disclosing sensitive information. Another common form is virus
hoaxes, which are false warnings of new malware threats.
We will further look into way by witch to classify all the different types of
malware by means of a new ontology and an application designed to work with it
towards comparing different antivirus systems available.
2. 2 Ontologies and OWL
2.1 Overview
The term ontology originates from philosophy. In that context, it is used as
the name of a subfield of philosophy, namely, the study of the nature of existence, the
branch of metaphysics concerned with identifying, in the most general terms, the
kinds of things that actually exist, and how to describe them. For example, the
observation that the world is made up of specific objects that can be grouped into
abstract classes based on shared properties is a typical ontological commitment.
However, in more recent years, ontology has become one of the many words hijacked
by computer science and given a specific technical meaning that is rather different
from the original one. Instead of ―ontology‖ we now speak of ―an ontology.‖ In
general, an ontology describes formally a domain of discourse. Typically, an ontology
consists of a finite list of terms and the relationships between these terms. The terms
denote important concepts (classes of objects) of the domain. For example, in a
university setting, staff members, students, courses, lecture theaters, and disciplines
are some important concepts. The relationships typically include hierarchies of
classes. A hierarchy specifies a class C to be a subclass of another class S if every
object in C is also included in S. For example, all faculty members are staff members.
Apart from subclass relationships, ontologies may include information
properties (X teaches Y)
value restrictions (only faculty members may teach courses)
disjointness statements (faculty and general staff are disjoint)
specifications of logical relationships between objects (every
department must include at least ten faculty members).
In the context of the Web, ontologies provide a shared understanding of a
domain. Such a shared understanding is necessary to overcome differences in
terminology. One application‘s zip code may be the same as another application‘s
area code. Another problem is that two applications may use the same term with
different meanings. In university A, a course may refer to a degree (like computer
science), while in university B it may mean a single subject (CS 101). Such
differences can be overcome by mapping the particular terminology to a shared
ontology or by defining direct mappings between the ontologies. In either case, it is
easy to see that ontologies support semantic interoperability.
Ontologies are useful for the organization and navigation of Web sites. Many
web sites today expose on the left-hand side of the page the top levels of a concept
hierarchy of terms. The user may click on one of them to expand the subcategories.
Also, ontologies are useful for improving the accuracy of Web searches. The search
engines can look for pages that refer to a precise concept in an ontology instead of
collecting all pages in which certain, generally ambiguous, keywords occur. In this
way, differences in terminology between Web pages and the queries can be
overcome. In addition, Web searches can exploit generalization/specialization
information. If a query fails to find any relevant documents, the search engine may
suggest to the user a more general query. It is even conceivable for the engine to run
3. such queries proactively to reduce the reaction time in case the user adopts a
suggestion. Or if too many answers are retrieved, the search engine may suggest to
the user some specializations.
The Web Ontology Working Group of W3C identified a number of
characteristic use cases for the Semantic Web that would require much more
expressiveness than RDF and RDF Schema offer. A number of research groups in
both the United States and Europe had already identified the need for a more powerful
ontology modeling language. This led to a joint initiative to define a richer language,
called DAML+OIL (the name is a join of the names of the U.S. proposal DAML-
ONT and the European language OIL). DAML+OIL in turn was taken as the starting
point for the W3C Web Ontology Working Group in defining OWL, the language that
is aimed to be the standardized and broadly accepted ontology language of the
Semantic Web.
Ontology languages allow users to write explicit, formal conceptualizations
of domain models. The main requirements are a well-defined syntax, efficient
reasoning support, a formal semantics, sufficient expressive power and convenience
of expression. The importance of a well-defined syntax is clear and known from the
area of programming languages; it is a necessary condition for machine processing of
information. All the languages we have presented so far have a well defined syntax.
DAML+OIL and OWL build upon RDF and RDFS and have the same kind of syntax.
Of course, it is questionable whether the XML-based RDF syntax is very user-
friendly; there are alternatives better suited to human users (for example, see the OIL
syntax). However, this drawback is not very significant because ultimately users will
be developing their own ontologies using authoring tools, or more generally, ontology
development tools, instead of writing them directly in DAML+OIL or OWL.
A formal semantics describes the meaning of knowledge precisely. Precisely here
means that the semantics does not refer to subjective intuitions, nor is it open to
different interpretations by different people (or machines). The importance of a
formal semantics is well-established in the domain of mathematical logic, for
instance. One use of a formal semantics is to allow people to reason about the
knowledge. For ontological knowledge, we may reason about the following:
Class membership. If x is an instance of a class C, and C is a subclass of D,
then we can infer that x is an instance of D
Equivalence of classes. If class A is equivalent to class B, and class B is
equivalent to class C, then A is equivalent to C, too.
Consistency. Suppose we have declared x to be an instance of the class A
and that A is a subclass of B ∩ C, A is a subclass of D, and B and D are
disjoint. Then we have an inconsistency because A should be empty but has
the instance x. This is an indication of an error in the ontology.
Classification. If we have declared that certain property-value pairs are a
sufficient condition for membership in a class A, then if an individual x
satisfies such conditions, we can conclude that x must be an instance of A.
Semantics is a prerequisite for reasoning support. Derivations such as the
preceding ones can be made mechanically instead of being made by hand.
4. Reasoning support is important because it allows one to:
check the consistency of the ontology and the knowledge
check for unintended relationships between classes
automatically classify instances in classes
Automated reasoning support allows one to check many more cases than
could be checked manually. Checks like the preceding ones are valuable for designing
large ontologies, where multiple authors are involved, and for integrating and sharing
ontologies from various sources. A formal semantics and reasoning support are
usually provided by mapping an ontology language to a known logical formalism, and
by using automated reasoners that already exist for those formalisms. OWL is
(partially) mapped on description logic, and makes use of existing reasoners such as
FaCT and RACER. Description logics are a subset of predicate logic for which
efficient reasoning support is possible.
RDF and RDFS allow the representation of some ontological knowledge.
The main modeling primitives of RDF/RDFS concern the organization of
vocabularies in typed hierarchies: subclass and sub-property relationships, domain
and range restrictions, and instances of classes. However, a number of other features
are missing. Here we list a few:
Local scope of properties. rdfs:range defines the range of a property,
say eats, for all classes. Thus in RDF Schema we cannot declare range
restrictions that apply to some classes only. For example, we cannot say that
cows eat only plants, while other animals may eat meat too
Disjointness of classes. Sometimes we wish to say that classes are disjoint.
For example, male and female are disjoint. But in RDF Schema we can only
state subclass relationships, e.g., female is a subclass of person
Boolean combinations of classes. Sometimes we wish to build new classes
by combining other classes using union, intersection, and complement. For
example, we may wish to define the class person to be the disjoint union of
the classes male and female. RDF Schema does not allow such definitions
Cardinality restrictions. Sometimes we wish to place restrictions on how
many distinct values a property may or must take. For example, we would
like to say that a person has exactly two parents, or that a course is taught by
at least one lecturer. Again, such restrictions are impossible to express in
RDF Schema
Special characteristics of properties. Sometimes it is useful to say that a
property is transitive (like ―greater than‖), unique (like ―is mother of‖), or
the inverse of another property (like ―eats‖ and ―is eaten by‖)
Thus we need an ontology language that is richer than RDF Schema, a
language that offers these features and more. In designing such a language one should
be aware of the trade-off between expressive power and efficient reasoning support.
Generally speaking, the richer the language, the more inefficient the reasoning
support becomes, often crossing the border of non-computability. Thus we need a
compromise, a language that can be supported by reasonably efficient reasoners while
being sufficiently expressive to express large classes of ontologies and knowledge.
5. 2.2 Protégé
Knowledge about the application domain is one of the most important
cornerstones of successful software projects. We must gather at least a basic
understanding of the concepts relevant to your customers before we can begin coding.
For example, we need to know how your customer's business processes work before
we can develop a warehouse management system; we need to know that users who
buy cat food might also be interested in cat litter before you can implement purchase
recommendations for an online shop.
We acquire such knowledge from domain experts and capture it in some kind
of domain model. In simple cases, we can scribble these models on paper. This
approach works fine for small projects and when the experts help us decipher their
handwriting. But it's better to have models that directly translate into a Java program.
For instance, we can use Unified Modeling Language (UML) to sketch the domain
models with class diagrams and use cases. UML is quite good for quickly getting to
an implementation, but it is basically a language for object-oriented programming that
few domain experts fully understand. And it consists of a fixed set of modeling
constructs (such as classes and attributes) that are not very useful when domain
experts would rather talk about specific business processes and products.
The Protégé-OWL editor is an extension of Protégé that supports the Web
Ontology Language (OWL). OWL is the most recent development in standard
ontology languages, endorsed by the World Wide Web Consortium (W3C) to
promote the Semantic Web vision. An OWL ontology may include descriptions of
classes, properties and their instances. Given such an ontology, the OWL formal
semantics specifies how to derive its logical consequences, i.e. facts not literally
present in the ontology, but entailed by the semantics. These entailments may be
based on a single document or multiple distributed documents that have been
combined using defined OWL mechanisms.
The Protégé-OWL editor enables users to:
• Load and save OWL and RDF ontologies.
• Edit and visualize classes, properties, and SWRL rules.
• Define logical class characteristics as OWL expressions.
• Execute reasoners such as description logic classifiers.
• Edit OWL individuals for Semantic Web markup.
Protégé-OWL's flexible architecture makes it easy to configure and extend
the tool. It is tightly integrated with Jena and has an open-source Java API for the
development of custom-tailored user interface components or arbitrary Semantic Web
services.
From a programmer's perspective, one of Protégé's most attractive features is
that it provides an open source API to plug in your own Java components and access
the domain models from your application. As a result, you can develop systems very
rapidly: just start with the underlying domain model, let Protégé generate the basic
user interface, and then gradually write widgets and plug-ins to customize look-and-
feel and behavior.
6. Individuals, represent objects in the domain in which we are interested 2. An
important difference between Protégé and OWL is that OWL does not use the Unique
Name Assumption (UNA). This means that two different names could actually refer
to the same individual. For example, ―Queen Elizabeth‖, ―The Queen‖ and ―Elizabeth
Windsor‖ might all refer to the same individual. In OWL, it must be explicitly stated
that individuals are the same as each other, or different to each other — otherwise
they might be the same as each other, or they might be different to each other.
Properties are binary relations on individuals - i.e. properties link two
individuals together. For example, the property hasSibling might link the individual
Matthew to the individual Gemma, or the property hasChild might link the individual
Peter to the individual Matthew. Properties can have inverses. For example, the
inverse of hasOwner is isOwnedBy. Properties can be limited to having a single value
– i.e. to being functional. They can also be either transitive or symmetric.
OWL classes are interpreted as sets that contain individuals. They are
described using formal (mathematical) descriptions that state precisely the
requirements for membership of the class. For example, the class Cat would contain
all the individuals that are cats in our domain of interest. Classes may be organised
into a superclass-subclass hierarchy, which is also known as a taxonomy. Subclasses
specialize (‗are subsumed by‘) their superclasses. For example consider the classes
Animal and Cat – Cat might be a subclass of Animal (so Animal is the superclass of
Cat). This says that, ‗All cats are animals‘, ‗All members of the class Cat are
members of the class Animal‘, ‗Being a Cat implies that you‘re an Animal‘, and ‗Cat
is subsumed by Animal‘. One of the key features of OWL-DL is that these superclass-
subclass relationships (subsumption relationships) can be computed automatically by
a reasoned. In OWL classes are built up of descriptions that specify the conditions
that must be satisfied by an individual for it to be a member of the class.
OWL Classes are assumed to ‗overlap‘. We therefore cannot assume that an
individual is not a member of a particular class simply because it has not been
asserted to be a member of that class. In order to ‗separate‘ a group of classes we
must make them disjoint from one another. This ensures that an individual who has
been asserted to be a member of one of the classes in the group cannot be a member
of any other classes in that group.
One of the key features of ontologies that are described using OWL-DL is
that they can be processed by a reasoner. One of the main services offered by a
reasoner is to test whether or not one class is a subclass of another class. By
performing such tests on the classes in an ontology it is possible for a reasoner to
compute the inferred ontology class hierarchy. Another standard service that is
offered by reasoners is consistency checking. Based on the description (conditions) of
a class the reasoner can check whether or not it is possible for the class to have any
instances. A class is deemed to be inconsistent if it cannot possibly have any
instances.
Protégé allows different OWL reasoners to be plugged-in; the reasoner
shipped with Protégé is called Fact++. The ontology can be ‗sent to the reasoner‘ to
automatically compute the classification hierarchy and also to check the logical
consistency of the ontology. In Protégé the ‗manually constructed‘ class hierarchy is
called the asserted hierarchy. The class hierarchy that is automatically computed by
the reasoner is called the inferred hierarchy. Being able to use a reasoner to
7. automatically compute the class hierarchy is one of the major benefits of building an
ontology using the OWL-DL sub-language. When constructing very large ontologies
(with upwards of several thousand classes in them) the use of a reasoner to compute
subclass-superclass relationships between classes becomes almost vital. Without a
reasoner it is very difficult to keep large ontologies in a maintainable and logically
correct state. In cases where ontologies can have classes that have many superclasses
(multiple inheritance) it is nearly always a good idea to construct the class hierarchy
as a simple tree. Classes in the asserted hierarchy (manually constructed hierarchy)
therefore have no more than one superclass. Computing and maintaining multiple
inheritance is the job of the reasoner. This technique helps to keep the ontology in a
maintainable and modular state. Not only does this promote the reuse of the ontology
by other ontologies and applications, it also minimizes human errors that are inherent
in maintaining a multiple inheritance hierarchy.
3 Malware
3.1 Overview
Malware, short for malicious software, is software designed to infiltrate a
computer system without the owner's informed consent. The expression is a general
term used by computer professionals to mean a variety of forms of hostile, intrusive,
or annoying software or program code. The term "computer virus" is sometimes used
as a catch-all phrase to include all types of malware, including true viruses. Software
is considered malware based on the perceived intent of the creator rather than any
particular features. Malware includes computer viruses, worms, Trojan horses, most
root kits, spyware, dishonest adware, crime ware and other malicious and unwanted
software. In law, malware is sometimes known as a computer contaminant, for
instance in the legal codes of several U. S. states, including California and West
Virginia.
Malware is not the same as defective software, that is, software that has a
legitimate purpose but contains harmful bugs. Preliminary results from Symantec
published in 2008 suggested that “the release rate of malicious code and other
unwanted programs may be exceeding that of legitimate software applications”.
According to F-Secure, "as much malware [was] produced in 2007 as in the previous
20 years altogether." Malware's most common pathway from criminals to users is
through the Internet: primarily by e-mail and the World Wide Web.
The prevalence of malware as a vehicle for organized Internet crime, along
with the general inability of traditional anti-malware protection platforms to protect
against the continuous stream of unique and newly produced professional malware,
has seen the adoption of a new mindset for businesses operating on the Internet - the
acknowledgment that some sizable percentage of Internet customers will always be
infected for some reason or other, and that they need to continue doing business with
infected customers. The result is a greater emphasis on back-office systems designed
8. to spot fraudulent activities associated with advanced malware operating on
customers' computers.
Many early infectious programs, including the first Internet Worm and a
number of MS-DOS viruses, were written as experiments or pranks generally
intended to be harmless or merely annoying rather than to cause serious damage to
computers. In some cases the perpetrator did not realize how much harm their
creations could do. Young programmers learning about viruses and the techniques
wrote them for the sole purpose that they could or to see how far it could spread. As
late as 1999, widespread viruses such as the Melissa virus appear to have been written
chiefly as pranks.
Hostile intent related to vandalism can be found in programs designed to
cause harm or data loss. Many DOS viruses, and the Windows ExploreZip worm,
were designed to destroy files on a hard disk, or to corrupt the file system by writing
invalid data. Network-borne worms such as the 2001 Code Red worm or the Ramen
worm fall into the same category. Designed to vandalize web pages, worms may seem
like the online equivalent to graffiti tagging, with the author's alias or affinity group
appearing everywhere the worm goes.
However, since the rise of widespread broadband Internet access, malicious
software has come to be designed for a profit motive, either more or less legal (forced
advertising) or criminal. For instance, since 2003, the majority of widespread viruses
and worms have been designed to take control of users' computers for black-market
exploitation.[citation needed] Infected "zombie computers" are used to send email
spam, to host contraband data such as child pornography, or to engage in distributed
denial-of-service attacks as a form of extortion.
Another strictly for-profit category of malware has emerged in spyware -
programs designed to monitor users' web browsing, display unsolicited
advertisements, or redirect affiliate marketing revenues to the spyware creator.
Spyware programs do not spread like viruses; they are, in general, installed by
exploiting security holes or are packaged with user-installed software, such as peer-
to-peer applications.
The best-known types of malware, viruses and worms, are known for the
manner in which they spread, rather than any other particular behavior. The term
computer virus is used for a program that has infected some executable software and
that causes that software, when run, to spread the virus to other executable software.
Viruses may also contain a payload that performs other actions, often malicious. A
worm, on the other hand, is a program that actively transmits itself over a network to
infect other computers. It too may carry a payload.
These definitions lead to the observation that a virus requires user
intervention to spread, whereas a worm spreads automatically. Using this distinction,
infections transmitted by email or Microsoft Word documents, which rely on the
recipient opening a file or email to infect the system, would be classified as viruses
rather than worms. Some writers in the trade and popular press appear to
misunderstand this distinction, and use the terms interchangeably.
For a malicious program to accomplish its goals, it must be able to do so
without being shut down, or deleted by the user or administrator of the computer on
which it is running. Concealment can also help get the malware installed in the first
place. When a malicious program is disguised as something innocuous or desirable,
9. users may be tempted to install it without knowing what it does. This is the technique
of the Trojan horse or Trojan.
In broad terms, a Trojan horse is any program that invites the user to run it,
concealing a harmful or malicious payload. The payload may take effect immediately
and can lead to many undesirable effects, such as deleting the user's files or further
installing malicious or undesirable software. Trojan horses known as droppers are
used to start off a worm outbreak, by injecting the worm into users' local networks.
One of the most common ways that spyware is distributed is as a Trojan horse,
bundled with a piece of desirable software that the user downloads from the Internet.
When the user installs the software, the spyware is installed alongside. Spyware
authors who attempt to act in a legal fashion may include an end-user license
agreement that states the behavior of the spyware in loose terms, which the users are
unlikely to read or understand.
Once a malicious program is installed on a system, it is essential that it stay
concealed, to avoid detection and disinfection. The same is true when a human
attacker breaks into a computer directly. Techniques known as root kits allow this
concealment, by modifying the host operating system so that the malware is hidden
from the user. Root kits can prevent a malicious process from being visible in the
system's list of processes, or keep its files from being read. Originally, a root kit was a
set of tools installed by a human attacker on a Unix system where the attacker had
gained administrator (root) access. Today, the term is used more generally for
concealment routines in a malicious program.
Some malicious programs contain routines to defend against removal, not
merely to hide themselves, but to repel attempts to remove them. An early example of
this behavior is recorded in the Jargon File tale of a pair of programs infesting a
Xerox CP-V timesharing system. Each ghost-job would detect the fact that the other
had been killed, and would start a new copy of the recently slain program within a
few milliseconds. The only way to kill both ghosts was to kill them simultaneously
(very difficult) or to deliberately crash the system. Similar techniques are used by
some modern malware, wherein the malware starts a number of processes that
monitor and restore one another as needed.
A backdoor is a method of bypassing normal authentication procedures.
Once a system has been compromised (by one of the above methods, or in some other
way), one or more backdoors may be installed in order to allow easier access in the
future. Backdoors may also be installed prior to malicious software, to allow attackers
entry.
The idea has often been suggested that computer manufacturers preinstall
backdoors on their systems to provide technical support for customers, but this has
never been reliably verified. Crackers typically use backdoors to secure remote access
to a computer, while attempting to remain hidden from casual inspection. To install
backdoors crackers may use Trojan horses, worms, or other methods.
During the 1980s and 1990s, it was usually taken for granted that malicious
programs were created as a form of vandalism or prank. More recently, the greater
share of malware programs have been written with a financial or profit motive in
mind. This can be taken as the malware authors' choice to monetize their control over
infected systems: to turn that control into a source of revenue.
10. Spyware programs are commercially produced for the purpose of gathering
information about computer users, showing them pop-up ads, or altering web-browser
behavior for the financial benefit of the spyware creator. For instance, some spyware
programs redirect search engine results to paid advertisements. Others, often called
"stealware" by the media, overwrite affiliate marketing codes so that revenue is
redirected to the spyware creator rather than the intended recipient.
Spyware programs are sometimes installed as Trojan horses of one sort or
another. They differ in that their creators present themselves openly as businesses, for
instance by selling advertising space on the pop-ups created by the malware. Most
such programs present the user with an end-user license agreement that purportedly
protects the creator from prosecution under computer contaminant laws. However,
spyware EULAs have not yet been upheld in court.
Another way that financially-motivated malware creators can profit from
their infections is to directly use the infected computers to do work for the creator.
The infected computers are used as proxies to send out spam messages. A computer
left in this state is often known as a zombie computer. The advantage to spammers of
using infected computers is they provide anonymity, protecting the spammer from
prosecution. Spammers have also used infected PCs to target anti-spam organizations
with distributed denial-of-service attacks.
In order to coordinate the activity of many infected computers, attackers
have used coordinating systems known as botnets. In a botnet, the malware or malbot
logs in to an Internet Relay Chat channel or other chat system. The attacker can then
give instructions to all the infected systems simultaneously. Botnets can also be used
to push upgraded malware to the infected systems, keeping them resistant to antivirus
software or other security measures.
It is possible for a malware creator to profit by stealing sensitive information
from a victim. Some malware programs install a key logger, which intercepts the
user's keystrokes when entering a password, credit card number, or other information
that may be exploited. This is then transmitted to the malware creator automatically,
enabling credit card fraud and other theft. Similarly, malware may copy the CD key
or password for online games, allowing the creator to steal accounts or virtual items.
Another way of stealing money from the infected PC owner is to take control
of a dial-up modem and dial an expensive toll call. Dialer (or porn dialer) software
dials up a premium-rate telephone number such as a U.S. "900 number" and leave the
line open, charging the toll to the infected user.
Data-stealing malware is a web threat that divests victims of personal and
proprietary information with the intent of monetizing stolen data through direct use or
underground distribution. Content security threats that fall under this umbrella include
keyloggers, screen scrapers, spyware, adware, backdoors, and bots. The term does not
refer to activities such as spam, phishing, DNS poisoning, SEO abuse, etc. However,
when these threats result in file download or direct installation, as most hybrid attacks
do, files that act as agents to proxy information will fall into the data-stealing malware
category.
11. 3.2 SMalL Ontology
The SMalL Ontology is designed to aid the development of malware
prevention software by offering a common knowledge base and a clear classification
of the existing malicious software. It covers all the different categories and
subcategories of malware and organized based on behavior, propagation methods,
payload, motivation etc.
The ontology is divided into five main categories based on the major
malicious software threats: Crimeware, Spyware, Trojans, Viruses and Worms.
A virus replicates by attaching its program instructions to an ordinary ―host‖
program or document, so that the virus instructions are executed when the host
program is executed. There are five main virus categories:
File virus - uses the file system of a given OS (or more than one) to
propagate. File viruses include viruses that infect executable files,
companion viruses that create duplicates of files, viruses that copy
themselves into various directories, and link viruses that exploit file system
features.
Boot sector virus - infects the boot sector or the master boot record, or
displaces the active boot sector, of a hard drive. Once the hard drive is
booted up, boot sector viruses load themselves into the computer‘s memory.
Many boot sector viruses, once executed, prevent the O S from booting. Boot
sector viruses were widespread in the 1990s, but have almost disappeared
since the introduction of 32-bit processors and the near-disappearance of
floppy disks as a storage medium for executables.
Macro virus - written in the macro scripting languages of word processing,
accounting, editing, or project applications, it propagates by exploiting the
macro language‘s properties in order to transfer itself from the infected file
containing the macro script to another file. The most widespread macro
viruses are for Microsoft Office applications (Word, Excel, PowerPoint,
Access). Because they are written in the code of application software, macro
viruses are platform independent and can spread between Mac, Windows,
Linux, and any other system running the targeted application.
Email virus - refers to the delivery mechanism rather than the infection target
or behavior. Email can be used to transmit any of the above types of virus by
copying and emailing itself to every address in the victim‘s email address
book, usually within an email attachment. Each time a recipient opens the
infected attachment, the virus harvests that victim‘s email address book and
repeats its propagation process.
Multi-variant virus - the same core virus but implemented with slight
variations, so that an anti-virus scanner that can detect one variant will not be
able to detect the other variants.
Worms are Self-propagating program that spreads over a network, usually
the Internet. Unlike viruses, may not depend on other programs or victim actions
(such as opening an infected email attachment or clicking on the Web link for a
malware Web site) for replication, dissemination, or execution. Worms spread by
locating other vulnerable potential hosts on the network (e.g., via scanning or
12. topological analysis), then copying their program instructions to those hosts. There
are five main categories of computer worms:
Email worm - spreads via infected email attachments
Instant messaging worm - Spread via infected attachments to IM messages or
reader access to Uniform Resource Locators (URL) in IM messages that
point to malicious Web sites from which the worm is downloaded.
IRC Worm - Comparable to IM worms, but exploit IRC rather than IM
channels.
P2P Worm - Copies itself into a shared folder, then uses P2P mechanisms to
announce its existence in hopes that other P2P users will download and
execute it.
Web Worm - Spread via user access to a Web page, File Transfer Protocol
(FTP) site, or other Internet resources.
A Trojan Horse is a destructive program that masquerades as a benign
program. Stealthware such as spyware, rootkits, keyloggers, trapdoors, and certain
adware represents a subset of Trojans that is intentionally designed to be hard-to
detect or undetectable Trojan horse software installs itself on the victim‘s computer
when the victim opens an email attachment or computer file containing the Trojan, or
clicks on a Web link that directs the victim‘s browser to a Web site from which the
Trojan is automatically downloaded. Once installed, the software can be controlled
remotely by hackers for criminal or other malicious purposes, such as extracting
money, passwords, or other sensitive information, or to create a zombie from which to
disseminate spam, phishing emails, the same Trojan, or other malware to other
computers on the network/Internet. Trojan horses are classified in six categories:
Backdoor Trojan (also known as Trapdoor Trojan or Remote-Access Trojan)
acts as a remote administration utility that enables control of the infected
machine by a remote host.
Data-collecting Trojan - surreptitiously collects and sends back information
from the victim‘s machine. The surreptitious nature of such software has led
to it being referred to as ―stealth ware.‖
Downloader or Dropper - downloads, installs, and in the case of the
Downloader, launches additional malware on the victim‘s machine.
Proxy Trojan - turns the victim‘s computer into a proxy server (i.e., a
zombie) that operates on behalf of the remote attacker. If the attacker‘s
activities are detected and tracked, the trail leads back to the victim rather
than to the attacker.
Rootkit - a collection of programs used by a hacker to evade detection while
trying to gain unauthorized access to the victim‘s computer. Rootkits are
designed to hide processes, files or Windows Registry entries. Rootkits are
used by hackers to hide their tracks or to insert threats surreptitiously on
compromised computers. Various types of malware use rootkits to hide
themselves on a computer
Bot - any type of malware (e.g., Trojan, worm, spyware bots or spybots) that
enables the attacker to surreptitiously gain complete control of the infected
machine. A computer that has been infected by a bot is referred to as a
13. zombie or, sometimes, a drone. Bots may be further subcategorized
according to their delivery mechanism. For example, a Spam bot is similar to
an email virus or mass-mailing worm in that it relies on the intended victim‘s
action to activate it, either by opening an attachment affixed to a spam email,
or by clicking on a Web link within a spam email which points to a Web site
from which the bot is downloaded to the victim‘s computer
Spyware represents non-Trojan stealthware that has the same objectives and
performs the same types of actions as spyware Trojans. A number of bots have
spyware capabilities, and are referred to as spybots. They are categorized in 2 main
categories:
Adware- Software that automatically displays advertising material to the
user, resulting in an unpleasant user experience. If malicious, adware usually
exhibits the behaviors and/or infection techniques used by viruses, worms,
and/or spyware.
Tracking cookie - a cookie is a data structure that stores information about a
user‘s browser session state. While cookies are a necessary component of
how many Web sites operate, tracking cookies are specifically designed to
track a user‘s behavior across multiple sites. Spyware sites routinely use
tracking cookies to monitor a user‘s browsing behavior and associate it with
the user‘s personal data such as name, credit card number, and other private
information, which can then be harvested and sold to illicit marketers or
cybercriminals.
Crimeware is malware used in aid of criminal activities. This said, there are
specific types of malware used predominantly or exclusively as crimeware. Four main
crimeware are known:
Email redirector - used to intercept and relay outgoing emails to the
attacker‘s system.
IM redirector - used to intercept and relay outgoing instant messages to the
attacker‘s system.
Clicker - redirects the victim to a Web site or Internet resource by sending
the necessary commands to the victim‘s browser or replacing the system
file(s) in which standard Internet URLs are stored (e.g., the Microsoft
Windows hosts file).
Transaction generator- targets not the end-user computer but the computer of
a corporate or financial institution‘s computer center. The software generates
fraudulent transactions on behalf of the attacker within the victim
organization‘s payment processing or other financial systems. In some
instances, transaction generators are used to intercept credit card data for
abuse by the attacker.
Session hijacker - usually a malicious browser component that, after the
victim logs in or begins a browser session, takes over that session to enable a
hacker to exploit it, usually to perform criminal actions, such as transferring
money from the victim‘s bank account.
15. 3.3 SMalL Java Application
The SMalL Java Application is a tool designed to compare available
software security systems. It works in conjunction with the SMalL ontology to
provide better ways by which users can examine similarities and differences between
antivirus solutions.
The application allows the user to add a new antivirus to the ontology and
link its properties to the available malware knowledgebase. The user can afterwards
compare the security systems and see exactly which one prevents against a given type
of malware and which one doesn’t, on which operating system they run .etc. The
application main windows are presented in Figure2.1, Figure 2.2 and Figure 2.3
3.3 SMalL File Format
We believe that the file format for malware related attacks can be an OWL
file created by extracting data relevant to the given attack directly from the SMalL
Ontology. For example in the case of an adware attack the file could contain the
antivirus used, the operating system it runs on and that the system might also be
infected with a Trojan. If this is the case and the antivirus didn’t manage to find the
Trojan then supplementary scans are required to find the problem. In the case a
system is infected by multiple malware programs then a custom file can be created
and the problems related so that on other occasions the antivirus can check for all of
them when one appears.
3.3 Conclusions
We created an ontology for malicious software classification which is able to
aid the development of malware prevention software by offering a common
knowledge base and a clear classification of the existing security issues. We presented
an application prototype which handles antivirus software comparison based on the
information available in the ontology and user entered data. We also proposed The
SMalL file format which is a comprehensive way to report software security issues
and brings new possibilities regarding scanning for software security problems.