Layered approach

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 35

Layered Approach Using Conditional
Random Fields for Intrusion Detection
Kapil Kumar Gupta, Baikunth Nath, Senior Member, IEEE, and
Ramamohanarao Kotagiri, Member, IEEE

Abstract—Intrusion detection faces a number of challenges; an intrusion detection system must reliably detect malicious activities
in a network and must perform efficiently to cope with the large amount of network traffic. In this paper, we address these two issues
of Accuracy and Efficiency using Conditional Random Fields and Layered Approach. We demonstrate that high attack detection
accuracy can be achieved by using Conditional Random Fields and high efficiency by implementing the Layered Approach.
Experimental results on the benchmark KDD ’99 intrusion data set show that our proposed system based on Layered Conditional
Random Fields outperforms other well-known methods such as the decision trees and the naive Bayes. The improvement in attack
detection accuracy is very high, particularly, for the U2R attacks (34.8 percent improvement) and the R2L attacks (34.5 percent
improvement). Statistical Tests also demonstrate higher confidence in detection accuracy for our method. Finally, we show that our
system is robust and is able to handle noisy data without compromising performance.

Index Terms—Intrusion detection, Layered Approach, Conditional Random Fields, network security, decision trees, naive Bayes.

Ç

1 INTRODUCTION
The signature-based systems are trained by extracting specific
I NTRUSION detection as defined by the SysAdmin, Audit,
Networking, and Security (SANS) Institute is the art of
detecting inappropriate, inaccurate, or anomalous activity
patterns (or signatures) from previously known attacks while
the anomaly-based systems learn from the normal data
[6]. Today, intrusion detection is one of the high priority and collected when there is no anomalous activity [11].
challenging tasks for network administrators and security Another approach for detecting intrusions is to consider
professionals. More sophisticated security tools mean that the both the normal and the known anomalous patterns for
attackers come up with newer and more advanced penetra- training a system and then performing classification on the
tion methods to defeat the installed security systems [4] and test data. Such a system incorporates the advantages of both
[24]. Thus, there is a need to safeguard the networks from the signature-based and the anomaly-based systems and is
known vulnerabilities and at the same time take steps to known as the Hybrid System. Hybrid systems can be very
detect new and unseen, but possible, system abuses by efficient, subject to the classification method used, and can
developing more reliable and efficient intrusion detection also be used to label unseen or new instances as they assign
systems. Any intrusion detection system has some inherent one of the known classes to every test instance. This is
requirements. Its prime purpose is to detect as many attacks possible because during training the system learns features
as possible with minimum number of false alarms, i.e., the from all the classes. The only concern with the hybrid method
system must be accurate in detecting attacks. However, an is the availability of labeled data. However, data requirement
accurate system that cannot handle large amount of network is also a concern for the signature- and the anomaly-based
traffic and is slow in decision making will not fulfill the systems as they require completely anomalous and attack-
purpose of an intrusion detection system. We desire a system free data, respectively, which are not easy to ensure.
that detects most of the attacks, gives very few false alarms, The rest of this paper is organized as follows: In Section 2,
copes with large amount of data, and is fast enough to make we discuss the related work with emphasis on various
real-time decisions. methods and frameworks used for intrusion detection. We
Intrusion detection started in around 1980s after the describe the use of Conditional Random Fields (CRFs) for
influential paper from Anderson [10]. Intrusion detection intrusion detection [23] in Section 3 and the Layered
systems are classified as network based, host based, or
Approach [22] in Section 4. We then describe how to
application based depending on their mode of deployment
integrate the Layered Approach and the CRFs in Section 5.
and data used for analysis [11]. Additionally, intrusion
In Section 6, we give our experimental results and compare
detection systems can also be classified as signature based or
our method with other approaches that are known to
anomaly based depending upon the attack detection method.
perform well. We observe that our proposed system,
Layered CRFs, performs significantly better than other
. The authors are with the Department of Computer Science and Software systems. We study the robustness of our method in Section 7
Engineering, and NICTA Victoria Research Laboratory, The University of by introducing noise in the system. We discuss feature
Melbourne, Parkville 3010, Australia. selection in Section 8 and draw conclusions in Section 9.
E-mail: {kgupta, bnath, rao}@csse.unimelb.edu.au.
Manuscript received 6 Mar. 2007; revised 11 Dec. 2007; accepted 28 Jan.
2008; published online 12 Mar. 2008. 2 RELATED WORK
For information on obtaining reprints of this article, please send e-mail to:
tdsc@computer.org, and reference IEEECS Log Number TDSC-2007-03-0031. The field of intrusion detection and network security has
Digital Object Identifier no. 10.1109/TDSC.2008.20. been around since late 1980s. Since then, a number of
1545-5971/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: Asha Das. Downloaded on August 10,2010 at 06:59:52 UTC from IEEE Xplore. Restrictions apply.

36 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010

methods and frameworks have been proposed and many often hard to select the best possible architecture for a neural
systems have been built to detect intrusions. Various network. Support vector machines have also been used for
techniques such as association rules, clustering, naive Bayes detecting intrusions [26]. Support vector machines map real-
classifier, support vector machines, genetic algorithms, valued input feature vector to a higher dimensional feature
artificial neural networks, and others have been applied to space through nonlinear mapping and can provide real-time
detect intrusions. In this section, we briefly discuss these detection capability, deal with large dimensionality of
techniques and frameworks. data, and can be used for binary-class as well as multiclass
Lee et al. introduced data mining approaches for classification. Other approaches for detecting intrusion
detecting intrusions in [30], [31], and [32]. Data mining include the use of genetic algorithm and autonomous and
approaches for intrusion detection include association rules probabilistic agents for intrusion detection [1] and [5]. These
and frequent episodes, which are based on building methods are generally aimed at developing a distributed
classifiers by discovering relevant patterns of program intrusion detection system.
and user behavior. Association rules [8] and frequent To overcome the weakness of a single intrusion detection
episodes are used to learn the record patterns that describe system, a number of frameworks have been proposed, which
user behavior. These methods can deal with symbolic data, describe the collaborative use of network-based and host-
and the features can be defined in the form of packet and based systems [45]. Systems that employ both signature-
connection details. However, mining of features is limited based and behavior-based techniques are discussed in [19]
to entry level of the packet and requires the number of and [41]. In [32], the authors describe a data mining frame-
records to be large and sparsely populated; otherwise, they work for building adaptive intrusion detection models. A
tend to produce a large number of rules that increase the distributed intrusion detection framework based on mobile
complexity of the system [7]. agents is discussed in [12].
Data clustering methods such as the k-means and the fuzzy The most closely related work, to our work, is of Lee et al.
c-means have also been applied extensively for intrusion [30], [31], and [32]. They, however, consider a data mining
detection [36] and [39]. One of the main drawbacks of the approach for mining association rules and finding frequent
clustering technique is that it is based on calculating numeric episodes in order to calculate the support and confidence of
distance between the observations, and hence, the observa- the rules separately. Instead, in our work, we define features
tions must be numeric. Observations with symbolic features from the observations as well as from the observations and
cannot be easily used for clustering, resulting in inaccuracy. the previous labels and perform sequence labeling via the
In addition, the clustering methods consider the features CRFs to label every feature in the observation. This setting is
independently and are unable to capture the relationship sufficient for modeling the correlation between different
between different features of a single record, which further features of an observation. We also compare our work with
degrades attack detection accuracy. [21], which describes the use of maximum entropy principle
Naive Bayes classifiers have also been used for intrusion for detecting anomalies in the network traffic. The key
detection [9]. However, they make strict independence difference between [21] and our work is that the authors in
assumption between the features in an observation result- [21] use only the normal data during training and build a
ing in lower attack detection accuracy when the features are baseline system, i.e., a behavior-based system, while we train
correlated, which is often the case for intrusion detection. our system with both the normal and the anomalous data,
Bayesian network can also be used for intrusion detection i.e., we build a hybrid system. Second, the system in [21] fails
[28]. However, they tend to be attack specific and build a
to model long-range dependencies in the observations,
decision network based on special characteristics of
which can be easily represented in our model. We also
individual attacks. Thus, the size of a Bayesian network
integrate the Layered Approach with the CRFs to gain the
increases rapidly as the number of features and the type of
attacks modeled by a Bayesian network increases. benefits of computational efficiency and high accuracy of
To detect anomalous traces of system calls in privileged detection in a single system.
processes [20], hidden Markov models (HMMs) have been We compare the Layered Approach with the works in [18],
applied in [17], [42], and [43]. However, modeling the system [25], and [41]. The authors in [18] describe the combination of
calls alone may not always provide accurate classification as “strong” classifiers using stacking, where the decision tress,
in such cases various connection level features are ignored. naive Bayes, and a number of other classification methods are
Further, HMMs are generative systems and fail to model used as base classifiers. The authors show that the output
long-range dependencies between the observations [29]. We from these classifiers can be combined to generate a better
further discuss this in detail in Section 3. classifier rather than selecting the best one. In [25], the authors
Decision trees have also been used for intrusion use a combination of “weak” classifiers. The individual
detection [9]. The decision trees select the best features for classification power of weak classifiers is slightly better than
each decision node during the construction of the tree based random guessing. The authors show that a number of such
on some well-defined criteria. One such criterion is to use classifiers when combined using simple majority voting
the information gain ratio, which is used in C4.5. Decision mechanism, provide good classification. In [41], the authors
trees generally have very high speed of operation and high- apply a combination of anomaly and misuse detectors for
attack detection accuracy. better qualification of analyzed events. However, our work is
Debar et al. [14] and Zhang et al. [46] discuss the use of not based upon classifier combination. Combination of
artificial neural networks for network intrusion detection. classifiers is expensive with regard to the processing time
Though the neural networks can work effectively with noisy and decision making. The purpose of classifier combination is
data, they require large amount of data for training and it is to improve accuracy. Rather, our system is based upon serial


GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 37

layering of multiple hybrid detectors. From our experiments edges in subgraph S. In addition, the features fk and gk
in Section 6, we show that the Layered CRFs perform better are assumed to be given and fixed. For example, a
than individual classifiers and they are more efficient and Boolean edge feature fk might be true if the observation Xi
accurate than a system based on classifier combination. The is “protocol ¼ tcp,” tag YiÀ1 is “normal,” and tag Yi is
results from individual classifiers at a layer are not combined “normal.” Similarly, a Boolean vertex feature gk might be true
at any later stage in the Layered Approach, and hence, an if the observation Xi is “service ¼ ftp” and tag Yi is “attack.”
attack can be blocked at the layer where it is detected. There is Further, the parameter estimation problem is to find the
no communication overhead among the layers and the central parameters ¼ ð1 ; 2 ; . . . ; 1 ; 2 ; . . .Þ from the training data
N
decision-maker. In addition, since the layers are independent D ¼ ðxi ; yi Þi¼1 with the empirical distribution pðx; yÞ [29].
~
they can be trained separately and deployed at critical CRFs are undirected graphical models used for sequence
locations in a network depending upon the specific require- tagging. The prime difference between CRF and other
ments of a network. Using a stacked system will not give us graphical models such as the HMM is that the HMM, being
the advantage of reduced processing when an attack is generative, models the joint distribution pðy; xÞ, whereas the
detected at the initial layers in the sequential model. CRF are discriminative models and directly model the
In this paper, we show the effectiveness of CRFs for conditional distribution pðyjxÞ, which is the distribution of
intrusion detection. Motivated by our results in [23], we interest for the task of classification and sequence labeling.
perform detailed analysis and show that CRFs are a strong Similar to HMM, the naive Bayes is also generative and
candidate for building robust intrusion detection systems. models the joint distribution. Modeling the joint distribution
We then show that high efficiency can be achieved by has two disadvantages. First, it is not the distribution of
implementing the Layered Approach. Finally, we integrate interest, since the observations are completely visible and the
the Layered Approach and the CRFs to develop a system interest is in finding the correct class for the observations,
that is accurate and performs efficiently. which is the conditional distribution pðyjxÞ. Second, inferring
the conditional probability pðyjxÞ from the modeled joint
distribution, using the Bayes rule, requires the marginal
3 CONDITIONAL RANDOM FIELDS FOR INTRUSION
distribution pðxÞ. To estimate this marginal distribution is
DETECTION difficult since the amount of training data is often limited and
Conditional models are probabilistic systems that are used the observation x contains highly dependent features that are
to model the conditional distribution over a set of random difficult to model and therefore strong independence
variables. Such models have been extensively used in the assumptions are made among the features of an observation.
natural language processing tasks. Conditional models offer This results in reduced accuracy [40]. CRFs, however, predict
a better framework as they do not make any unwarranted the label sequence y given the observation sequence x. This
assumptions on the observations and can be used to model allows them to model arbitrary relationship among different
rich overlapping features among the visible observations. features in an observation x [15]. CRFs also avoid the
Maxent classifiers [37], maximum entropy Markov models observation bias and the label bias problem, which are
[34], and CRFs [29] are such conditional models. The present in other discriminative models, such as the maximum
advantage of CRFs is that they are undirected and are, thus, entropy Markov models. This is because the maximum
free from the Label Bias and the Observation Bias [27]. The entropy Markov models have a per-state exponential model
simplest conditional classifier is the Maxent classifier based for the conditional probabilities of the next state given the
upon maximum entropy classification, which estimates the current state and the observation, whereas the CRFs have a
conditional distribution of every class given the observations single exponential model for the joint probability of the entire
[37]. The training data is used to constrain this conditional sequence of labels given the observation sequence [29].
distribution while ensuring maximum entropy and hence The task of intrusion detection can be compared to many
maximum uniformity. We now give a brief description of the problems in machine learning, natural language processing,
CRFs, which is motivated from the work in [29]. and bioinformatics. The CRFs have proven to be very
Let X be the random variable over data sequence to be successful in such tasks, as they do not make any unwar-
labeled and Y the corresponding label sequence. In ranted assumptions about the data. Hence, we explore the
addition, let G ¼ ðV ; EÞ be a graph such that Y ¼ ðYv Þv2ðV Þ , suitability of CRFs for intrusion detection.
so that Y is indexed by the vertices of G. Then, ðX; Y Þ is a
CRF, when conditioned on X, the random variables Yv 3.1 Motivating Example
obey the Markov property with respect to the graph: The data analyzed by the intrusion detection system for
pðYv jX; Yw ; w 6¼ vÞ ¼ pðYv jX; Yw ; w $ vÞ, where w $ v means classification often has a number of features that are highly
that w and v are neighbors in G, i.e., a CRF is a random field correlated and complex relationships exist between them.
globally conditioned on X. For a simple sequence (or chain) For example, when classifying network connections as
modeling, as in our case, the joint distribution over the label either normal or as attack, a system may consider features
sequence Y given X has the following form: such as “logged in” and “number of file creations.” When
! these features are analyzed individually, they do not
X X provide any information that can aid in detecting attacks.
p ðyjxÞ / exp k fk ðe; yje ; xÞ þ k gk ðv; yjv ; xÞ ; ð1Þ
However, when these features are analyzed together, they
e2E;k v2V ;k
can provide meaningful information, which can be helpful
where x is the data sequence, y is a label sequence, and yjs is for the classification task. Taking another example, the
the set of components of y associated with the vertices or connection level feature such as the “service invoked” at the



Fig. 2. Layered representation.

Our first goal is to improve the attack detection accuracy.
We first compare the accuracy of CRFs for detecting attacks
Fig. 1. Graphical representation of a CRF.
with other methods in Section 6. We consider all the
41 features in the data set for each of the four attack groups
destination provides some information about the class label
separately. As we shall observe, the CRFs outperform other
(in case an attacker sends request to a service that is not
methods for detecting “Unauthorized access to Root” (U2R)
available). This information becomes more concrete and
attacks. They are also effective in detecting the Probe,
aids in classification when analyzed with other features
“Remote to Local” (R2L), and “Denial of Service” (DoS)
such as “protocol type” and “amount of data transferred”
attacks. However, CRFs can be expensive during training
between source and destination (in case the client connects
and testing. For a simple linear chain structure, the time
to an available service such as the ftp and performs data
complexity for training a CRF is OðT L2 NIÞ, where T is the
transfer). These relationships, between different features in
length of the sequence, L is the number of labels, N is the
the observed data, if considered during classification can
number of training instances, and I is the number of
significantly decrease classification error. The CRFs do not
iterations. During inference, the Viterbi algorithm is
consider features to be independent and hence perform
employed, which has a complexity of OðT L2 Þ. The quad-
better when compared with other methods.
ratic complexity is significant when the number of labels is
The data set used in our experiments represents features
large as in language tasks. However, for intrusion detection,
of every session in relational form with only one label for
there are only two labels “normal” and “attack,” and thus,
the entire record. In this case, using a conditional model
the system is very efficient. We further improve the overall
would result in a simple maximum entropy classifier [40].
system performance by using the Layered Approach, which
However, we represent the data in the form of a sequence
decreases T , i.e., the length of the sequence. The Layered
and assign a label to every feature in the sequence using the
Approach is described next.
first-order Markov assumption instead of assigning a single
label to the entire observation. Though, this increases the
complexity but it also increases the attack detection accuracy. 4 LAYERED APPROACH FOR INTRUSION DETECTION
Each record represents a separate connection, and hence, We now describe the Layer-based Intrusion Detection
we consider every record as a separate sequence. We aim to System (LIDS) in detail. The LIDS draws its motivation
model the relationships among features of individual from what we call as the Airport Security model, where a
connections using a CRF, as shown in Fig. 1. In the figure, number of security checks are performed one after the other
features such as duration, protocol, service, flag, and in a sequence. Similar to this model, the LIDS represents a
src_bytes take some possible value for every connection. sequential Layered Approach and is based on ensuring
During training, feature weights are learnt, and during availability, confidentiality, and integrity of data and (or)
testing, features are evaluated for the given observation, services over a network. Fig. 2 gives a generic representa-
which is then labeled accordingly. tion of the framework.
As it is evident from the figure, every label is connected The goal of using a layered model is to reduce computation
to every input feature, which indicates that all the features and the overall time required to detect anomalous events. The
in an observation help in labeling, and thus, a CRF can time required to detect an intrusive event is significant and
model dependencies among the features in an observation. can be reduced by eliminating the communication overhead
Present intrusion detection systems do not consider such among different layers. This can be achieved by making the
relationships among the features in the observations. They
layers autonomous and self-sufficient to block an attack
either consider only one feature, such as in the case of
without the need of a central decision-maker. Every layer in
system call modeling, or assume conditional independence
the LIDS framework is trained separately and then deployed
among different features in the observation as in the case of
a naive Bayes classifier. As we will show from our sequentially. We define four layers that correspond to the
experimental results, the CRFs can effectively model such four attack groups mentioned in the data set. They are Probe
relationships among different features of an observation layer, DoS layer, R2L layer, and U2R layer. Each layer is then
resulting in higher attack detection accuracy. Another separately trained with a small set of relevant features.
advantage of using CRFs is that every element in the Feature selection is significant for Layered Approach and
sequence is labeled such that the probability of the entire discussed in the next section. In order to make the layers
labeling is maximized, i.e., all the features in the observa- independent, some features may be present in more than one
tion collectively determine the final labels. Hence, even if layer. The layers essentially act as filters that block any
some data is missing, the observation sequence can still be anomalous connection, thereby eliminating the need of
labeled with less number of features. further processing at subsequent layers enabling quick



response to intrusion. The effect of such a sequence of layers is illegitimate requests. Hence, for the DoS layer, traffic
that the anomalous events are identified and blocked as soon features such as the “percentage of connections having same
as they are detected. destination host and same service” and packet level features
Our second goal is to improve the speed of operation of such as the “source bytes” and “percentage of packets with
the system. Hence, we implement the LIDS and select a errors” are significant. To detect DoS attacks, it may not be
small set of features for every layer rather than using all the important to know whether a user is “logged in or not.”
41 features. This results in significant performance im-
provement during both the training and the testing of the 5.1.3 R2L Layer
system. In many situations, there is a trade-off between The R2L attacks are one of the most difficult to detect as
efficiency and accuracy of the system and there can be they involve the network level and the host level features.
various avenues to improve system performance. Methods We therefore selected both the network level features such
such as naive Bayes assume independence among the as the “duration of connection” and “service requested”
observed data. This certainly increases system efficiency, and the host level features such as the “number of failed
but it may severely affect the accuracy. To balance this login attempts” among others for detecting R2L attacks.
trade-off, we use the CRFs that are more accurate, though
expensive, but we implement the Layered Approach to 5.1.4 U2R Layer
improve overall system performance. The performance of The U2R attacks involve the semantic details that are very
our proposed system, Layered CRFs, is comparable to that difficult to capture at an early stage. Such attacks are often
of the decision trees and the naive Bayes, and our system content based and target an application. Hence, for U2R
has higher attack detection accuracy. attacks, we selected features such as “number of file
creations” and “number of shell prompts invoked,” while
5 INTEGRATING LAYERED APPROACH WITH we ignored features such as “protocol” and “source bytes.”
CONDITIONAL RANDOM FIELD We used domain knowledge together with the practical
significance and the feasibility of each feature before
In Section 1, we discussed two main requirements for an selecting it for a particular layer. Thus, from the total
intrusion detection system; accuracy of detection and 41 features, we selected only 5 features for Probe layer,
efficiency in operation. As discussed in Sections 3 and 4, 9 features for DoS layer, 14 features for R2L layer, and
respectively, the CRFs can be effective in improving the 8 features for U2R layer. Since each layer is independent of
attack detection accuracy by reducing the number of false every other layer, the feature set for the layers is not
alarms, while the Layered Approach can be implemented to disjoint. The selected features for all the four layers are
improve the overall system efficiency. Hence, a natural presented in Appendix A. We then use the CRFs for attack
choice is to integrate them to build a single system that is detection as discussed in Section 3. However, the difference
accurate in detecting attacks and efficient in operation. Given is that we use only the selected features for each layer rather
the data, we first select four layers corresponding to the four than using all the 41 features. We now give the algorithm
attack groups (Probe, DoS, R2L, and U2R) and perform for integrating CRFs with the Layered Approach.
feature selection for each layer, which is described next. Algorithm
5.1 Feature Selection Training
Step 1: Select the number of layers, n, for the complete
Ideally, we would like to perform feature selection auto-
matically. However, as will be discussed later in Section 8, system.
the methods for automatic feature selection were not found Step 2: Separately perform features selection for each layer.
to be effective. In this section, we describe our approach for Step 3: Train a separate model with CRFs for each layer
selecting features for every layer and why some features using the features selected from Step 2.
were chosen over others. In our system, every layer is Step 4: Plug in the trained models sequentially such that
separately trained to detect a single type of attack category. only the connections labeled as normal are passed
We observe that the attack groups are different in their to the next layer.
impact, and hence, it becomes necessary to treat them Testing
differently. Hence, we select features for each layer based Step 5: For each (next) test instance perform Steps 6
upon the type of attacks that the layer is trained to detect. through 9.
Step 6: Test the instance and label it either as attack or
5.1.1 Probe Layer
normal.
The probe attacks are aimed at acquiring information about Step 7: If the instance is labeled as attack, block it and
the target network from a source that is often external to the
identify it as an attack represented by the layer
network. Hence, basic connection level features such as the
name at which it is detected and go to Step 5. Else
“duration of connection” and “source bytes” are significant
while features like “number of files creations” and “number pass the sequence to the next layer.
of files accessed” are not expected to provide information Step 8: If the current layer is not the last layer in the system,
for detecting probes. test the instance and go to Step 7. Else go to Step 9.
Step 9: Test the instance and label it either as normal or as
5.1.2 DoS Layer an attack. If the instance is labeled as an attack,
The DoS attacks are meant to force the target to stop the block it and identify it as an attack corresponding
service(s) that is (are) provided by flooding it with to the layer name.



TABLE 1
desktop running with Intel(R) Core(TM) 2, CPU 2.4 GHz,
Data Set and 2-Gbyte RAM under exactly the same conditions. We are
mainly interested in the test time efficiency and not in the
time required for training of the model as the real-time
performance of the system depend upon the test efficiency
alone. We note that our system is very efficient during testing.
When we considered all the 41 features, the time taken to test
all the 250,436 attacks was 57 seconds, which reduced to
17 seconds when we performed feature selection and
implemented the Layered Approach. More details will be
presented when we give the detailed results for the
Our final goal is to improve both the attack detection experiments.
accuracy and the efficiency of the system. Hence, we For our results, we give the Precision, Recall, and F -Value
integrate the CRFs and the Layered Approach to build a and not the accuracy alone as with the given data set, it is
single system. We perform detailed experiments and show easy to achieve very high accuracy by carefully selecting the
that our integrated system has dual advantage. First, as sample size. From Table 1, we note that the number of
expected, the efficiency of the system increases signifi- instances for the U2R, Probes, and R2L attacks is very low.
cantly. Second, since we select significant features for each Hence, if we use accuracy as a measure for testing the
layer, the accuracy of the system further increases. This is
performance of the system, the system can be biased and can
because all the 41 features are not required for detecting
attain an accuracy of more than 99 percent for U2R attacks
attacks belonging to a particular attack group. Using more
[16]. However, Precision, Recall, and F -Value are not
features than required can result in fitting irregularities in
dependent on the size of the training and the test samples.
the data, which has a negative effect on the attack detection
They are defined as follows:
accuracy of the system.
TP
Precision ¼
6 EXPERIMENTS TP þ FP
TP
For our experiments, we use the benchmark KDD ’99 Recall ¼
TP þ FN
intrusion data set [3]. This data set is a version of the original
ð1 þ

2 Þ Ã Recall Ã Precision
1998 DARPA intrusion detection evaluation program, which F -Value ¼ ;
is prepared and managed by the MIT Lincoln Laboratory.

2 Ã ðRecall þ PrecisionÞ
The data set contains about five million connection records where TP, FP, and FN are the number of True Positives,
as the training data and about two million connection False Positives, and False Negatives, respectively, and

records as the test data. In our experiments, we use corresponds to the relative importance of precision versus
10 percent of the total training data and 10 percent of the recall and is usually set to 1.
test data (with corrected labels), which are provided We divide the training data into different groups;
separately. This leads to 494,020 training and 311,029 test Normal, Probe, DoS, R2L, and U2R. Similarly, we divide
instances. Each record in the data set represents a connection the test data. We perform 10 experiments for each attack
between two IP addresses, starting and ending at some well- class by randomly selecting data corresponding to that
defined times with a well-defined protocol. Further, every attack class and normal data only. For example, to detect
record is represented by 41 different features. Each record Probe attacks, we train and test the system with Probe
represents a separate connection and is hence considered to attacks and normal data only. We do not add the DoS, R2L,
be independent of any other record. and U2R data when detecting Probes. Not including these
The training data is either labeled as normal or as one of attacks while training allows the system to better learn the
the 24 different kinds of attack. These 24 attacks can be features for Probe attacks and normal events. When such a
grouped into four classes; Probing, DoS, R2L, and U2R. system is deployed online, other attacks such as DoS can
Similarly, the test data is also labeled as either normal or as
either be seen as normal or as Probes. If DoS attacks are
one of the attacks belonging to the four attack groups. It is
detected as normal, we expect them to be detected as attack
important to note that the test data is not from the same
at other layers in the system. However, if the DoS attacks are
probability distribution as the training data, and it includes
detected as Probe, it must be considered as an advantage
specific attack types not present in the training data. This
since the attack is detected at an early stage. Similarly, if
makes the intrusion detection task more realistic [3]. Table 1
some Probe attacks are not detected at the Probe layer, they
gives the number of instances for each group of attack in the
may be detected at subsequent layers. Hence, for four attack
data set.
For our experiments with CRFs, we use the CRF toolkit, classes, we have four independent models, which are
CRF++ [2]. We use the Weka tool [44] to perform experiments trained separately with specific features to detect attacks
with the decision trees and the naive Bayes classifier. We belonging to that particular group. For our experiments, we
develop python and shell scripts for data formatting and report the best, the average, and the worst cases.
implementing the Layered Approach. For all our experi- We represent a single layer, for example the Probe layer,
ments, we perform hybrid detection, as discussed in Section 1, in Fig. 3. Other layers can be constructed similarly.
and use both the normal and the anomalous connections for In Section 6.1, we perform experiments with individual
training the model. We perform our experiments on a layers in the system. In Section 6.2, we represent how to



TABLE 3
Normal and Probes (with Feature Selection)

Fig. 3. Representation of a single layer (e.g., probe layer).

implement the system in real scenario and compare our
results with other systems in Section 6.3. In Section 6.4, we
discuss the significance of our results.

6.1 Building Individual Layers of the System
We perform two sets of experiments. From the first
experiment, we wish to examine the accuracy of CRFs for
intrusion detection. The objective is to see how CRFs perform feature selection, our system achieves much higher
compare with other techniques, which are known to accuracy and there is significant improvement in efficiency.
perform well. We do not consider feature selection, and
the systems are trained using all the 41 features. From this 6.1.2 Detecting Probe Attacks with Feature Selection
experiment, we observe that CRFs perform much better for We used the same set of instances for this experiment as
U2R attacks while the decision trees achieve higher attack used in the previous experiment. However, we perform
detection for Probes and R2L. The difference in attack feature selection for this experiment. Table 3 gives the
detection accuracy for DoS is not significant. We note that results for this experiment.
the reason for better performance of decision trees is that We observe that the system takes only 2.04 seconds to
they perform feature selection. This motivates us to perform label all the 64,759 test instances. The Layered CRFs
our second experiment where we perform feature selection perform better and faster than our previous experiment
by selecting a small set of features for every attack group and are the best choice for detecting Probes. We also note
instead of using all the 41 features. We perform the same that there is no significant advantage with respect to time
experiment with decision trees and naive Bayes and for the layered decision trees as the number of features used
compare the results. We call the integrated models as in normal decision trees and in the layered decision trees is
Layered CRFs, layered decision trees, and layered naive approximately the same, resulting in similar efficiency. We
Bayes, respectively. For better comparison and readability, further note that the Recall and hence the F -Value for the
we give the results for both the experiments together. layered naive Bayes decreases drastically. This can be
explained as follows: The classification accuracy with naive
6.1.1 Detecting Probe Attacks with All 41 Features Bayes generally improves as the number of features
We randomly select about 10,000 normal records and all the increases. However, if the number of features increases to
a very large extent, the estimation tends to become
Probe records from the training data as the training data for
unreliable. As a result, when we use all the 41 features,
detecting Probe attacks. We then use all the normal and
the naive Bayes performs well but when we decrease the
Probe records from the test data for testing. Hence, we have
number of features to five, its classification accuracy
15,000 training instances and 64,759 test instances. Table 2 decreases. From this experiment, we conclude that the
gives the results for the experiments. Layered CRFs are a better choice for detecting Probe
In Table 2, the testing time of 14.53 seconds represents attacks.
the total time taken to label 64,759 test instances. The results
show that the decision trees are more efficient than the CRFs 6.1.3 Detecting DoS Attacks with All 41 Features
and the naive Bayes. This is because they have a small tree
We randomly select about 20,000 normal records and about
structure, often with very few decision nodes, which is very
4,000 DoS records from the training data as the training data
efficient. The attack detection accuracy is also higher for the
for detecting DoS attacks. We then use all the normal and
decision trees as they select the best possible features during
tree construction. However, as we show next, once we DoS records from the test data for testing. Hence, we have
24,000 training instances and 290,446 test instances. Table 4
gives the results for the experiments.
TABLE 2 In Table 4, the testing time of 64.42 seconds represents
Normal and Probes (All 41 Features) the time taken to label all the 290,446 test instances. The
results show that all the three methods considered have
similar attack detection accuracy; however, decision trees
give a slight advantage with regard to test time efficiency.

6.1.4 Detecting DoS Attacks with Feature Selection
We used the same data for this experiment as used in the
previous experiment. However, we perform feature selec-
tion. Table 5 gives the results.
We observe that the system now takes only 15.17 seconds
to label all the 290,446 test instances. The results follow the



TABLE 4 TABLE 7
Normal and DoS (All 41 Features) Normal and R2L (with Feature Selection)

TABLE 5
Normal and DoS (with Feature Selection)
6.1.6 Detecting R2L Attacks with Feature Selection
Table 7 gives the results when we performed feature
selection for detecting R2L attacks.
We observe that the time taken to test all the
76,942 instances is only 5.96 seconds. Further, the
Layered CRFs perform much better than the CRFs (an
increase of about 60 percent), layered decision trees (an
increase of about 125 percent), decision trees (an increase
of about 17 percent), layered naive Bayes (an increase of
about 250 percent), and naive Bayes (an increase of
about 250 percent) and are the best choice for detecting
same trend as in the previous experiment with only a slight the R2L attacks. The Layered CRFs take slightly more
improvement. However, if we consider the testing time we time, which is acceptable since we achieve much higher
find that layered decision trees are a better choice. We also detection accuracy.
note that there is slight increase in the detection accuracy
6.1.7 Detecting U2R Attacks with All 41 Features
when we perform feature selection, but this increase is not
significant. The real advantage is seen in the reduced time We randomly select about 1,000 normal records and all the
for testing, which decreases four folds. U2R records from the training data as the training data for
detecting the User to Root attacks. We then use all the
6.1.5 Detecting R2L Attacks with All 41 Features normal and U2R records from the test data for testing.
We randomly select about 1,000 normal records and all the Hence, we have 1,000 training instances and 60,661 test
R2L records from the training data as the training data for instances. Table 8 gives the results.
detecting R2L attacks. We then use all the normal and In Table 8, the testing time of 13.45 seconds represents
R2L records from the test data for testing. Hence, we have the time taken to label all the 60,661 test instances. In this
2,000 training instances and 76,942 test instances. Table 6 gives experiment, we find that the CRFs are far better than the
the results. other two methods. The F -Value for CRFs is more than
In Table 6, the testing time of 17.16 seconds represents 150 percent with respect to the decision trees and more than
the time taken to label all the 76,942 test instances. We 600 percent with respect to the naive Bayes. The U2R attacks
observe that the decision trees have a higher F -Value, but if are very difficult to detect and most of the present intrusion
we look at the number of false alarms, we find that the CRFs detection systems fail to detect such attacks with acceptable
perform better and have high Precision compared to the reliability. We find that the CRFs can be used to reliably
decision trees and the naive Bayes. detect such attacks.

TABLE 6 TABLE 8
Normal and R2L (All 41 Features) Normal and U2R (All 41 Features)



TABLE 9 TABLE 10
Normal and U2R (with Feature Selection) Confusion Matrix

TABLE 11
Attack Detection at Each Layer (Case 1)

6.1.8 Detecting U2R Attacks with Feature Selection
In this experiment, we used exactly the same set of instances
as we used in the previous experiment. We also perform
feature selection. Table 9 gives the results for this experiment.
We observe that the system takes only 2.67 seconds to label
all the 60,661 test instances. The Layered CRFs are the best category once the system detects an event as anomalous.
choice for detecting the U2R attacks and are far better than Layered Approach not only improves the attack detection,
CRFs (an increase of about 8 percent), layered decision trees but it also helps identify the type of attack once detected
(an increase of about 30 percent), decision trees (an increase because every layer is trained to detect only a particular
of about 184 percent), layered naive Bayes (an increase of category of attack. Hence, if an attack is detected at the U2R
about 38 percent), and naive Bayes (an increase of about layer, it is very likely that the attack is of “U2R” type. This
675 percent). We observe that the attack detection capability enables to perform quick recovery and prevent similar
also increases for the decision trees and the naive Bayes. attacks. Fig. 4 gives the real-time system representation.
It is evident from the results that the accuracy of Layered We integrate the four models (with feature selection) from
CRFs is significantly higher for the U2R, R2L, and the Probe Section 6.1 to develop the final system. In this experiment,
attacks. The difference in accuracy is, however, not sig- we use the same data for training the individual models as
nificant for the DoS attacks. Further, regardless of the used in our previous experiments. However, the data in the
method considered and particularly for the CRFs, the time test set is relabeled either as normal or as attack and all the
required for training and testing the system is drastically data from the test set is passed though the system starting
reduced once we perform feature selection. We also note from the first layer. If layer 1 detects any connection as an
that the increase in detection accuracy is not significant for attack, it is blocked and labeled as “Probe.” Only the events
the layered decision trees and the layered naive Bayes for labeled as “Normal” are allowed to go to the next layer. The
the DoS group of attack. Their accuracy of detection same process is repeated at the next layers where an attack is
decreases for the Probe and R2L attacks while it increases blocked and labeled as “DoS,” “R2L,” or “U2R” at layer 2,
for the U2R attacks. However, we find that in all the cases, layer 3, and layer 4, respectively. We perform all the
the Layered CRFs perform significantly better and can better experiments 10 times and report their average. We give the
learn a model when we use a small set of specific features for results for this experiment in Tables 10, 11, and 12. Table 10
training. gives the confusion matrix where the values represent the
percent detection with respect to each of the five classes.
6.2 Implementing the System in Real Life From the table, we observe that our system can detect
In real scenario, we are not aware of the category of an most of the “Probe” (98.62 percent), “DoS” (97.40 percent),
attack. Rather, we are interested in identifying the attack and “U2R” (86.33 percent) attacks while giving very few
false alarms at each layer. The system can also detect “R2L”
attacks with much higher reliability (29.62 percent) when
compared with the previously reported systems, as we will
discuss later in Section 6.3. The confusion matrix shows that

TABLE 12
Attack Detection at Each Layer (Case 2)

Fig. 4. Real-time representation of the system.



only 71.90 percent of DoS attacks are labeled as DoS during TABLE 13
testing. However, it is very important to note that the Layered versus Nonlayered Approach
accuracy for detecting DoS attacks is not 71.90 percent, but
it is 25:50 þ 71:90 þ 0:00 þ 0:00 ¼ 97:40 percent. This is
because 25.50 percent of the DoS attacks are already
detected at the first layer, though our system identifies
them as probes since they are detected at the first layer. This
is acceptable because in the real environment it is critical to
detect an attack as early as possible to minimize its impact.
It is also important to note that most of the “U2R” attacks
are detected in the third layer itself and hence labeled as
“R2L.” However, if we remove the third layer, the fourth
layer can detect these attacks with similar accuracy. Further,
looking at the “R2L” and “U2R” columns in Table 10, it is accurate in detecting attacks particularly the U2R, the R2L,
natural to think that the two layers can be merged. However, and the Probes.
this has two disadvantages. First, merging the two layers It is important to note that the time should be read in
results in increasing the number of features, which reduces relative terms rather than absolute, as for ease of experi-
efficiency. The merged layer performs poorly with regard to ments we used scripts for implementation. In real environ-
the total time taken when compared with both the ment, high speed can be achieved by implementing the
unmerged layers together. Second, when the layers are complete system in languages with efficient compilers such
merged, the “U2R” attacks are not detected effectively and
as the “C Language.” Further, pipelining can be implemen-
their individual attack detection accuracy decreases. This is
ted in multicore processors, where each core may represent a
because the number of “U2R” instances is very low in the
training data and the system simply learns the features that single layer, and due to pipelining, multiple I/O operations
are specific to the “R2L” attacks. Hence, we prefer separate can be replaced by a single I/O operation providing very
layers for the two attack groups. Using our approach, we can high speed of operation.
hope that any attack, even though its category is unknown,
6.3 Comparison of Results
can be detected at any one of the layers in the system. We
can also increase or decrease the number of layers depend- In this section, we compare our work with other well-known
ing upon the environment where the system is deployed. methods based on the anomaly intrusion detection principle.
We evaluate the performance of each layer in the The anomaly-based systems primarily detect deviations
system in Table 11. From the table, we observe that out of from the learnt normal data by using statistical methods,
all the 250,436 attack instances in the test data set, more machine learning, or data mining approaches. Standard
than 25 percent of the attacks are blocked at layer 1, and techniques such as the decision trees and naive Bayes are
more than 90 percent of all the attacks have been blocked known to perform well. However, our experiments show
by the end of layer 2. Thus, the Layered Approach is very that the Layered CRFs perform far better than these
effective in reducing the attack traffic at each layer in the techniques. The main reason for this is that the CRFs do not
system. The configuration takes 21 seconds to classify all consider the observation features to be independent. In [38],
the 250,436 attacks. the authors present a comparative study of various classifiers
We can do further optimization by putting the DoS layer when applied to the KDD ’99 data set, and in [13], the authors
before the Probe layer. We can do this because the data is propose the use of Principle Component Analysis (PCA)
relational and each layer in our system is independent. before applying a machine learning algorithm. Use of
Putting the DoS layer before the Probe layer serves dual support vector machines is discussed in [26]. We compare
advantage as most of the attacks are detected at the first our results from the results presented in these papers in
layer itself and the overall system performs efficiently. This Table 14. The table represents the Probability of Detection
optimization becomes significant in severe attack situations (PD) and False Alarm Rate (FAR) in percent for various
when the target is overwhelmed with illegitimate connec- methods including the KDD ’99 cup winners.
tions. The results are presented in Table 12. From the table, we observe that the Layered CRFs
We observe that the Layered Approach can be very perform significantly better than the previously reported
effective in restricting the attack traffic to the initial layers in results including the winner of the KDD ’99 cup and
the system. We also performed experiments when we do various other methods applied to this data set. The most
not implement the Layered Approach, i.e., we consider only impressive part of the Layered CRFs is the margin of improve-
a single system that is trained with two classes (normal and ment as compared with other methods. Layered CRFs have very
attack). In this system, all the Probes, DoS, R2L, and U2R high attack detection of 98.6 percent for Probes (5.8 percent
attacks are labeled as “attack.” We perform experiments improvement) and 97.40 percent detection for DoS. They
both with and without feature selection. For feature outperform by a significant percentage for the R2L (34.5 percent
selection, we consider 21 features, which are selected by improvement) and the U2R (34.8 percent improvement) attacks.
applying the union operation on the feature sets of all the
four attack types. We compare these results with the 6.4 Discussion and Issues
Layered Approach in Table 13. From our experiments and the comparison in Table 14, we
We observe that a system implementing the Layered conclude that the Layered CRFs can be very effective in
Approach with feature selection is more efficient and more detecting the Probe, the U2R, and the R2L attacks as well as



TABLE 14 TABLE 15
Comparison of Results Ranking for the Six Methods

scores better. Our integrated system also has the advantage
that any method can be used in the layers of the system. This
gives flexibility to the user to decide between the time and
accuracy trade-off. Furthermore, we can increase or decrease
the number of layers in the system depending upon the task
requirement. Finally, our system can be used for performing
analysis on attacks because the attack category can be
inferred from the layer at which the attack is detected.
To determine the statistical significance of our results, we
compare our proposed method (Layered CRFs) with others
for detecting Probes, DoS, R2L, and U2R attacks. We use the
Wilcoxon sum rank test with 95 percent confidence interval
to discriminate the performance of these methods. Table 15
gives the ranking for various methods compared, where a
system with rank 1 is the best.
The results of the Wilcoxon test indicate that the Layered
CRFs are much better (or equal) for detecting attacks. Thus,
we conclude that the Layered CRFs are a strong candidate
the DoS attacks. However, if we consider all the 41 features for building robust and efficient intrusion detection systems.
given in the data set, we find that the time required to train
and test the model is high. To address this, we performed 7 EFFECT OF NOISE
experiments with our integrated system by implementing a
four-layer system. The four layers correspond to Probe, Ideally, we would like to perform similar experiments with
DoS, R2L, and U2R. For each layer, we then selected a set of a large number of data sets. However, given the domain of
features that is sufficient to detect attacks at that particular the problem, there are no other data sets that are freely
layer. Feature selection for each layer enhances the available, which can be used for our experiments. To
performance of the entire system. The runtime (testing) ameliorate this problem to some extent, we add substantial
amount of noise in the training data and perform similar
performance of our model is comparable with other
experiments to study the robustness of these systems. By
methods; however, the time required to train the model is
experimenting with noisy data, we want to determine the
slightly higher. We also observe that feature selection not
sensitivity of the proposed scheme with respect to noise. If
only decreases the time required to test an instance, but it
the system performs poorly with noisy data, the results
also increases the accuracy of attack detection. This is
could be an artifact of the data set.
because using more features than required can generate
superfluous rules often resulting in fitting irregularities in 7.1 Addition of Noise to Data
the data, which can misguide classification. From our Addition of noise was controlled by two parameters, the
experimental results, we conclude that the main strength probability of adding noise to a feature, p, and the scaling
of our method lies in detecting the R2L and the U2R attacks, factor, s, for a feature. We performed four sets of experi-
which are not satisfactorily detected by other methods. Our ments with noisy data, separately, one for each layer. For
method gives slight improvement for detecting Probe each layer, we varied the parameter p between 0 and 0.95 (by
attacks and was similar in accuracy when compared with keeping it at values 0.10, 0.20, 0.33, 0.50, 0.75, 0.90, and 0.95)
other methods for detecting the DoS attacks. and varied the parameter s between À1,000 and þ1,000. In
The prime reason for better detection accuracy for the case when the original feature was “0,” noise was added to
CRFs is that they do not consider the observation features to any feature by using an additive function (random value
be independent. CRFs evaluate all the rules together, which between À1,000 and þ1,000) instead of scaling. Figs. 5, 6, 7,
are applicable for a given observation. This results in and 8 represent the effect of noise on each layer separately.
capturing the correlation among different features of the We find that our integrated system is robust to noise in
observation resulting in higher accuracy. Considering both the training data and performs better than other methods
the accuracy and the time required for testing, our system for all of the four attack groups.



Fig. 5. Effect of noise on Probe layer. Fig. 8. Effect of noise on U2R layer.

We then used PCA for dimensionality reduction [13].
However, the main drawback of using PCA in our task is
that PCA transforms a large number of possibly correlated
features into a small number of uncorrelated features known
as the principle components. Hence, when we applied PCA
followed by CRFs in the newly transformed feature space,
the method did not provide significant advantage as the
strength of our approach is to model correlation among
features and the features in the new space are independent.
We also note that, to construct a decision tree, the
C4.5 algorithm performs feature selection. We selected the
Fig. 6. Effect of noise on DoS layer. same features as selected by the C4.5 algorithm and then
performed experiments with only those features. However,
8 AUTOMATIC FEATURE SELECTION there was no significant improvement in the results.
From our experiments in the previous sections, we showed We also performed experiments with the method
the advantages of performing feature selection and im- proposed in [33] for efficiently inducing features for a
plementing the Layered Approach for attack detection. We CRF. The method is based upon iteratively constructing
performed our experiments by manually selecting features feature conjunctions that would significantly increase the
for different layers. However, we want to compare the conditional log-likelihood if added to the model. We used
results of manual feature selection with the results of the Mallet tool [35] for performing these experiments and
automatic feature selection (and no feature selection) for all compare the results with our previous results based on
the layers. Hence, we investigated various methods for Layered CRFs with manual feature selection in Table 16. We
automatic feature selection. observed that both the systems, with automatic and manual
We experimented with a feed forward neural network to feature selections, had similar test time performance, but
determine the weights for all the 41 features. Features with the accuracy of detection when features were induced
weights close to zero were discarded. As a result, only a automatically was significantly lower than our system
small set of features was selected for each layer. However, based upon manual feature selection.
when we performed the experiments on the reduced set of It was not surprising that manual feature selection
features and compared the results, there was no significant performed better than automatic feature selection. How-
improvement in the detection accuracy, though there was ever, we note that automatic feature selection for Layered
reduction in training and testing time. CRFs performed better than the decision trees, particularly
for the R2L and the U2R attacks. This suggests that Layered

TABLE 16
Feature Selection

Fig. 7. Effect of noise on R2L layer.


Layered approach

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Semelhante a Layered approach

Semelhante a Layered approach (20)

Mais de ingenioustech

Mais de ingenioustech (20)

Último

Último (20)

Layered approach