` Traffic Classification based on Machine Learning
1. `
Traffic Classification based on Machine Learning
using Flow-level Information
Jong Gun Lee (jglee@an.kaist.ac.kr)
Advanced Networking Lab.
2. `
Table of Contents
• Motivation of this work
• Background about machine learning
• Our approach using machine learning
• Experiment (dataset and result)
• Conclusion
3. `
Motivation
• We cannot effectively classify the traffic of some new
emergent applications,
– such as online games and streaming applications
– because there is no application information, such as port
number or a common byte sequence in payload
We propose a methodology to classify Internet traffic
with supervised and unsupervised learning
4. `
Basic Terminologies of Machine Learning
• Classifier
is mapping unlabeled instances into classes
• Instance
is a single object of the world
• Attribute
is a single object of the world
• Feature
is the specification of an attribute and its value
• Feature vector
is a list of features describing an instance
5. `
Unsupervised and Supervised Learning
• Supervised learning (with answer/teacher)
– With a training set, a classifier learns the characteristics of each
class. And when entering new instance, the classifier predicts
the class of the instance.
• Unsupervised learning (without answer/teacher)
– With only a set of data (feature vectors), a classifier make a set
of clusters.
6. `
K-Means
• One of the unsupervised learning methods
• K value is the number of clusters and this value is given as
the initial parameter
• Procedure
– First, the classifier randomly chooses K points as the centers of
K subspaces
– Second, it divides the overall vector space into K subspaces
according to the centers
– Third, it picks new K centers for each subspaces
– And then, it iterates 2nd
and 3rd
steps until all of the centers are
not changed or moved within the threshold value
8. `
Overall Process of Our Method
Unsupervised
Learning
Feature
Extraction
Supervised
Learning
N packets N feature
vectors
Classifier
K Clusters
Classification
Method
9. `
Flow-level Feature Information
• Protocol number: 6(TCP) or 17(UDP)
• Duration: seconds
• Number of packets per second (PPS)
• Mean of size of all packets
• Mean of size of non-ACK packets
• Rate of ACK packets
• Interaction Information
10. `
Feature Extraction (Interaction Information)
• Interaction Information
– H: 2-dimensional histogram, 16x16
– p1, p2, p3, …, pn
• a sequence of packets size of a flow and its partner flow
according to timestamp
For i = 1 : n-1
H[pi/100][pi+1/100]++
A sequence of packets’ size: 40, 80, 1500, …, 40, 1500
Pair-wise representation: [40, 80], [80, 1500], …, [40, 1500]
Histogram: [40/100, 80/100], [80/100, 1500/100], … , [40/100, 1500/100]
[0, 0], [0, 15], …, [0, 15]
19. `
To do list
• Direction
– Rx and Tx, Rx only, and Tx only
• Dynamic bin size
• Initial N packets or all the packets
• Different (un)supervised learning method
• Different feature extraction method