Document clustering for forensic analysis

P R O J E C T I N T E R N A L G U I D E :
J . J A G A D E E S W A R R E D D Y M . T E C H ,
A S S I S T A N T P R O F E S S O R ,
D E P A R T M E N T O F C O M P U T E R S C I E N C E A N D E N G I N E E R I N G ,
A . S . C . E . T .
P R E S E N T E D B Y :
T . S A N J A Y H A R S H A ( 1 0 G 2 1 A 0 5 A 5 ) ,
P . J A Y A K U M A R ( 1 0 G 2 1 A 0 5 7 9 ) ,
B . S R E E N I V A S A T E J A ( 1 1 G 2 5 A 0 5 0 1 ) ,
S K . Z A H I D ( 1 0 G 2 1 A 0 5 B 9 ) .
DOCUMENT CLUSTERING FOR FORENSIC
ANALYSIS:AN APPROACH FOR IMPROVING
COMPUTER INSPECTION

AGENDA
 1. Abstract
 2. Introduction
 3. Existing systems and Disadvantages
 4. Proposed systems and Advantages
 5. System Architecture
 6. UML Diagrams
 7. Modules and their explanation
 8. System Configurations
 9. I/O Screen Shots
 10. Conclusion
 11. Future enhancements

ABSTRACT
 In computer forensic analysis, hundreds of thousands of files are
usually examined. Much of the data in those files consists of
unstructured text, whose analysis by computer examiners is difficult
to be performed.
 In particular, algorithms for clustering documents can facilitate the
discovery of new and useful knowledge from the documents under
analysis.
 The present an approach that applies document clustering algorithms
to forensic analysis of computers seized in police investigations.

FORENSIC COMPUTING -
INTRODUCTION
• Digital forensics is a branch of forensic science encompassing The recovery and
investigation of material found in digital devices or often in relation to computer
crime.
• Computer forensics is the application of investigation and analysis techniques to
gather and preserve them.
• Volume of data in the digital world has increased from 161 hexabytes in 2006 to 988
hexabytes in 2010.
• Has a direct impact in Computer Forensics

EXISTING SYSTEM
 From a more technical viewpoint, our datasets consist of unlabeled
objects—the classes or categories of documents that can be found are a
priori unknown.
 A new data sample would come from a different population.
 In this context, use of cluster algorithms for finding latent patterns from
text documents found in seized computers.

DISADVANTAGES OF EXISTING SYSTEM
 The literature on Computer Forensics only reports the use of algorithms
that assume that the number of clusters is known and fixed a priori by the
user.
 Aimed at relaxing this assumption, which is often unrealistic in practical
applications, a common approach in other domains involves estimating
the number of clusters from data.

Example of an unstructured data

Examples of structured and unstructured data

PROPOSED SYSTEM
 Here, we decided to choose a set of (six) representative algorithm in order
to show the potential of the proposed approach, namely:
 Partitional K-means and K-medoids,
 Classical hierarchical algorithms like Single/Complete/Average Link, and
 the cluster ensemble algorithm known as CSPA.
 These algorithms were run with different combinations of their
parameters, resulting in sixteen different algorithmic instantiations.

ADVANTAGES OF PROPOSED SYSTEM
 Most importantly, we observed that clustering algorithms indeed tend to
induce clusters formed by either relevant or irrelevant documents, thus
contributing to enhance the expert examiner’s job.
 Furthermore, our evaluation of the proposed approach in applications
show that it has the potential to speed up the computer inspection process.

UML Diagrams
Documents
Preprocessing
Term
Frequency
Similarity
Calculation
Cluster
Formation
Evaluating
Query Results
• Use Case Diagram of
the system

Document
Preprocessing
Valid
Similarity
Computation
Term Frequency
Unconsidered
Cluster
Formation
Query
Results
NO
YES
• Activity Diagram of the system

Data Flow Diagram
Documents Preprocessing Term Frequency
Similarity
Calculation
Cluster
Formation
Query
Processing
Evaluating
Results

MODULES
 Preprocessing Module
 Calculating the number of clusters
 Clustering techniques
 Removing Outliers

Preprocessing:
 In this, stopwords (prepositions, pronouns, articles, and irrelevant document
metadata) have been removed.
 Also, the Snow balls Stemming algorithm for Portuguese words has been used.
 Documents are represented in a vector space model.
 Term Variance (TV) is used to increase the effectiveness and efficiency of the
clustering algorithm.
 Uses two measures namely :
 Cosine-based distance and Levenshtein-based distance.

Calculating the number of Clusters:
 A widely used approach consists of getting a set of data partitions with different
numbers of clusters and then selecting that particular partition that provides
the best result according to a specific quality criterion (e.g., a relative validity
index called Silhouettes).

• Uses both Classical hierarchical clustering algorithms and partitional
algorithms like K-means.
• let us assume that a set of data partitions with different numbers of clusters
is available, from which we want to choose the best one.
• Average dissimilarity of an object i to all objects will be called as d(i, C).
• The smallest one is selected i.e.; b(i) = min d(i, C).
• Average dissimilarity to its neighboring cluster and the silhouette for a
given object s(i) is given by:
• Thus, the higher the better the assignment of object to a given cluster
that has the maximum average silhouette.

Clustering techniques:
 Uses partitional K-means and K-medoids, the hierarchical
Single/Complete/Average Link, and the cluster ensemble based algorithm
known as CSPA—are popular in the machine learning and data mining fields,
and therefore they have been used in our study.

Removing Outliers:
 We assess a simple approach to remove outliers. This approach makes
recursive use of the silhouette.
 Fundamentally, if the best partition chosen by the silhouette has singletons
(i.e., clusters formed by a single object only), these are removed.

HARDWARE REQUIREMENTS
 Processor - Pentium –IV
 Speed - 1.1 GHz
 RAM - 256 MB(min)
 Hard Disk - 20 GB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA

SOFTWARE REQUIREMENTS
 Operating System : Windows XP
 Programming Language : JAVA
 Java Version : JDK 1.6 & above.
 Database : MySQL 4.5
 IDE : Java Net Beans 7.4

An iterative semi-supervised machine learning
approach

CONCLUSION
• Using this proposed approach which can become an ideal application for
document clustering to forensic analysis of computers, laptops and hard disks
which are seized from criminals during investigation of police.
• There are several practical results based on our work which are extremely useful
for the experts working in forensic computing department.

FUTURE ENHANCEMENTS
• Aimed at further leveraging the use of data clustering algorithms in
similar applications, a promising venue for future work involves
investigating automatic approaches for cluster labelling.
• Inquiry on the possibility of endowing in the cluster calibration procedure
semantic .
• Another line of research could consist in using supervised learning tools
to categorize data on already defined categories for investigative
purposes.

REFERENCE
 Louis Filipe da Cruz Nassif and Eduardo Raul Hruschka
“Document Clustering for Forensic Analysis: An Approach for
Improving Computer Inspection” - IEEE TRANSACTIONS ON
INFORMATION FORENSICS AND SECURITY, VOL. 8,
NO. 1, JANUARY 2013.

Document clustering for forensic analysis

Document clustering for forensic analysis

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (13)

Semelhante a Document clustering for forensic analysis

Semelhante a Document clustering for forensic analysis (20)

Último

Último (20)

Document clustering for forensic analysis