Micro-Scholarship, What it is, How can it help me.pdf
Document clustering for forensic analysis
1. P R O J E C T I N T E R N A L G U I D E :
J . J A G A D E E S W A R R E D D Y M . T E C H ,
A S S I S T A N T P R O F E S S O R ,
D E P A R T M E N T O F C O M P U T E R S C I E N C E A N D E N G I N E E R I N G ,
A . S . C . E . T .
P R E S E N T E D B Y :
T . S A N J A Y H A R S H A ( 1 0 G 2 1 A 0 5 A 5 ) ,
P . J A Y A K U M A R ( 1 0 G 2 1 A 0 5 7 9 ) ,
B . S R E E N I V A S A T E J A ( 1 1 G 2 5 A 0 5 0 1 ) ,
S K . Z A H I D ( 1 0 G 2 1 A 0 5 B 9 ) .
DOCUMENT CLUSTERING FOR FORENSIC
ANALYSIS:AN APPROACH FOR IMPROVING
COMPUTER INSPECTION
2. AGENDA
1. Abstract
2. Introduction
3. Existing systems and Disadvantages
4. Proposed systems and Advantages
5. System Architecture
6. UML Diagrams
7. Modules and their explanation
8. System Configurations
9. I/O Screen Shots
10. Conclusion
11. Future enhancements
3. ABSTRACT
In computer forensic analysis, hundreds of thousands of files are
usually examined. Much of the data in those files consists of
unstructured text, whose analysis by computer examiners is difficult
to be performed.
In particular, algorithms for clustering documents can facilitate the
discovery of new and useful knowledge from the documents under
analysis.
The present an approach that applies document clustering algorithms
to forensic analysis of computers seized in police investigations.
4. FORENSIC COMPUTING -
INTRODUCTION
• Digital forensics is a branch of forensic science encompassing The recovery and
investigation of material found in digital devices or often in relation to computer
crime.
• Computer forensics is the application of investigation and analysis techniques to
gather and preserve them.
• Volume of data in the digital world has increased from 161 hexabytes in 2006 to 988
hexabytes in 2010.
• Has a direct impact in Computer Forensics
5. EXISTING SYSTEM
From a more technical viewpoint, our datasets consist of unlabeled
objects—the classes or categories of documents that can be found are a
priori unknown.
A new data sample would come from a different population.
In this context, use of cluster algorithms for finding latent patterns from
text documents found in seized computers.
6. DISADVANTAGES OF EXISTING SYSTEM
The literature on Computer Forensics only reports the use of algorithms
that assume that the number of clusters is known and fixed a priori by the
user.
Aimed at relaxing this assumption, which is often unrealistic in practical
applications, a common approach in other domains involves estimating
the number of clusters from data.
10. PROPOSED SYSTEM
Here, we decided to choose a set of (six) representative algorithm in order
to show the potential of the proposed approach, namely:
Partitional K-means and K-medoids,
Classical hierarchical algorithms like Single/Complete/Average Link, and
the cluster ensemble algorithm known as CSPA.
These algorithms were run with different combinations of their
parameters, resulting in sixteen different algorithmic instantiations.
11. ADVANTAGES OF PROPOSED SYSTEM
Most importantly, we observed that clustering algorithms indeed tend to
induce clusters formed by either relevant or irrelevant documents, thus
contributing to enhance the expert examiner’s job.
Furthermore, our evaluation of the proposed approach in applications
show that it has the potential to speed up the computer inspection process.
18. Preprocessing:
In this, stopwords (prepositions, pronouns, articles, and irrelevant document
metadata) have been removed.
Also, the Snow balls Stemming algorithm for Portuguese words has been used.
Documents are represented in a vector space model.
Term Variance (TV) is used to increase the effectiveness and efficiency of the
clustering algorithm.
Uses two measures namely :
Cosine-based distance and Levenshtein-based distance.
19. Calculating the number of Clusters:
A widely used approach consists of getting a set of data partitions with different
numbers of clusters and then selecting that particular partition that provides
the best result according to a specific quality criterion (e.g., a relative validity
index called Silhouettes).
20. • Uses both Classical hierarchical clustering algorithms and partitional
algorithms like K-means.
• let us assume that a set of data partitions with different numbers of clusters
is available, from which we want to choose the best one.
• Average dissimilarity of an object i to all objects will be called as d(i, C).
• The smallest one is selected i.e.; b(i) = min d(i, C).
• Average dissimilarity to its neighboring cluster and the silhouette for a
given object s(i) is given by:
• Thus, the higher the better the assignment of object to a given cluster
that has the maximum average silhouette.
21. Clustering techniques:
Uses partitional K-means and K-medoids, the hierarchical
Single/Complete/Average Link, and the cluster ensemble based algorithm
known as CSPA—are popular in the machine learning and data mining fields,
and therefore they have been used in our study.
22. Removing Outliers:
We assess a simple approach to remove outliers. This approach makes
recursive use of the silhouette.
Fundamentally, if the best partition chosen by the silhouette has singletons
(i.e., clusters formed by a single object only), these are removed.
23. HARDWARE REQUIREMENTS
Processor - Pentium –IV
Speed - 1.1 GHz
RAM - 256 MB(min)
Hard Disk - 20 GB
Key Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse
Monitor - SVGA
24. SOFTWARE REQUIREMENTS
Operating System : Windows XP
Programming Language : JAVA
Java Version : JDK 1.6 & above.
Database : MySQL 4.5
IDE : Java Net Beans 7.4
29. CONCLUSION
• Using this proposed approach which can become an ideal application for
document clustering to forensic analysis of computers, laptops and hard disks
which are seized from criminals during investigation of police.
• There are several practical results based on our work which are extremely useful
for the experts working in forensic computing department.
30. FUTURE ENHANCEMENTS
• Aimed at further leveraging the use of data clustering algorithms in
similar applications, a promising venue for future work involves
investigating automatic approaches for cluster labelling.
• Inquiry on the possibility of endowing in the cluster calibration procedure
semantic .
• Another line of research could consist in using supervised learning tools
to categorize data on already defined categories for investigative
purposes.
31. REFERENCE
Louis Filipe da Cruz Nassif and Eduardo Raul Hruschka
“Document Clustering for Forensic Analysis: An Approach for
Improving Computer Inspection” - IEEE TRANSACTIONS ON
INFORMATION FORENSICS AND SECURITY, VOL. 8,
NO. 1, JANUARY 2013.