SlideShare uma empresa Scribd logo
1 de 32
P R O J E C T I N T E R N A L G U I D E :
J . J A G A D E E S W A R R E D D Y M . T E C H ,
A S S I S T A N T P R O F E S S O R ,
D E P A R T M E N T O F C O M P U T E R S C I E N C E A N D E N G I N E E R I N G ,
A . S . C . E . T .
P R E S E N T E D B Y :
T . S A N J A Y H A R S H A ( 1 0 G 2 1 A 0 5 A 5 ) ,
P . J A Y A K U M A R ( 1 0 G 2 1 A 0 5 7 9 ) ,
B . S R E E N I V A S A T E J A ( 1 1 G 2 5 A 0 5 0 1 ) ,
S K . Z A H I D ( 1 0 G 2 1 A 0 5 B 9 ) .
DOCUMENT CLUSTERING FOR FORENSIC
ANALYSIS:AN APPROACH FOR IMPROVING
COMPUTER INSPECTION
AGENDA
 1. Abstract
 2. Introduction
 3. Existing systems and Disadvantages
 4. Proposed systems and Advantages
 5. System Architecture
 6. UML Diagrams
 7. Modules and their explanation
 8. System Configurations
 9. I/O Screen Shots
 10. Conclusion
 11. Future enhancements
ABSTRACT
 In computer forensic analysis, hundreds of thousands of files are
usually examined. Much of the data in those files consists of
unstructured text, whose analysis by computer examiners is difficult
to be performed.
 In particular, algorithms for clustering documents can facilitate the
discovery of new and useful knowledge from the documents under
analysis.
 The present an approach that applies document clustering algorithms
to forensic analysis of computers seized in police investigations.
FORENSIC COMPUTING -
INTRODUCTION
• Digital forensics is a branch of forensic science encompassing The recovery and
investigation of material found in digital devices or often in relation to computer
crime.
• Computer forensics is the application of investigation and analysis techniques to
gather and preserve them.
• Volume of data in the digital world has increased from 161 hexabytes in 2006 to 988
hexabytes in 2010.
• Has a direct impact in Computer Forensics
EXISTING SYSTEM
 From a more technical viewpoint, our datasets consist of unlabeled
objects—the classes or categories of documents that can be found are a
priori unknown.
 A new data sample would come from a different population.
 In this context, use of cluster algorithms for finding latent patterns from
text documents found in seized computers.
DISADVANTAGES OF EXISTING SYSTEM
 The literature on Computer Forensics only reports the use of algorithms
that assume that the number of clusters is known and fixed a priori by the
user.
 Aimed at relaxing this assumption, which is often unrealistic in practical
applications, a common approach in other domains involves estimating
the number of clusters from data.
Example of an unstructured data
Examples of structured and unstructured data
PROPOSED SYSTEM
 Here, we decided to choose a set of (six) representative algorithm in order
to show the potential of the proposed approach, namely:
 Partitional K-means and K-medoids,
 Classical hierarchical algorithms like Single/Complete/Average Link, and
 the cluster ensemble algorithm known as CSPA.
 These algorithms were run with different combinations of their
parameters, resulting in sixteen different algorithmic instantiations.
ADVANTAGES OF PROPOSED SYSTEM
 Most importantly, we observed that clustering algorithms indeed tend to
induce clusters formed by either relevant or irrelevant documents, thus
contributing to enhance the expert examiner’s job.
 Furthermore, our evaluation of the proposed approach in applications
show that it has the potential to speed up the computer inspection process.
System Architecture
UML Diagrams
Documents
Preprocessing
Term
Frequency
Similarity
Calculation
Cluster
Formation
Evaluating
Query Results
• Use Case Diagram of
the system
Document
Preprocessing
Valid
Similarity
Computation
Term Frequency
Unconsidered
Cluster
Formation
Query
Results
NO
YES
• Activity Diagram of the system
Data Flow Diagram
Documents Preprocessing Term Frequency
Similarity
Calculation
Cluster
Formation
Query
Processing
Evaluating
Results
MODULES
 Preprocessing Module
 Calculating the number of clusters
 Clustering techniques
 Removing Outliers
Preprocessing:
 In this, stopwords (prepositions, pronouns, articles, and irrelevant document
metadata) have been removed.
 Also, the Snow balls Stemming algorithm for Portuguese words has been used.
 Documents are represented in a vector space model.
 Term Variance (TV) is used to increase the effectiveness and efficiency of the
clustering algorithm.
 Uses two measures namely :
 Cosine-based distance and Levenshtein-based distance.
Calculating the number of Clusters:
 A widely used approach consists of getting a set of data partitions with different
numbers of clusters and then selecting that particular partition that provides
the best result according to a specific quality criterion (e.g., a relative validity
index called Silhouettes).
• Uses both Classical hierarchical clustering algorithms and partitional
algorithms like K-means.
• let us assume that a set of data partitions with different numbers of clusters
is available, from which we want to choose the best one.
• Average dissimilarity of an object i to all objects will be called as d(i, C).
• The smallest one is selected i.e.; b(i) = min d(i, C).
• Average dissimilarity to its neighboring cluster and the silhouette for a
given object s(i) is given by:
• Thus, the higher the better the assignment of object to a given cluster
that has the maximum average silhouette.
Clustering techniques:
 Uses partitional K-means and K-medoids, the hierarchical
Single/Complete/Average Link, and the cluster ensemble based algorithm
known as CSPA—are popular in the machine learning and data mining fields,
and therefore they have been used in our study.
Removing Outliers:
 We assess a simple approach to remove outliers. This approach makes
recursive use of the silhouette.
 Fundamentally, if the best partition chosen by the silhouette has singletons
(i.e., clusters formed by a single object only), these are removed.
HARDWARE REQUIREMENTS
 Processor - Pentium –IV
 Speed - 1.1 GHz
 RAM - 256 MB(min)
 Hard Disk - 20 GB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA
SOFTWARE REQUIREMENTS
 Operating System : Windows XP
 Programming Language : JAVA
 Java Version : JDK 1.6 & above.
 Database : MySQL 4.5
 IDE : Java Net Beans 7.4
An iterative semi-supervised machine learning
approach
CONCLUSION
• Using this proposed approach which can become an ideal application for
document clustering to forensic analysis of computers, laptops and hard disks
which are seized from criminals during investigation of police.
• There are several practical results based on our work which are extremely useful
for the experts working in forensic computing department.
FUTURE ENHANCEMENTS
• Aimed at further leveraging the use of data clustering algorithms in
similar applications, a promising venue for future work involves
investigating automatic approaches for cluster labelling.
• Inquiry on the possibility of endowing in the cluster calibration procedure
semantic .
• Another line of research could consist in using supervised learning tools
to categorize data on already defined categories for investigative
purposes.
REFERENCE
 Louis Filipe da Cruz Nassif and Eduardo Raul Hruschka
“Document Clustering for Forensic Analysis: An Approach for
Improving Computer Inspection” - IEEE TRANSACTIONS ON
INFORMATION FORENSICS AND SECURITY, VOL. 8,
NO. 1, JANUARY 2013.
Document clustering for forensic analysis

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Lect4
Lect4Lect4
Lect4
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Data clustering
Data clustering Data clustering
Data clustering
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Clustering
ClusteringClustering
Clustering
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
Analysis of the Datasets
Analysis of the DatasetsAnalysis of the Datasets
Analysis of the Datasets
 

Destaque

semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
Souvik Roy
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
Rob Gillen
 
Personalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing WebsitesPersonalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing Websites
AM Publications
 

Destaque (13)

semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
 
Personalised Search: Historical & Geographical Factors
Personalised Search: Historical & Geographical FactorsPersonalised Search: Historical & Geographical Factors
Personalised Search: Historical & Geographical Factors
 
Project report on Data Clustering
Project report on Data ClusteringProject report on Data Clustering
Project report on Data Clustering
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Disease Detection System
Disease Detection SystemDisease Detection System
Disease Detection System
 
Personalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing WebsitesPersonalizing Image Search from the Photo Sharing Websites
Personalizing Image Search from the Photo Sharing Websites
 
K means clustering
K means clusteringK means clustering
K means clustering
 
How to Realize the Benefits of Cloud Services Brokerage
How to Realize the Benefits of Cloud Services BrokerageHow to Realize the Benefits of Cloud Services Brokerage
How to Realize the Benefits of Cloud Services Brokerage
 
WebRTC Seminar Report
WebRTC  Seminar ReportWebRTC  Seminar Report
WebRTC Seminar Report
 
Weka
WekaWeka
Weka
 
IR
IRIR
IR
 
Graph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS DetectionGraph-based KNN Algorithm for Spam SMS Detection
Graph-based KNN Algorithm for Spam SMS Detection
 
K means clustering
K means clusteringK means clustering
K means clustering
 

Semelhante a Document clustering for forensic analysis

accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
Farhan Zaki
 
Forensic drive correlation
Forensic drive correlationForensic drive correlation
Forensic drive correlation
Ramesh Gubba
 

Semelhante a Document clustering for forensic analysis (20)

Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
IRJET-Performance Enhancement in Machine Learning System using Hybrid Bee Col...
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...
 
Forensic drive correlation
Forensic drive correlationForensic drive correlation
Forensic drive correlation
 
A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...
A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...
A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...
 
A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...
A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...
A NOVEL EVALUATION APPROACH TO FINDING LIGHTWEIGHT MACHINE LEARNING ALGORITHM...
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Airborne Data Processing And Analysis Software Package
Airborne Data Processing And Analysis Software PackageAirborne Data Processing And Analysis Software Package
Airborne Data Processing And Analysis Software Package
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Detection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed ApproachDetection of Outliers in Large Dataset using Distributed Approach
Detection of Outliers in Large Dataset using Distributed Approach
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
Detection of malicious attacks by Meta classification algorithms
Detection of malicious attacks by Meta classification algorithmsDetection of malicious attacks by Meta classification algorithms
Detection of malicious attacks by Meta classification algorithms
 
I0343047049
I0343047049I0343047049
I0343047049
 
Visualization of Computer Forensics Analysis on Digital Evidence
Visualization of Computer Forensics Analysis on Digital EvidenceVisualization of Computer Forensics Analysis on Digital Evidence
Visualization of Computer Forensics Analysis on Digital Evidence
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 

Document clustering for forensic analysis

  • 1. P R O J E C T I N T E R N A L G U I D E : J . J A G A D E E S W A R R E D D Y M . T E C H , A S S I S T A N T P R O F E S S O R , D E P A R T M E N T O F C O M P U T E R S C I E N C E A N D E N G I N E E R I N G , A . S . C . E . T . P R E S E N T E D B Y : T . S A N J A Y H A R S H A ( 1 0 G 2 1 A 0 5 A 5 ) , P . J A Y A K U M A R ( 1 0 G 2 1 A 0 5 7 9 ) , B . S R E E N I V A S A T E J A ( 1 1 G 2 5 A 0 5 0 1 ) , S K . Z A H I D ( 1 0 G 2 1 A 0 5 B 9 ) . DOCUMENT CLUSTERING FOR FORENSIC ANALYSIS:AN APPROACH FOR IMPROVING COMPUTER INSPECTION
  • 2. AGENDA  1. Abstract  2. Introduction  3. Existing systems and Disadvantages  4. Proposed systems and Advantages  5. System Architecture  6. UML Diagrams  7. Modules and their explanation  8. System Configurations  9. I/O Screen Shots  10. Conclusion  11. Future enhancements
  • 3. ABSTRACT  In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed.  In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis.  The present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations.
  • 4. FORENSIC COMPUTING - INTRODUCTION • Digital forensics is a branch of forensic science encompassing The recovery and investigation of material found in digital devices or often in relation to computer crime. • Computer forensics is the application of investigation and analysis techniques to gather and preserve them. • Volume of data in the digital world has increased from 161 hexabytes in 2006 to 988 hexabytes in 2010. • Has a direct impact in Computer Forensics
  • 5. EXISTING SYSTEM  From a more technical viewpoint, our datasets consist of unlabeled objects—the classes or categories of documents that can be found are a priori unknown.  A new data sample would come from a different population.  In this context, use of cluster algorithms for finding latent patterns from text documents found in seized computers.
  • 6. DISADVANTAGES OF EXISTING SYSTEM  The literature on Computer Forensics only reports the use of algorithms that assume that the number of clusters is known and fixed a priori by the user.  Aimed at relaxing this assumption, which is often unrealistic in practical applications, a common approach in other domains involves estimating the number of clusters from data.
  • 7. Example of an unstructured data
  • 8.
  • 9. Examples of structured and unstructured data
  • 10. PROPOSED SYSTEM  Here, we decided to choose a set of (six) representative algorithm in order to show the potential of the proposed approach, namely:  Partitional K-means and K-medoids,  Classical hierarchical algorithms like Single/Complete/Average Link, and  the cluster ensemble algorithm known as CSPA.  These algorithms were run with different combinations of their parameters, resulting in sixteen different algorithmic instantiations.
  • 11. ADVANTAGES OF PROPOSED SYSTEM  Most importantly, we observed that clustering algorithms indeed tend to induce clusters formed by either relevant or irrelevant documents, thus contributing to enhance the expert examiner’s job.  Furthermore, our evaluation of the proposed approach in applications show that it has the potential to speed up the computer inspection process.
  • 13.
  • 16. Data Flow Diagram Documents Preprocessing Term Frequency Similarity Calculation Cluster Formation Query Processing Evaluating Results
  • 17. MODULES  Preprocessing Module  Calculating the number of clusters  Clustering techniques  Removing Outliers
  • 18. Preprocessing:  In this, stopwords (prepositions, pronouns, articles, and irrelevant document metadata) have been removed.  Also, the Snow balls Stemming algorithm for Portuguese words has been used.  Documents are represented in a vector space model.  Term Variance (TV) is used to increase the effectiveness and efficiency of the clustering algorithm.  Uses two measures namely :  Cosine-based distance and Levenshtein-based distance.
  • 19. Calculating the number of Clusters:  A widely used approach consists of getting a set of data partitions with different numbers of clusters and then selecting that particular partition that provides the best result according to a specific quality criterion (e.g., a relative validity index called Silhouettes).
  • 20. • Uses both Classical hierarchical clustering algorithms and partitional algorithms like K-means. • let us assume that a set of data partitions with different numbers of clusters is available, from which we want to choose the best one. • Average dissimilarity of an object i to all objects will be called as d(i, C). • The smallest one is selected i.e.; b(i) = min d(i, C). • Average dissimilarity to its neighboring cluster and the silhouette for a given object s(i) is given by: • Thus, the higher the better the assignment of object to a given cluster that has the maximum average silhouette.
  • 21. Clustering techniques:  Uses partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble based algorithm known as CSPA—are popular in the machine learning and data mining fields, and therefore they have been used in our study.
  • 22. Removing Outliers:  We assess a simple approach to remove outliers. This approach makes recursive use of the silhouette.  Fundamentally, if the best partition chosen by the silhouette has singletons (i.e., clusters formed by a single object only), these are removed.
  • 23. HARDWARE REQUIREMENTS  Processor - Pentium –IV  Speed - 1.1 GHz  RAM - 256 MB(min)  Hard Disk - 20 GB  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA
  • 24. SOFTWARE REQUIREMENTS  Operating System : Windows XP  Programming Language : JAVA  Java Version : JDK 1.6 & above.  Database : MySQL 4.5  IDE : Java Net Beans 7.4
  • 25.
  • 26.
  • 27.
  • 28. An iterative semi-supervised machine learning approach
  • 29. CONCLUSION • Using this proposed approach which can become an ideal application for document clustering to forensic analysis of computers, laptops and hard disks which are seized from criminals during investigation of police. • There are several practical results based on our work which are extremely useful for the experts working in forensic computing department.
  • 30. FUTURE ENHANCEMENTS • Aimed at further leveraging the use of data clustering algorithms in similar applications, a promising venue for future work involves investigating automatic approaches for cluster labelling. • Inquiry on the possibility of endowing in the cluster calibration procedure semantic . • Another line of research could consist in using supervised learning tools to categorize data on already defined categories for investigative purposes.
  • 31. REFERENCE  Louis Filipe da Cruz Nassif and Eduardo Raul Hruschka “Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection” - IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013.