SlideShare uma empresa Scribd logo
1 de 20
Prepared by: Mahmoud Rafeek Alfarra
Seminar Program
Document
Clustering and Classification
Out Line
 Classification and its techniques
 Clustering its techniques
 Document clustering !!
 Comparison
Classification: Definition
 Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification: Definition
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
Classification Techniques
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal
to 1.
Artificial Neural Networks (ANN)
X1 X2 X3 Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0

X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Output
node
Input
nodes





otherwise0
trueisif1
)(where
)04.03.03.03.0( 321
z
zI
XXXIY
Artificial Neural Networks (ANN)
 Model is an assembly of
inter-connected nodes
and weighted links
 Output node sums up
each of its input value
according to the weights
of its links
 Compare output node
against some threshold t

X1
X2
X3
Y
Black box
w1
t
Output
node
Input
nodes
w2
w3
)( tXwIY
i
ii  
Perceptron Model
)( tXwsignY
i
ii  
or
Clustering Definition
 Clustering is a division of data into groups of
similar objects.
 Each group is called cluster and consists of
objects that are similar between themselves and
dissimilar to objects of other groups .
ClusteringDefinition
C3
C2 C1
Document clustering
 Document clustering is an automatic grouping
of text documents into clusters so that
documents within a cluster have high similarity
in comparison to one another, but are dissimilar
to documents in other clusters.
The challenge
 The problem of Document clustering is how to organize
a large set of documents of various topics and reach
satisfy organization. It can display as follow:
 Given: A huge set of documents of various topics (shared,
related, totally different).
 Required: Group the documents into a number of clusters
such that the intra-cluster similarity is maximized, and the
inter-cluster similarity is minimized.
The challenge
Document cluster Document cluster
Document cluster
Inter-Cluster
Sim.
Intra-Cluster
Sim.
Inter-Cluster Sim. < Intra-Cluster Sim.
Clustering’s Process
Knowledge
Document Data Model
Representation
•Document Cleaning
•Feature Selection or Extraction.
Documents samples
Clustering Algorithm
• Similarity Measure
• Criterion of Clustering
Cluster Validation
• External Indices
• Internal Indices
Results Interpretation
Clusters
1 2
3
4
Clustering Techniques
 Clustering methods in general can be viewed
from different perspectives, the most widely
applied to text domain are:
 Hierarchical Clustering
 Partitioning Clustering
 Neural Network based Clustering
Clustering Techniques
 Suffix Tree Clustering algorithm
2015-09-26 16
D1: cat ate cheese
D2: mouse ate cheese too
D3: cat ate mouse too
and
Clustering Techniques
 Document Index Graph for clustering (DIG)
Clustering Techniques
 Graph based growing hierarchal SOM
Comparison
Thanks

Mais conteúdo relacionado

Mais procurados

The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
avrilcoghlan
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources
innocent87
 

Mais procurados (20)

Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontology
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biology
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Data retrieval
Data retrievalData retrieval
Data retrieval
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 
The STRING database
The STRING databaseThe STRING database
The STRING database
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources
 
Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbget
 
Prosite
PrositeProsite
Prosite
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Text categorization
Text categorizationText categorization
Text categorization
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in Bioinformatics
 

Destaque

Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
Rob Gillen
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
Souvik Roy
 
Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...
Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...
Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...
Aboul Ella Hassanien
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
KU Leuven
 

Destaque (20)

Clustering
ClusteringClustering
Clustering
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis Tool
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
 
A la croisée des Graphes
A la croisée des GraphesA la croisée des Graphes
A la croisée des Graphes
 
Facebook Conférence "Ne vous limitez pas à la Fan page et aux Like"
Facebook Conférence "Ne vous limitez pas à la Fan page et aux Like"Facebook Conférence "Ne vous limitez pas à la Fan page et aux Like"
Facebook Conférence "Ne vous limitez pas à la Fan page et aux Like"
 
Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...
Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...
Neutrosophic sets and fuzzy c means clustering for improving ct liver image s...
 
Graph Clustering and cluster
Graph Clustering and clusterGraph Clustering and cluster
Graph Clustering and cluster
 
Data clustering
Data clustering Data clustering
Data clustering
 
البرمجة الهدفية بلغة جافا - مفاهيم أساسية
البرمجة الهدفية بلغة جافا - مفاهيم أساسية البرمجة الهدفية بلغة جافا - مفاهيم أساسية
البرمجة الهدفية بلغة جافا - مفاهيم أساسية
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 

Semelhante a Document clustering and classification

PPT file
PPT filePPT file
PPT file
butest
 

Semelhante a Document clustering and classification (20)

EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdf
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Deep learning
Deep learningDeep learning
Deep learning
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
PPT file
PPT filePPT file
PPT file
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Clusterix at VDS 2016
Clusterix at VDS 2016Clusterix at VDS 2016
Clusterix at VDS 2016
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Data clustring
Data clustring Data clustring
Data clustring
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 

Mais de Mahmoud Alfarra

8 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 201020118 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 20102011
Mahmoud Alfarra
 

Mais de Mahmoud Alfarra (20)

Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2
 
Computer Programming, Loops using Java
Computer Programming, Loops using JavaComputer Programming, Loops using Java
Computer Programming, Loops using Java
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
 
Chapter9 graph data structure
Chapter9  graph data structureChapter9  graph data structure
Chapter9 graph data structure
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structure
 
Chapter 7: Queue data structure
Chapter 7:  Queue data structureChapter 7:  Queue data structure
Chapter 7: Queue data structure
 
Chapter 6: stack data structure
Chapter 6:  stack data structureChapter 6:  stack data structure
Chapter 6: stack data structure
 
Chapter 5: linked list data structure
Chapter 5: linked list data structureChapter 5: linked list data structure
Chapter 5: linked list data structure
 
Chapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structureChapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structure
 
Chapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structureChapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structure
 
Chapter 2: array and array list data structure
Chapter 2: array and array list  data structureChapter 2: array and array list  data structure
Chapter 2: array and array list data structure
 
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter1  intro toprincipleofc#_datastructure_b_csChapter1  intro toprincipleofc#_datastructure_b_cs
Chapter1 intro toprincipleofc#_datastructure_b_cs
 
Chapter 0: introduction to data structure
Chapter 0: introduction to data structureChapter 0: introduction to data structure
Chapter 0: introduction to data structure
 
3 classification
3  classification3  classification
3 classification
 
8 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 201020118 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 20102011
 
7 programming-using-java decision-making220102011
7 programming-using-java decision-making2201020117 programming-using-java decision-making220102011
7 programming-using-java decision-making220102011
 
6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-
 
5 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop201020115 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop20102011
 
4 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava201020114 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava20102011
 
3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer
 

Último

Último (20)

Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 

Document clustering and classification

  • 1. Prepared by: Mahmoud Rafeek Alfarra Seminar Program Document Clustering and Classification
  • 2. Out Line  Classification and its techniques  Clustering its techniques  Document clustering !!  Comparison
  • 3. Classification: Definition  Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 4. Classification: Definition Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 5. Classification Techniques  Decision Tree based Methods  Rule-based Methods  Memory based reasoning  Neural Networks  Naïve Bayes and Bayesian Belief Networks  Support Vector Machines
  • 6. Artificial Neural Networks (ANN) X1 X2 X3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 X1 X2 X3 Y Black box Output Input Output Y is 1 if at least two of the three inputs are equal to 1.
  • 7. Artificial Neural Networks (ANN) X1 X2 X3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0  X1 X2 X3 Y Black box 0.3 0.3 0.3 t=0.4 Output node Input nodes      otherwise0 trueisif1 )(where )04.03.03.03.0( 321 z zI XXXIY
  • 8. Artificial Neural Networks (ANN)  Model is an assembly of inter-connected nodes and weighted links  Output node sums up each of its input value according to the weights of its links  Compare output node against some threshold t  X1 X2 X3 Y Black box w1 t Output node Input nodes w2 w3 )( tXwIY i ii   Perceptron Model )( tXwsignY i ii   or
  • 9. Clustering Definition  Clustering is a division of data into groups of similar objects.  Each group is called cluster and consists of objects that are similar between themselves and dissimilar to objects of other groups .
  • 11. Document clustering  Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters.
  • 12. The challenge  The problem of Document clustering is how to organize a large set of documents of various topics and reach satisfy organization. It can display as follow:  Given: A huge set of documents of various topics (shared, related, totally different).  Required: Group the documents into a number of clusters such that the intra-cluster similarity is maximized, and the inter-cluster similarity is minimized.
  • 13. The challenge Document cluster Document cluster Document cluster Inter-Cluster Sim. Intra-Cluster Sim. Inter-Cluster Sim. < Intra-Cluster Sim.
  • 14. Clustering’s Process Knowledge Document Data Model Representation •Document Cleaning •Feature Selection or Extraction. Documents samples Clustering Algorithm • Similarity Measure • Criterion of Clustering Cluster Validation • External Indices • Internal Indices Results Interpretation Clusters 1 2 3 4
  • 15. Clustering Techniques  Clustering methods in general can be viewed from different perspectives, the most widely applied to text domain are:  Hierarchical Clustering  Partitioning Clustering  Neural Network based Clustering
  • 16. Clustering Techniques  Suffix Tree Clustering algorithm 2015-09-26 16 D1: cat ate cheese D2: mouse ate cheese too D3: cat ate mouse too and
  • 17. Clustering Techniques  Document Index Graph for clustering (DIG)
  • 18. Clustering Techniques  Graph based growing hierarchal SOM