SlideShare uma empresa Scribd logo
1 de 56
Text Mining
IS698
Min Song
 The Needs:
- Find people as well as documents that can
address my information need.
- Promote collaboration and knowledge
sharing
- Leverage existing information access
system
- The Information Sources:
- Email, groupware, online reports, …
Example 1: KM People Finder
Example 1:
Simple KM People Finder
Relevant
Docs
Search or
Navigation
System
Name
Extractor Authority
List
Query
Ranked People Names
Example 1: KM People Finder
• An exploration and analysis of textual (natural-language) datatextual (natural-language) data by
automatic and semi automatic means to discover new knowledge.
Text Mining Definition
 Many definitions in the literature
“The non trivial extraction of implicit, previously unknown,
and potentially useful information from (large amount of)
textual data”.
 What is ““previously unknown”previously unknown”
information ?
 Strict definition
 Information that not even the writer knows.
 e.g., Discovering a new method for a hair growth
that is described as a side effect for a different
procedure
 Lenient definition
 Rediscover the information that the author
encoded in the text
 e.g., Automatically extracting a product’s name
from a web-page.
Text Mining Definition
Outline
 Text mining applications
 Text characteristics
 Text mining process
 Learning methods
Text Mining Applications
 Marketing: Discover distinct
groups of potential buyers
according to a user text based
profile
 e.g. amazon
 Industry: Identifying groups
of competitors web pages
 e.g., competing products
and their prices
 Job seeking: Identify
parameters in searching for
jobs
 e.g., www.flipdog.com
 Information Retrieval
 Indexing and retrieval of textual documents
 Information Extraction
 Extraction of partial knowledgepartial knowledge in the text
 Web Mining
 Indexing and retrieval of textual documents
and extraction of partial knowledge using the
web
 Clustering
 Generating collections of similar text
documents
Text Mining Methods
Information Retrieval
 Given:
 A source of textual
documents
 A user query (text
based)
IR
System
Query
E.g. Spam / Text
Documents
source
• Find:
• A set (ranked) of documents that
are relevant to the query
Ranked
Documents
Document
Document
Document
Intelligent Information Retrieval
 meaning of words
 Synonyms “buy” / “purchase”
 Ambiguity “bat” (baseball vs. mammal)
 order of words in the query
 hot dog stand in the amusement park
 hot amusement stand in the dog park
 user dependency for the data
 direct feedback
 indirect feedback
 authority of the source
 IBM is more likely to be an authorized source then
my second far cousin
 Given:
 A source of textual documents
 A well defined limited query (text based)
 Find:
 Sentences with relevantrelevant information
 Extract the relevant information and
ignore non-relevant information
(important!)
 Link related information and output in a
predetermined format
What is Information Extraction?
Information Extraction: Example
 Salvadoran President-elect Alfredo Cristiania condemned the
terrorist killing of Attorney General Roberto Garcia Alvarado
and accused the Farabundo Marti Natinal Liberation Front
(FMLN) of the crime. … Garcia Alvarado, 56, was killed when
a bomb placed by urban guerillas on his vehicle exploded as
it came to a halt at an intersection in downtown San
Salvador. … According to the police and Garcia Alvarado’s
driver, who escaped unscathed, the attorney general was
traveling with two bodyguards. One of them was injured.
 Incident Date: 19 Apr 89
 Incident Type: Bombing
 Perpetrator Individual ID: “urban guerillas”
 Human Target Name: “Roberto Garcia Alvarado”
 ...
What is Information Extraction?
Extraction
System
Documents
source
Ranked
Documents
Relevant Info 1
Relevant Info 2
Relevant Info 3
Query 1
(E.g. job title)
Query 2
(E.g. salary)
Combine
Query Results
Why Mine the Web?
 Enormous wealth of textual information on the Web.
 Book/CD/Video stores (e.g., Amazon)
 Restaurant information (e.g., Zagats)
 Car prices (e.g., Carpoint)
 Lots of data on user access patterns
 Web logs contain sequence of URLs accessed by users
 Possible to retrieve “previously unknown”
information
 People who ski also frequently break their leg.
 Restaurants that serve sea food in California are likely
to be outside San-Francisco
Mining the Web
IR / IE
System
Query
Documents
source
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Web Spider
 The Web is a huge collection of
documents where many contain:
 Hyper-linkHyper-link information
 Access and usage information
 The Web is very dynamic
 Web pages are constantly being
generated (removed)
Unique Features of the Web
Challenge: Develop new Web mining algorithms to . . .
• Exploit hyper-links and access patterns.
• Be adaptable to its documents source
 Combine the intelligent IR tools
 meaningmeaning of words
 orderorder of words in the query
 user dependencyuser dependency for the data
 authorityauthority of the source
 With the unique web features
 retrieve Hyper-link information
 utilize Hyper-link as input
Intelligent Web Search
What is Clustering ?
 Given:
 A source of
textual documents
 Similarity
measure
 e.g., how many
words are
common in these
documents
Clustering
System
Similarity
measure
Documents
source
Doc
Do
c
Doc
Doc
Doc
DocDoc
Doc
Doc
Doc
• Find:
• Several clusters of documents
that are relevant to each other
Outline
 Text mining applications
 Text characteristics
 Text mining process
 Learning methods
Text characteristics: Outline
 Large textual data base
 High dimensionality
 Several input modes
 Dependency
 Ambiguity
 Noisy data
 Not well structured text
Text characteristics
 Large textual data base
 Efficiency consideration
 over 2,000,000,000 web pages
 almost all publications are also in electronic form
 High dimensionality (Sparse input)
 Consider each word/phrase as a dimension
 Several input modes
 e.g., Web mining: information about user is
generated by semantics, browse pattern and outside
knowledgebase.
Text characteristics
 Dependency
 relevant information is a complex conjunction of
words/phrases
 e.g., Document categorization.
Pronoun disambiguation.
 Ambiguity
 Word ambiguity
 Pronouns (he, she …)
 “buy”, “purchase”
 Semantic ambiguity
 The king saw the rabbit with his glasses.
Text characteristics
 Noisy data
 Example: Spelling mistakes
 Not well structured text
 Chat rooms
 “r u available ?”
 “Hey whazzzzzz up”
 Speech
Outline
 Text mining applications
 Text characteristics
 Text mining process
 Learning methods
Text mining process
Text mining process
 Text preprocessing
 Syntactic/Semantic
text analysis
 Features Generation
 Bag of words
 Features Selection
 Simple counting
 Statistics
 Text/Data Mining
 Classification-
Supervised learning
 Clustering-
Unsupervised
learning
 Analyzing results
 Part Of Speech (pos) tagging
 Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball
(noun)
 ~98% accurate.
 Word sense disambiguation
 Context basedContext based or proximity basedproximity based
 Very accurate
 Parsing
 Generates a parse treeparse tree (graph) for each sentence
 Each sentence is a stand alone graph
Syntactic / Semantic text analysis
 Given: a collection of labeled records (training settraining set)
 Each record contains a set of features (attributesattributes),
and the true class (labellabel)
 Find: a modelmodel for the class as a function of the values
of the features
 Goal: previously unseen records should be assigned a
class as accurately as possible
 A test settest set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it
Text Mining: Classification definition
Similarity Measures:
• Euclidean DistanceEuclidean Distance if attributes are continuous
• Other Problem-specific Measures
• e.g., how many words are common in these documents
 Given: a set of documents and a similarity measuresimilarity measure
among documents
 Find: clusters such that:
 Documents in one cluster are more similar to one
another
 Documents in separate clusters are less similar to
one another
 Goal:
 Finding a correctcorrect set of documents
Text Mining: Clustering definition
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labelslabels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
Supervised vs. Unsupervised Learning
 Correct classification: The known label of test
sample is identical with the class resultclass result from
the classification model
 Accuracy ratio: the percentage of test set
samples that are correctly classified by the
model
 A distance measuredistance measure between classes can be
used
 e.g., classifying “football” document as a
“basketball” document is not as bad as
classifying it as “crime”.
Evaluation:What Is Good Classification?
 Good clustering method: produce
high quality clusters with . . .
 high intra-classintra-class similarity
 low inter-classinter-class similarity
 The qualityquality of a clustering method is
also measured by its ability to discover
some or all of the hiddenhidden patterns
Evaluation: What Is Good Clustering?
Outline
 Text mining applications
 Text characteristics
 Text mining process
 Learning methods
 Classification
 Clustering
Classification: An Example
Ex# Country Marital
Status
Income
Hooligan
1 England Single 125K Yes
2 England Married Yes
3 England Single 70K Yes
4 Italy Married 40K No
5 USA Divorced 95K No
6 England Married 60K Yes
7 England 20K Yes
8 Italy Single 85K Yes
9 France Married 75K No
10 Denmark Single 50K No
10
categorical
categorical
continuous
class
Training
Set
Model
Learn
Classifier
Country Marital
Status
Income
Hooligan
England Single 75K ?
Turkey Married 50K ?
England Married 150K ?
Divorced 90K ?
Single 40K ?
Itlay Married 80K ?
10
Test
Set
Text Classification: An Example
Ex#
Hooligan
1
An English football fan
…
Yes
2
During a game in Italy
…
Yes
3
England has been
beating France …
Yes
4
Italian football fans were
cheering …
No
5
An average USA
salesman earns 75K
No
6
The game in London
was horrific
Yes
7
Manchester city is likely
to win the championship
Yes
8
Rome is taking the lead
in the football league
Yes
10
class
Training
Set
Model
Learn
Classifier
text
Test
Set
Hooligan
A Danish football fan ?
Turkey is playing vs. France.
The Turkish fans …
?
10
Classification Techniques
 Instance-Based Methods
 Decision trees
 Neural networks
 Bayesian classification
 Instance-based (memory based)
learning
 Store training examples and delay the
processing (“lazy evaluation”) until a new
instance must be classified
 k-nearest neighbor approach
 InstancesInstances (Examples) are represented
as points in a Euclidean spacepoints in a Euclidean space
Instance-based Methods
football
Italian
The English
footballfootball fan
is a hooligan.
.
.
football
Italian
Similar to his
English equivalent,
the ItalianItalian
footballfootball fan
is a hooligan.
.
.
Text Examples in Euclidean
Space
 All instances correspond to points in the nn-
D space
 The nearest neighbor are defined in terms
of Euclidean distance
.
_
+
+ ?
+
_ _
+
_
_
+
_
+
+ +
+
_ _
+
_
_
+
• The kk-NN-NN returns the most common value among the kk nearest training
examples
• Voronoi diagram: the decision surface induced by 11-NN-NN for a typical set of
training examples
K-Nearest Neighbor Algorithm
Classification Techniques
 Instance-Based Methods
 Decision trees
 Neural networks
 Bayesian classification
Ex# Country Marital
Status
Income
Hooligan
1 England Single 125K Yes
2 England Married 100K Yes
3 England Single 70K Yes
4 Italy Married 40K No
5 USA Divorced 95K No
6 England Married 60K Yes
7 England Divorced 20K Yes
8 Italy Single 85K Yes
9 France Married 75K No
10 Denmark Single 50K No
10
categorical
categorical
continuous
class
Decision Tree: An Example
Yes
English
Yes
No
MarSt
NO
MarriedSingle, Divorced
Splitting Attributes
Income
YESNO
> 80K < 80K
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
Ex#
Hooligan
1
An English football fan
…
Yes
2
During a game in Italy
…
Yes
3
England has been
beating France …
Yes
4
Italian football fans were
cheering …
No
5
An average USA
salesman earns 75K
No
6
The game in London
was horrific
Yes
7
Manchester city is likely
to win the championship
Yes
8
Rome is taking the lead
in the football league
Yes
10
class
text
Decision Tree: A Text Example
Yes
English
Yes
No
MarSt
NO
MarriedSingle, Divorced
Splitting Attributes
Income
YESNO
> 80K < 80K
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases:
 Tree construction
 Tree pruning
 Identify and remove branches that reflect noisenoise or
outliersoutliers
 Use of decision tree: Classifying an unknown sample
 Test the attribute of the sample against the decision
tree
Classification by DT Induction
 Partitioning Methods
 Hierarchical Methods
Clustering Techniques
 Partitioning method: Construct a partition of n
documents into a set of k clusters
 Given: a set of documents and the number k
 Find: a partition of k clusters that optimizes the
chosen partitioning criterion
 Global optimalGlobal optimal: exhaustively enumerate all
partitions
 Heuristic methods: k-means and k-medoids
algorithms
 k-meansk-means: Each cluster is represented by the center
of the cluster
Partitioning Algorithms
 k-means algorithm is implemented in
4 steps:
1. Partition objects into kk nonempty subsets.
2. Compute seed points as the centroidscentroids of the
clusters of the current partition. The
centroid is the center (mean point) of the
cluster.
3. Assign each object to the cluster with the
nearest seed point.
4. Go back to Step 2, stop when no more new
assignment.
The K-means Clustering Method
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
The K-means Clustering:
Example
 Partitioning Methods
 Hierarchical Methods
Clustering Techniques
 Agglomerative:
 Start with each document being a single cluster.
 Eventually all document belong to the same
cluster.
 Divisive:
 Start with all document belong to the same
cluster.
 Eventually each node forms a cluster on its own.
 Does not require the number of clusters k in advance
 Needs a termination condition
 The final mode in both Agglomerative and Divisive
in of no use.
Hierarchical Clustering
Step 0
b
d
c
e
a
a b
Step 1 Step 2
d e
Step 3
c d e
Step 4
a b c d e
agglomerative
Step 4 Step 3 Step 2 Step 1 Step 0
divisive
Hierarchical Clustering:
Example
• Dendrogram: Decomposes data
objects into a several levels of
nested partitioning (tree of
clusters).
• Clustering of the data objects is
obtained by cutting the
dendrogram at the desired level,
then each connectedconnected component
forms a cluster.
A Dendogram: Hierarchical
Clustering
Demo
Commercial Tools
 IBM Intelligent Miner for Text
 Semio Map
 InXight LinguistX / ThingFinder
 LexiQuest
 ClearForest
 Teragram
 SRA NetOwl Extractor
 Autonomy
 Text is tricky to process, but “ok” results are easily
achieved
 There exist several text mining systemstext mining systems
 e.g., D2K - Data to Knowledge
 http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
 Additional IntelligenceIntelligence can be integrated with text
mining
 One may play with any phase of the text mining
process
Summary
Summary
 There are many other scientific and statistical text miningscientific and statistical text mining
methodsmethods developed but not covered in this talk.
 http://www.cs.utexas.edu/users/pebronia/text-mining/
 http://filebox.vt.edu/users/wfan/text_mining.html
 Also, it is important to study theoretical foundationstheoretical foundations of data
mining.
 Data Mining Concepts and Techniques / J.Han &
M.Kamber
 Machine Learning, / T.Mitchell

Mais conteúdo relacionado

Mais procurados

Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search EngineJay R Modi
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining Bhawi247
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentJohn Breslin
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsSeth Grimes
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trendsKU Leuven
 

Mais procurados (20)

Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text mining
Text miningText mining
Text mining
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Text mining
Text miningText mining
Text mining
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated Content
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Text MIning
Text MIningText MIning
Text MIning
 
Text mining
Text miningText mining
Text mining
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Introduction to Text Mining and Semantics
Introduction to Text Mining and SemanticsIntroduction to Text Mining and Semantics
Introduction to Text Mining and Semantics
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 

Destaque

Text Analytics
Text AnalyticsText Analytics
Text AnalyticsAjay Ram
 
Text & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesText & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesDaniel Dollar
 
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Spark Summit
 
Text mining full text for molecular targets
Text mining full text for molecular targetsText mining full text for molecular targets
Text mining full text for molecular targetsAnn-Marie Roche
 
Multimedia mining research – an overview
Multimedia mining research – an overview  Multimedia mining research – an overview
Multimedia mining research – an overview ijcga
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehHadi Mohammadzadeh
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionFlorian Leitner
 
Multimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningMultimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningBhagyashree Barde
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorialmgrcar
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 

Destaque (13)

Text Analytics
Text AnalyticsText Analytics
Text Analytics
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text & Data Mining Licensing Issues
Text & Data Mining Licensing IssuesText & Data Mining Licensing Issues
Text & Data Mining Licensing Issues
 
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
 
Text mining full text for molecular targets
Text mining full text for molecular targetsText mining full text for molecular targets
Text mining full text for molecular targets
 
Multimedia mining research – an overview
Multimedia mining research – an overview  Multimedia mining research – an overview
Multimedia mining research – an overview
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi Mohammadzadeh
 
Text Mining - Data Mining
Text Mining - Data MiningText Mining - Data Mining
Text Mining - Data Mining
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: Introduction
 
Multimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep LearningMultimedia Data Mining using Deep Learning
Multimedia Data Mining using Deep Learning
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorial
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Data mining
Data miningData mining
Data mining
 

Semelhante a Week12

Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexingbalaabirami
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivelbalaabirami
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanJISC CETIS
 
Visualizing and Making Sense of Information
Visualizing and Making Sense of InformationVisualizing and Making Sense of Information
Visualizing and Making Sense of InformationPARC, a Xerox company
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1Sumit Sony
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Finalguestcaef1d
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0John Breslin
 
The personal search engine
The personal search engineThe personal search engine
The personal search engineArjen de Vries
 
Presentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTAPresentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTATimo Kouwenhoven
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Phrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalPhrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalBala Abirami
 
6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.ppt6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.pptBereketAraya
 
Profiling a Person With Search Log Data
Profiling a Person With Search Log DataProfiling a Person With Search Log Data
Profiling a Person With Search Log DataJim Jansen
 
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...Bradley Allen
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfJemalNesre1
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 

Semelhante a Week12 (20)

Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
Automatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles DuncanAutomatic Metadata Generation Charles Duncan
Automatic Metadata Generation Charles Duncan
 
Visualizing and Making Sense of Information
Visualizing and Making Sense of InformationVisualizing and Making Sense of Information
Visualizing and Making Sense of Information
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
The personal search engine
The personal search engineThe personal search engine
The personal search engine
 
Presentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTAPresentation Timo Kouwenhoven FIATIFTA
Presentation Timo Kouwenhoven FIATIFTA
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Phrase based Indexing and Information Retrieval
Phrase based Indexing and Information RetrievalPhrase based Indexing and Information Retrieval
Phrase based Indexing and Information Retrieval
 
6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.ppt6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.ppt
 
Profiling a Person With Search Log Data
Profiling a Person With Search Log DataProfiling a Person With Search Log Data
Profiling a Person With Search Log Data
 
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...
Relational Navigation: A Taxonomy-Based Approach to Information Access and Di...
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 

Último

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Último (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 

Week12

  • 2.  The Needs: - Find people as well as documents that can address my information need. - Promote collaboration and knowledge sharing - Leverage existing information access system - The Information Sources: - Email, groupware, online reports, … Example 1: KM People Finder
  • 3. Example 1: Simple KM People Finder Relevant Docs Search or Navigation System Name Extractor Authority List Query Ranked People Names
  • 4. Example 1: KM People Finder
  • 5. • An exploration and analysis of textual (natural-language) datatextual (natural-language) data by automatic and semi automatic means to discover new knowledge. Text Mining Definition  Many definitions in the literature “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.
  • 6.  What is ““previously unknown”previously unknown” information ?  Strict definition  Information that not even the writer knows.  e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure  Lenient definition  Rediscover the information that the author encoded in the text  e.g., Automatically extracting a product’s name from a web-page. Text Mining Definition
  • 7. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods
  • 8. Text Mining Applications  Marketing: Discover distinct groups of potential buyers according to a user text based profile  e.g. amazon  Industry: Identifying groups of competitors web pages  e.g., competing products and their prices  Job seeking: Identify parameters in searching for jobs  e.g., www.flipdog.com
  • 9.  Information Retrieval  Indexing and retrieval of textual documents  Information Extraction  Extraction of partial knowledgepartial knowledge in the text  Web Mining  Indexing and retrieval of textual documents and extraction of partial knowledge using the web  Clustering  Generating collections of similar text documents Text Mining Methods
  • 10. Information Retrieval  Given:  A source of textual documents  A user query (text based) IR System Query E.g. Spam / Text Documents source • Find: • A set (ranked) of documents that are relevant to the query Ranked Documents Document Document Document
  • 11. Intelligent Information Retrieval  meaning of words  Synonyms “buy” / “purchase”  Ambiguity “bat” (baseball vs. mammal)  order of words in the query  hot dog stand in the amusement park  hot amusement stand in the dog park  user dependency for the data  direct feedback  indirect feedback  authority of the source  IBM is more likely to be an authorized source then my second far cousin
  • 12.  Given:  A source of textual documents  A well defined limited query (text based)  Find:  Sentences with relevantrelevant information  Extract the relevant information and ignore non-relevant information (important!)  Link related information and output in a predetermined format What is Information Extraction?
  • 13. Information Extraction: Example  Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.  Incident Date: 19 Apr 89  Incident Type: Bombing  Perpetrator Individual ID: “urban guerillas”  Human Target Name: “Roberto Garcia Alvarado”  ...
  • 14. What is Information Extraction? Extraction System Documents source Ranked Documents Relevant Info 1 Relevant Info 2 Relevant Info 3 Query 1 (E.g. job title) Query 2 (E.g. salary) Combine Query Results
  • 15. Why Mine the Web?  Enormous wealth of textual information on the Web.  Book/CD/Video stores (e.g., Amazon)  Restaurant information (e.g., Zagats)  Car prices (e.g., Carpoint)  Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users  Possible to retrieve “previously unknown” information  People who ski also frequently break their leg.  Restaurants that serve sea food in California are likely to be outside San-Francisco
  • 16. Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Web Spider
  • 17.  The Web is a huge collection of documents where many contain:  Hyper-linkHyper-link information  Access and usage information  The Web is very dynamic  Web pages are constantly being generated (removed) Unique Features of the Web Challenge: Develop new Web mining algorithms to . . . • Exploit hyper-links and access patterns. • Be adaptable to its documents source
  • 18.  Combine the intelligent IR tools  meaningmeaning of words  orderorder of words in the query  user dependencyuser dependency for the data  authorityauthority of the source  With the unique web features  retrieve Hyper-link information  utilize Hyper-link as input Intelligent Web Search
  • 19. What is Clustering ?  Given:  A source of textual documents  Similarity measure  e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Do c Doc Doc Doc DocDoc Doc Doc Doc • Find: • Several clusters of documents that are relevant to each other
  • 20. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods
  • 21. Text characteristics: Outline  Large textual data base  High dimensionality  Several input modes  Dependency  Ambiguity  Noisy data  Not well structured text
  • 22. Text characteristics  Large textual data base  Efficiency consideration  over 2,000,000,000 web pages  almost all publications are also in electronic form  High dimensionality (Sparse input)  Consider each word/phrase as a dimension  Several input modes  e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.
  • 23. Text characteristics  Dependency  relevant information is a complex conjunction of words/phrases  e.g., Document categorization. Pronoun disambiguation.  Ambiguity  Word ambiguity  Pronouns (he, she …)  “buy”, “purchase”  Semantic ambiguity  The king saw the rabbit with his glasses.
  • 24. Text characteristics  Noisy data  Example: Spelling mistakes  Not well structured text  Chat rooms  “r u available ?”  “Hey whazzzzzz up”  Speech
  • 25. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods
  • 27. Text mining process  Text preprocessing  Syntactic/Semantic text analysis  Features Generation  Bag of words  Features Selection  Simple counting  Statistics  Text/Data Mining  Classification- Supervised learning  Clustering- Unsupervised learning  Analyzing results
  • 28.  Part Of Speech (pos) tagging  Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun)  ~98% accurate.  Word sense disambiguation  Context basedContext based or proximity basedproximity based  Very accurate  Parsing  Generates a parse treeparse tree (graph) for each sentence  Each sentence is a stand alone graph Syntactic / Semantic text analysis
  • 29.  Given: a collection of labeled records (training settraining set)  Each record contains a set of features (attributesattributes), and the true class (labellabel)  Find: a modelmodel for the class as a function of the values of the features  Goal: previously unseen records should be assigned a class as accurately as possible  A test settest set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition
  • 30. Similarity Measures: • Euclidean DistanceEuclidean Distance if attributes are continuous • Other Problem-specific Measures • e.g., how many words are common in these documents  Given: a set of documents and a similarity measuresimilarity measure among documents  Find: clusters such that:  Documents in one cluster are more similar to one another  Documents in separate clusters are less similar to one another  Goal:  Finding a correctcorrect set of documents Text Mining: Clustering definition
  • 31.  Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labelslabels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Supervised vs. Unsupervised Learning
  • 32.  Correct classification: The known label of test sample is identical with the class resultclass result from the classification model  Accuracy ratio: the percentage of test set samples that are correctly classified by the model  A distance measuredistance measure between classes can be used  e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”. Evaluation:What Is Good Classification?
  • 33.  Good clustering method: produce high quality clusters with . . .  high intra-classintra-class similarity  low inter-classinter-class similarity  The qualityquality of a clustering method is also measured by its ability to discover some or all of the hiddenhidden patterns Evaluation: What Is Good Clustering?
  • 34. Outline  Text mining applications  Text characteristics  Text mining process  Learning methods  Classification  Clustering
  • 35. Classification: An Example Ex# Country Marital Status Income Hooligan 1 England Single 125K Yes 2 England Married Yes 3 England Single 70K Yes 4 Italy Married 40K No 5 USA Divorced 95K No 6 England Married 60K Yes 7 England 20K Yes 8 Italy Single 85K Yes 9 France Married 75K No 10 Denmark Single 50K No 10 categorical categorical continuous class Training Set Model Learn Classifier Country Marital Status Income Hooligan England Single 75K ? Turkey Married 50K ? England Married 150K ? Divorced 90K ? Single 40K ? Itlay Married 80K ? 10 Test Set
  • 36. Text Classification: An Example Ex# Hooligan 1 An English football fan … Yes 2 During a game in Italy … Yes 3 England has been beating France … Yes 4 Italian football fans were cheering … No 5 An average USA salesman earns 75K No 6 The game in London was horrific Yes 7 Manchester city is likely to win the championship Yes 8 Rome is taking the lead in the football league Yes 10 class Training Set Model Learn Classifier text Test Set Hooligan A Danish football fan ? Turkey is playing vs. France. The Turkish fans … ? 10
  • 37. Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification
  • 38.  Instance-based (memory based) learning  Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  k-nearest neighbor approach  InstancesInstances (Examples) are represented as points in a Euclidean spacepoints in a Euclidean space Instance-based Methods
  • 39. football Italian The English footballfootball fan is a hooligan. . . football Italian Similar to his English equivalent, the ItalianItalian footballfootball fan is a hooligan. . . Text Examples in Euclidean Space
  • 40.  All instances correspond to points in the nn- D space  The nearest neighbor are defined in terms of Euclidean distance . _ + + ? + _ _ + _ _ + _ + + + + _ _ + _ _ + • The kk-NN-NN returns the most common value among the kk nearest training examples • Voronoi diagram: the decision surface induced by 11-NN-NN for a typical set of training examples K-Nearest Neighbor Algorithm
  • 41. Classification Techniques  Instance-Based Methods  Decision trees  Neural networks  Bayesian classification
  • 42. Ex# Country Marital Status Income Hooligan 1 England Single 125K Yes 2 England Married 100K Yes 3 England Single 70K Yes 4 Italy Married 40K No 5 USA Divorced 95K No 6 England Married 60K Yes 7 England Divorced 20K Yes 8 Italy Single 85K Yes 9 France Married 75K No 10 Denmark Single 50K No 10 categorical categorical continuous class Decision Tree: An Example Yes English Yes No MarSt NO MarriedSingle, Divorced Splitting Attributes Income YESNO > 80K < 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm
  • 43. Ex# Hooligan 1 An English football fan … Yes 2 During a game in Italy … Yes 3 England has been beating France … Yes 4 Italian football fans were cheering … No 5 An average USA salesman earns 75K No 6 The game in London was horrific Yes 7 Manchester city is likely to win the championship Yes 8 Rome is taking the lead in the football league Yes 10 class text Decision Tree: A Text Example Yes English Yes No MarSt NO MarriedSingle, Divorced Splitting Attributes Income YESNO > 80K < 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm
  • 44.  Decision tree  A flow-chart-like tree structure  Internal node denotes a test on an attribute  Branch represents an outcome of the test  Leaf nodes represent class labels or class distribution  Decision tree generation consists of two phases:  Tree construction  Tree pruning  Identify and remove branches that reflect noisenoise or outliersoutliers  Use of decision tree: Classifying an unknown sample  Test the attribute of the sample against the decision tree Classification by DT Induction
  • 45.  Partitioning Methods  Hierarchical Methods Clustering Techniques
  • 46.  Partitioning method: Construct a partition of n documents into a set of k clusters  Given: a set of documents and the number k  Find: a partition of k clusters that optimizes the chosen partitioning criterion  Global optimalGlobal optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-meansk-means: Each cluster is represented by the center of the cluster Partitioning Algorithms
  • 47.  k-means algorithm is implemented in 4 steps: 1. Partition objects into kk nonempty subsets. 2. Compute seed points as the centroidscentroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment. The K-means Clustering Method
  • 48. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 The K-means Clustering: Example
  • 49.  Partitioning Methods  Hierarchical Methods Clustering Techniques
  • 50.  Agglomerative:  Start with each document being a single cluster.  Eventually all document belong to the same cluster.  Divisive:  Start with all document belong to the same cluster.  Eventually each node forms a cluster on its own.  Does not require the number of clusters k in advance  Needs a termination condition  The final mode in both Agglomerative and Divisive in of no use. Hierarchical Clustering
  • 51. Step 0 b d c e a a b Step 1 Step 2 d e Step 3 c d e Step 4 a b c d e agglomerative Step 4 Step 3 Step 2 Step 1 Step 0 divisive Hierarchical Clustering: Example
  • 52. • Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connectedconnected component forms a cluster. A Dendogram: Hierarchical Clustering
  • 53. Demo
  • 54. Commercial Tools  IBM Intelligent Miner for Text  Semio Map  InXight LinguistX / ThingFinder  LexiQuest  ClearForest  Teragram  SRA NetOwl Extractor  Autonomy
  • 55.  Text is tricky to process, but “ok” results are easily achieved  There exist several text mining systemstext mining systems  e.g., D2K - Data to Knowledge  http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/  Additional IntelligenceIntelligence can be integrated with text mining  One may play with any phase of the text mining process Summary
  • 56. Summary  There are many other scientific and statistical text miningscientific and statistical text mining methodsmethods developed but not covered in this talk.  http://www.cs.utexas.edu/users/pebronia/text-mining/  http://filebox.vt.edu/users/wfan/text_mining.html  Also, it is important to study theoretical foundationstheoretical foundations of data mining.  Data Mining Concepts and Techniques / J.Han & M.Kamber  Machine Learning, / T.Mitchell