1. Large-scale information extraction and integration infrastructure
for supporting financial decision making (FP7-ICT-257928)
http://project-first.eu
Text Mining and Text Stream
Mining Tutorial
Miha Grčar
miha.grcar@ijs.si
Department of Knowledge Technologies
Jožef Stefan Institute, Ljubljana
http://kt.ijs.si
2. Text and text stream mining
tutorial
• Part I: Text mining
• Part II: Text stream mining
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 2
4. PART I • PART II
INTRO • BOW • ML • EVAL • APP
What is text mining?
• Text mining provides a set of methodologies and tools for
discovering, presenting, and evaluating knowledge from
large collections of textual documents
• Text mining employs adopts and adapts methodologies and
tools from …
– Data mining (DM)
– Machine learning (ML)
– Information retrieval (IR)
– Natural language processing (NLP)
– Visualization
– Social network analysis and graph mining
– Knowledge management
– …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 4
5. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Typical text mining process
Feedback loop
- Performance and
Evaluation /
- utility assessment
validation
- Feedback loop
Data Text pre-
Modeling
acquisition processing
- Presentation
- Acquisition - Transformation - Discover Application
- Interaction
- Cleaning - Extract
- Organize knowledge
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 5
6. PART I • PART II
INTRO • BOW • ML • EVAL • APP
What do we cover in Part 1?
Feedback loop
- Cross validation
Evaluation /
- Precision
validation
- Recall …
Data Text pre-
Modeling
acquisition processing - Search & browse
- Categorization
- Recommendation
- Vector spc model - Machine learning Application - Advertising
- (bags-of-words) - Classification - Spam detection
- Clustering - Summarization
- Visualization …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 6
7. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
• Tokenize • Remove stop words
the
quick
brown
The quick
dog
brown dog
jumps
jumps over
over
the lazy dog.
the
lazy
dog
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 7
8. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
• Tokenize • Remove stop words • Lemmatize • Compute weights
the
quick
brown
quick
jump
brown
lazy
dog
The quick
dog
brown dog
jumps jump 1 1 2 1 1
jumps over
over
the lazy dog.
the
lazy
dog
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 8
9. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Tokenization & stop word removal
Original text: Simple tokenizer (alphanumeric
strings only):
After ripping 14% higher from
June until the first week of After | ripping | 14 | higher | from
October, stocks ran headfirst into | June | until | the | first | week |
a wall of worry seemingly too of | October | stocks | ran |
large to climb. Europe, China, the headfirst | into | a | wall | of |
fiscal cliff, etc aren't new worry | seemingly | too | large |
concerns but that doesn't mean to | climb | Europe | China | the |
they aren't real. Investors fiscal | cliff | etc | aren | t | new |
suddenly care and are behaving concerns | but | that | doesn | t |
accordingly, selling some of their mean | they | aren | t | real |
more aggressive names and Investors | suddenly | care | and |
rotating into defensives. are | behaving | accordingly |
selling | some | of | their | more |
aggressive | names | and |
rotating | into | defensives
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 9
10. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Tokenization & stop word removal
Original text: Regex tokenizer ([p{L}']+):
After ripping 14% higher from After | ripping | higher | from |
June until the first week of June | until | the | first | week |
October, stocks ran headfirst into of | October | stocks | ran |
a wall of worry seemingly too headfirst | into | a | wall | of |
large to climb. Europe, China, the worry | seemingly | too | large |
fiscal cliff, etc aren't new to | climb | Europe | China | the
concerns but that doesn't mean | fiscal | cliff | etc | aren't | new
they aren't real. Investors | concerns | but | that | doesn't
suddenly care and are behaving | mean | they | aren't | real |
accordingly, selling some of their Investors | suddenly | care | and
more aggressive names and | are | behaving | accordingly |
rotating into defensives. selling | some | of | their | more
| aggressive | names | and |
rotating | into | defensives
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 10
11. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Lemmatization
Original text: Lemmatized:
After ripping 14% higher from After | rip | high | from | June |
June until the first week of until | the | first | week | of |
October, stocks ran headfirst into October | stock | run | headfirst
a wall of worry seemingly too | into | a | wall | of | worry |
large to climb. Europe, China, the seemingly | too | large | to |
fiscal cliff, etc aren't new climb | Europe | China | the |
concerns but that doesn't mean fiscal | cliff | etc | aren't | new |
they aren't real. Investors concern | but | that | doesn't |
suddenly care and are behaving mean | they | aren't | real |
accordingly, selling some of their Investor | suddenly | care | and |
more aggressive names and are | behave | accordingly | sell |
rotating into defensives. some | of | their | more |
aggressive | name | and | rotate
| into | defensive
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 11
12. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Bags of words
Lemmatization
Original text: Lemmatized:
È uno dei punti più contestati E | uno | dei | puntare | più |
della legge di Stabilità approvata contestato | della | legge | di |
da poco dal governo: il taglio alle Stabilità | approvare | da | poco |
dal | governo | il | tagliare | alle |
detrazioni fiscali, ossia gli "sconti" detrazione | fiscale | ossia | gli |
che ogni contribuente può scontare | che | ogni | contribuire |
vantare sulla propria può | vantare | sulla | proprio |
dichiarazione dei redditi. Secondo dichiarazione | dei | reddito |
una bozza aggiornata del disegno Secondo | una | bozzare |
di legge, il taglio si applicherebbe aggiornare | del | disegnare | di |
a decorrere dal periodo di legge | il | tagliare | si | applicare | a
imposta al 31 dicembre 2012. Un | decorrere | dal | periodare | di |
dettaglio che aveva creato, nei impostare | al | dicembre | Un |
giorni scorsi, non poche dettagliare | che | aveva | creare |
nei | giorno | scorrere | non | poca |
polemiche. polemico
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 12
13. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Computing weights
• TF
– Term Frequency
– The number of times a lemma (stem) occurs in a document
• DF
– Document Frequency
– The number of documents in which a lemma (stem) occurs at least
once
• TFIDF
• Higher TF means higher TFIDF
• Higher DF means lower TFIDF
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 13
14. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Computing weights
DF
TF
IDF TFIDF
quick 1 1 0 0
The quick
brown dog brown 1 1 0 0
jumps over dog 2 1 0 0
the lazy dog. jump 1 1 0 0
lazy 1 1 0 0
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 14
15. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Computing weights
DF
TF
jump IDF TFIDF
quick 1 1 0.69 0.69
The quick
brown dog brown 1 1 0.69 0.69
jumps over dog 2 1 0.69 1.39
the lazy dog. jump 1 2 0 0
lazy 1 1 0.69 0.69
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 15
16. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Cosine similarity
d1
d2
0
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 16
17. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Cosine similarity
d1
1
d1 '
d2
d2'
0
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 17
18. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Centroids
• Determine characteristic
words in a cluster
• Nearest centroid classifier
• k-means clustering
• …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 18
19. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Where are we?
Feedback loop
- Cross validation
Evaluation /
- Precision
validation
- Recall …
Data Text pre-
Modeling
acquisition processing - Search & browse
- Categorization
- Recommendation
- Vector spc model - Machine learning Application - Advertising
- (bags-of-words) - Classification - Spam detection
- Clustering - Summarization
- Visualization …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 19
20. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Machine learning
• Machine learning is concerned with the development of
algorithms that allow computer programs to learn from past
experience [Mitchell]
• Machine learning refers to a collection of algorithms that take
as input empirical data (e.g., from databases or sensors) and
try to discover some characteristics (rules, constraints,
patterns, features) of the process that generated the data
[Wikipedia]
• Learning from past experience = learning from past examples
• Examples (instances) = document vectors (normalized sparse
vectors)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 20
21. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Machine learning
• We will look at two commonly used
machine learning techniques
– Classification
• Assigning instances (documents) to two or
more predefined (discrete) classes
• Supervised learning method
– Clustering
• Arranging instances (documents) into
groups (clusters) so that instances in the
same group are more similar to each other
than to those in other groups
• Unsupervised learning method
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 21
22. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification
• Labeled documents
Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout
Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services
Economy & Government • Gasoline fuels inflation, but Fed policy seen steady
Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely
...
Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory
Investing Picks • The Fresh Market: A Strong Buy
• Learn to classify
Labeled Training Classification
dataset Algorithm Model
• Classify unlabeled documents
Unlabeled Classification Predictions
dataset Algorithm (Labels)
Fresh Del Monte Produce Inc.
Investing Picks
Enters Oversold Territory
Classification
Model
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 22
23. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification
with k-Nearest Neighbors
Investing Picks
Mergers & Acquisitions
Economy & Government
Investing Picks: 4
Mergers & Acquisitions: 1
Economy & Government: 0
Lucca, Oct 2012 23
24. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification
with Nearest Centroid Classifier
Investing Picks
Mergers & Acquisitions
s1
s2
s3
Economy & Government Similarity s2 > s1 > s3
s2: Mergers & Acquisitions
s1: Investing Picks
s3: Economy & Government
Lucca, Oct 2012 24
25. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification
with Support Vector Machine (SVM)
w
Investing Picks
• Maximize w
• Minimize tradeoff
Mergers & Acquisitions
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 25
26. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Classification algorithms
Nearest SVM
k-NN centroid (linear kernel)
Multiclass? yes yes no
Explains decisions? no yes yes
Explains model? no yes yes
Number of parameters 1 0 1
Model size big small small
Training speed 0 fast slow
Classification speed slow fast fast
Accuracy (on texts) low medium high
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 26
27. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Clustering
Lucca, Oct 2012 27
28. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Clustering
• k-means clustering
• Agglomerative hierarchical clustering
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 28
29. PART I • PART II
INTRO • BOW • ML • EVAL • APP
k-means clustering
Input: k
Output: k clusters (and their centroids)
1. Randomly select k instances for initial centroids
2. Assign step
Assign each instance to the nearest centroid
3. If the assignments did not change, end the
algorithm
4. Update step
Recompute (update) centroids
5. Repeat at Step 2
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 29
30. PART I • PART II
INTRO • BOW • ML • EVAL • APP
k-means clustering
This video is available at http://first.ijs.si/tutorial/video/kmeans.html
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 30
31. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Agglomerative hierarchical clustering
1. Find the two most similar instances
2. Connect them
3. Replace them with their centroid
4. Repeat …
“Dendrogram”
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 31
32. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Where are we?
Feedback loop
- Cross validation
Evaluation /
- Precision
validation
- Recall …
Data Text pre-
Modeling
acquisition processing - Search & browse
- Categorization
- Recommendation
- Vector spc model - Machine learning Application - Advertising
- (bags-of-words) - Classification - Spam detection
- Clustering - Summarization
- Visualization …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 32
33. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Evaluation
• Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29)
– 10-fold cross validation
– Stratified
• Accuracy
• Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall |
http://en.wikipedia.org/wiki/F1_Score)
• Micro and macro-averaging (http://nlp.stanford.edu/IR-
book/html/htmledition/evaluation-of-text-classification-1.html |
http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization)
• Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 33
34. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Where are we?
Feedback loop
- Cross validation
Evaluation /
- Precision
validation
- Recall …
Data Text pre-
Modeling
acquisition processing - Search & browse
- Categorization
- Recommendation
- Vector spc model - Machine learning Application - Advertising
- (bags-of-words) - Classification - Spam detection
- Clustering - Summarization
- Visualization …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 34
35. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Applications
• Enhanced Web search • Text summarization
(SearchPoint) Leskovec et al. (2005): Extracting Summary
Sentences Based on the Document Semantic
• Social browsing (LiveNetLife) Graph. Microsoft Research Technical Report
MSR-TR-2005-07.
• Content categorization • Sentiment analysis
• Content-based recommender (demo later)
systems • News aggregation
• Advertising http://emm.newsexplorer.eu
• Blogging assistance (Zemanta) • Knowledge engineering
http://ontogen.ijs.si
• Spam detection • …
• Visualization / summarization
of large corpora
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 35
36. Enhanced Web search (http://www.searchpoint.com)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 36
37. Hi!
Hello
Social browsing (http://www.livenetlife.com) @ http://videolectures.net
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 37
38. Content categorization @ http://videolectures.net
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 38
39. Recommender system @ http://videolectures.net
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 39
41. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Blogging assistant (http://www.zemanta.com)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 41
42. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Pump & dump
Siering, Muntermann, Grčar (2012)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 42
43. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Visualizations
• Document space
visualization
• Canyon flows
• Tag clouds
http://www.jasondavies.com/wordcloud/
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 43
44. PART I • PART II
Recap
• Basics • Applications
– What is text mining? – Enhanced Web search
– TF-IDF bag-of-words vectors (SearchPoint)
– Cosine similarity – Social browsing (LiveNetLife)
– Centroids – Content categorization
• Machine learning – Content-based recommender
systems
– k-NN
– Advertising
– Nearest centroid classifier
– Writing assistance (Zemanta)
– SVM
– Spam detection
– k-means
– Visualization / summarization
– Agglomerative clustering of large corpora …
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 44
46. PART I • PART II
INTRO • DACQ • BOW • ML • APP
What is text stream mining?
Same as text mining but on streams
Text stream mining provides a set of
methodologies and tools for discovering,
presenting, and evaluating knowledge from
streams of textual documents
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 46
47. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Remember
Typical text mining process
Feedback loop
- Performance and
Evaluation /
- utility assessment
validation
- Feedback loop
Data Text pre-
Modeling
acquisition processing
- Presentation
- Acquisition - Transformation - Discover Application
- Interaction
- Cleaning - Extract
- Organize knowledge
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 47
48. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Typical text stream mining process
Feedback loop
- Performance and
- utility assessment
Evaluation /
- Obtaining new
validation
- labels
- Feedback loop
Stream
Text pre-
data Modeling
processing
acquisition
- Presentation
- Acquisition - Transformation - Discover Application
- Interaction
- Cleaning - Extract
- Organize knowledge
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 48
49. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream mining pipelines
• Pipelining and parallelization
Parallelization
– Enables concurrent processing
– Increases throughput Pipelining
– Enables distributed execution (cluster)
• Near-realtime online systems
– Stream cannot be paused or slowed down
(e.g., newsfeeds)
– [Near-realtime] Time between reception and
utilization of data should be as short as possible
– [Online] Stream is infinite and (sooner or later)
outdated data needs to be deleted
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 49
50. PART I • PART II
INTRO • DACQ • BOW • ML • APP
What do we cover in Part II?
Feedback loop
Evaluation /
validation
Stream
Text pre-
data Modeling
processing
acquisition
- Online document
- space visualization
- RSS feeds - Online BOW - Online ML Application
- Online tweeter
- Boilerplate remover - Incr. NCC
- sentiment classif.
- Language detection - Incr. k-means
- Incr. SVM
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 50
51. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
RSS Boilerplate Language
reader remover detector
RSS Boilerplate Language
Load balancing
reader remover detector
Online
Sync
...
BOW
. .
Preprocessing
. .
pipelines
. .
RSS Boilerplate Language
reader remover detector
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 51
52. PART I • PART II
INTRO • DACQ • BOW • ML • APP
RSS (Really Simple Syndication)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 52
53. PART I • PART II
INTRO • DACQ • BOW • ML • APP
RSS (Really Simple Syndication)
<rss version="2.0">
<channel>
<generator>NFE/1.0</generator>
<title>Top Stories - Google News</title>
<link>http://news.google.com/news?pz=1&ned=us&hl=en</link>
<language>en</language>
<webMaster>news-feedback@google.com</webMaster>
<copyright>&copy;2011 Google</copyright>
<item>
<title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster -
Bloomberg</title>
<link>http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNEF9B
7Q8C7_TBDKPEMFjb83fcuNfQ&url=http://www.bloomberg.com/news/2011-
02-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link>
<category>Top Stories</category>
<pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate>
<description>The ouster of Hosni Mubarak from Egypt’s presidency today, after
protests that started Jan. 25, prompted the following comments from analysts:
“The army needs to move quickly to remove obstacles to ...</description>
</item>
...
</channel>
</rss>
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 53
54. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
RSS Boilerplate Language
reader remover detector
RSS Boilerplate Language
Load balancing
reader remover detector
Online
Sync
...
BOW
. .
Preprocessing
. .
pipelines
. .
RSS Boilerplate Language
reader remover detector
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 54
55. PART I • PART II
INTRO • DACQ • BOW • ML • APP
http://www.bbc.co.uk/news/world-us-canada-15051554
Boilerplate removal
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 55
56. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Boilerplate removal
URL tree
protocol :// domain / path / file ? query
http:// kt.ijs.si /a/b/ c.html ?pg=0
Tree branch:
# si ijs kt a b
root domain path
http://www.bbc.co.uk/news/world-us-canada-15051554
# uk co bbc www news
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 56
57. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Boilerplate removal
URL tree
How many times
did I see “About
Us” in this part of
the tree?
Path
Domain
Root
Stream #
This method is …
• Unsupervised
• Online
• Incremental
(consumes one document at a time)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 57
58. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
RSS Boilerplate Language
reader remover detector
RSS Boilerplate Language
Load balancing
reader remover detector
Online
Sync
...
BOW
. .
Preprocessing
. .
pipelines
. .
RSS Boilerplate Language
reader remover detector
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 58
59. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Language detection
• Motivation: language-specific text analysis
components and applications
• Solutions based on word lists and word or
character sequences (n-grams)
• Character n-gram model
– Build character n-gram histograms for many
languages (language models)
– Compare text document histogram to language
models
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 59
60. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Language detection
English German
E 1 E 1
T 2 N 2
O 3 R 3
A 4 I 4
N 5 T 5
I 6 S 6
H 7 A 7
S 8 D 8
R 9 U 9
D 10 EN 10
THE DER, DEN
E_ 11 G 11
L 12 ER 12
_T 13 H 13
TH 14 L 14
HE 15 N_ 15
U 16 O 16
W 17 M 17
C 18 _D 18
M 19 C 19
... ... ... ...
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 60
61. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Language detection
Article “Egypt rejoices at Mubarak departure”
450 350
400
300
350
250
English article (n-gram rank)
English article (n-gram rank)
300
250 200
200 150
150
100
100
50
50
0 0
0 100 200 300 400 0 50 100 150 200 250 300 350
English language model (n-gram rank) German language model (n-gram rank)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 61
62. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Text stream acquisition and
preprocessing
RSS Boilerplate Language
reader remover detector
RSS Boilerplate Language
Load balancing
reader remover detector
Online
Sync
...
BOW
. .
Preprocessing
. .
pipelines
. .
RSS Boilerplate Language
reader remover detector
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 62
63. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Online BOW
Stream Outdated
Queue
of TF vectors
Add Remove
DF
values
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 63
64. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Online BOW
Stream Outdated
Queue
of TF vectors
DF
values
TF DF
TF-IDF
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 64
65. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Where are we?
Feedback loop
Evaluation /
validation
Stream
Text pre-
data Modeling
processing
acquisition
- Online document
- space visualization
- RSS feeds - Online BOW - Online ML Application
- Online tweeter
- Boilerplate remover - Incr. NCC
- sentiment classif.
- Language detection - Incr. k-means
- Incr. SVM
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 65
66. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Batch, incremental, offline, online
• Batch learning
Consuming all training examples at once
• Incremental learning
Consuming one example at a time
• Mini-batch learning
Consuming several examples at a time
• Offline learning (for datasets/finite streams)
All data is stored and can be accessed repeatedly
• Online learning (for infinite streams)
Each example is discarded after being processed
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 66
67. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Incremental nearest centroid classifier
Outdated
instance New
instance
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 67
68. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Incremental k-means clustering
Converges in only a few iterations (warm start)
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 68
69. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Other incremental methods
• Incremental SVM
A. Bordes, S. Ertekin, J. Weston, and L. Bottou
(2005): Fast Kernel Classifiers with Online and
Active Learning, Journal of Machine Learning
Research, vol. 6, pp. 1579–1619
• Incremental perceptron
www.cs.columbia.edu/~jebara/4771/tutorials/pe
rceptron.pdf
• Incremental winnow
http://en.wikipedia.org/wiki/Winnow_%28algorit
hm%29
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 69
70. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Where are we?
Feedback loop
Evaluation /
validation
Stream
Text pre-
data Modeling
processing
acquisition
- Online document
- space visualization
- RSS feeds - Online BOW - Online ML Application
- Online tweeter
- Boilerplate remover - Incr. NCC
- sentiment classif.
- Language detection - Incr. k-means
- Incr. SVM
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 70
71. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Document space visualization
2D
Several 1000
dimensions
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 71
72. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Document space visualization
Neighborhoods
computation
Corpus k-means Least-squares
preprocessing clustering interpolation
Document Stress
corpus majorization
Layout
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 72
73. PART I • PART II
INTRO • BOW • ML • EVAL • APP
Document space visualization
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 73
74. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Document space visualization
Maintaining
sorted lists
Warm start
Warm start
Parallelization
Neighborhoods
computation
Corpus k-means Least-squares
preprocessing clustering interpolation
Stress
Document Online majorization
corpus
BOW
Layout
Warm start
Pipelining
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 74
75. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Document space visualization
This video is available at http://first.ijs.si/tutorial/video/ameba.html
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 75
76. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Twitter
• Platform for sending
short messages
(similar to SMS)
• Est. 225 million users
• 100 million accounts
added in 2010
• 65 million tweets per day
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 76
77. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Financial tweets
• Informal $ sign convention
• Some examples (March 19):
– User#1: $AAPL is making an announcement at 9am
on what it plans to do with its 97 billion in cash.We
expect a dividend announcement
– User#2: $AAPL over 600.00 a share in the pre-market
on news of a dividend.
– User#3: Will there be any other news besides $AAPL
dividend?
• We acquire ~13,000 tweets per
weekday, for ~1,800 NASDAQ/NYSE
stocks ($GOOG, $MSFT…)
• We analyze tweets to determine
whether they contain positive or
negative vocabulary
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 77
78. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Labeled documents
POS Financial markets are now officially open :)
POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research
POS $AAPL : trust me -- AAPL will soar tomorrow
NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon
NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!!
NEG @aekins that's just too bad
...
• Learn to classify
Labeled Training Classification
dataset Algorithm Model
• Classify unlabeled documents
Unlabeled Classification Predictions
dataset Algorithm (Labels)
So Nickelodeon filed for bankruptcy
and announced that the next Kids Choice NEG
Awards will be it's last.
Classification
Model
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 78
79. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Emoticons &
SVM classifier
Goodnight everyoneeee :) Love yall
I have a good feeling about today ;)
ooo the ice cream van is here... yaaaaaay :D
• Neutral zone
in the garden in the sun! Just about to fill the pool! happy days! :D
Finally got JSON in #processing to work. More playing around coming :)
@oanhLove I hate when that happens... :-/
No jobs, no money. how in the hell is min wage here 4 f'n clams an hour? :(
I hate when I have to call and wake people up :(
• Explanations I don't have any chalk! :-/ MY CHALKBOARD IS USELESS
UGHHHHHHHHHHHHHHH.. life is NOT good all the time!!!!!! ;(
• Accuracy
Lucca, Oct 2012 79
80. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Emoticons & –
–
SVM classifier
– – +
+
–
• Neutral zone – – +
– +
– +
+
• Explanations – – + +
+
• Accuracy –
+
+ +
+
+
Lucca, Oct 2012 80
81. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Emoticons & –
–
SVM classifier
– 0 0
+
–
• Neutral zone – – 0
– +
– +
+
• Explanations – 0 0 +
+
• Accuracy 0
+
+ +
0
+
Lucca, Oct 2012 81
82. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Emoticons &
SVM classifier “Sovereign debt and unemployment are big
issues in EU.”
• Neutral zone unemployed, issues, debt, eu
sovereign, big
• Explanations
• Accuracy
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 82
83. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Sentiment classification
• Emoticons & Replace
usernames
Replace Remove
Replace Replace
negations exclamation
Replace
question
Average accuracy
SVM classifier
URLs with a letter Accuracy Precision/recall 10-fold cross
with a with a marks with a marks with
token repetition validation
token token token a token
X X 81.06% 81.32%/81.32% 76.98%
X X X X X X 80.22% 82.08%/78.02% 77.43%
• Neutral zone
X X X 79.94% 77.78%/84.62% 77.10%
X X X 79.94% 76.70%/86.81% 77.53%
X X X 79.67% 80.79%/78.57% 76.85%
X 78.83% 77.60%/81.87% 77.29%
• Explanations X X 78.55%
78.55%
75.86%/84.62%
77.78%/80.77%
76.91%
76.93%
X X X X 78.27% 80.23%/75.82% 76.93%
X X X 78.27% 76.53%/82.42% 77.04%
• Accuracy X X X X X 77.44% 75.12%/82.97% 76.86%
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 83
84. Grey:
Netflix stock closing price
Blue:
The number of positive
tweets
Yellow:
The difference between the
positive and negative tweets
Green dots:
Relevant events concerning
Netflix
Red:
The number of negative
tweets
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 84
85. First-quarter earnings
release Plans to launch in 43
countries in Latin America
and the Caribbean
Volume peaks likely
represent important events Netflix loses TV shows and
films, Netflix loses the Starz
deal
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 85
86. Sentiment cross-over
happens before price plunge
Sentiment cross-over
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 86
87. PART I • PART II
INTRO • DACQ • BOW • ML • APP
Presidential elections http://predsedniskevolitve.si
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 87
88. PART I • PART II
Recap
• Basics • Applications
– What is text stream – Online document space
mining? visualization
– Pipelining, parallelization – Online tweeter sentiment
– Web data acquisition classifier
– Online BOWs • Stock sentiment
monitoring
• Machine learning • Presidential elections
– Batch, incremental, offline,
online
– Incremental nearest
centroid classifier
– Incremental k-means
– Warm start
Lucca, Oct 2012 Miha Grčar: Text and text stream mining 88
Notas do Editor
Applet at http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html
- Vegas77 Entertainment SE- Spam normally sent on weekends, lines drawn at Fridays – exceptions 28.3. and 28.4. - Price on Monday higher in many cases