O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Semantic Modeling 
Computational Framework for 
Generating Visual Summaries of 
Topical Clusters in Twitter Streams* 
Auth...
Visual Summaries of Twitter Streams 
2 
http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif 
http://www...
Step 1: 
get & 
pre-process Data 
construct graph & 
clustering 
extract keywords & 
summarize 
Keywords 
Stream 
Tweets 
...
Input: Keywords 
• initial set of Keywords 
• similar to Twitter Search 
4
Input: Keywords 
• initial set of Keywords 
• similar to Twitter Search 
5
Step 1: Stream Tweets 
• HTTP base API 
- JSON, REST 
6
7 
• OAuth + HTTP 
• here: java library with 
scala and play!framework
Step 1: Preprocessing 
• transform Tweets 
- easy-to-analyze / clan format 
• Process of cleaning: 
1. lowercase 
2. remov...
Step 1: Preprocessing 
• Example Keywords: 
- SCALA 
- Scala 
- scala 
- #scala 
• Ling Pipe Library* 
- remove tense and ...
Step 1: Preprocessing 
• Example Tweets 
10 
new york time 
reactive 
programming 
tool scala scale 
techrepublic 
akka-ht...
Step 1: Preprocessing 
• Example Tweets 
11 
new york time 
reactive 
programming 
tool scala scale 
techrepublic 
akka-ht...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
akka-http ba...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
akka-http ba...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
14 *http://a...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
15 *http://a...
Step 2: Graph 
• Word Co-Occurrence Graph 
- Word = Node (Unigrams) 
- Tweet = Link between Nodes 
• Example 
16 *http://a...
17
18
Step 2: Graph 
• Co-Occurrence Graph 
- connect nodes (words) within and between 
tweets 
- add strength (weight) and cost...
Step 2: Graph 
• Summary 
+ 
= 
reactive 
scala 
stream 
based 
… 
uses 
programming 
…
Step 2: Clustering 
• Here: „complete link (max) clustering“ algorithm 
- hierarchical clustering algorithm that forms 
cl...
Step 2: Clustering 
• Here: „complete link (max) clustering“ algorithm 
• each node starts as individual cluster 
! 
Clust...
Step 2: Clustering 
Graph Representation Cluster Representation 
reactive 
scala 
stream 
based 
… 
reactive 
scala 
strea...
Step 2: Clustering 
24
Step 2: Clustering 
distance = 0.5 
25
Step 2: Clustering 
distance = 1 
distance = 0.5 
distance = 1 
26
Step 2: Clustering 
distance = 1 
distance = 0.5 
distance = 1 
27 
1 
1
Step 2: Clustering 
distance = 1 
distance = 0.5 
distance = 1 
28 
distance = 2 
1 
1
Step 2: Clustering 
29
Step 2: Clustering 
• Final step: Dendrogram 
- tree diagram 
- represents the arrangement of hierarchical clusters 
• why...
Step 2: Clustering 
• Final step: Dendrogram 
- closer to the root = lower similarity 
root 
reactive scala 
31 
first clu...
Step 2: Clustering 
• Final step: Dendrogram 
- closer to the root = lower similarity 
root 
new york programming … akka-h...
Step 2: Clustering 
• Final step: Dendrogram 
- closer to the root = lower similarity 
root 
new york programming … akka-h...
34
Step 3: Extract topical keywords 
Preprocessing/ 
Cleaning 
35 
Construct 
Graph 
Extract Topical 
Keywords
Step 3: Extract topical keywords 
• keywords 
- express a topic 
- frequently used 
- summarize tweets content 
• Question...
Step 3: Extract topical keywords 
• How? 
- „topical tweets“ vs. „general tweets“ 
• frequently in topical tweets! 
- sear...
Step 3: Extract topical keywords 
• Strength of a word 
- is a word relevant for that topical cluster? 
38 
Low 
Frequency...
Step 3: Extract topical keywords 
• Strength of a word 
- is a word relevant for that topical cluster? 
39 
Low 
Frequency...
Step 3: Extract topical keywords 
• Result 
- topical strength for each keyword 
- sort them by relevancy 
- select top 20...
Final Step 
• Combine clusters and keywords 
• create visual summary 
41
Final Step 
42 
• Keyword1 
• Keyword2 
• Keyword3 
• Keyword4 
• … 
high relevancy 
low relevancy
Final Step 
43 
• Keyword1 
• Keyword2 
• Keyword3 
• Keyword4 
• … 
high relevancy 
low relevancy
Final Step 
44 
• Treemap Visualisation 
- color = cluster 
- area of word = frequency of word
Final Step 
• Wordcloud Visualisation 
- color = cluster 
- size of word = frequency of word 
45
Final Notes 
• 4. Million Topical Tweets 
• 15 Days 
• User Study 
- Treemap vs. Word Cloud 
46
Thank You! 
• Discussion 
- Loosing precision while cleaning tweet 
- Loosing sense while removing stop words like 
„not“ ...
Próximos SlideShares
Carregando em…5
×

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

605 visualizações

Publicada em

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

Based on: http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams

  1. 1. Semantic Modeling Computational Framework for Generating Visual Summaries of Topical Clusters in Twitter Streams* Authors: Presenter: ! Miray Kas Sebastian Alfers - HTW Berlin Bongwon Suh 1 * http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9
  2. 2. Visual Summaries of Twitter Streams 2 http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif http://www.infobarrel.com/media/image/54054.jpg
  3. 3. Step 1: get & pre-process Data construct graph & clustering extract keywords & summarize Keywords Stream Tweets Preprocessing/ Cleaning Construct Graph Clustering Select Relevant Clusters Extract Topical Keywords Visual Cluster Summary Step 2: Step 3: 3
  4. 4. Input: Keywords • initial set of Keywords • similar to Twitter Search 4
  5. 5. Input: Keywords • initial set of Keywords • similar to Twitter Search 5
  6. 6. Step 1: Stream Tweets • HTTP base API - JSON, REST 6
  7. 7. 7 • OAuth + HTTP • here: java library with scala and play!framework
  8. 8. Step 1: Preprocessing • transform Tweets - easy-to-analyze / clan format • Process of cleaning: 1. lowercase 2. remove urls, user mentions and stop words • like @user, „a“ or „123“ 3. remove special characters (#,.) 8
  9. 9. Step 1: Preprocessing • Example Keywords: - SCALA - Scala - scala - #scala • Ling Pipe Library* - remove tense and plurals 9 }scala *http://alias-i.com/lingpipe/
  10. 10. Step 1: Preprocessing • Example Tweets 10 new york time reactive programming tool scala scale techrepublic akka-http based reactive stream scala scaladay
  11. 11. Step 1: Preprocessing • Example Tweets 11 new york time reactive programming tool scala scale techrepublic akka-http based reactive stream scala scaladay
  12. 12. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example akka-http based stream reactive scala scaladay 12 *http://alias-i.com/lingpipe/
  13. 13. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example akka-http based stream reactive scala scaladay 13 *http://alias-i.com/lingpipe/
  14. 14. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example 14 *http://alias-i.com/lingpipe/ based akka-http reactive stream scaladay scala
  15. 15. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example 15 *http://alias-i.com/lingpipe/ based akka-http reactive stream scaladay scala Nodes NLoindkess
  16. 16. Step 2: Graph • Word Co-Occurrence Graph - Word = Node (Unigrams) - Tweet = Link between Nodes • Example 16 *http://alias-i.com/lingpipe/ based akka-http reactive stream scaladay scala
  17. 17. 17
  18. 18. 18
  19. 19. Step 2: Graph • Co-Occurrence Graph - connect nodes (words) within and between tweets - add strength (weight) and cost (distance) • More frequently words - increase the strength - decrease cost 19
  20. 20. Step 2: Graph • Summary + = reactive scala stream based … uses programming …
  21. 21. Step 2: Clustering • Here: „complete link (max) clustering“ algorithm - hierarchical clustering algorithm that forms clusters by merging subgroups • Group Words from Tweets - frequently appear on topic - cluster = topic * http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html
  22. 22. Step 2: Clustering • Here: „complete link (max) clustering“ algorithm • each node starts as individual cluster ! Clusters = Nodes = Words in tweet • close clusters are successively merged together - close = highest cost within clusters 22
  23. 23. Step 2: Clustering Graph Representation Cluster Representation reactive scala stream based … reactive scala stream based … 23 cost = distance = 0.5 cost = distance = 1 1 1
  24. 24. Step 2: Clustering 24
  25. 25. Step 2: Clustering distance = 0.5 25
  26. 26. Step 2: Clustering distance = 1 distance = 0.5 distance = 1 26
  27. 27. Step 2: Clustering distance = 1 distance = 0.5 distance = 1 27 1 1
  28. 28. Step 2: Clustering distance = 1 distance = 0.5 distance = 1 28 distance = 2 1 1
  29. 29. Step 2: Clustering 29
  30. 30. Step 2: Clustering • Final step: Dendrogram - tree diagram - represents the arrangement of hierarchical clusters • why? - easy to apply thresholds metics 30
  31. 31. Step 2: Clustering • Final step: Dendrogram - closer to the root = lower similarity root reactive scala 31 first cluster
  32. 32. Step 2: Clustering • Final step: Dendrogram - closer to the root = lower similarity root new york programming … akka-http based stream scaladay 32 reactive scala
  33. 33. Step 2: Clustering • Final step: Dendrogram - closer to the root = lower similarity root new york programming … akka-http based stream scaladay 33 reactive scala thresholds
  34. 34. 34
  35. 35. Step 3: Extract topical keywords Preprocessing/ Cleaning 35 Construct Graph Extract Topical Keywords
  36. 36. Step 3: Extract topical keywords • keywords - express a topic - frequently used - summarize tweets content • Questions - „What are the relevant keywords?“ - „In what clusters do they appear?“ 36
  37. 37. Step 3: Extract topical keywords • How? - „topical tweets“ vs. „general tweets“ • frequently in topical tweets! - search keywords „reactive scala“! • not frequently in general tweets! - general twitter stream (all tweets) 37
  38. 38. Step 3: Extract topical keywords • Strength of a word - is a word relevant for that topical cluster? 38 Low Frequency High Frequency Low Frequency High Frequency Topical Tweets General Tweets
  39. 39. Step 3: Extract topical keywords • Strength of a word - is a word relevant for that topical cluster? 39 Low Frequency High Frequency Low Frequency High Frequency Topical Tweets General Tweets ✔ relevant for topic / cluster
  40. 40. Step 3: Extract topical keywords • Result - topical strength for each keyword - sort them by relevancy - select top 20 keyword • choose clusters that contain this words 40
  41. 41. Final Step • Combine clusters and keywords • create visual summary 41
  42. 42. Final Step 42 • Keyword1 • Keyword2 • Keyword3 • Keyword4 • … high relevancy low relevancy
  43. 43. Final Step 43 • Keyword1 • Keyword2 • Keyword3 • Keyword4 • … high relevancy low relevancy
  44. 44. Final Step 44 • Treemap Visualisation - color = cluster - area of word = frequency of word
  45. 45. Final Step • Wordcloud Visualisation - color = cluster - size of word = frequency of word 45
  46. 46. Final Notes • 4. Million Topical Tweets • 15 Days • User Study - Treemap vs. Word Cloud 46
  47. 47. Thank You! • Discussion - Loosing precision while cleaning tweet - Loosing sense while removing stop words like „not“ (negate) - Unigram vs. Multigram? - ? 47

×