2024: Domino Containers - The Next Step. News from the Domino Container commu...
Â
Conor Hayes - Topics, tags and trends in the blogosphere
1. WebCamp 07 Social Networks
Topics Tags and Trends in the
Blogosphere
Conor Hayes
DERI, Galway
2. WebCamp 07
Social Networks
Outline
īŦ Blogs and the Blogosphere
īŦ Linking Blogs - Content vs. Tags
īŦ Topics and Bloggers
īĄ User entropy
īĄ Topic drift
īŦ Blog reactivity
īŦ Identifying consistent, topicârelevant blogs
3. WebCamp 07
Social Networks
Blogs
īŦ Web site with journal style entries
īĄ Dated in reverse chronological order
īĄ Generally written by a single user
īĄ Regularly updated
īŦ Distributed publishing
īĄ Easy to maintain and update
īŦ Exponential growth:
īĄ Technorati: 50 millions blogs (July 2006), doubling in size
every 6 months
īŦ Increasingly important indicator of public opinion on
politics, technology, current affairs
4. WebCamp 07
Social Networks
Blogs vs. Usenet
īŦ Blogosphere is user-centred
īĄ Distributed architecture
īĄ Topic organisation is locally defined by tags
īĄ Not easy to find relevant posts related to the same topic
īŦ Usenet is topic-centred
īĄ Logically centralised architecture
īĄ Topic organisation is a priori defined by newsgroup
heading, and subject headers
īĄ Users know where to go to find information on a particular
topic
5. WebCamp 07
Social Networks
A topic-centred blogosphere?
īŦ Semantic Web
īĄ Link blogs using machine readable metadata : SIOC
īŦ Tagging
īĄ Tags: simple propositional entities, locally defined
īŦ Link analysis:
īĄ Majority of blogs have little or no inward connections
īŦ Blog Roll, Comment List
īĄ Relatively static group
īŦ âConventionalâ Knowledge discovery techniques
īĄ Clustering + online recommender systems
6. WebCamp 07
Social Networks
Nearest Neighbour Recommender
īŦ Method :
1. Periodically, identify a set of nearest neighbour
blogs â one set for each topic the user is
interested in.
2. Select matching posts from these neighbours
īŦ What are the implications of User drift ?
īĄ How quickly does the neighbourhood set change
īŦ What is the relationship between Users and
Topics?
īĄ How consistently are bloggers attached to topics?
īŦ Which neighbours consistently provide the most
topic- relevant information?
7. WebCamp 07
Social Networks
Experiments
īŦ We cluster blog data over different time periods
īŦ user entropy: measures whether bloggers remain
together over time
īŦ topic drift: measure blogger behaviour in relation to
Topic growth and drift:
īŦ We identify the most relevant blogs in each cluster
using tag analysis
8. WebCamp 07
Social Networks
Data
īŦ We collected blog data from Jan16 to Feb 27, 2006
īĄ 7200 blogs in total
īŦ We created 6 data sets, one for each week
īĄ mean of 4250 blogs per week
īĄ 70% overlap between consecutive weeks
īŦ Each instance in each data set contains the posts
from a single tag, from a single blogger
īŦ An instance is only included in the data set for a
week only if the user has posted in that week
9. WebCamp 07
Social Networks
Clustering
īŦ Goals: Uncover latent structures reflecting topics in
the collection and provide a means of summarisation
īŦ Spherical k-means : partitions document corpus into
k disjoint groups of documents
īŦ Produces interpretable concept summary for each
group
īŦ Clustering quality:
īĄ Blogs in the same cluster should
be similar;
īĄ Blogs in different clusters should
be dissimilar.
īŦ Hr: Ratio of intra- to inter-
cluster similarity
13. WebCamp 07
Social Networks
Methodology
īŦ We cluster each data set in date order at different k
īŦ We reuse the cluster centroids in window t to seed the
clusters in window t+1
14. WebCamp 07
Social Networks
User Drift
īŦ We define User Entropy: a measure of the degree
of user dispersion between windows
wint+n
q : number of clusters at wint+n
containing users from cluster r
nr
i
: number of users from cluster r
contained in cluster i at wint+n
nr: number of users from cluster r
available at wint+n
wint
16. WebCamp 07
Social Networks
Proportion of Users
= mean fraction of dataset contained in top 20% of clusters
= mean fraction of dataset contained in bottom 20% of clusters
17. WebCamp 07
Social Networks
User Drift vs. Cluster Strength
(mean) correlation: Hr vs. Ur at k
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 10 20 30 40 50 60 70 80 90 100
k
pearsonR.
18. WebCamp 07
Social Networks
User Drift Conclusions
īŦ Even where n is low, user dispersion occurs
īŦ As n increases user entropy also increases,
suggesting that the ârelationshipâ between users
based on shared topics is short lived
īŦ User dispersion is related to cluster strength
īĄ Strong clusters experience less user drift than weak
clusters
īŦ However, the fraction of data from strong clusters is
smaller than the fraction from weak clusters, by at
least a factor of 2
īŦ We will return to user entropy later in the talk
19. WebCamp 07
Social Networks
Topic drift
īŦ inter window similarity Wr
t+1
īŦ Wr
t+1
for a cluster r at wint is the similarity between the
centroid of cluster r and the centroid of the
corresponding cluster r at wint+n
Wr
t+n
= cos(Cr,t, Cr,t+n)
īŦ Intuitively, Wr
t+n
is a measure of the drift of the
concept centroid, Cr,from wint to wint+n
20. WebCamp 07
Social Networks
Topic Drift vs. User Drift
(mean) correlation: Wr vs. Ur at k
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 10 20 30 40 50 60 70 80 90 100
k
pearsonR.
21. WebCamp 07
Social Networks
A Model of Topic Drift
īŦ Topic drift is related to user
drift, but topics may be
more stable than users
īŦ By observation: rate of
topic change is less than
rate of user drift
īŦ Our analysis so far would
suggest that users are not
firmly fixed to topics, rather
they drift between topics
over time.
Type a = {X,Y,Z}
Type b= {P,Q,R}
23. WebCamp 07
Social Networks
Observations
īŦ The blogosphere responds quickly to breaking news
stories
īŦ The relationship between topic and user drift is
pronounced where topic drift is extreme
īŦ Otherwise, there is steady turnover of users around
relatively stable concepts
īŦ Users âfloatâ between topics
25. WebCamp 07
Social Networks
Tag clouds
īŦ In previous work we showed that tags perform badly
at grouping similar blog posts together
26. WebCamp 07
Social Networks
Tag clouds
īŦ Clustering followed by tag analysis allowed us to
determine clusters contain strong concepts
īŦ It also allowed us to fragment the global tag space to
produce local tag clouds
28. WebCamp 07
Social Networks
A-bloggers are
1. more similar to each other than c- bloggers
2. more similar to the cluster centroid (topic definition)
than c-bloggers
3. more similar to pages retrieved from Google using
the topic description
29. WebCamp 07
Social Networks
Entropy: a-blogs vs c-blogs
īŦ A-blog entropy is lower
īŦ As interval increases a-blogs experience smaller
increases in entropy
īŦ Suggests that a-bloggers tend to write consistently
about the same things over time
30. WebCamp 07
Social Networks
A-blogs Example
īŦ Cluster 28 in Win5; k =50
īŦ Cluster description: mobile, internet, weblog, web, patent
A-blogs
1) âComunications: technology, economic and social issues at the intersection of
telecom, mobility and the Internetâ
2) âIP Blawgâ: technology and Intellectual property blog
3) âSmall business IP management blog: Patent, Trademark, Copyright, Internet,
and Technology Lawâ
4) âOpen Gardens: Wireless mobility, Digital convergence - Mobile web 2.0â
5) âMobile Enterprise Weblog: the voice of enterprise mobility managementâ
C-blogs
1) âDigital Music Den: Digital Music, online music marketingâ
2) "icarusindie.com â blog about nothingâ: general computing and technology
3) âDunkie's Sagaâ - personal blog: personal, cars, games, quizzes, some
technology
4) âComplex Christ â a vision for church that is organic, networked, decentralized,
bottom-up, emergent, communal, flexible, always evolvingâ
5) âPhilips Brooks patent infringement updatesâ: legal blog on general patent
issues (pharmaceutical as well as technological)
31. WebCamp 07
Social Networks
Conclusions
īŦ We have accumulated empirical evidence to suggest
that a-bloggers are topic authorities
īĄ Tend to form tight subgroups close to cluster topic
definition
īĄ Consistently more similar to pages ranked by Google using
the cluster topic definition
īĄ Tend to stay together at differerent clusterings over time. In
other words they tend to write regularly about the same
topic
īŦ What characteristics does the a-blogger have?
īĄ A blogger that is aware of a wider potential readership and
chooses his/her tags so that they can be understood easily
by others
īĄ Writes regularly in depth about fairly narrowly defined
subjects
īĄ New professional bloggers
32. WebCamp 07
Social Networks
Future Work
īŦ Produce tag hierarchies using hierarchical clustering
īŦ Combine this with the work of Hak Lae Kim and Dr.
Suk-Hyung Hwang:
īŦ Formal concept model of a blog social network using
tags and content analysis
īŦ Tag recommender: Enrich SIOC topic descriptions
with tag cloud meta data
īŦ Develop a set of style features to classify blogs
33. WebCamp 07
Social Networks
References
īŦ Hayes, C., Avesani, P (2007) Using Tags and Clustering to
Identify Topic Relevant Blogs in Proceedings of the
International Conference on Weblogs and Social Media
(ICWSM 07)
īŦ Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An
Analysis of the Use of Tags in a Blog Recommender
System. In proceedings of IJCAI-07, the International Joint
Conference on Artificial Intelligence
īŦ Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An
Analysis of Bloggers and Topics for a Blog Recommender
System in proceedings of the Workshop on Web Mining
(Webmine 06) , 7th European Conference on Machine
Learning and the 10th European Conference on Principles