Long journey of Ruby standard library at RubyConf AU 2024
Finding bursty topics from microblogs
1. FINDING BURSTY TOPICS
FROM MICROBLOGS
Qiming Diao, Jing Jiang, Feida Zhu, Ee-Peng Lim
Living Analytics Research Centre
School of Information Systems
Singapore Management University
2. Abstract
1.
2.
To find topics that have bursty patterns on
microblogs
two observations:
posts published around the same time are
more likely to have the same topic
posts published by the same user are more
likely to have the same topic
3. Introduction
Retrospective bursty event detection :
Bursty detection: state machine
Topic discovery: LDA
1.
2.
Two assumptions:
If a post is about a global event, it is likely to
follow a global topic distribution that is timedependent.
If a post is about a personal topic, it is likely
to follow a personal topic distribution that is
more or less stable overtime.
4. Method
Preliminaries
d i
, u i , t i , w i,j
a bursty topic b as a word distribution coupled with
a bursty interval, denoted as ( ϕb,tbs ,tbe )
Our task: to find meaningful bursty topics from
the input text stream.
Our method: a topic discovery step and a burst
detection step.
5. Our Topic Model
1.
2.
3.
4.
Assume:
C (latent) topics in the text stream, where
each topic c has a word distribution ϕc.
A background word distribution ϕB
A single post is most likely to be about a
single topic.
A global topic distribution θt for each time
point t .
6.
Our focus is to find popular global events, we
need to separate out these “personal” posts.
A time-independent topic distribution ηu for
each user to capture her long term topical
interests.
11. Burst Detection
Assume:
A
series of counts( mc1 , mc2 ,..., mcT)
representing the intensity of the topic at different
time points.
These counts are generated by two Poisson
distributions corresponding to a bursty state and a
normal state.
12. Burst Detection
σ 0 = 0 . 9 and σ 1 =0 . 6 for all topics.
Finally, a burst is marked by a consecutive
subsequence of bursty states.
13. Experiments
Data Set
sampled
2892 users from this dataset and
extracted their tweets between September 1 and
November 30, 2011(91 days in total).
the final dataset with 3,967,927 tweets and
24,280,638 tokens.
14.
Ground Truth Generation
top-30
bursty topics from each model
two human judges to judge their quality by
assigning a score of either 0 or 1
Evaluation
set the number of topics C to 80, α to 50/C
and β to 0.01. Each model was run for 500
iterations of Gibbs sampling.
We
18. two case studies to demonstrate
the effectiveness of our model
Effectiveness of Temporal Models: Both
TimeLDA and TimeUserLDA tend to group posts
published on the same day into the same topic.
19. two case studies to demonstrate
the effectiveness of our model
Effectiveness of User Models: it is important to
filter out users’ “personal” posts in order to find
meaningful global events.
20. Conclusions
A new topic model that considers both the
temporal information of microblog posts and
users’ personal interests.
A Poisson-based state machine to identify
bursty periods from the topics discovered by
our model.
22. ABSTRACT
TM-LDA learns the transition parameters
among topics by minimizing the prediction
error on topic distribution in subsequent
postings.
We develop an efficient updating algorithm to
adjust transition parameters, as new
documents stream in.
23.
1.
2.
3.
Challenges:
to model and analyze latent topics in social
textual data;
to adaptively update the models as the
massive social content streams in;
to facilitate temporal-aware applications of
social media
24. contribution
First, we propose a novel temporally-aware
topic language model, TM-LDA, which
captures the latent topic transitions in
temporally-sequenced documents.
Second, we design an efficient algorithm to
update TM-LDA which enables it to be
performed on large scale data.
Finally, we evaluate TM-LDA against the static
topic modeling method(LDA)
25. METHODOLOGY
TM-LDA Algorithm
if
we define the space of topic distribution as X =
{ x ∈ Rn+ : || x || 1 = 1 } , TM-LDA can be
considered as a function f : X → X .
the prediction error
TM-LDA is
modeled as a non-linear mapping:
33.
Updating Transition Parameters with QRfactorization
Suppose
the QR-factorization of matrix A is A =
QR , where Q′Q = I and R is an upper triangular
matrix. RT=Q’B
35. Predicting Future Tweets
TM-LDA first trains LDA on 7-day historical tweets and
compute the transition parameter matrix accordingly. Then
for each new tweet generated on the 8th day, it predicts
the topic distribution of the following tweet.
36.
Estimated Topic Distributions ofFuture" Tweets : the
topic distribution of the tweet b.
LDA Topic Distributions of Future" Tweets :
the inferred topic distribution of the tweet b .
LDA Topic Distributions ofPrevious" Tweets :
the inferred topic distribution of the tweet a .
38. Properties of Transition
Parameters
T is a square matrix where the size of T is
determined by the number of topics trained in
LDA.
The row sum of T is always 1, which means
that the overall weights emitted from atopic
is 1.
43. CONCLUSIONS
a novel temporally-aware language model,
TM-LDA, for efficiently modeling streams of
social text such as a Twitter stream for an
author
an efficient model updating algorithm for TMLDA