News Article Ranking : Leveraging the Wisdom of Bloggers

1. News Article Ranking:Leveraging the Wisdom of Bloggers Richard McCreadie, Craig Macdonald & IadhOunis

3. Thelwall explored how bloggers reacted to the London bombings

4. 30% of bloggersblog on news-related topics (Technorati poll 2008)

5. Hence, the blogosphere is valuable as a source of news-related information

6. Kȍniget al. & Sayyadiet al. have exploited the blogosphere for event detectionObama Victory Number of blog posts Day (November 2008) M. Thelwall WWW’06 Kȍnig et al. SIGIR’09 Sayyadi et al. ICWSM’09

8. Every day newspaper editors select articles for placement within their newspapers.

9. This can be seen as a ranking problem.

10. Rank articles by readership interestFront Page Page 2 Newspaper Editor . . . We investigate how such a ranking can be approximated using evidence from the blogosphere

12. The News Article Ranking Problem

13. The Votes Approach

14. Evaluating Votes

15. Temporal Promotion

16. News Article Representation

17. ConclusionsTalk Outline

19. Given a day of interest dQ we wish to score each news article a by its predicted importance, score(a,dQ) using evidence from the blogosphere.=29 Day dQ =23 =14 =13 News Article Ranker =4 =4 Importance Scores

21. Score by blog post volumeApproach Two Stages: Score each news article a for all days d based on related blog post volume for day d. News articles are represented by their headlines Given a query day dQ rank A based on the score for each news article on day dQ, i.e. score(a, dQ) -> a voting process The Votes Approach

22. Votes Approach : Stage 1 Stage 1: Score days for each news story 1 1 2 3 4 2 3 4 Ranking of days for a blog post ranking 4) Rank days by votes received 2) Select the top 1000 blog posts for a 3) Each post votes for a day Days votes = 2 votes = 1 votes = 2 votes = 2 For each news articlea 1) Use its representation (headline) as a query votes = 0 votes = 1 votes = 2 votes = 0 Terrier Votes Voting Model : Count * Craig Macdonald PhD thesis 2009

23. Votes Approach : Stage 2 Stage 2: Rank news articles for day dQ votes = 2 2 Stage 1 votes = 2 votes = 2 4 2 votes = 1 votes = 2 News article a News article a News article a 1 4 1 2 3 votes = 0 votes = 1 3 1 votes = 0 3 votes = 6 4 votes = 2 votes = 6 3 4 Query Day 2 votes = 3 votes = 2 News article a 1 3 2 votes = 1 votes = 3 2 1 votes = 1 2 votes = 9 1 votes = 7 votes = 9 3 1 votes = 5 votes = 7 2 News article a 3 votes = 0 3 votes = 5 4 2 votes = 0 4 Ranking of Articles

33. Rank news articles by predicted importance

34. Evidence mined Blogs08

35. 100k Articles provided by the New York Times

37. Inlinks (hyperlink evidence vs Votes textual evidence)

39. 100k news headlines from the New York Times to represent articles

40. E.g. ‘In a Decisive Victory, Obama Reshapes the Electoral Map’

41. Uses blog posts from the Blogs08 blog post corpus (28 million posts)

42. Judgments for 50 days of interest (dQ’s)

45. Top stories identification task

46. Blogs08 blog post corpus

48. Criteria:

49. Timing : Favour stories that cover ‘live’ events

50. Significance : Favour stories that effect many people

51. Proximity : Favour stories that are local to the reader (USA)

54. Secondary index holds blog post -> day relations

55. Retrieve 1000 blogposts for headlines.

56. DPH (DFR)

58. Inlinks : hyperlink evidence

59. TREC 2009 best systemsExperimental Setup

60. Votes Performance Better performance than TREC 2009 best systems Results: BM25<DPH (DFR) Votes + extras Hyperlink evidence is of less value than textual evidence Votes Approach TREC 2009 Best Systems

62. Can be effectively leveraged to rank news articles by their importance

63. However, still room for improvement (0.17 map)Votes Performance How can we improve Votes performance?

73. Two Techniques

74. NDayBoost

75. GaussBoostTemporal Promotion

80. ∆d is the distance (in days) from dQ

81. w is the width of the gaussian curve

83. Weights downward the scores for each day dependent on w.ScoreGaussBoost(B,4) = (1*4)+(0.79*1)+(0.18*1) = 4.970 ScoreGaussBoost(A,4) = (1*4)+(0.79*4)+(0.18*3) = 7.700 dQ N = -2 Score =7.700 Num Votes Score=11 Score =4.970 Score=6 Days

85. Does the quality of evidence decrease as distance from dQ increases?

86. Is historical or future (before or after dQ) blog post evidence more useful?Research Questions

88. The parameter w determines the width of the Gaussian curve, and as such, the weights ∆d for the days.( n = -2, w = 0.5 ) ScoreGaussBoost(A,4) = (1*4)+(0.38*4)+(0.01*3) = 4.608 ScoreGaussBoost(B,4) = (1*4)+(0.38*1)+(0.01*1) = 4.390 ( n = -2, w = 1 ) ScoreGaussBoost(A,4) = (1*4)+(0.79*4)+(0.18*3) = 7.700 ScoreGaussBoost(B,4) = (1*4)+(0.79*1)+(0.18*1) = 4.970 Temporal Promotion

89. NDayBoost Performance Future blog postings does provide useful evidence Baseline DPH+Votes MAP Historical evidence is not useful for NDayBoost n value (days)

90. GaussBoost Performance Future blog postings provide stronger evidence than historical postings Historical blog postings are useful for days close to dQ Baseline DPH+Votes MAP w value (not days!)

92. Both historical and future evidence is useful to improve Votes ranking performance

93. Can use this evidence to generate a better ranking for editors if the data is available

94. Future evidence is more powerful than historical evidence

95. Not too useful if we want to rank in real-time though

96. NDayBoost is only effective for future evidence

97. GaussBoost is effective for both future and historical evidence

98. The most effective of the techniques

99. Does not over emphasise evidence from days distant from dQTemporal Promotion

106. ConclusionsTalk Outline Can we improve upon the news article representation?

108. e.g. ‘In a Decisive Victory, Obama Reshapes the Electoral Map

109. Headlines are a sparse representation of an article

110. Many headlines are not `news-worthy’

111. Editors don’t even consider these

113. Prune headlines less likely to be news-worthyImproving the Article Representation

115. Add related terms (counter sparsity)Approach: Select retrieve top 3 blog posts from: Blogs08 (query expansion , K. L. Kwok and M. S. Chan. SIGIR 1998) Wikipedia (collection enrichment, F. Diaz and D. Metzler. SIGIR 2006) using DPH (DFR) Expand query with the top 10 terms identified using Bo1 (G. Amati, Thesis 2003) from those documents. a Terrier Top Terms DPH Bo1 Blogs08/Wikipedia Query expansion/External Query expansion/Collection Enrichment

116. Related but generic terms Case specific terms

118. Collection enrichment helps find the blog posts that are related.Article Improvement Performance Collection enrichment with Wikipedia significantly increases performance MAP

120. Try simulating this within the system

123. Corrections for the Record

124. Comments of the Week

125. Inside the Times

126. Best Sellers

127. The Week Ahead

128. Movie Review

129. Arts Briefly

130. The Listings

131. Dance Review

132. Whats on Today

133. Critics Choice

134. Book of the Times

135. Music ReviewE.g. ‘Inside the Times, November 6, 2008’ E.g. ‘N.F.L. ROUNDUP; Giants Shut Down Tyree for Season; Raiders Cut Hall’

138. Temporal promotion (GaussBoost)

139. Headline pruning (All Heuristics)

141. DPH+Votes

143. Can be used to automatically rank news stories for a newspaper editor

145. More useful to look at tomorrows blog posts than yesterdays blog posts

147. i.e. they can disregard whole classes of articles as not being news-worthy

148. By pruning away such articlesapriori, ranking performance is improved

149. Headlines are sparse representations of news articles

150. Enrichment with terms from Wikipedia can help find more representative blog postsConclusions

152. Focus on real-time ranking of news (no future evidence)

153. Uses a larger news article collection from ReutersFuture Work Questions?

News Article Ranking : Leveraging the Wisdom of Bloggers

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a News Article Ranking : Leveraging the Wisdom of Bloggers

Semelhante a News Article Ranking : Leveraging the Wisdom of Bloggers (20)

Último

Último (20)

News Article Ranking : Leveraging the Wisdom of Bloggers

Notas do Editor