Comments oriented blog summarization by sentence extraction
1. Author: MeishanHu, Aixin Sun, Ee-Peng Lim Publication: CIKM’07 Presenter: Jhih-Ming Chen Comments-Oriented Blog Summarization by Sentence Extraction 1
2. Outline Introduction Problem Definition ReQuT Model Reader-, Quotation- and Topic- Measures Word Representativeness Score Sentence Selection User Study and Experiments User Study Experimental Results Conclusion 2
3. Introduction Readers treat comments associated with a post as an inherent part of the post. existing research largely ignore comments by focusing on blog posts only This paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts. to find out whether the reading of comments would change a reader’s understanding about the post 3
4. Introduction This paper focus on the problem of comments-oriented blog post summarization. summarize a blog post by extracting representative sentences from the post using information hidden in its comments 4
5. Introduction 5 Sentence Detection splits blog post content into sentences Word Representativeness Measure weighs words appearing in comments Sentence Selection computes a representativeness score for each sentence based on representativeness of its contained words
6. Problem Definition 6 Given a blog post P P = {s1, s2, ... , sn} si = {w1, w2, ... , wm} C = {c1, c2,... , ck}associated with P The task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.
7. Problem Definition 7 One straightforward approach compute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given threshold Intuitively, word representativeness can be measured by counting the number of occurrences of a word in comments Binary Rep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise. Comment Frequency (CF) Rep(wk) is the number of comments containing word wk Term Frequency (TF) Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.
8. Problem Definition 8 Binary captures minimum information; CF and TF capture slightly more. Other information available in comments that could be very useful are ignored. e.g., authors of comments, quotations among comments and so on. All three measures suffer from spam comments.
9. REQUT Model 9 A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink. These observations provide us guidelines on measuring word representativeness. A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s). A comment may contain quoted sentences from one or more comments. Discussion in comments often branches into several topics.
10. Reader-, Quotation- and Topic- Measures 10 Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.
11. Reader-, Quotation- and Topic- Measures 11 With Observation 1 given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER) ra ∈ VRis a reader eR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scomments WR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)
12. Reader-, Quotation- and Topic- Measures 12 With Observation 1 Compute reader authority |R|denotes the total number of readers of the blog d is the damping factor (usually set d to 0.85) The reader measure of a word wkdenoted by RM(wk) tf(wk,ci)is the term frequency of word wkin comment ci ci← rameansthatciisauthored by reader ra
13. Reader-, Quotation- and Topic- Measures 13 With Observation 2 for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraph GQ = (VQ,EQ) ci∈ VQis acomment eQ(cj,ci)∈ EQindicatescjquotedsentences from ci WQ(cj,ci)is 1over the number of comments that cjever quoted
14. Reader-, Quotation- and Topic- Measures 14 With Observation 2 Derive the quotation degree D(ci)of a comment ci |C| is the number of comments associated with the givenpost A comment that is not quoted by anyother comment receives a quotation degree of 1/|C| The quotation measure of a word wkdenoted by QM(wk) wk∈cimeans that word wkappears in comment ci
15. Reader-, Quotation- and Topic- Measures 15 With Observation 3 given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithm the similarity threshold in clustering comments was empirically set to 0.4 hotly discussed topic has a large numberof comments all close to the topic cluster centroid
16. Reader-, Quotation- and Topic- Measures 16 With Observation 3 Compute the importance of a topiccluster |ci| is the length of comment ciinnumberof words C is the set of comments sim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tu The topic measure of a word wkdenotedby TM(wk) ci∈ tudenotes comment ciis clustered into topic cluster tu
17. Word Representativeness Score 17 Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel. Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk) 0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0
18. Sentence Selection 18 Density-based selection (DBS) wordsappearing in comments as keywords and therest non-keywords K is the total number ofkeywords contained in si Score(wj) is the score of keywordwj distance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si
19. Sentence Selection 19 Summation-based selection (SBS) give a higher representativeness score to a sentence ifit contains more representative words |si|is the length of sentence siin number of words (includingstopwords) τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score
20. User Study and Experiments 20 We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented. Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts. IEBlog has less loyal but more readers, with topicsmainly in Web development.
21. User Study 21 Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post. The user study was conducted in two phrases 3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts. Select approximately 30%of sentences from each post withoutcomments as its summary The selectedsentences served as a labeled dataset known as RefSet-1. Select approximately 30%of sentences from each posts and their comments as its summary The selectedsentences served as a labeled dataset known as RefSet-2.
22. User Study 22 For each human summarizer, we computed the level ofself-agreementshown in Table. Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer. That is,reading comments does change one’s understanding aboutblog posts.
23. Experimental Results 23 RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures. τ=0.2 α = β = γ = 0.33 Normalized Discounted Cumulative Gain
24. Conclusion 24 Reading commentsdoes affect one’s understanding about a blog post. Evaluated two sentence selection methods with four word representativeness measures. ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.