SlideShare uma empresa Scribd logo
1 de 24
Author: MeishanHu, Aixin Sun, Ee-Peng Lim Publication: CIKM’07 Presenter: Jhih-Ming Chen Comments-Oriented Blog Summarization by Sentence Extraction 1
Outline Introduction Problem Definition ReQuT Model Reader-, Quotation- and Topic- Measures Word Representativeness Score Sentence Selection User Study and Experiments User Study Experimental Results Conclusion 2
Introduction Readers treat comments associated with a post as an inherent part of the post. existing research largely ignore comments by focusing on blog posts only This paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts. to find out whether the reading of comments would change a reader’s understanding about the post 3
Introduction This paper focus on the problem of comments-oriented blog post summarization. summarize a blog post by extracting representative sentences from the post using information hidden in its comments 4
Introduction 5 Sentence Detection splits blog post content into sentences Word Representativeness Measure weighs words appearing in comments Sentence Selection computes a representativeness score for each sentence based on representativeness of its contained words
Problem Definition 6 Given a blog post P P = {s1, s2, ... , sn} si = {w1, w2, ... , wm} C = {c1, c2,... , ck}associated with P The task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.
Problem Definition 7 One straightforward approach compute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given threshold Intuitively, word representativeness can be measured by counting the number of occurrences of a word in comments Binary Rep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise. Comment Frequency (CF) Rep(wk) is the number of comments containing word wk Term Frequency (TF) Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.
Problem Definition 8 Binary captures minimum information; CF and TF capture slightly more. Other information available in comments that could be very useful are ignored. e.g., authors of comments, quotations among comments and so on. All three measures suffer from spam comments.
REQUT Model 9 A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink. These observations provide us guidelines on measuring word representativeness. A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s). A comment may contain quoted sentences from one or more comments. Discussion in comments often branches into several topics.
Reader-, Quotation- and Topic- Measures 10 Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and  represents hotly discussed topics.
Reader-, Quotation- and Topic- Measures 11 With Observation 1 given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER) ra ∈ VRis a reader eR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scomments WR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)
Reader-, Quotation- and Topic- Measures 12 With Observation 1 Compute reader authority |R|denotes the total number of readers of the blog d is the damping factor (usually set d to 0.85) The reader measure of a word wkdenoted by RM(wk) tf(wk,ci)is the term frequency of word wkin comment ci ci← rameansthatciisauthored by reader ra
Reader-, Quotation- and Topic- Measures 13 With Observation 2 for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraph GQ = (VQ,EQ) ci∈ VQis acomment eQ(cj,ci)∈ EQindicatescjquotedsentences from ci WQ(cj,ci)is 1over the number of comments that cjever quoted
Reader-, Quotation- and Topic- Measures 14 With Observation 2 Derive the quotation degree D(ci)of a comment ci |C| is the number of comments associated with the givenpost A comment that is not quoted by anyother comment receives a quotation degree of 1/|C| The quotation measure of a word wkdenoted by QM(wk) wk∈cimeans that word wkappears in comment ci
Reader-, Quotation- and Topic- Measures 15 With Observation 3 given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithm the similarity threshold in clustering comments was empirically set to 0.4 hotly discussed topic has a large numberof comments all close to the topic cluster centroid
Reader-, Quotation- and Topic- Measures 16 With Observation 3 Compute the importance of a topiccluster |ci| is the length of comment ciinnumberof words C is the set of comments sim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tu The topic measure of a word wkdenotedby TM(wk) ci∈ tudenotes comment ciis clustered into topic cluster tu
Word Representativeness Score 17 Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel. Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk) 0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0
Sentence Selection 18 Density-based selection (DBS) wordsappearing in comments as keywords and therest non-keywords K is the total number ofkeywords contained in si Score(wj) is the score of keywordwj distance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si
Sentence Selection 19 Summation-based selection (SBS) give a higher representativeness score to a sentence ifit contains more representative words |si|is the length of sentence siin number of words (includingstopwords) τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score
User Study and Experiments 20 We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented. Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts. IEBlog has less loyal but more readers, with topicsmainly in Web development.
User Study 21 Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post. The user study was conducted in two phrases 3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts. Select approximately 30%of sentences from each post withoutcomments as its summary The selectedsentences served as a labeled dataset known as RefSet-1. Select approximately 30%of sentences from each posts and their comments as its summary The selectedsentences served as a labeled dataset known as RefSet-2.
User Study 22 For each human summarizer, we computed the level ofself-agreementshown in Table. Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer. That is,reading comments does change one’s understanding aboutblog posts.
Experimental Results 23 RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures. τ=0.2  α = β = γ = 0.33 Normalized Discounted Cumulative Gain
Conclusion 24 Reading commentsdoes affect one’s understanding about a blog post. Evaluated two sentence selection methods with four word representativeness measures. ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.

Mais conteúdo relacionado

Semelhante a Comments oriented blog summarization by sentence extraction

Effective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From TextEffective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From Textmaria.grineva
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsUniversity of Bologna
 
Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsM. Atif Qureshi
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereM. Atif Qureshi
 
Lexical Analysis to Effectively Detect User's Opinion
Lexical Analysis to Effectively Detect User's Opinion   Lexical Analysis to Effectively Detect User's Opinion
Lexical Analysis to Effectively Detect User's Opinion dannyijwest
 
Speculative analysis for comment quality assessment
Speculative analysis for comment quality assessmentSpeculative analysis for comment quality assessment
Speculative analysis for comment quality assessmentPooja Rani
 
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...Takashi Inui
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereDERIGalway
 
IRJET- Finding Related Forum Posts through Intention-Based Segmentation
IRJET-  	  Finding Related Forum Posts through Intention-Based SegmentationIRJET-  	  Finding Related Forum Posts through Intention-Based Segmentation
IRJET- Finding Related Forum Posts through Intention-Based SegmentationIRJET Journal
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksMohamed El-Geish
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Using sentence compression to develop visual analytics - VISLA15
Using sentence compression to develop visual analytics - VISLA15Using sentence compression to develop visual analytics - VISLA15
Using sentence compression to develop visual analytics - VISLA15Shane Dawson
 
What Did They Do? Deriving High-Level Edit Histories in Wikis
What Did They Do? Deriving High-Level Edit Histories in WikisWhat Did They Do? Deriving High-Level Edit Histories in Wikis
What Did They Do? Deriving High-Level Edit Histories in WikisRobert Biuk-Aghai
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code CriticsLaurie Smith
 
NLP Techniques for Text Summarization.docx
NLP Techniques for Text Summarization.docxNLP Techniques for Text Summarization.docx
NLP Techniques for Text Summarization.docxKevinSims18
 

Semelhante a Comments oriented blog summarization by sentence extraction (20)

Blog summarizer
Blog summarizerBlog summarizer
Blog summarizer
 
Effective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From TextEffective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From Text
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citations
 
Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendations
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphere
 
Lexical Analysis to Effectively Detect User's Opinion
Lexical Analysis to Effectively Detect User's Opinion   Lexical Analysis to Effectively Detect User's Opinion
Lexical Analysis to Effectively Detect User's Opinion
 
Speculative analysis for comment quality assessment
Speculative analysis for comment quality assessmentSpeculative analysis for comment quality assessment
Speculative analysis for comment quality assessment
 
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Const...
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
IRJET- Finding Related Forum Posts through Intention-Based Segmentation
IRJET-  	  Finding Related Forum Posts through Intention-Based SegmentationIRJET-  	  Finding Related Forum Posts through Intention-Based Segmentation
IRJET- Finding Related Forum Posts through Intention-Based Segmentation
 
Aman chaudhary
 Aman chaudhary Aman chaudhary
Aman chaudhary
 
Automatic Summarizaton Tutorial
Automatic Summarizaton TutorialAutomatic Summarizaton Tutorial
Automatic Summarizaton Tutorial
 
N15-1013
N15-1013N15-1013
N15-1013
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social Networks
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Using sentence compression to develop visual analytics - VISLA15
Using sentence compression to develop visual analytics - VISLA15Using sentence compression to develop visual analytics - VISLA15
Using sentence compression to develop visual analytics - VISLA15
 
What Did They Do? Deriving High-Level Edit Histories in Wikis
What Did They Do? Deriving High-Level Edit Histories in WikisWhat Did They Do? Deriving High-Level Edit Histories in Wikis
What Did They Do? Deriving High-Level Edit Histories in Wikis
 
A Critique On Code Critics
A Critique On Code CriticsA Critique On Code Critics
A Critique On Code Critics
 
Ists
IstsIsts
Ists
 
NLP Techniques for Text Summarization.docx
NLP Techniques for Text Summarization.docxNLP Techniques for Text Summarization.docx
NLP Techniques for Text Summarization.docx
 

Comments oriented blog summarization by sentence extraction

  • 1. Author: MeishanHu, Aixin Sun, Ee-Peng Lim Publication: CIKM’07 Presenter: Jhih-Ming Chen Comments-Oriented Blog Summarization by Sentence Extraction 1
  • 2. Outline Introduction Problem Definition ReQuT Model Reader-, Quotation- and Topic- Measures Word Representativeness Score Sentence Selection User Study and Experiments User Study Experimental Results Conclusion 2
  • 3. Introduction Readers treat comments associated with a post as an inherent part of the post. existing research largely ignore comments by focusing on blog posts only This paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts. to find out whether the reading of comments would change a reader’s understanding about the post 3
  • 4. Introduction This paper focus on the problem of comments-oriented blog post summarization. summarize a blog post by extracting representative sentences from the post using information hidden in its comments 4
  • 5. Introduction 5 Sentence Detection splits blog post content into sentences Word Representativeness Measure weighs words appearing in comments Sentence Selection computes a representativeness score for each sentence based on representativeness of its contained words
  • 6. Problem Definition 6 Given a blog post P P = {s1, s2, ... , sn} si = {w1, w2, ... , wm} C = {c1, c2,... , ck}associated with P The task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.
  • 7. Problem Definition 7 One straightforward approach compute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given threshold Intuitively, word representativeness can be measured by counting the number of occurrences of a word in comments Binary Rep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise. Comment Frequency (CF) Rep(wk) is the number of comments containing word wk Term Frequency (TF) Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.
  • 8. Problem Definition 8 Binary captures minimum information; CF and TF capture slightly more. Other information available in comments that could be very useful are ignored. e.g., authors of comments, quotations among comments and so on. All three measures suffer from spam comments.
  • 9. REQUT Model 9 A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink. These observations provide us guidelines on measuring word representativeness. A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s). A comment may contain quoted sentences from one or more comments. Discussion in comments often branches into several topics.
  • 10. Reader-, Quotation- and Topic- Measures 10 Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.
  • 11. Reader-, Quotation- and Topic- Measures 11 With Observation 1 given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER) ra ∈ VRis a reader eR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scomments WR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)
  • 12. Reader-, Quotation- and Topic- Measures 12 With Observation 1 Compute reader authority |R|denotes the total number of readers of the blog d is the damping factor (usually set d to 0.85) The reader measure of a word wkdenoted by RM(wk) tf(wk,ci)is the term frequency of word wkin comment ci ci← rameansthatciisauthored by reader ra
  • 13. Reader-, Quotation- and Topic- Measures 13 With Observation 2 for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraph GQ = (VQ,EQ) ci∈ VQis acomment eQ(cj,ci)∈ EQindicatescjquotedsentences from ci WQ(cj,ci)is 1over the number of comments that cjever quoted
  • 14. Reader-, Quotation- and Topic- Measures 14 With Observation 2 Derive the quotation degree D(ci)of a comment ci |C| is the number of comments associated with the givenpost A comment that is not quoted by anyother comment receives a quotation degree of 1/|C| The quotation measure of a word wkdenoted by QM(wk) wk∈cimeans that word wkappears in comment ci
  • 15. Reader-, Quotation- and Topic- Measures 15 With Observation 3 given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithm the similarity threshold in clustering comments was empirically set to 0.4 hotly discussed topic has a large numberof comments all close to the topic cluster centroid
  • 16. Reader-, Quotation- and Topic- Measures 16 With Observation 3 Compute the importance of a topiccluster |ci| is the length of comment ciinnumberof words C is the set of comments sim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tu The topic measure of a word wkdenotedby TM(wk) ci∈ tudenotes comment ciis clustered into topic cluster tu
  • 17. Word Representativeness Score 17 Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel. Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk) 0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0
  • 18. Sentence Selection 18 Density-based selection (DBS) wordsappearing in comments as keywords and therest non-keywords K is the total number ofkeywords contained in si Score(wj) is the score of keywordwj distance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si
  • 19. Sentence Selection 19 Summation-based selection (SBS) give a higher representativeness score to a sentence ifit contains more representative words |si|is the length of sentence siin number of words (includingstopwords) τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score
  • 20. User Study and Experiments 20 We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented. Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts. IEBlog has less loyal but more readers, with topicsmainly in Web development.
  • 21. User Study 21 Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post. The user study was conducted in two phrases 3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts. Select approximately 30%of sentences from each post withoutcomments as its summary The selectedsentences served as a labeled dataset known as RefSet-1. Select approximately 30%of sentences from each posts and their comments as its summary The selectedsentences served as a labeled dataset known as RefSet-2.
  • 22. User Study 22 For each human summarizer, we computed the level ofself-agreementshown in Table. Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer. That is,reading comments does change one’s understanding aboutblog posts.
  • 23. Experimental Results 23 RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures. τ=0.2 α = β = γ = 0.33 Normalized Discounted Cumulative Gain
  • 24. Conclusion 24 Reading commentsdoes affect one’s understanding about a blog post. Evaluated two sentence selection methods with four word representativeness measures. ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.