Invited talk at SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction (April 3, 2012). Based on paper by Ryu, Lease, and Woodward, to appear at ACM HyperText 2012. Joint work with Hohyon Ryu and Nicholas Woodward.
Streamlining Python Development: A Guide to a Modern Project Setup
Discovering and Navigating Memes in Social Media
1. Discovering and Navigating Memes
in Social Media
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
Joint Work with
Hohyon Ryu & Nicholas Woodward
Paper to appear at HyperText 2012: 23rd ACM Conference on Hypertext and Social Media
2. April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 2
3. Critical Reading (Literacy)
• Context-awareness (how work is situated)
– Related works, Time/Place, Author…
• Recognizing & questioning
– Sources of Influence
– Positions, Assumptions, Bias, …
• New challenges online
– Scale, authorship, citing of sources, borrowing…
• Traditional approach: education
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 3
4. Inspiration #1: Living Stories
livingstories.googlelabs.com
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 4
5. Memes
• Similar phrases found across multiple sources
– Includes multiple phrasings of same idea
• Re-use reveals implicit network
– Sources, Individuals, Communities
– Patterns of re-use reinforce links
• Questions
– Re-use?
– Intended re-use?
– Visible (quoted)?
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 5
6. Inspiration #2: Meme Tracker
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 6
7. Where Repeated Text Occurs
• Intended Re-use
– Visible (Quotation): “to be or not to be”
• Leskovec et al., KDD’09 ( memetracker.org )
– Hidden: e.g. plagiarism, false plurality
– Unmarked
• Near-Duplicate documents
• Boilerplate: All rights reserved
• Common adage: …a penny saved…
• Style, genre, laziness, …
• Accidental borrowing
• Shared context (e.g. named entities)
– E.g. named-entities: S. Skiena et al., Stony Brook ( textmap.com )
• Chance (e.g. …then he said…)
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 7
8. Data
• TREC Blogs08 Collection
– http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
– 28M permalinks (January 2008 – January 2009)
– 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
– http://www.icwsm.org/data/
– 44 million blog posts (August - September, 2008)
– 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 8
9. Inspiration #3: Popular Passages
• Kolak & Schilit, HyperText’08
• Find re-use in scanned books
– Find repeated phrases
– Group related phrases
– Rank passages
– MapReduce processing architecture
• Browsing interface with generated links
• Issues: data/task, locality, details, scalability
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 9
10. Processing Architecture
Blogs08 Test Collection
28M posts, 1.4TB
Preprocessing (Pseudo-MapReduce)
Decruft & Language Identification
HTML Strip & Near-Duplicate Detection 16M posts, 960GB
Common Phrase Extraction
15K posts, 43GB
3 MapReduce Stages
Common Phrase Ranking
Daily Top 200 Phrases 6.2M phrases, 2GB
1 MapReduce Process
Common Phrase Clustering
75K phrases, 2.6MB
1 MapReduce Process
Meme Browser 68K memes
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 10
11. Meme Browser
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 11
12. Efficiency: Meme Clustering
• From WEKA ARFF format to sparse representation
– From ~96 hours 11 hours
• Indexed vs. un-indexed
– From 11 hours 16 minutes (single core)
– From 34 minutes 3 minutes (136 cores)
• Distributed vs. single core
– From 11 hours 34 minutes (un-indexed)
– From 16 minutes 3 minutes (indexed)
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 12
13. Thank You!
Joint Work with Matt Lease
– Hohyon (Will) Ryu ml@ischool.utexas.edu
– Nicholas Woodward www.ischool.utexas.edu/~ml
@mattlease
Support
• FCT of Portugal / UT CoLab
• Amazon Web Services
Meme Browser: • UT Austin LIFT Award
odyssey.ischool.utexas.edu/mb • John P. Commons Fellowship