SlideShare a Scribd company logo
1 of 18
Discovering Memes in Social Media

                              Matt Lease
                        School of Information
                      University of Texas at Austin
                        ml@ischool.utexas.edu
                              @mattlease

                             Joint Work with
                     Hohyon Ryu & Nicholas Woodward


Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
Memes
• Short, similar phrases found in
  many different sources
  – Re-use, shared temporal context
• Evolutionary mutation &
  propagation as they transmit
  from source-to-source
• Reveals implicit connections
  between sources, individuals
  and communities involved
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   2
MemeBrowser & Critical Literacy




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   3
Google/NYT Living Stories




                 livingstories.googlelabs.com
March 21, 2012         ACM SIGKDD - Austin Chapter Meeting   4
Related Work
• Jure Leskovec et al. (KDD’09): blogs
     – quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
     – Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
     – Mine “popular passages” from complete texts
     – MapReduce “shingling” approach
     – Popular passages found are local, not global

March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   5
MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
   – 48 Dell R610 nodes
         • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
         • 48GB RAM with ~1.5TB disk per node
         • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
   – 16 Dell R710 (same CPU configuration)
         • 144GB RAM with ~0.8TB disk per node
   – Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
 March 21, 2012         ACM SIGKDD - Austin Chapter Meeting        6
Datasets
• TREC Blogs08 Collection
     – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
     – 28M permalinks (January 2008 – January 2009)
     – 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
     – http://www.icwsm.org/data/
     – 44 million blog posts (August - September, 2008)
     – 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset

March 21, 2012        ACM SIGKDD - Austin Chapter Meeting      7
Processing Architecture
                                                               Blogs08 Test Collection
                                                                  28M posts, 1.4TB
       Preprocessing (Pseudo-MapReduce)
       Decruft & Language Identification
       HTML Strip & Near-Duplicate Detection                       16M posts, 960GB



       Common Phrase Extraction
                                                                    15K posts, 43GB
       3 MapReduce Stages

       Common Phrase Ranking
       Daily Top 200 Phrases                                       6.2M phrases, 2GB
       1 MapReduce Process

       Common Phrase Clustering
                                                                   75K phrases, 2.6MB
       1 MapReduce Process

       Meme Browser
                                                                      68K memes


March 21, 2012               ACM SIGKDD - Austin Chapter Meeting                         8
Creating the Shingle Table
• e.g. trigram shingles for: what do you think of

  – what do you
  – do you think
  – you think of




 March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   9
Grouping Shingles by Document
• Mapper: trivial grouping; Reducer: Identity




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   10
Common Phrase (CP) Detection
• Mapper:
  Merge adjacent
  shingles into memes
  (ignoring small gaps)

• Reducer:
  Find set of
  documents in which
  each meme occurs
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   11
Ranking Memes




 March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   12
Clustering Memes
• Mapper:
  Single-link
  hierarchical
  clustering with
  cosine similarity
• Reducer:
  create/merge
  clusters


  March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   13
Efficiency: Meme Clustering



• From WEKA ARFF format to sparse representation
   – From ~96 hours  11 hours
• Indexed vs. un-indexed
   – From 11 hours  16 minutes (single core)
   – From 34 minutes  3 minutes (136 cores)
• Distributed vs. single core
   – From 11 hours  34 minutes (un-indexed)
   – From 16 minutes  3 minutes (indexed)
  March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   14
Meme Browser: Original Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   15
Meme Browser: Current Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   16
Meme Evolution (Leskovec et al.’09)




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   17
Thank You!
• Joint Work with                   Matt Lease
  – Hohyon (Will) Ryu               ml@ischool.utexas.edu
     • InfoChimps (Summer’11)       www.ischool.utexas.edu/~ml
     • Indeed.com (Summer’12)         @mattlease
  – Nicholas Woodward (TACC)
     • Latin American Network
       Information Center (LANIC)   Support
                                    • FCT of Portugal / UT CoLab
                                    • Amazon Web Services
                                    • UT Austin LIFT Award
                                    • John P. Commons Fellowship

More Related Content

Viewers also liked

Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes LatinitasAndrea Zarate
 
WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.Ravard & Co
 
Meme Powerpoint
Meme PowerpointMeme Powerpoint
Meme PowerpointConnor
 
Memes, Memes Everywhere
Memes, Memes EverywhereMemes, Memes Everywhere
Memes, Memes EverywhereCast From Clay
 
Fantastic memes and how to use them
Fantastic memes and how to use themFantastic memes and how to use them
Fantastic memes and how to use themAaron Hill
 
Social networking PPT
Social networking PPTSocial networking PPT
Social networking PPTvarun0912
 
A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)TrackMaven
 
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...Dave McClure
 

Viewers also liked (11)

Gdc reports2013 4_13
Gdc reports2013 4_13Gdc reports2013 4_13
Gdc reports2013 4_13
 
Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
 
WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.
 
Memes
MemesMemes
Memes
 
Meme Powerpoint
Meme PowerpointMeme Powerpoint
Meme Powerpoint
 
mems ppt
mems pptmems ppt
mems ppt
 
Memes, Memes Everywhere
Memes, Memes EverywhereMemes, Memes Everywhere
Memes, Memes Everywhere
 
Fantastic memes and how to use them
Fantastic memes and how to use themFantastic memes and how to use them
Fantastic memes and how to use them
 
Social networking PPT
Social networking PPTSocial networking PPT
Social networking PPT
 
A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)
 
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
 

Similar to Discovering Memes in Social Media

Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaMatthew Lease
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowskaguest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World LazowskaWCET
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and HadoopSalil Navgire
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsPyData
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
 
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneStartup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneIdan Tohami
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data miningAhmad Ammari
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semanticsplan4all
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?Robert Sanderson
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQLCrate.io
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayInformation Development World
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesDr.-Ing. Thomas Hartmann
 

Similar to Discovering Memes in Social Media (20)

Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneStartup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 

More from Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
 

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Recently uploaded

Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 

Recently uploaded (20)

Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 

Discovering Memes in Social Media

  • 1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas Woodward Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
  • 2. Memes • Short, similar phrases found in many different sources – Re-use, shared temporal context • Evolutionary mutation & propagation as they transmit from source-to-source • Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
  • 3. MemeBrowser & Critical Literacy March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3
  • 4. Google/NYT Living Stories livingstories.googlelabs.com March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
  • 5. Related Work • Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org • Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com • O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not global March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
  • 6. MapReduce @ UT • UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10 • New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc. • Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
  • 7. Datasets • TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed • ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed • ICWSM 2011 Spinn3r Blog Dataset March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
  • 8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
  • 9. Creating the Shingle Table • e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
  • 10. Grouping Shingles by Document • Mapper: trivial grouping; Reducer: Identity March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10
  • 11. Common Phrase (CP) Detection • Mapper: Merge adjacent shingles into memes (ignoring small gaps) • Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
  • 12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12
  • 13. Clustering Memes • Mapper: Single-link hierarchical clustering with cosine similarity • Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13
  • 14. Efficiency: Meme Clustering • From WEKA ARFF format to sparse representation – From ~96 hours  11 hours • Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores) • Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14
  • 15. Meme Browser: Original Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15
  • 16. Meme Browser: Current Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16
  • 17. Meme Evolution (Leskovec et al.’09) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17
  • 18. Thank You! • Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship