Rohini K. Srihari delivers her powerful presentation at the KDD 2010 Workshop on Social Media Analytics.
Overview:
-What is Social Media?
-Value Proposition: Why mine social media?
-Business Analytics
-Counterterrorism
-Challenges
-Technology, Challenges
-Multilingual social media mining
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Social Media Analytics: The Value Proposition
1. Social Media Analytics: the Value Proposition
Rohini K. Srihari
KDD 2010 Workshop on Social Media Analytics
July 25, 2010
2. Outline
What is Social Media?
Value Proposition: Why mine social media?
Business Analytics
Counterterrorism
Challenges
Technology, Challenges
Multilingual social media mining
Future
3. Social Media Data Actionable Intelligence
Consumer Generated, Not Edited, Not Authenticated
4. Data/Text Mining
Extracting useful information from large data sets
Analyze Observational Data to find unsuspected relationships
and Summarize data in novel ways that are understandable
and useful to data owner
Information Discovery
non-trivial, implicit, previously unknown relationships
Ex of Trivial: Those who are pregnant are female
Summarize
as Patterns and Models (usually probabilistic)
Usefulness:
meaningful: lead to some advantage, usually economic
Analysis:
Automatic/Semi-Automatic Process (Knowledge Extraction)
6. Market Size
Business Analytics market projected to be $28 billion in
2011 (IDC Report)
Social Analytics taking leading position of interest within
organizations
Integrating Social Media Analytics and Business
Intelligence
Source: HCL India
7. Customer Relationship Management
Data sources are primarily internal
Call center transcripts
E-mail
Customer feedback
Cost avoidance
Product exchange mitigation
Early warning detection on new products
Increase in customer satisfaction and loyalty
Insight towards new products, product features
Identification of possible marketing
opportunities
8. e-Service Chat Monitoring
Operator: How can I assist you today?
Customer: I need help with operating your coffee maker I bought
from Amazon.com yesterday.
Operator: Certainly. What problem are you facing?
Customer: I fill in the coffee powder, water, and then press the red
button on the side, and nothing happens.
Operator: The red button enables the ‘clean coffee maker’ process.
You will need to use the white knob on the other side to brew
coffee.
Customer: I see.
Customer: BTW, in the Nespresso cappuccino machine I recently
bought, it was the red button for start.
Is there anything else I can assist with today? SEND
Alert:
COMPETITOR PRODUCT
MENTION
9. Reputation Management
Data sources are primarily external, e.g.
www.youtube.com
www.epinions.com
tripadvisor.com (travel related website)
Consumer Brand Analytics
What are people saying about our brand?
Marketing Communications
Significant spending on marketing, advertising:
companies trying to position their products
Brand analytics helps to determine whether such
campaigns are effective
10. Mining Product Reviews
Application is Industrial Design
Automatically mine product reviews for information on
product features, new requests, etc.
Focus on wheelchairs
Features Extracted
Easy to use
Fit into a car
Comfortable chair
Light weight
Convenient to fold
Sturdy
Good price
11. Viral Marketing
Jure Leskovec (Stanford), Lada Adamic (U of Michigan),
Bernardo A. Huberman (HP Labs)
Personalized recommendations
Viral marketing
Cross-selling
“people who bought x also bought y”
Collaborative filtering
“based on ratings of users like you…”
Delicious, Digg.com
68% of consumers consult friends and family
before purchasing home electronics
(Burke 2003)
Success rate: # of purchases following a
recommendation / # recommenders
Books overall have a 3% success rate
12. 500 million active users! Many different groups clamoring for
▪ More than 20 million users update their status data and text analytics:
at least once each day ▪ FB Engineers
▪ More than 850 million photos uploaded to the ▪ Advertisers
site each month ▪ Page owners
▪ >1 billion pieces of content (web links, blog ▪ Platform/Connect developers
posts, photos, etc.) shared each week ▪ Marketers
▪ Academics
13. An aside: Social Media Marketing
http://www.socialmediaexaminer.com/new-studies-show-value-of-social-media/
Lead Generation
Breakdown of respondents’ top benefits of social networking:
50%: Generating leads
45%: Keeping up with the industry
44%: Monitoring online conversation
38%: Finding vendors/suppliers
Online Forum Users Are Enthusiastic Brand Advocates
79.2% of forum contributors help a friend or family member make a decision about
a product purchase – compared with 47.6% of non-contributors and 53.8% overall.
65% of forum contributors share advice (offline and in person) based on
information that they’ve read online – compared with 35% of non-contributors and
40.8% overall.
57.7% of forum contributors proactively recommend someone make a particular
purchase – compared with 16.9% of non-contributors and 24.9% overall.
Only 47% of Companies Experimenting With Social Media
Gartner study predicts that by the end of 2010, more than 60% of Fortune 1000
companies will manage an online community.
ComBlu’s study, The State of Online Branded Communities, shows that most
companies do not understand how to engage within online communities and have
no real idea of what their customers want on these sites.
14. Citizen Response
E-RuleMaking
the use of digital technologies by government
agencies in rulemaking, decision making processes
solicit citizen feedback on bills being debated in
Congress
What new issues are being raised, what aspects of
bill are popular, unpopular
Better to mine social media than using focus
groups?
Political Campaigns
Why do people support a candidate- is it really
based on issues?
15. Use Case: Understanding and Visualizing Consumer Responses
Extracting Entities and Sentiment to Power Alerting, Link Diagrams, and Geo-Mapping
15
16. Twitter: Real-Time Citizen Journalism
• Mumbai terror attack regarded as coming
of age of Twitter
• citizen journalism provided more valuable
information than wire services, broadcast
news
• information about places to avoid, well
being of relatives, friends, etc.
• many redundant posts, users have to wade
through hundreds of posts to locate useful
information
• Goal: to mine this data in real-time and
produce well organized summaries
16
17. Law Enforcement, Homeland Security
• Facebook
• gang members frequently boast about their activities on their facebook pages
• Chat rooms
• Stalkers, pedophiles
• Twitter
• protest rallies being planned G20 Summit Protest
• who, what, where, when
• Craigslist
17
18. Human Behaviour Analysis
Process social media content, provide tools for analysts to:
Predictive
Identify social networks: groups, members
Identify topics of discussion and sentiment Modeling
• E.g. angry at govt., wanting retaliation, peacemakers
• Thought influencers
Link Diagrams
Identify social goals through analysis of verbal
communication
• Manipulation: Persuasion, threats, coercion
• Religious supremacy: religious analogues
• recruitment
Social Media
Content
20. Analyzing Social Media Data
Content Analysis
Text analysis, multimedia analysis
Structure Analysis
Usage Analysis
Search engine optimization
What keywords are driving customers to your site,
competitor sites
Query logs, site traffic
Ideally combine all three of these!
22. Content Acquisition
Pre-selected, validated sites
Epinions.com, Amazon.com, NYT blogs,
reader comments
Search Service
Tripadvisor.com, Craigslist
Twitter, Facebook
Blog Search Engines
Google Blog Search
http://blogsearch.google.com/
Technorati http://technorati.com/
Blogpulse http://blogpulse.com/
BoardReader Lucene Index
Storage
http://boardreader.com/
http://www.omgili.com/
Spidering
23. Data Collection: Spidering
“Dark Web” : the portion of the WorldWideWeb used to help achieve the sinister
objectives of terrorists and extremists.
Spider uses
breadth and
depth first
(BFS and DFS)
traversal for
crawl space
URL ordering
based on URL
tokens, anchor
text, and link
levels.
• Automated
discovery of
proxy servers
to distribute
collection and
increase
reliability.
•
24. Content Analysis
Model Based
Develop models that generalize characteristics of data
Machine learning: Supervised, semi-supervised, unsupervised
E.g., sequence labeling, classification
N-gram language models
Linguistic: based on rules of English grammar
Information Extraction
• Pattern Mining
• frequency analysis, local patterns
Google n-gram data
What words are used in conjunction with Buffalo, Buffalo Sabres, University at Buffalo
Query log analysis
Learn spelling corrections, Learn lists of named entities, Learn relationships
Discover trends
Flu, cough, fever : frequency of queries in certain regions, change from the norm
Combine both approaches
25. Reliability of Data
How much trust in data? (Forrester)
Email from people you know: 77%
Consumer product ratings/reviews: 60%
Message board posts: 21%
Personal blog: 18%, company blog: 16%
Splog: Spam in weblogs
UK has lawful intercept program
What about results of data mining?
Off-topic posts
Comments on blog posts, forums quickly turn into personal
rants, completely off-topic
Possible Remedies
Focus on sites where data is known to be more reliable
Use technology to filter out spam, splog and off-topic posts
26.
27. Informal Language
Loss of Functional Indicators
Missing punctuation
Missing or raNDOm case information
Solutions:
Whole phrases reduced to acronyms
Casual, Phonetic Spelling
• spelling correction
tha, teh = the • acronym look-up
Explicit Sentiment Commentary
• machine learning: treat it as
Happy Birthdaaaayyyy!!!1!1!
must go <sigh>
a machine translation problem!
:-P grrr…..
Mistaken auto-correction or replacement
Co-operation = Cupertino
The Queen = Queen Elizabeth, “hundreds of worker bees commanded by
Queen Elizabeth”
Twitter Conventions
alanbr82 RT @royjwells: New Blog Post - Will Old Spice Achieve a ROI?
http://ow.ly/2dZf7 #oldspice #sm #socialmedia
RT, hashtags #, url shortening
Word Inventions
refudiate, wee-wee’d up
momager, rickRoll
L33t, IMHO, meh
28. Legal Issues
Privacy of data
UK has lawful intercept program
What about results of data mining?
Liability
Major issue for pharmaceutical companies: if they discover
report of side effect of drug, they are required to report it
Analysts making positive public statements about company
earnings, yet contradicting this on blogs, facebook pages
Workplace Issues
Time spent on social media sites during work hours leading
to lower productivity
29. Accuracy of Analysis
Text analysis is based on natural language
processing which is a useful, but imperfect
technology
“Bill Gates, the CEO of Microsoft was initially very
happy about its site location in Seattle, but now
he has other thoughts. He is very displeased with
the pollution…. Also, its employees are upset with
the construction work…around its vicinity. In
all, he wants to abandon the current site…..”
Validate performance accuracy
Who is expressing an opinion? through benchmarks on specially
constructed data sets
What is the opinion about?
Is it positive or negative?
30. Sentiment Analysis
Aims to determine the attitude of a speaker or a writer with respect to some target or topic.
I think, Obama needs to begin to take the
blame for his failed policies -- his statement
"that his policies are getting us out of this
mess" are a big lie1.
SENTIMENT
Attributes
ID:ex1 , TargetID:t1,
Opinion Holder Topic Polarity: Negative
Target
1 - http://gretawire.blogs.foxnews.com/ouch-this-is-not-fair-to-president-obama-yes-an-accident-but-one-that-needs-
31. Opinion summary
In product reviews, we are interested in generating a
feature-based summary for a product.
Digital_camera_1:
Feature: picture quality
Positive: 253
<individual review sentences>
Negative: 6
<individual review sentences>
Feature: size
Positive: 134
<individual review sentences>
Negative: 10
<individual review sentences>
…
32. Scalability: Massively Distributed/Parallel Computing
Hadoop
Open-source framework for running Map-Reduce on a cluster of commodity
machines, as well as a distributed file system for long-term storage
Map-Reduce (invented at Google) provides a way to process large data sets
that scales linearly with the number of machines in the cluster....if your data
doubles in size, just buy twice as many computers
Hadoop now an Apache project led by the Grid Computing team at Yahoo!
HIVE
SQL-like query language, table partitioning schema, and metadata store built
on top of Hadoop
Developed at Facebook, now an Apache subproject
Facebook Analytics:
How many people are
discussing being laid off; plot
percentage of total posts by
state
34. Language Usage Statistics[1]
English is not the only
language on the internet
Urdu speaking Internet users -
12,000,000 (2006)
~ 1.6% of 42.4%
[1] Source:Internet World Stats. Based on 1,733,993,741 estimated internet users for Sept 30, 2009
Copyright 2009, Miniwatts Marketing Group
35. Multilingual Social Media Mining
How did people in Egypt, Israel and Pakistan react to the
latest presidential speech?
Opinion Extraction
Topic: What is the opinion about?
Opinion Holder: Who is expressing it?
What is the intensity of the opinion?
In what context is it being expressed?
Emotion Detection
What kind of emotion is being expressed? – goes beyond
just the positive or negative emotion
Required to perform behavioral analysis, cross cultural
analysis
36. Faceted Search: Sentiment about Topic
People are filled with anger and sorrow because of the policies made by Musharaf.
OPINION HOLDER – Writer, People
TARGET –Musharaf’s policies (Musharaf is an implied target)
37. Multilingual Text Analysis
Dealing with script, coding variations
Even low-level text analysis becomes difficult
Chinese: no white space between words
Arabic: complex diacriticals
Language Training Resources
Lexicons, annotated corpora, etc.
If sufficient training data exists, new languages
can be adapted to fairly easily
E.g. core Russian in 3 weeks!
Treat language porting as a special case of
domain porting
Ideally, should involve creation of new data
sources, not new code
39. Context Aware Translation
斯洛文尼亚总理扬沙,欧洲委员会主席巴罗佐和欧盟外交政策
负责人索拉纳与梅德韦杰夫共进非正式晚餐
Context Aware Translation
Babelfish Translation
Name translation output:
<NeGPE english="Slovenia"> 斯洛文尼亚 </NeGPE> 总理
Slovenia premier the sand blowing, <NePer english="Jansa"> 扬沙 </NePer> ,
Council of Europe President Baluozuo <NeOrg english="European Commission"> 欧洲 委员会 </
and European Union foreign policy NeOrg> 主席
<NePer english="Barroso"> 巴罗佐 </NePer> 和
person in charge Solana and <NeGPE english="European Union"> 欧盟 </NeGPE> 外
Medvedev have the unofficial supper. 交 政策 负责人
<NePer english="Solana"> 索拉纳 </NePer> 与
<NePer english="Medvedev"> 梅德韦杰夫 </NePer>
共 进 非正式 晚餐 。
Powered by Semantex™ extracted entities, Babelfish translates as:
Slovenia Premier Jansa, Council of Europe President Barroso and
European Union foreign policy person in charge Solana and
Medvedev have the unofficial supper.
40. Mining Wikipedia for Lexicons
• Translation lexicons automatically extracted from Chinese Wikipedia, use cross language
links to add English translations
• Easy to regenerate with new versions of Wikipedia
• Chinese Wikipedia is constantly growing
41. COLABA: Colloquial Arabic Blog Analysis
– Proliferation of open source, social media
– Dominance of non-English content
– Use of dialects and colloquial language
– Limited supply of multilingual analysts
42. Tools made for MSA fail on Arabic dialects
Human translation for all Arabic variants below is the same:
“There is no electricity, what happened?”
Arabic Variant Arabic Source Text Google Translate
Egyptian الكهربا اتقطعت، ليه كده بس؟ Atqtat electrical wires, Why are Posted?
Levantine شكلو مفيش كهربا، ليش هيك؟ Cklo Mafeesh ?كهرباLech heck ,
Iraqi شو ماكو كهرباء، خير؟ Xu MACON electricity, good?
MSA ليوجد كهرباء، ماذا حصل؟Does not have electricity, what
Arabic Dialects are not handled well in current machine translation systems.
happened?
COLABA enables MSA tools to interpret dialects correctly.
42
43. Code Mixing, Switching
Use of Latin script: lack of transliteration
standards makes it difficult to process
Spanglish, Hinglish, Urdish, etc.
Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay
hoay dartay thay abhi this man has brought it out in the open.
[It is sad to see that those words that even a non muslim would
fear to utter until yesterday, this man has brought it out in the
open]
Solutions:
• Apply “romanized” POS tagger, English tagger in tandem: use machine learning
to combine evidence and generate final tag, language ID
• For longer English spans, use English NLP system
44. Resource Poor Languages
Bootstrap Learning: process of improving the performance of a trained
classifier by iteratively adding data that is labeled by the classifier itself
to the training set, and retraining the classifier
Useful when there is not enough annotated data
Requirement
NEEDS SEED DATA
corrections
TRAINING
DAT
SEED
A
CORRECT
SAMPLES
45. The Road Ahead?
Strengths Weaknesses
free form facilitates capturing language analysis and mining
the true voice of customer, are challenging
wisdom of crowd
susceptible to spam, self-
can be expressed through voice, serving use by companies
text messaging on mobile phones,
etc. Behaviour, predictive models
need more research
Threats Opportunities
privacy and security issues: promise of collective problem
possible to assimilate detailed solving: coordination, cooperation
knowledge about person’s
mobile use supports dealing
activities, whereabouts
with societal problems, disaster
can lead to anti-social situations: social network is
behaviour! geospatial proximity