SlideShare uma empresa Scribd logo
1 de 74
Separating the Wheat from the Chaff
Finding Relevant Tweets in Social Media Streams
Na’im Tyson, PhD
Sciences, About.com
April 20, 2017
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 1 / 23
1 Introduction
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
5 Tokenization
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
5 Tokenization
6 Vectorization
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering
8 Model Diagnostics
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering
8 Model Diagnostics
9 Roads Not Taken
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
1 Introduction
2 Ingesting Text Data
3 Document Preprocessing
4 Process Steps
5 Tokenization
6 Vectorization
7 Clustering
8 Model Diagnostics
9 Roads Not Taken
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
Introduction Consultant Role
Your Role as Consultant. . .
• Advise on open source and proprietary analytical solutions for small- to
medium-sized businesses
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
Introduction Consultant Role
Your Role as Consultant. . .
• Advise on open source and proprietary analytical solutions for small- to
medium-sized businesses
• Build solutions to solve business goals using Open Source Software
(whenever possible)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
Introduction Consultant Role
Your Role as Consultant. . .
• Advise on open source and proprietary analytical solutions for small- to
medium-sized businesses
• Build solutions to solve business goals using Open Source Software
(whenever possible)
• Develop systems to monitor solutions over time (when requested) OR
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
Introduction Consultant Role
Your Role as Consultant. . .
• Advise on open source and proprietary analytical solutions for small- to
medium-sized businesses
• Build solutions to solve business goals using Open Source Software
(whenever possible)
• Develop systems to monitor solutions over time (when requested) OR
• Develop diagnostics to monitor model behaviour
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
Introduction Client Description
Brand Intelligence Firm
• Boutique Social Monitoring & Analysis Firm
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
Introduction Client Description
Brand Intelligence Firm
• Boutique Social Monitoring & Analysis Firm
• Provide quantitative summaries from qualitative data (tweets, facebook
posts, web pages, etc.)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
Introduction Client Description
Brand Intelligence Firm
• Boutique Social Monitoring & Analysis Firm
• Provide quantitative summaries from qualitative data (tweets, facebook
posts, web pages, etc.)
• Analytics Dashboards
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
Introduction Client Description
Brand Intelligence Firm
• Boutique Social Monitoring & Analysis Firm
• Provide quantitative summaries from qualitative data (tweets, facebook
posts, web pages, etc.)
• Analytics Dashboards
• How do they acquire data?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
Introduction Client Description
Brand Intelligence Firm
• Boutique Social Monitoring & Analysis Firm
• Provide quantitative summaries from qualitative data (tweets, facebook
posts, web pages, etc.)
• Analytics Dashboards
• How do they acquire data?
• Data Collector/Aggregation Services
• Collect social data from multiple APIs
• Saves engineering resources
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Labeling: Would Turkers make good labelers for marking tweets as relevant?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Labeling: Would Turkers make good labelers for marking tweets as relevant?
Cost: How many tweets will they label for creating a
model?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Labeling: Would Turkers make good labelers for marking tweets as relevant?
Cost: How many tweets will they label for creating a
model?
Scalability: Labeling thousands or hundreds of thousands of
tweets
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Labeling: Would Turkers make good labelers for marking tweets as relevant?
Cost: How many tweets will they label for creating a
model?
Scalability: Labeling thousands or hundreds of thousands of
tweets
Consistency: How do you know whether they are consistent
labelers?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Labeling: Would Turkers make good labelers for marking tweets as relevant?
Cost: How many tweets will they label for creating a
model?
Scalability: Labeling thousands or hundreds of thousands of
tweets
Consistency: How do you know whether they are consistent
labelers?
• Implementation of consistency labeling
statistics
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Introduction Project Scope
Business Problem
Imagine: One batch of data - tweets
Relevance: How do you know which ones are relevant to the brand?
Labeling: Would Turkers make good labelers for marking tweets as relevant?
Cost: How many tweets will they label for creating a
model?
Scalability: Labeling thousands or hundreds of thousands of
tweets
Consistency: How do you know whether they are consistent
labelers?
• Implementation of consistency labeling
statistics
Goal: Establish a system for programmatically computing relevance of tweets
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
Ingesting Text Data Scraping & Crawling
Most of the methods in this section—except the last two—came from
[Bengfort (2016)]
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 6 / 23
Ingesting Text Data Scraping & Crawling
Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
Ingesting Text Data Scraping & Crawling
Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task
• Text content, publish data, page links or any other goodies
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
Ingesting Text Data Scraping & Crawling
Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task
• Text content, publish data, page links or any other goodies
• Crawling is an information processing task
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
Ingesting Text Data Scraping & Crawling
Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task
• Text content, publish data, page links or any other goodies
• Crawling is an information processing task
• Traversal of a website’s link network by crawler or spider
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
Ingesting Text Data Scraping & Crawling
Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task
• Text content, publish data, page links or any other goodies
• Crawling is an information processing task
• Traversal of a website’s link network by crawler or spider
• Find out what you can crawl before you start crawling!
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
Ingesting Text Data Scraping & Crawling
Two Sides of the Same Coin?
• Scraping (from a web page) is an information extraction task
• Text content, publish data, page links or any other goodies
• Crawling is an information processing task
• Traversal of a website’s link network by crawler or spider
• Find out what you can crawl before you start crawling!
• Type into Google search: <DOMAIN NAME> robots.txt
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
Ingesting Text Data Scraping & Crawling
Sample Scrape & Crawl in Python
import bs4
import requests
from slugify import slugify
sources = ['https://www.washingtonpost.com', 'http://www.nytimes.com/',
'http://www.chicagotribune.com/', 'http://www.bostonherald.com/',
'http://www.sfchronicle.com/']
def scrape_content(url, page_name):
try:
page = requests.get(url).content
filename = slugify(page_name).lower() + '.html'
with open(filename, 'wb') as f:
f.write(page)
except:
pass
def crawl(url):
domain = url.split("//www.")[-1].split("/")[0]
html = requests.get(url).content
soup = bs4.BeautifulSoup(html, "lxml")
links = set(soup.find_all('a', href=True))
for link in links:
sub_url = link['href']
page_name = link.string
if domain in sub_url:
scrape_content(sub_url, page_name)
if __name__ == '__main__':
for url in sources:
crawl(url)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 8 / 23
Ingesting Text Data RSS Reading
• RSS = Real Simple Syndication
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 9 / 23
Ingesting Text Data RSS Reading
• RSS = Real Simple Syndication
• Standardized XML format for syndicated text content
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 9 / 23
Ingesting Text Data RSS Reading
• RSS = Real Simple Syndication
• Standardized XML format for syndicated text content
import bs4
import feedparser
from slugify import slugify
feeds = ['http://blog.districtdatalabs.com/feed',
'http://feeds.feedburner.com/oreilly/radar/atom',
'http://blog.revolutionanalytics.com/atom.xml']
def rss_parse(feed):
parsed = feedparser.parse(feed)
posts = parsed.entries
for post in posts:
html = post.content[0].get('value')
soup = bs4.BeautifulSoup(html, 'lxml')
post_title = post.title
filename = slugify(post_title).lower() + '.xml'
TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']
for tag in soup.find_all(TAGS):
paragraphs = tag.get_text()
with open(filename, 'a') as f:
f.write(paragraphs + 'n n')
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 9 / 23
Ingesting Text Data APIs
API Details & Sample Python
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
Ingesting Text Data APIs
API Details & Sample Python
• API = application programming interface
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
Ingesting Text Data APIs
API Details & Sample Python
• API = application programming interface
• Allows interaction between a client and server-side service that are
independent of each other
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
Ingesting Text Data APIs
API Details & Sample Python
• API = application programming interface
• Allows interaction between a client and server-side service that are
independent of each other
• Usually requires an API key, an API secret, an access token, and an
access token secret
• Twitter requires registration at https://apps.twitter.com for API credentials
—import tweepy
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
Ingesting Text Data APIs
API Details & Sample Python
• API = application programming interface
• Allows interaction between a client and server-side service that are
independent of each other
• Usually requires an API key, an API secret, an access token, and an
access token secret
• Twitter requires registration at https://apps.twitter.com for API credentials
—import tweepy
import oauth2
API_KEY = ' '
API_SECRET = ' '
TOKEN_KEY = ' '
TOKEN_SECRET = ' '
def oauth_req(url, key, secret, http_method="GET", post_body="",
http_headers=None):
consumer = oauth2.Consumer(key=API_KEY, secret=API_SECRET)
token = oauth2.Token(key=key, secret=secret)
client = oauth2.Client(consumer, token)
resp, content = client.request(url, method=http_method,
body=post_body, headers=http_headers)
return content
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
Ingesting Text Data PDF Miner
PDF to Text
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path, codec='utf-8', password='', maxpages=0, caching=True, pages=None):
''' convert pdf to text using PDFMiner.
:param codec: target encoding of text
:param password: password for the pdf if it is password-protected
:param maxpages: maximum number of pages to extract
:param caching: boolean
:param pages: a list of page number to extract from the pdf (zero-based)
:return: text string of all pages specified in the pdf
'''
rsrcmgr = PDFResourceManager()
retstr = StringIO()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
pagenos = set(pages) if pages else set()
with open(path, 'rb') as fp:
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password, caching=caching,
check_extractable=True):
interpreter.process_page(page)
device.close()
txt = retstr.getvalue()
retstr.close()
return txt
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 11 / 23
Document Preprocessing Business Considerations
• Every tweet is a document
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
Document Preprocessing Business Considerations
• Every tweet is a document
• Reject retweets
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
Document Preprocessing Business Considerations
• Every tweet is a document
• Reject retweets
• Ignore (toss) hypertext links
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
Document Preprocessing Business Considerations
• Every tweet is a document
• Reject retweets
• Ignore (toss) hypertext links
• Why might this be a bad idea?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
Document Preprocessing Business Considerations
• Every tweet is a document
• Reject retweets
• Ignore (toss) hypertext links
• Why might this be a bad idea?
• Hint: can links tell you about relevant tweets?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
Document Preprocessing Business Considerations
• Every tweet is a document
• Reject retweets
• Ignore (toss) hypertext links
• Why might this be a bad idea?
• Hint: can links tell you about relevant tweets?
RT @chriswheeldon2: #pinchmeplease so honored. #Beginnersluck Congrats to all at @AmericanInParis for Best Mus
------------------------------------------------------------
RT @VanKaplan: .@AmericanInParis won Best Musical @ Outer Critics Awards! http://t.co/3y9Xem0c9I @PittsburghCLO
------------------------------------------------------------
RT @cope_leanne: Congratulations @AmericanInParis @chriswheeldon2 @robbiefairchild 4 outer Critic Circle wins .
------------------------------------------------------------
.@robbiefairchild @chriswheeldon2 Congrats on Outer Critics Circle Awards for your brilliant work in @AmericanI
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
Document Preprocessing Cleaning Code
def extract_links(text):
''' get hypertext links in a piece of text. '''
regex = r'https?://[^s<>"]+|www.[^s<>"]+'
return re.findall(regex, text)
def clean_posts(postList):
''' remove retweets found w/in posts. keep a cache of urls to keep track
of a mapping b/t a unique token for that url and the url itself. '''
retweet_regex = r'^RT @w+:'
url_cache = {}
link_num = 1
cleaned_posts = []
for post in postList:
if re.match(retweet_regex, post): continue
urls = extract_links(post)
for url in urls:
if url not in url_cache:
url_cache.setdefault(url, 'LINK{0}'.format(link_num))
link_num = link_num + 1
post = post.replace(url, url_cache[url])
cleaned_posts.append(post.strip())
return cleaned_posts
def get_posts(post_filepath):
postlist = open(post_filepath).read().splitlines()
postlist = [p for p in postlist if len(p) > 0 and not p.startswith('---')]
return postlist
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 13 / 23
Process Steps
Inspired by [Richert (2014)]
• Feature Extraction
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
Process Steps
Inspired by [Richert (2014)]
• Feature Extraction
• Extract salient features from each tweet; store it as a vector
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
Process Steps
Inspired by [Richert (2014)]
• Feature Extraction
• Extract salient features from each tweet; store it as a vector
• Cluster Vectors (of Tweets)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
Process Steps
Inspired by [Richert (2014)]
• Feature Extraction
• Extract salient features from each tweet; store it as a vector
• Cluster Vectors (of Tweets)
• Determine the cluster for the tweet in question
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
Tokenization Tokenizing Tweets
from nltk.tokenize import RegexpTokenizer
POST_PATTERN = r'''(?x) # set flag to allow verbose regexps
([A-Z].)+ # abbreviations, e.g. U.S.A.
| https?://[^s<>"]+|www.[^s<>"]+ # html links
| w+([-']w+)* # words with optional internal hyphens
| $?d+(.d+)?%? # currency and percentages, e.g. $12.40, 82%
| #w+b # hashtags
| @w+b # handles
'''
class MediaTokenizer(RegexpTokenizer):
''' regex tokenization class for tokenizing media posts given a pattern. '''
def __init__(self, tokPattern, **kwargs):
super(self.__class__, self).__init__(tokPattern, **kwargs)
def __call__(self, text):
return self.tokenize(text)
tweet_tokenizer = MediaTokenizer(POST_PATTERN)
print tweet_tokenizer('The quick brown fox jumped over the lazy dog.')
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 15 / 23
Vectorization
Sci-kit Learn’s Vectorizer Implemented
from ast import literal_eval
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
english_stemmer = SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
''' stem words using english stemmer so they can be vectorized by count. '''
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
config = {'encoding': 'utf-8', 'decode_error': 'strict', 'strip_accents': 'ascii',
'ngram_range': '(1,2)', 'stop_words': 'english', 'lowercase': True, 'min_df': 5,
'max_df': 0.8, 'binary': False, 'smooth_idf': False}
vectorizer = StemmedCountVectorizer(min_df=config['min_df'], max_df=config['max_df'],
encoding=config['encoding'], binary=config['binary'],
lowercase=config['lowercase'],
strip_accents=config['strip_accents'],
stop_words=config['stop_words'],
ngram_range=literal_eval(config['ngram_range']),
smooth_idf=config['smooth_idf'],
tokenizer=tweet_tokenizer # FROM LAST SLIDE!
# NOTE: tokenizer MUST have __call__()
)
vec_posts = vectorizer.fit_transform(posts)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 16 / 23
Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters
• Nondeterministic: different starting values may result in a different
assignment of points to clusters
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters
• Nondeterministic: different starting values may result in a different
assignment of points to clusters
• Run the k-means algorithm several times and then compare the results
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters
• Nondeterministic: different starting values may result in a different
assignment of points to clusters
• Run the k-means algorithm several times and then compare the results
• This assumes you have time to do this!
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters
• Nondeterministic: different starting values may result in a different
assignment of points to clusters
• Run the k-means algorithm several times and then compare the results
• This assumes you have time to do this!
• Might be simpler to change tokenization and vectorization methods
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
Clustering
What is KMeans?
• Clustering algorithm that segments data into k clusters
• Nondeterministic: different starting values may result in a different
assignment of points to clusters
• Run the k-means algorithm several times and then compare the results
• This assumes you have time to do this!
• Might be simpler to change tokenization and vectorization methods
Algorithm [Janert (2010), p. 662-663]
choose initial positions for the cluster centroids
repeat:
for each point:
calculate its distance from each cluster centroid
assign the point to the nearest cluster
recalculate the positions of the cluster centroids
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
Clustering How is it implemented?
import scipy as sp, sys, yaml
from sklearn.cluster import KMeans
seed = 2
sp.random.seed(seed) # to reproduce the data later on
def train_cluster_model(posts, configDoc='prelim.yaml', tokenizer=None,
vectorizer_type=StemmedCountVectorizer):
try:
config = yaml.load(open(configDoc))
except IOError, ie:
sys.stderr.write("Can't open config file: %s" % str(ie))
sys.exit(1)
if not tokenizer:
tokenizer = MediaTokenizer(POST_PATTERN)
vectorizer = vectorizer_type(
min_df=config['min_df'],
max_df=config['max_df'],
encoding=config['encoding'],
lowercase=config['lowercase'],
strip_accents=config['strip_accents'],
stop_words=config['stop_words'],
ngram_range=literal_eval(
config['ngram_range']),
smooth_idf=config['smooth_idf'],
tokenizer=tokenizer)
vec_posts = vectorizer.fit_transform(posts)
cls_model = KMeans(n_clusters=2, init='k-means++', n_jobs=2)
cls_model.fit(vec_posts)
return {'model':cls_model, 'vectorizer': vectorizer}
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 18 / 23
Clustering Model Testing
import cPickle as pickle, sys, yaml
from scipy.spatial.distance import euclidean
def test_model(posts_path, cls_mod_path, vectorizer_path, yaml_filepath):
orig, posts = vectorize_posts(posts_path, vectorizer_path)
try:
config = yaml.load(open(yaml_filepath))
except IOError, ie:
sys.stderr.write("Can't open yaml file: %s" % str(ie))
sys.exit(1)
vectorizer = pickle.load(open(vectorizer_path, 'rb'))
vec_posts = vectorizer.transform(posts)
cls_model = pickle.load(open(cls_mod_path, 'rb'))
cls_labels = cls_model.predict(vec_posts).tolist()
dists = [None] * len(cls_labels)
for i, label in enumerate(cls_labels):
dists[i] = euclidean(vec_posts.getrow(i).toarray(),
cls_model.cluster_centers_[label])
for t, l, d in zip(orig, cls_labels, dists):
print '{0}t{1}t{2:.6f}'.format(t, l, d)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 19 / 23
Model Diagnostics Top Terms Per Cluster
def top_terms_per_cluster(km, vectorizer, outFile, k=2, topNTerms=10):
''' print top terms from each cluster '''
from warnings import warn, simplefilter
''' NOTE: ignore the following (annoying) deprecation warning:
/Library/Python/2.7/site-packages/sklearn/utils/__init__.py:94:
DeprecationWarning: Function fixed_vocabulary is deprecated;
The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.
Please use `fixed_vocabulary_` instead. '''
simplefilter('ignore', DeprecationWarning)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
# check to see if top n terms is beyond centroid length
centroid_vec_length = order_centroids[0,].shape[0]
if topNTerms > centroid_vec_length:
warn('Top n terms parameter exceed centroid vector length!')
warn('Switching to centroid vector length: %d' % centroid_vec_length)
topNTerms = centroid_vec_length
terms = vectorizer.get_feature_names()
with open(outFile, 'w') as topFeatsFile:
topFeatsFile.write("Top terms per cluster:n")
for i in range(k):
topFeatsFile.write("Cluster %d:n" % (i + 1))
for ind in order_centroids[i, :topNTerms]:
topFeatsFile.write(" %sn" % terms[ind])
topFeatsFile.write('n')
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 20 / 23
Model Diagnostics Model Visualization [Bari (2014)]
>>> from sklearn.decomposition import PCA
>>> from sklearn.cluster import KMeans
>>> import pylab as pl
>>> pca = PCA(n_components=2).fit(vectorized_posts)
>>> pca_2d = pca.transform(vectorized_posts)
>>> pl.figure('Reference Plot')
>>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=vectorized_posts_targets)
>>> kmeans = KMeans(n_clusters=2) # REFER TO PRECEDING SLIDES
>>> kmeans.fit(vectorized_posts)
>>> pl.figure('K-means with 2 clusters')
>>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=kmeans.labels_)
>>> pl.show()
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 21 / 23
Roads Not Taken
• Batch vs. Stream Processing
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
Roads Not Taken
• Batch vs. Stream Processing
• Batch KMeans (sklearn.cluster.MiniBatchKMeans)
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
Roads Not Taken
• Batch vs. Stream Processing
• Batch KMeans (sklearn.cluster.MiniBatchKMeans)
• Other types of vectorization and tokenization
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
Roads Not Taken
• Batch vs. Stream Processing
• Batch KMeans (sklearn.cluster.MiniBatchKMeans)
• Other types of vectorization and tokenization
• Using unsupervised machine learning as a segue to a supervised solution
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
Roads Not Taken
• Batch vs. Stream Processing
• Batch KMeans (sklearn.cluster.MiniBatchKMeans)
• Other types of vectorization and tokenization
• Using unsupervised machine learning as a segue to a supervised solution
• What happened in the end with the client?
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
References
A. Bari, M. Chaouchi and T. Jung.
Predictive Analytics for Dummies (1st Edition).
For Dummies, 2014.
B. Bengfort, R. Bilbro and T. Ojeda
Applied Text Analysis with Python.
O’Reilly Media, 2016.
Philipp K. Janert.
Data Analysis with Open Source Tools.
O’Reilly Media, 2010.
W. Richert and L. Pedro Coelho
Building Machine Learning Systems with Python.
Packt Publishing, 2014.
Tyson (About.com) Finding Relevance in Tweets April 20, 2017 23 / 23

Mais conteúdo relacionado

Semelhante a Finding Relevant Tweets in Social Media

Social Media Propensity - Approach to understand networks
Social Media Propensity - Approach to understand networksSocial Media Propensity - Approach to understand networks
Social Media Propensity - Approach to understand networksHari Bhaskar Sankaranarayanan
 
Rob Procter
Rob ProcterRob Procter
Rob ProcterNSMNSS
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentOntotext
 
Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6ARDC
 
Using Twitter Chats and Podcasts to Mobilize Engagement
Using Twitter Chats and Podcasts to Mobilize EngagementUsing Twitter Chats and Podcasts to Mobilize Engagement
Using Twitter Chats and Podcasts to Mobilize EngagementRPO America
 
Mistry - A Tale of a Happy Marriage: Content Strategy & User Experience Strategy
Mistry - A Tale of a Happy Marriage: Content Strategy & User Experience StrategyMistry - A Tale of a Happy Marriage: Content Strategy & User Experience Strategy
Mistry - A Tale of a Happy Marriage: Content Strategy & User Experience StrategyLavaCon
 
Content Strategy for Lead Generation
Content Strategy for Lead GenerationContent Strategy for Lead Generation
Content Strategy for Lead GenerationOrbit Media Studios
 
Andy Crestodina — How to Find Blog Topics Your Audience REALLY Cares About
Andy Crestodina — How to Find Blog Topics Your Audience REALLY Cares AboutAndy Crestodina — How to Find Blog Topics Your Audience REALLY Cares About
Andy Crestodina — How to Find Blog Topics Your Audience REALLY Cares AboutSemrush
 
Evaluating Digital Advertising Paths at Scale
Evaluating Digital Advertising Paths at Scale Evaluating Digital Advertising Paths at Scale
Evaluating Digital Advertising Paths at Scale DigitalMarketingShow
 
2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate TrackerAnwar Jameel
 
Event Data & Other New Services - Crossref LIVE Hannover
Event Data & Other New Services - Crossref LIVE HannoverEvent Data & Other New Services - Crossref LIVE Hannover
Event Data & Other New Services - Crossref LIVE HannoverCrossref
 
Crossref Event Data and other new services
Crossref Event Data and other new servicesCrossref Event Data and other new services
Crossref Event Data and other new servicesCrossref
 
Links That Increases Rankings
Links That Increases Rankings Links That Increases Rankings
Links That Increases Rankings Lisa Myers
 
Identifying The Benefit of Linked Data
Identifying The Benefit of Linked DataIdentifying The Benefit of Linked Data
Identifying The Benefit of Linked DataRichard Wallis
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichmentsemanticsconference
 

Semelhante a Finding Relevant Tweets in Social Media (20)

Social Media Propensity - Approach to understand networks
Social Media Propensity - Approach to understand networksSocial Media Propensity - Approach to understand networks
Social Media Propensity - Approach to understand networks
 
Rob Procter
Rob ProcterRob Procter
Rob Procter
 
Boost your data analytics with open data and public news content
Boost your data analytics with open data and public news contentBoost your data analytics with open data and public news content
Boost your data analytics with open data and public news content
 
Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6
 
Using Twitter Chats and Podcasts to Mobilize Engagement
Using Twitter Chats and Podcasts to Mobilize EngagementUsing Twitter Chats and Podcasts to Mobilize Engagement
Using Twitter Chats and Podcasts to Mobilize Engagement
 
Mistry - A Tale of a Happy Marriage: Content Strategy & User Experience Strategy
Mistry - A Tale of a Happy Marriage: Content Strategy & User Experience StrategyMistry - A Tale of a Happy Marriage: Content Strategy & User Experience Strategy
Mistry - A Tale of a Happy Marriage: Content Strategy & User Experience Strategy
 
Content Strategy for Lead Generation
Content Strategy for Lead GenerationContent Strategy for Lead Generation
Content Strategy for Lead Generation
 
Andy Crestodina — How to Find Blog Topics Your Audience REALLY Cares About
Andy Crestodina — How to Find Blog Topics Your Audience REALLY Cares AboutAndy Crestodina — How to Find Blog Topics Your Audience REALLY Cares About
Andy Crestodina — How to Find Blog Topics Your Audience REALLY Cares About
 
Evaluating Digital Advertising Paths at Scale
Evaluating Digital Advertising Paths at Scale Evaluating Digital Advertising Paths at Scale
Evaluating Digital Advertising Paths at Scale
 
2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker2016 Presidential Candidate Tracker
2016 Presidential Candidate Tracker
 
Event Data & Other New Services - Crossref LIVE Hannover
Event Data & Other New Services - Crossref LIVE HannoverEvent Data & Other New Services - Crossref LIVE Hannover
Event Data & Other New Services - Crossref LIVE Hannover
 
Crossref Event Data and other new services
Crossref Event Data and other new servicesCrossref Event Data and other new services
Crossref Event Data and other new services
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
Personalized Hotlink Assignment
Personalized Hotlink AssignmentPersonalized Hotlink Assignment
Personalized Hotlink Assignment
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Links That Increases Rankings
Links That Increases Rankings Links That Increases Rankings
Links That Increases Rankings
 
Identifying The Benefit of Linked Data
Identifying The Benefit of Linked DataIdentifying The Benefit of Linked Data
Identifying The Benefit of Linked Data
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
Web mining
Web miningWeb mining
Web mining
 
How Topics and Links Affect Everyone and Everything
How Topics and Links Affect Everyone and EverythingHow Topics and Links Affect Everyone and Everything
How Topics and Links Affect Everyone and Everything
 

Último

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Finding Relevant Tweets in Social Media

  • 1. Separating the Wheat from the Chaff Finding Relevant Tweets in Social Media Streams Na’im Tyson, PhD Sciences, About.com April 20, 2017 Tyson (About.com) Finding Relevance in Tweets April 20, 2017 1 / 23
  • 2. 1 Introduction Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 3. 1 Introduction 2 Ingesting Text Data Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 4. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 5. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 6. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps 5 Tokenization Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 7. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps 5 Tokenization 6 Vectorization Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 8. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps 5 Tokenization 6 Vectorization 7 Clustering Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 9. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps 5 Tokenization 6 Vectorization 7 Clustering 8 Model Diagnostics Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 10. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps 5 Tokenization 6 Vectorization 7 Clustering 8 Model Diagnostics 9 Roads Not Taken Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 11. 1 Introduction 2 Ingesting Text Data 3 Document Preprocessing 4 Process Steps 5 Tokenization 6 Vectorization 7 Clustering 8 Model Diagnostics 9 Roads Not Taken Tyson (About.com) Finding Relevance in Tweets April 20, 2017 2 / 23
  • 12. Introduction Consultant Role Your Role as Consultant. . . • Advise on open source and proprietary analytical solutions for small- to medium-sized businesses Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
  • 13. Introduction Consultant Role Your Role as Consultant. . . • Advise on open source and proprietary analytical solutions for small- to medium-sized businesses • Build solutions to solve business goals using Open Source Software (whenever possible) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
  • 14. Introduction Consultant Role Your Role as Consultant. . . • Advise on open source and proprietary analytical solutions for small- to medium-sized businesses • Build solutions to solve business goals using Open Source Software (whenever possible) • Develop systems to monitor solutions over time (when requested) OR Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
  • 15. Introduction Consultant Role Your Role as Consultant. . . • Advise on open source and proprietary analytical solutions for small- to medium-sized businesses • Build solutions to solve business goals using Open Source Software (whenever possible) • Develop systems to monitor solutions over time (when requested) OR • Develop diagnostics to monitor model behaviour Tyson (About.com) Finding Relevance in Tweets April 20, 2017 3 / 23
  • 16. Introduction Client Description Brand Intelligence Firm • Boutique Social Monitoring & Analysis Firm Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
  • 17. Introduction Client Description Brand Intelligence Firm • Boutique Social Monitoring & Analysis Firm • Provide quantitative summaries from qualitative data (tweets, facebook posts, web pages, etc.) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
  • 18. Introduction Client Description Brand Intelligence Firm • Boutique Social Monitoring & Analysis Firm • Provide quantitative summaries from qualitative data (tweets, facebook posts, web pages, etc.) • Analytics Dashboards Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
  • 19. Introduction Client Description Brand Intelligence Firm • Boutique Social Monitoring & Analysis Firm • Provide quantitative summaries from qualitative data (tweets, facebook posts, web pages, etc.) • Analytics Dashboards • How do they acquire data? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
  • 20. Introduction Client Description Brand Intelligence Firm • Boutique Social Monitoring & Analysis Firm • Provide quantitative summaries from qualitative data (tweets, facebook posts, web pages, etc.) • Analytics Dashboards • How do they acquire data? • Data Collector/Aggregation Services • Collect social data from multiple APIs • Saves engineering resources Tyson (About.com) Finding Relevance in Tweets April 20, 2017 4 / 23
  • 21. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 22. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 23. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Labeling: Would Turkers make good labelers for marking tweets as relevant? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 24. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Labeling: Would Turkers make good labelers for marking tweets as relevant? Cost: How many tweets will they label for creating a model? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 25. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Labeling: Would Turkers make good labelers for marking tweets as relevant? Cost: How many tweets will they label for creating a model? Scalability: Labeling thousands or hundreds of thousands of tweets Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 26. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Labeling: Would Turkers make good labelers for marking tweets as relevant? Cost: How many tweets will they label for creating a model? Scalability: Labeling thousands or hundreds of thousands of tweets Consistency: How do you know whether they are consistent labelers? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 27. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Labeling: Would Turkers make good labelers for marking tweets as relevant? Cost: How many tweets will they label for creating a model? Scalability: Labeling thousands or hundreds of thousands of tweets Consistency: How do you know whether they are consistent labelers? • Implementation of consistency labeling statistics Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 28. Introduction Project Scope Business Problem Imagine: One batch of data - tweets Relevance: How do you know which ones are relevant to the brand? Labeling: Would Turkers make good labelers for marking tweets as relevant? Cost: How many tweets will they label for creating a model? Scalability: Labeling thousands or hundreds of thousands of tweets Consistency: How do you know whether they are consistent labelers? • Implementation of consistency labeling statistics Goal: Establish a system for programmatically computing relevance of tweets Tyson (About.com) Finding Relevance in Tweets April 20, 2017 5 / 23
  • 29. Ingesting Text Data Scraping & Crawling Most of the methods in this section—except the last two—came from [Bengfort (2016)] Tyson (About.com) Finding Relevance in Tweets April 20, 2017 6 / 23
  • 30. Ingesting Text Data Scraping & Crawling Two Sides of the Same Coin? • Scraping (from a web page) is an information extraction task Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
  • 31. Ingesting Text Data Scraping & Crawling Two Sides of the Same Coin? • Scraping (from a web page) is an information extraction task • Text content, publish data, page links or any other goodies Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
  • 32. Ingesting Text Data Scraping & Crawling Two Sides of the Same Coin? • Scraping (from a web page) is an information extraction task • Text content, publish data, page links or any other goodies • Crawling is an information processing task Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
  • 33. Ingesting Text Data Scraping & Crawling Two Sides of the Same Coin? • Scraping (from a web page) is an information extraction task • Text content, publish data, page links or any other goodies • Crawling is an information processing task • Traversal of a website’s link network by crawler or spider Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
  • 34. Ingesting Text Data Scraping & Crawling Two Sides of the Same Coin? • Scraping (from a web page) is an information extraction task • Text content, publish data, page links or any other goodies • Crawling is an information processing task • Traversal of a website’s link network by crawler or spider • Find out what you can crawl before you start crawling! Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
  • 35. Ingesting Text Data Scraping & Crawling Two Sides of the Same Coin? • Scraping (from a web page) is an information extraction task • Text content, publish data, page links or any other goodies • Crawling is an information processing task • Traversal of a website’s link network by crawler or spider • Find out what you can crawl before you start crawling! • Type into Google search: <DOMAIN NAME> robots.txt Tyson (About.com) Finding Relevance in Tweets April 20, 2017 7 / 23
  • 36. Ingesting Text Data Scraping & Crawling Sample Scrape & Crawl in Python import bs4 import requests from slugify import slugify sources = ['https://www.washingtonpost.com', 'http://www.nytimes.com/', 'http://www.chicagotribune.com/', 'http://www.bostonherald.com/', 'http://www.sfchronicle.com/'] def scrape_content(url, page_name): try: page = requests.get(url).content filename = slugify(page_name).lower() + '.html' with open(filename, 'wb') as f: f.write(page) except: pass def crawl(url): domain = url.split("//www.")[-1].split("/")[0] html = requests.get(url).content soup = bs4.BeautifulSoup(html, "lxml") links = set(soup.find_all('a', href=True)) for link in links: sub_url = link['href'] page_name = link.string if domain in sub_url: scrape_content(sub_url, page_name) if __name__ == '__main__': for url in sources: crawl(url) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 8 / 23
  • 37. Ingesting Text Data RSS Reading • RSS = Real Simple Syndication Tyson (About.com) Finding Relevance in Tweets April 20, 2017 9 / 23
  • 38. Ingesting Text Data RSS Reading • RSS = Real Simple Syndication • Standardized XML format for syndicated text content Tyson (About.com) Finding Relevance in Tweets April 20, 2017 9 / 23
  • 39. Ingesting Text Data RSS Reading • RSS = Real Simple Syndication • Standardized XML format for syndicated text content import bs4 import feedparser from slugify import slugify feeds = ['http://blog.districtdatalabs.com/feed', 'http://feeds.feedburner.com/oreilly/radar/atom', 'http://blog.revolutionanalytics.com/atom.xml'] def rss_parse(feed): parsed = feedparser.parse(feed) posts = parsed.entries for post in posts: html = post.content[0].get('value') soup = bs4.BeautifulSoup(html, 'lxml') post_title = post.title filename = slugify(post_title).lower() + '.xml' TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li'] for tag in soup.find_all(TAGS): paragraphs = tag.get_text() with open(filename, 'a') as f: f.write(paragraphs + 'n n') Tyson (About.com) Finding Relevance in Tweets April 20, 2017 9 / 23
  • 40. Ingesting Text Data APIs API Details & Sample Python Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
  • 41. Ingesting Text Data APIs API Details & Sample Python • API = application programming interface Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
  • 42. Ingesting Text Data APIs API Details & Sample Python • API = application programming interface • Allows interaction between a client and server-side service that are independent of each other Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
  • 43. Ingesting Text Data APIs API Details & Sample Python • API = application programming interface • Allows interaction between a client and server-side service that are independent of each other • Usually requires an API key, an API secret, an access token, and an access token secret • Twitter requires registration at https://apps.twitter.com for API credentials —import tweepy Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
  • 44. Ingesting Text Data APIs API Details & Sample Python • API = application programming interface • Allows interaction between a client and server-side service that are independent of each other • Usually requires an API key, an API secret, an access token, and an access token secret • Twitter requires registration at https://apps.twitter.com for API credentials —import tweepy import oauth2 API_KEY = ' ' API_SECRET = ' ' TOKEN_KEY = ' ' TOKEN_SECRET = ' ' def oauth_req(url, key, secret, http_method="GET", post_body="", http_headers=None): consumer = oauth2.Consumer(key=API_KEY, secret=API_SECRET) token = oauth2.Token(key=key, secret=secret) client = oauth2.Client(consumer, token) resp, content = client.request(url, method=http_method, body=post_body, headers=http_headers) return content Tyson (About.com) Finding Relevance in Tweets April 20, 2017 10 / 23
  • 45. Ingesting Text Data PDF Miner PDF to Text from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO def convert_pdf_to_txt(path, codec='utf-8', password='', maxpages=0, caching=True, pages=None): ''' convert pdf to text using PDFMiner. :param codec: target encoding of text :param password: password for the pdf if it is password-protected :param maxpages: maximum number of pages to extract :param caching: boolean :param pages: a list of page number to extract from the pdf (zero-based) :return: text string of all pages specified in the pdf ''' rsrcmgr = PDFResourceManager() retstr = StringIO() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=LAParams()) interpreter = PDFPageInterpreter(rsrcmgr, device) pagenos = set(pages) if pages else set() with open(path, 'rb') as fp: for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) device.close() txt = retstr.getvalue() retstr.close() return txt Tyson (About.com) Finding Relevance in Tweets April 20, 2017 11 / 23
  • 46. Document Preprocessing Business Considerations • Every tweet is a document Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
  • 47. Document Preprocessing Business Considerations • Every tweet is a document • Reject retweets Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
  • 48. Document Preprocessing Business Considerations • Every tweet is a document • Reject retweets • Ignore (toss) hypertext links Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
  • 49. Document Preprocessing Business Considerations • Every tweet is a document • Reject retweets • Ignore (toss) hypertext links • Why might this be a bad idea? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
  • 50. Document Preprocessing Business Considerations • Every tweet is a document • Reject retweets • Ignore (toss) hypertext links • Why might this be a bad idea? • Hint: can links tell you about relevant tweets? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
  • 51. Document Preprocessing Business Considerations • Every tweet is a document • Reject retweets • Ignore (toss) hypertext links • Why might this be a bad idea? • Hint: can links tell you about relevant tweets? RT @chriswheeldon2: #pinchmeplease so honored. #Beginnersluck Congrats to all at @AmericanInParis for Best Mus ------------------------------------------------------------ RT @VanKaplan: .@AmericanInParis won Best Musical @ Outer Critics Awards! http://t.co/3y9Xem0c9I @PittsburghCLO ------------------------------------------------------------ RT @cope_leanne: Congratulations @AmericanInParis @chriswheeldon2 @robbiefairchild 4 outer Critic Circle wins . ------------------------------------------------------------ .@robbiefairchild @chriswheeldon2 Congrats on Outer Critics Circle Awards for your brilliant work in @AmericanI Tyson (About.com) Finding Relevance in Tweets April 20, 2017 12 / 23
  • 52. Document Preprocessing Cleaning Code def extract_links(text): ''' get hypertext links in a piece of text. ''' regex = r'https?://[^s<>"]+|www.[^s<>"]+' return re.findall(regex, text) def clean_posts(postList): ''' remove retweets found w/in posts. keep a cache of urls to keep track of a mapping b/t a unique token for that url and the url itself. ''' retweet_regex = r'^RT @w+:' url_cache = {} link_num = 1 cleaned_posts = [] for post in postList: if re.match(retweet_regex, post): continue urls = extract_links(post) for url in urls: if url not in url_cache: url_cache.setdefault(url, 'LINK{0}'.format(link_num)) link_num = link_num + 1 post = post.replace(url, url_cache[url]) cleaned_posts.append(post.strip()) return cleaned_posts def get_posts(post_filepath): postlist = open(post_filepath).read().splitlines() postlist = [p for p in postlist if len(p) > 0 and not p.startswith('---')] return postlist Tyson (About.com) Finding Relevance in Tweets April 20, 2017 13 / 23
  • 53. Process Steps Inspired by [Richert (2014)] • Feature Extraction Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
  • 54. Process Steps Inspired by [Richert (2014)] • Feature Extraction • Extract salient features from each tweet; store it as a vector Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
  • 55. Process Steps Inspired by [Richert (2014)] • Feature Extraction • Extract salient features from each tweet; store it as a vector • Cluster Vectors (of Tweets) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
  • 56. Process Steps Inspired by [Richert (2014)] • Feature Extraction • Extract salient features from each tweet; store it as a vector • Cluster Vectors (of Tweets) • Determine the cluster for the tweet in question Tyson (About.com) Finding Relevance in Tweets April 20, 2017 14 / 23
  • 57. Tokenization Tokenizing Tweets from nltk.tokenize import RegexpTokenizer POST_PATTERN = r'''(?x) # set flag to allow verbose regexps ([A-Z].)+ # abbreviations, e.g. U.S.A. | https?://[^s<>"]+|www.[^s<>"]+ # html links | w+([-']w+)* # words with optional internal hyphens | $?d+(.d+)?%? # currency and percentages, e.g. $12.40, 82% | #w+b # hashtags | @w+b # handles ''' class MediaTokenizer(RegexpTokenizer): ''' regex tokenization class for tokenizing media posts given a pattern. ''' def __init__(self, tokPattern, **kwargs): super(self.__class__, self).__init__(tokPattern, **kwargs) def __call__(self, text): return self.tokenize(text) tweet_tokenizer = MediaTokenizer(POST_PATTERN) print tweet_tokenizer('The quick brown fox jumped over the lazy dog.') Tyson (About.com) Finding Relevance in Tweets April 20, 2017 15 / 23
  • 58. Vectorization Sci-kit Learn’s Vectorizer Implemented from ast import literal_eval from nltk.stem import SnowballStemmer from sklearn.feature_extraction.text import CountVectorizer english_stemmer = SnowballStemmer('english') class StemmedCountVectorizer(CountVectorizer): ''' stem words using english stemmer so they can be vectorized by count. ''' def build_analyzer(self): analyzer = super(StemmedCountVectorizer, self).build_analyzer() return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc)) config = {'encoding': 'utf-8', 'decode_error': 'strict', 'strip_accents': 'ascii', 'ngram_range': '(1,2)', 'stop_words': 'english', 'lowercase': True, 'min_df': 5, 'max_df': 0.8, 'binary': False, 'smooth_idf': False} vectorizer = StemmedCountVectorizer(min_df=config['min_df'], max_df=config['max_df'], encoding=config['encoding'], binary=config['binary'], lowercase=config['lowercase'], strip_accents=config['strip_accents'], stop_words=config['stop_words'], ngram_range=literal_eval(config['ngram_range']), smooth_idf=config['smooth_idf'], tokenizer=tweet_tokenizer # FROM LAST SLIDE! # NOTE: tokenizer MUST have __call__() ) vec_posts = vectorizer.fit_transform(posts) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 16 / 23
  • 59. Clustering What is KMeans? • Clustering algorithm that segments data into k clusters Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
  • 60. Clustering What is KMeans? • Clustering algorithm that segments data into k clusters • Nondeterministic: different starting values may result in a different assignment of points to clusters Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
  • 61. Clustering What is KMeans? • Clustering algorithm that segments data into k clusters • Nondeterministic: different starting values may result in a different assignment of points to clusters • Run the k-means algorithm several times and then compare the results Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
  • 62. Clustering What is KMeans? • Clustering algorithm that segments data into k clusters • Nondeterministic: different starting values may result in a different assignment of points to clusters • Run the k-means algorithm several times and then compare the results • This assumes you have time to do this! Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
  • 63. Clustering What is KMeans? • Clustering algorithm that segments data into k clusters • Nondeterministic: different starting values may result in a different assignment of points to clusters • Run the k-means algorithm several times and then compare the results • This assumes you have time to do this! • Might be simpler to change tokenization and vectorization methods Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
  • 64. Clustering What is KMeans? • Clustering algorithm that segments data into k clusters • Nondeterministic: different starting values may result in a different assignment of points to clusters • Run the k-means algorithm several times and then compare the results • This assumes you have time to do this! • Might be simpler to change tokenization and vectorization methods Algorithm [Janert (2010), p. 662-663] choose initial positions for the cluster centroids repeat: for each point: calculate its distance from each cluster centroid assign the point to the nearest cluster recalculate the positions of the cluster centroids Tyson (About.com) Finding Relevance in Tweets April 20, 2017 17 / 23
  • 65. Clustering How is it implemented? import scipy as sp, sys, yaml from sklearn.cluster import KMeans seed = 2 sp.random.seed(seed) # to reproduce the data later on def train_cluster_model(posts, configDoc='prelim.yaml', tokenizer=None, vectorizer_type=StemmedCountVectorizer): try: config = yaml.load(open(configDoc)) except IOError, ie: sys.stderr.write("Can't open config file: %s" % str(ie)) sys.exit(1) if not tokenizer: tokenizer = MediaTokenizer(POST_PATTERN) vectorizer = vectorizer_type( min_df=config['min_df'], max_df=config['max_df'], encoding=config['encoding'], lowercase=config['lowercase'], strip_accents=config['strip_accents'], stop_words=config['stop_words'], ngram_range=literal_eval( config['ngram_range']), smooth_idf=config['smooth_idf'], tokenizer=tokenizer) vec_posts = vectorizer.fit_transform(posts) cls_model = KMeans(n_clusters=2, init='k-means++', n_jobs=2) cls_model.fit(vec_posts) return {'model':cls_model, 'vectorizer': vectorizer} Tyson (About.com) Finding Relevance in Tweets April 20, 2017 18 / 23
  • 66. Clustering Model Testing import cPickle as pickle, sys, yaml from scipy.spatial.distance import euclidean def test_model(posts_path, cls_mod_path, vectorizer_path, yaml_filepath): orig, posts = vectorize_posts(posts_path, vectorizer_path) try: config = yaml.load(open(yaml_filepath)) except IOError, ie: sys.stderr.write("Can't open yaml file: %s" % str(ie)) sys.exit(1) vectorizer = pickle.load(open(vectorizer_path, 'rb')) vec_posts = vectorizer.transform(posts) cls_model = pickle.load(open(cls_mod_path, 'rb')) cls_labels = cls_model.predict(vec_posts).tolist() dists = [None] * len(cls_labels) for i, label in enumerate(cls_labels): dists[i] = euclidean(vec_posts.getrow(i).toarray(), cls_model.cluster_centers_[label]) for t, l, d in zip(orig, cls_labels, dists): print '{0}t{1}t{2:.6f}'.format(t, l, d) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 19 / 23
  • 67. Model Diagnostics Top Terms Per Cluster def top_terms_per_cluster(km, vectorizer, outFile, k=2, topNTerms=10): ''' print top terms from each cluster ''' from warnings import warn, simplefilter ''' NOTE: ignore the following (annoying) deprecation warning: /Library/Python/2.7/site-packages/sklearn/utils/__init__.py:94: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18. Please use `fixed_vocabulary_` instead. ''' simplefilter('ignore', DeprecationWarning) order_centroids = km.cluster_centers_.argsort()[:, ::-1] # check to see if top n terms is beyond centroid length centroid_vec_length = order_centroids[0,].shape[0] if topNTerms > centroid_vec_length: warn('Top n terms parameter exceed centroid vector length!') warn('Switching to centroid vector length: %d' % centroid_vec_length) topNTerms = centroid_vec_length terms = vectorizer.get_feature_names() with open(outFile, 'w') as topFeatsFile: topFeatsFile.write("Top terms per cluster:n") for i in range(k): topFeatsFile.write("Cluster %d:n" % (i + 1)) for ind in order_centroids[i, :topNTerms]: topFeatsFile.write(" %sn" % terms[ind]) topFeatsFile.write('n') Tyson (About.com) Finding Relevance in Tweets April 20, 2017 20 / 23
  • 68. Model Diagnostics Model Visualization [Bari (2014)] >>> from sklearn.decomposition import PCA >>> from sklearn.cluster import KMeans >>> import pylab as pl >>> pca = PCA(n_components=2).fit(vectorized_posts) >>> pca_2d = pca.transform(vectorized_posts) >>> pl.figure('Reference Plot') >>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=vectorized_posts_targets) >>> kmeans = KMeans(n_clusters=2) # REFER TO PRECEDING SLIDES >>> kmeans.fit(vectorized_posts) >>> pl.figure('K-means with 2 clusters') >>> pl.scatter(pca_2d[:, 0], pca_2d[:, 1], c=kmeans.labels_) >>> pl.show() Tyson (About.com) Finding Relevance in Tweets April 20, 2017 21 / 23
  • 69. Roads Not Taken • Batch vs. Stream Processing Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
  • 70. Roads Not Taken • Batch vs. Stream Processing • Batch KMeans (sklearn.cluster.MiniBatchKMeans) Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
  • 71. Roads Not Taken • Batch vs. Stream Processing • Batch KMeans (sklearn.cluster.MiniBatchKMeans) • Other types of vectorization and tokenization Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
  • 72. Roads Not Taken • Batch vs. Stream Processing • Batch KMeans (sklearn.cluster.MiniBatchKMeans) • Other types of vectorization and tokenization • Using unsupervised machine learning as a segue to a supervised solution Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
  • 73. Roads Not Taken • Batch vs. Stream Processing • Batch KMeans (sklearn.cluster.MiniBatchKMeans) • Other types of vectorization and tokenization • Using unsupervised machine learning as a segue to a supervised solution • What happened in the end with the client? Tyson (About.com) Finding Relevance in Tweets April 20, 2017 22 / 23
  • 74. References A. Bari, M. Chaouchi and T. Jung. Predictive Analytics for Dummies (1st Edition). For Dummies, 2014. B. Bengfort, R. Bilbro and T. Ojeda Applied Text Analysis with Python. O’Reilly Media, 2016. Philipp K. Janert. Data Analysis with Open Source Tools. O’Reilly Media, 2010. W. Richert and L. Pedro Coelho Building Machine Learning Systems with Python. Packt Publishing, 2014. Tyson (About.com) Finding Relevance in Tweets April 20, 2017 23 / 23