SlideShare uma empresa Scribd logo
1 de 87
Baixar para ler offline
Mining Social Web APIs
with IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
Austin - 9 January 2015
1
Intro
2
Hello, My Name Is ... Matthew
3
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
Transforming Curiosity Into Insight
4
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding
templates for data science experiments
Think of the book as "premium" support for the
OSS project
Table of Contents (1/2)
Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking
About, and More
Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and
More
Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human
Language, Summarize Blog Posts, and More
Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and
More
5
Table of Contents (2/2)
Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs,
and More
Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing
over RDF, and More
Chapter 9 - Twitter Cookbook
Appendix A - Information About This Machine's Virtual Machine Experience
Appendix B - OAuth Primer
Appendix C - Python and IPython Notebook Tips & Tricks
6
Designed for Pedagogy
Brief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
7
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
8
*Estimates as of early 2014
Overview
Module 1 - Virtual Machine Setup (30 mins)
Module 2 - Mining Twitter (90 mins)
BREAK (15 mins)
Module 3 - Mining Facebook (60 mins)
Lunch (75 mins)
9
Module 4 - Mining LinkedIn (75 mins)
BREAK (15 mins)
Module 5 - Open/Choice (75 mins)
Module 6 - Privacy & Ethics; (15 mins)
Module 7 - Final Q&A (15 mins)
Module Format
~10-15 minutes of exposition
I talk; you listen
~15 minutes of independent (or collaborative) work
You hack while I walk around and help you
~5 minutes of recap with Q&A
You ask; I try to answer
10
Workshop Objective
To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit
Not for me talk to you for 8 straight hours
11
Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours
I'm available 24/7 this week (and beyond) to help you be successful
12
Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers
Or you want to learn to do these things now...
And you're a quick learner
13
Module 1: Virtual Machine Setup
14
Why do you need a VM?
15
To save yourself a lot of time
Because installation and configuration management is tedious and time-
consuming
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating
system
But I can do all of that myself...
True...
If you would rather troubleshoot unexpected installation/configuration issues
instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you
don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to
be compiled
Which requires specific versions of developer libraries to be installed
You get the idea...
16
The Virtual Machine Experience
Vagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
17
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
18
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
19
20
21
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
22
What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers
Go to http://bit.ly/XXX and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
Learn how I arrived at this setup at http://MiningTheSocialWeb.com
23
Module 2: Mining Twitter
24
Objectives
25
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs
from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
Twitter Primitives
26
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
API Requests
RESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?
screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
27
Twitter is an Interest Graph
28
Roberto Mercedes
Jorge
Ana
Nina
Johnny
Araya
Rodolfo
Hernández
What's in a Tweet?
29
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
30
Data Mining = Curiosity + Stats
Curiosity
Interests, desires, and intuitions
Statistics
Counting
Comparing
Filtering
Ranking
Hypothesis testing; knowledge discovery
31
Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
32
33
Example: Histogram of Retweets
Social Media Analysis Framework
A memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)
Acquire
Get the data
Analyze
Count things
Summarize
Plot the results
34
Exercises
Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
35
Module 3: Mining Facebook
36
Objectives
37
Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph
API
Be able to analyze likes from Facebook pages and friends
The Graph v2 API changes at substantially tamped down privacy and
permissions
See https://developers.facebook.com/docs/apps/changelog
Facebook Primitives
Account Types: People & Pages
Mutual Connections
Likes
Shares
Comments
Extensive Privacy Controls
38
Facebook is an Interest Graph
39
Roberto Mercedes
Jorge
Ana
Nina
Johnny
Araya
Rodolfo
Hernández
Graph API
Nodes
Things
Edges
Connections between things
Fields
Info (properties) about things
40
Facebook API Explorer
41
Go to https://developers.facebook.com/tools/explorer
Really, go there right now...
Graph API Explorer
42
Example Graph API Requests
Social Graph API requests
Easy to learn and use
http://graph.facebook.com/me/feed
http://graph.facebook.com/me/likes
http://graph.facebook.com/me/?fields=id,name,friends.fields(likes.limit(10))
http://graph.facebook.com/Mining-the-Social-Web?fields=id,name,about,likes
JSON responses
Traditional pagination
43
44
Retrieve Your Likes
Permissions Prior to Graph v2
45
Permissions as of Graph v2
46
Explore Facebook Pages
47
Names of pages
MiningTheSocialWeb
CrossFit
OReilly
Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500
Social Media Analysis Framework
Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
48
Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2
(Mining Facebook)" notebook
Execute and tinker with Examples 2-1 thru 2-6
Inspect content in your feed
Juxtapose public figures
Compare/contrast similar products/brands of interest
49
Module 4: Mining LinkedIn
50
Objectives
51
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning
Be able to employ geocoding services to arrive at a set of coordinates
from a textual reference to a location
Visualize geographic data with cartograms
LinkedIn Primitives
Account Types: People, Groups, Companies, Jobs
And Activity Streams
Data is typically perceived as being more sensitive
Richest data source? (Think: LinkedIn's business model)
Profile descriptions from mutual connections
A little messier than it first appears
52
API Requests
HTTP-based Requests
Field selector syntax
http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)
XML responses
CSV address book download
53
Is LinkedIn an Interest Graph?
Fundamentally: yes. the developer API requires you to do a bit of work to model it
Less trivial to find some of the "pivots"
e.g. There's no public Skills API for developers
But the data is there (mostly in profile descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data
54
Clustering
An unsupervised machine learning learning technique
Think: an algorithm that organizes the data into partitions
55
Example: Clustered Job Titles
56
3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use
57
Jaccard Similarity
58
k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to
compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by finding the nearest Kn—effectively
creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid of the cluster and reassign its Ki
value to be that value. (Hence, you’re computing “k-means” during each iteration of
the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between
iterations. Generally speaking, relatively few iterations are required for convergence.
59
k-Means: Initialize
60
k-Means: Step 1
61
k-Means: Step 2
62
k-Means: Step 3
63
k-Means: (Fast-Forward) Step 9
64
Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it first appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package
65
Introducing: The Dorling Cartogram
66
Social Media Analysis Framework
Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
67
Exercises
Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API
connection and follow along with the first few examples
Download your connections as a CSV file from http://www.linkedin.com/people/
export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code
Try clustering your contacts in Example 3-12
Use the python-linkedin client to tap into Activity Streams
See https://developer.linkedin.com/documents/get-network-updates-and-
statistics-api
68
Module 5: Choice
69
Objectives
70
To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness
Social Media Analysis Framework
Remember:
Aspire
Acquire
Analyze
Summarize
71
Recommendations
Setup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
72
Module 6: Privacy & Ethics
73
74
Know thy data, and know thyself
--Matthew A. Russell
75
If we have data, let’s look at data.
If we have opinions, let’s go with mine
--Jim Barksdale
76
In God we trust. All others must bring data
--W. Edwards Deming
Communication => Data
Communication
Senders
humans & machines
Messages
natural language, images, videos, etc.
Recipients
humans & machines
77
Data Alchemy
Data: Documents & document fragments (text messages, etc.)
Information: "Assertions", summaries, tags, etc.
Knowledge: Aggregated, queryable information
Wisdom: “Compressed” knowledge
Gold: Money
78
Machine Learning
79
A program that learns (improves) from experience (data) according
to some objective
Supervised learning
Unsupervised learning
Reinforcement learning
How to do it
Program mathematical models and hope for the best...
How to do it well
Program state-of-the-art mathematical models with sufficient
representative data
80
Knowledge is a process of piling up facts;
wisdom lies in their simplification
--Martin Fischer
81
Any sufficiently advanced technology is
indistinguishable from magic
--Arthur C. Clarke
Is Privacy Already an Illusion?
82
Digital happenings circa 2014
The Cloud
Social Media
Deep Learning
The Internet of Things
Internet.org
83
Civilization is the progress toward a society of privacy...
-- Ayn Rand
84
If you have something that you don’t want anyone to know,
maybe you shouldn’t be doing it in the first place.
-- Eric Schmidt, (former) CEO of Google
Influences on Ethics
Capitalism, economics, & marketing
A for-profit corporation's fiduciary duty: To maximize the common stock's value
How to do it? By transacting commerce
How do it well? By advertising more effectively than competitors
How to do it really well? With highly relevant personalized ads (recommenders)
Terms of Service (ToS) - The legal extent of ethical obligations?
85
Module 7: Final Q&A
86
Free Stuff
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts
87

Mais conteúdo relacionado

Destaque

Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Socialmining intercon-2012
Socialmining intercon-2012Socialmining intercon-2012
Socialmining intercon-2012Rafael Novello
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationDigital Reasoning
 
Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Digital Reasoning
 
Tim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldTim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldDigital Reasoning
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Digital Reasoning
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Using cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationUsing cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationDigital Reasoning
 
Desfaces educativos
Desfaces educativosDesfaces educativos
Desfaces educativosJesus Perez
 
Capítulo piloto de Desenfocados
Capítulo piloto de DesenfocadosCapítulo piloto de Desenfocados
Capítulo piloto de DesenfocadosRipi86
 
CIF - Classificação Internacional de Funcionalidade
CIF - Classificação Internacional de FuncionalidadeCIF - Classificação Internacional de Funcionalidade
CIF - Classificação Internacional de FuncionalidadeEduardo Santana Cordeiro
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followersmark madsen
 
PrimeVision E-mercials
PrimeVision E-mercialsPrimeVision E-mercials
PrimeVision E-mercialsAlex Coroneos
 
Harbor Research - The Internet of Things Meets the Internet of People
Harbor Research - The Internet of Things Meets the Internet of PeopleHarbor Research - The Internet of Things Meets the Internet of People
Harbor Research - The Internet of Things Meets the Internet of PeopleHarbor Research
 
Rutas de iglesias singulares en la Comunidad de Madrid
Rutas de iglesias singulares en la Comunidad de MadridRutas de iglesias singulares en la Comunidad de Madrid
Rutas de iglesias singulares en la Comunidad de MadridLa Gatera de la Villa
 
Public Value Management, een nieuw sturingsparadigma?
Public Value Management, een nieuw sturingsparadigma?Public Value Management, een nieuw sturingsparadigma?
Public Value Management, een nieuw sturingsparadigma?Rob Janssens
 

Destaque (20)

Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Socialmining intercon-2012
Socialmining intercon-2012Socialmining intercon-2012
Socialmining intercon-2012
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your Organization
 
Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...
 
Tim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldTim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric World
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Using cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationUsing cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communication
 
How to Build a Tech Team
How to Build a Tech TeamHow to Build a Tech Team
How to Build a Tech Team
 
Desfaces educativos
Desfaces educativosDesfaces educativos
Desfaces educativos
 
Fraude y robo de informacion
Fraude y robo de informacionFraude y robo de informacion
Fraude y robo de informacion
 
Capítulo piloto de Desenfocados
Capítulo piloto de DesenfocadosCapítulo piloto de Desenfocados
Capítulo piloto de Desenfocados
 
CIF - Classificação Internacional de Funcionalidade
CIF - Classificação Internacional de FuncionalidadeCIF - Classificação Internacional de Funcionalidade
CIF - Classificação Internacional de Funcionalidade
 
Tips RGE
Tips RGETips RGE
Tips RGE
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
 
PrimeVision E-mercials
PrimeVision E-mercialsPrimeVision E-mercials
PrimeVision E-mercials
 
Onlinekommunikation – K2-Tagungsbroschüre 16. Juni 2011
Onlinekommunikation – K2-Tagungsbroschüre 16. Juni 2011Onlinekommunikation – K2-Tagungsbroschüre 16. Juni 2011
Onlinekommunikation – K2-Tagungsbroschüre 16. Juni 2011
 
Harbor Research - The Internet of Things Meets the Internet of People
Harbor Research - The Internet of Things Meets the Internet of PeopleHarbor Research - The Internet of Things Meets the Internet of People
Harbor Research - The Internet of Things Meets the Internet of People
 
Rutas de iglesias singulares en la Comunidad de Madrid
Rutas de iglesias singulares en la Comunidad de MadridRutas de iglesias singulares en la Comunidad de Madrid
Rutas de iglesias singulares en la Comunidad de Madrid
 
Public Value Management, een nieuw sturingsparadigma?
Public Value Management, een nieuw sturingsparadigma?Public Value Management, een nieuw sturingsparadigma?
Public Value Management, een nieuw sturingsparadigma?
 

Semelhante a Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)

python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptxKaviya452563
 
TwtBot9/28/17SD
TwtBot9/28/17SDTwtBot9/28/17SD
TwtBot9/28/17SDThinkful
 
Personal learning networks
Personal learning networksPersonal learning networks
Personal learning networksrobin fay
 
Tbjsphx918
Tbjsphx918Tbjsphx918
Tbjsphx918Thinkful
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Five steps to get tweets sent by a list of users
Five steps to get tweets sent by a list of usersFive steps to get tweets sent by a list of users
Five steps to get tweets sent by a list of usersWeiai Wayne Xu
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfNho Vĩnh
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesRudiger Wolf
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Goodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social UpdateGoodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social UpdatePatrick Chanezon
 
Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Hal Speed
 
How To Start Your InfoSec Career
How To Start Your InfoSec CareerHow To Start Your InfoSec Career
How To Start Your InfoSec CareerAndrew McNicol
 
Social Network Analysis Basics for Social Media Profs - Handout
Social Network Analysis Basics for Social Media Profs - HandoutSocial Network Analysis Basics for Social Media Profs - Handout
Social Network Analysis Basics for Social Media Profs - HandoutMatthew J. Kushin, Ph.D.
 

Semelhante a Mining Social Web APIs with IPython Notebook (Data Day Texas 2015) (20)

Hacking For Innovation
Hacking For InnovationHacking For Innovation
Hacking For Innovation
 
We are losing our tweets!
We are losing our tweets!We are losing our tweets!
We are losing our tweets!
 
python programming.pptx
python programming.pptxpython programming.pptx
python programming.pptx
 
TwtBot9/28/17SD
TwtBot9/28/17SDTwtBot9/28/17SD
TwtBot9/28/17SD
 
Personal learning networks
Personal learning networksPersonal learning networks
Personal learning networks
 
Tbjsphx918
Tbjsphx918Tbjsphx918
Tbjsphx918
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Five steps to get tweets sent by a list of users
Five steps to get tweets sent by a list of usersFive steps to get tweets sent by a list of users
Five steps to get tweets sent by a list of users
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
MySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdfMySQL for Python_ Nho Vĩnh Share.pdf
MySQL for Python_ Nho Vĩnh Share.pdf
 
London atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slidesLondon atlassian meetup 31 jan 2016 jira metrics-extract slides
London atlassian meetup 31 jan 2016 jira metrics-extract slides
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Goodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social UpdateGoodle Developer Days Munich 2008 - Open Social Update
Goodle Developer Days Munich 2008 - Open Social Update
 
Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023Teaching Machine Learning with Physical Computing - July 2023
Teaching Machine Learning with Physical Computing - July 2023
 
How To Start Your InfoSec Career
How To Start Your InfoSec CareerHow To Start Your InfoSec Career
How To Start Your InfoSec Career
 
Recsys 2016
Recsys 2016Recsys 2016
Recsys 2016
 
Social Network Analysis Basics for Social Media Profs - Handout
Social Network Analysis Basics for Social Media Profs - HandoutSocial Network Analysis Basics for Social Media Profs - Handout
Social Network Analysis Basics for Social Media Profs - Handout
 

Último

Call Girls In South Ex. Delhi O9654467111 Women Seeking Men
Call Girls In South Ex. Delhi O9654467111 Women Seeking MenCall Girls In South Ex. Delhi O9654467111 Women Seeking Men
Call Girls In South Ex. Delhi O9654467111 Women Seeking MenSapana Sha
 
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Production diary Film the city powerpoint
Production diary Film the city powerpointProduction diary Film the city powerpoint
Production diary Film the city powerpointAshtonCains
 
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFECASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFECall girl Jaipur
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<Health
 
Capstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfCapstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfeliklein8
 
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...Heena Escort Service
 
Finance-and-Operations-in-the-Azure-Cloud.pdf
Finance-and-Operations-in-the-Azure-Cloud.pdfFinance-and-Operations-in-the-Azure-Cloud.pdf
Finance-and-Operations-in-the-Azure-Cloud.pdfandersonwille2024
 
Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"
Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"
Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"SocioCosmos
 
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdfSEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdfmacawdigitalseo2023
 
Film show post-production powerpoint for site
Film show post-production powerpoint for siteFilm show post-production powerpoint for site
Film show post-production powerpoint for siteAshtonCains
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...ZurliaSoop
 
Capstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfCapstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfeliklein8
 
Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the siteAshtonCains
 
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...Nitya salvi
 
Social media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketingSocial media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketingSheikhSaifAli1
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for siteAshtonCains
 
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceVellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceDamini Dixit
 
Interpreting the brief for the media IDY
Interpreting the brief for the media IDYInterpreting the brief for the media IDY
Interpreting the brief for the media IDYgalaxypingy
 

Último (20)

Call Girls In South Ex. Delhi O9654467111 Women Seeking Men
Call Girls In South Ex. Delhi O9654467111 Women Seeking MenCall Girls In South Ex. Delhi O9654467111 Women Seeking Men
Call Girls In South Ex. Delhi O9654467111 Women Seeking Men
 
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
 
Production diary Film the city powerpoint
Production diary Film the city powerpointProduction diary Film the city powerpoint
Production diary Film the city powerpoint
 
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFECASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
CASH PAYMENT ON GIRL HAND TO HAND HOUSEWIFE
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
 
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
Call Girls in Chattarpur (delhi) call me [9953056974] escort service 24X7
 
Capstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfCapstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdf
 
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...College & House wife  Call Girls in Paharganj 9634446618 -Best Escort call gi...
College & House wife Call Girls in Paharganj 9634446618 -Best Escort call gi...
 
Finance-and-Operations-in-the-Azure-Cloud.pdf
Finance-and-Operations-in-the-Azure-Cloud.pdfFinance-and-Operations-in-the-Azure-Cloud.pdf
Finance-and-Operations-in-the-Azure-Cloud.pdf
 
Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"
Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"
Craft Your Legacy: Invest in YouTube Presence from Sociocosmos"
 
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdfSEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
 
Film show post-production powerpoint for site
Film show post-production powerpoint for siteFilm show post-production powerpoint for site
Film show post-production powerpoint for site
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
 
Capstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfCapstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdf
 
Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the site
 
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 8617697112 Top Class Pondicherry Escort Servi...
 
Social media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketingSocial media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketing
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for site
 
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceVellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Interpreting the brief for the media IDY
Interpreting the brief for the media IDYInterpreting the brief for the media IDY
Interpreting the brief for the media IDY
 

Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)

  • 1. Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com Austin - 9 January 2015 1
  • 3. Hello, My Name Is ... Matthew 3 Background in Computer Science Data mining & machine learning CTO @ Digital Reasoning Systems Data mining; machine learning Author @ O'Reilly Media 5 published books on technology Principal @ Zaffra Selective boutique consulting
  • 4. Transforming Curiosity Into Insight 4 An open source software (OSS) project http://bit.ly/MiningTheSocialWeb2E A book http://bit.ly/135dHfs Accessible to (virtually) everyone Virtual machine with turn-key coding templates for data science experiments Think of the book as "premium" support for the OSS project
  • 5. Table of Contents (1/2) Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More 5
  • 6. Table of Contents (2/2) Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More Chapter 9 - Twitter Cookbook Appendix A - Information About This Machine's Virtual Machine Experience Appendix B - OAuth Primer Appendix C - Python and IPython Notebook Tips & Tricks 6
  • 7. Designed for Pedagogy Brief Intro Objectives API Primer Analysis Technique(s) Data Visualization Recap Suggested Exercises Recommended Resources 7
  • 8. The Social Web Is All the Rage World population: ~7B people Facebook: 1.15B users Twitter: 500M users Google+ 343M users LinkedIn: 238M users ~200M+ blogs (conservative estimate) 8 *Estimates as of early 2014
  • 9. Overview Module 1 - Virtual Machine Setup (30 mins) Module 2 - Mining Twitter (90 mins) BREAK (15 mins) Module 3 - Mining Facebook (60 mins) Lunch (75 mins) 9 Module 4 - Mining LinkedIn (75 mins) BREAK (15 mins) Module 5 - Open/Choice (75 mins) Module 6 - Privacy & Ethics; (15 mins) Module 7 - Final Q&A (15 mins)
  • 10. Module Format ~10-15 minutes of exposition I talk; you listen ~15 minutes of independent (or collaborative) work You hack while I walk around and help you ~5 minutes of recap with Q&A You ask; I try to answer 10
  • 11. Workshop Objective To send you away as a social web hacker Broad working knowledge popular social web APIs Hands-on experience hacking on social web data with a common toolkit Not for me talk to you for 8 straight hours 11
  • 12. Just a Few More Things This workshop is... An adaptation of Mining the Social Web, 2nd Edition More of a guided hacking session where you follow along (vs a preso) Wider than it is deeper There's only so much you can do in a few hours I'm available 24/7 this week (and beyond) to help you be successful 12
  • 13. Assumptions At some point in your life, you have Programmed with Python Worked with JSON Made requests and processed responses to/from web servers Or you want to learn to do these things now... And you're a quick learner 13
  • 14. Module 1: Virtual Machine Setup 14
  • 15. Why do you need a VM? 15 To save yourself a lot of time Because installation and configuration management is tedious and time- consuming So that you can focus on the task at hand instead So that I can support you regardless of your hardware and operating system
  • 16. But I can do all of that myself... True... If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages Including scientific computing tools that require underlying C/C++ code to be compiled Which requires specific versions of developer libraries to be installed You get the idea... 16
  • 17. The Virtual Machine Experience Vagrant A nice abstraction around virtual machine providers One ring to rule them all Virtualbox, VMWare, AWS, ... IPython Notebook The easiest way to program with Python A better REPL (interpreter) Great for hacking 17
  • 18. What happens when you vagrant up? Vagrant follows the instructions in your Vagrantfile Starts up a Virtualbox instance Uses Chef to provision it Installs OS patches/updates Installs MTSW software dependencies Starts IPython Notebook server on port 8888 18
  • 19. Why Should I Use IPython Notebook? Because it's great for hacking And hacking is usually the first step Because it's great for collaboration Sharing/publishing results is trivial Because the UX is as easy as working in a notepad Think of it as "executable paper" 19
  • 20. 20
  • 21. 21
  • 22. VM Quick Start Instructions Go to http://MiningTheSocialWeb.com/quick-start/ Follow the instructions And watch the screencasts! Basically: Install Virtualbox & Vagrant Run "vagrant up" in a terminal to start a guest VM Then, go to http://localhost:8888 on your host machine's web browser 22
  • 23. What Could Be Easier? A hosted version of the VM! But only for a few hours during this workshop Because it costs money to run these servers Go to http://bit.ly/XXX and pick a machine Do not share the URLs outside of this workshop! Please don't try to hack the machines Learn how I arrived at this setup at http://MiningTheSocialWeb.com 23
  • 24. Module 2: Mining Twitter 24
  • 25. Objectives 25 Be able to identify Twitter primitives Understand tweet metadata and how to use it Learn how to extract entities such as user mentions, hashtags, and URLs from tweets Apply techniques for performing frequency analysis with Python Be able to plot histograms of Twitter data with IPython Notebook
  • 26. Twitter Primitives 26 Accounts Types: "Anything" "Following" Relationships Favorites Retweets Replies (Almost) No Privacy Controls
  • 27. API Requests RESTful requests Everything is a "resource" You GET, PUT, POST, and DELETE resources Standard HTTP "verbs" Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json? screen_name=SocialWebMining Streaming API filters JSON responses Cursors (not quite pagination) 27
  • 28. Twitter is an Interest Graph 28 Roberto Mercedes Jorge Ana Nina Johnny Araya Rodolfo Hernández
  • 29. What's in a Tweet? 29 140 Characters ... ... Plus ~5KB of metadata! Authorship Time & location Tweet "entities" Replying, retweeting, favoriting, etc.
  • 30. What are Tweet Entities? Essentially, the "easy to get at" data in the 140 characters @usermentions #hashtags URLs multiple variations (financial) symbols stock tickers media 30
  • 31. Data Mining = Curiosity + Stats Curiosity Interests, desires, and intuitions Statistics Counting Comparing Filtering Ranking Hypothesis testing; knowledge discovery 31
  • 32. Histograms A chart that is handy for frequency analysis They look like bar charts...except they're not bar charts Each value on the x-axis is a range (or "bin") of values Not categorical data Each value on the y-axis is the combined frequency of values in each range 32
  • 34. Social Media Analysis Framework A memorable four step process to guide data science experiments: Aspire To test a hypothesis (answer a question) Acquire Get the data Analyze Count things Summarize Plot the results 34
  • 35. Exercises Review Python idioms in the "Appendix C (Python Tips & Tricks)" notebook Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook Fill in Example 1-1 with credentials and begin work Execute each example sequentially Customize queries Explore tweet metadata; count tweet entities; plot histograms of results Explore the "Chapter 9 (Twitter Cookbook)" notebook Think of it as a collection of building blocks 35
  • 36. Module 3: Mining Facebook 36
  • 37. Objectives 37 Be able to identify Facebook primitives Learn about Facebook’s Social Graph API and how to make API requests Understand how Open Graph protocol extends Facebook's Social Graph API Be able to analyze likes from Facebook pages and friends The Graph v2 API changes at substantially tamped down privacy and permissions See https://developers.facebook.com/docs/apps/changelog
  • 38. Facebook Primitives Account Types: People & Pages Mutual Connections Likes Shares Comments Extensive Privacy Controls 38
  • 39. Facebook is an Interest Graph 39 Roberto Mercedes Jorge Ana Nina Johnny Araya Rodolfo Hernández
  • 40. Graph API Nodes Things Edges Connections between things Fields Info (properties) about things 40
  • 41. Facebook API Explorer 41 Go to https://developers.facebook.com/tools/explorer Really, go there right now...
  • 43. Example Graph API Requests Social Graph API requests Easy to learn and use http://graph.facebook.com/me/feed http://graph.facebook.com/me/likes http://graph.facebook.com/me/?fields=id,name,friends.fields(likes.limit(10)) http://graph.facebook.com/Mining-the-Social-Web?fields=id,name,about,likes JSON responses Traditional pagination 43
  • 45. Permissions Prior to Graph v2 45
  • 46. Permissions as of Graph v2 46
  • 47. Explore Facebook Pages 47 Names of pages MiningTheSocialWeb CrossFit OReilly Web URLs (OGP extensions to Facebook's Social Graph) http://www.imdb.com/title/tt0117500
  • 48. Social Media Analysis Framework Recall the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize 48
  • 49. Exercises Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook Execute and tinker with Examples 2-1 thru 2-6 Inspect content in your feed Juxtapose public figures Compare/contrast similar products/brands of interest 49
  • 50. Module 4: Mining LinkedIn 50
  • 51. Objectives 51 Learn about LinkedIn’s Developer Platform Understand how clustering works A fundamental type of machine learning Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location Visualize geographic data with cartograms
  • 52. LinkedIn Primitives Account Types: People, Groups, Companies, Jobs And Activity Streams Data is typically perceived as being more sensitive Richest data source? (Think: LinkedIn's business model) Profile descriptions from mutual connections A little messier than it first appears 52
  • 53. API Requests HTTP-based Requests Field selector syntax http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url) XML responses CSV address book download 53
  • 54. Is LinkedIn an Interest Graph? Fundamentally: yes. the developer API requires you to do a bit of work to model it Less trivial to find some of the "pivots" e.g. There's no public Skills API for developers But the data is there (mostly in profile descriptions) for your direct connections Companies, job titles, job descriptions Lots of richness is tucked away in human language data 54
  • 55. Clustering An unsupervised machine learning learning technique Think: an algorithm that organizes the data into partitions 55
  • 57. 3 Steps to Clustering Your Data Normalization Compare (similarity/distance measurement) n-grams, edit distance, and Jaccard are common, but your imagination is the limit Why can't you just compare everything to everything? Dimensionality Reduction Ideally, your clustering algorithm will mitigate the pain k-means is among the most common clustering techniques in use 57
  • 59. k-Means Explained 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk. 2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons. 3. For each of the k clusters, calculate the centroid of the cluster and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.) 4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence. 59
  • 65. Geocoding Transforming a location to a set of coordinates Nashville, TN => (36.16783905029297, -86.77816009521484) A harder problem than it first appears The Bing API is especially generous Requires an account sign up: http://bingmapsportal.com Use the API key with the geopy package 65
  • 66. Introducing: The Dorling Cartogram 66
  • 67. Social Media Analysis Framework Remember: Use the same four step process to guide data science experiments: Aspire Acquire Analyze Summarize 67
  • 68. Exercises Follow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples Download your connections as a CSV file from http://www.linkedin.com/people/ export-settings and save them to your VM A deviation from instructions in Example 3-6 is necessary for remote VMs See http://bit.ly/mtsw-ch03-helper-code Try clustering your contacts in Example 3-12 Use the python-linkedin client to tap into Activity Streams See https://developer.linkedin.com/documents/get-network-updates-and- statistics-api 68
  • 70. Objectives 70 To work on "loose ends" or areas of interest from previous modules To hack on code in notebooks not yet encountered To setup the virtual machine on your own box if you haven't yet To collaborate/talk and otherwise make the most of our togetherness
  • 71. Social Media Analysis Framework Remember: Aspire Acquire Analyze Summarize 71
  • 72. Recommendations Setup your own development environment if you haven't already Appendix A Text Mining & Natural Language Processing Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages) Graph Mining Chapter 7 (Mining GitHub) 72
  • 73. Module 6: Privacy & Ethics 73
  • 74. 74 Know thy data, and know thyself --Matthew A. Russell
  • 75. 75 If we have data, let’s look at data. If we have opinions, let’s go with mine --Jim Barksdale
  • 76. 76 In God we trust. All others must bring data --W. Edwards Deming
  • 77. Communication => Data Communication Senders humans & machines Messages natural language, images, videos, etc. Recipients humans & machines 77
  • 78. Data Alchemy Data: Documents & document fragments (text messages, etc.) Information: "Assertions", summaries, tags, etc. Knowledge: Aggregated, queryable information Wisdom: “Compressed” knowledge Gold: Money 78
  • 79. Machine Learning 79 A program that learns (improves) from experience (data) according to some objective Supervised learning Unsupervised learning Reinforcement learning How to do it Program mathematical models and hope for the best... How to do it well Program state-of-the-art mathematical models with sufficient representative data
  • 80. 80 Knowledge is a process of piling up facts; wisdom lies in their simplification --Martin Fischer
  • 81. 81 Any sufficiently advanced technology is indistinguishable from magic --Arthur C. Clarke
  • 82. Is Privacy Already an Illusion? 82 Digital happenings circa 2014 The Cloud Social Media Deep Learning The Internet of Things Internet.org
  • 83. 83 Civilization is the progress toward a society of privacy... -- Ayn Rand
  • 84. 84 If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place. -- Eric Schmidt, (former) CEO of Google
  • 85. Influences on Ethics Capitalism, economics, & marketing A for-profit corporation's fiduciary duty: To maximize the common stock's value How to do it? By transacting commerce How do it well? By advertising more effectively than competitors How to do it really well? With highly relevant personalized ads (recommenders) Terms of Service (ToS) - The legal extent of ethical obligations? 85
  • 86. Module 7: Final Q&A 86
  • 87. Free Stuff http://MiningTheSocialWeb.com Mining the Social Web 2E Chapter 1 (Chimera) http://bit.ly/13XgNWR Source Code (GitHub) http://bit.ly/MiningTheSocialWeb2E http://bit.ly/1fVf5ej (numbered examples) Screencasts (Vimeo) http://bit.ly/mtsw2e-screencasts 87