SlideShare a Scribd company logo
1 of 65
Download to read offline
Mining the Geo Needles in
   the Social Haystack
       (Where 2.0, 2011)

Matthew A. Russell
http://linkedin.com/in/ptwobrussell
@ptwobrussell
About Me

• VP of Engineering @ Digital Reasoning Systems
• Principal @ Zaffra
• Author of Mining the Social Web et al.
• Triathlete-in-training

                                                  @SocialWebMining
                                    2
Objectives

• Orientation to geo data in the social web space
• Hands-on exercises for analyzing/visualizing geo data
• Whet your appetite and send you away motivated and with useful
 tools/insight



                                    3
Approximate Schedule


• Microformats: 10 minutes
• Twitter: 15 minutes
• LinkedIn: 15 minutes
• Facebook: 15 minutes
• Text-mining: 15 minutes
• General Q&A (time-permitting)
   4
Development

• Your local machine
• Python version 2.{6,7}
  • Recommend Windows users try ActivePython
• We'll handle the rest along the way


   5
Microformats



               Agile Data Solutions
Microformats

• My definition: "conventions for unambiguously including structured
 data into web pages in an entirely value-added way" (MTSW, p19)
• Bookmark and browse: http://microformats.org
• Examples:
  • geo, hCard, hEvent, hResume, XFN

                                     7
geo
<!-- Download MTSW pp 30-34 from XXX -->

<!-- The multiple class approach -->
<span style="display: none" class="geo">
  <span class="latitude">36.166</span>
  <span class="longitude">-86.784</span>
</span>

<!-- When used as one class, the separator must be a semicolon -->
<span style="display: none" class="geo">36.166; -86.784</span>
                                 8
Exercise!

• View source @ http://en.wikipedia.org/wiki/List_of_U.S._national_parks
• Use http://microform.at to extract the geo data as KML
  • http://microform.at/?type=geo&url=http%3A%2F%2Fen.wikipedia.org
   %2Fwiki%2FList_of_U.S._national_parks
  • Try pasting this URL into Google Maps and see what happens

                                     9
Exercise Results

• Feel free to hack on the KML
  • http://code.google.com/apis/kml/documentation/
• Google Earth can be fun too
  • But you already knew that
  • We'll see it later...


  10
Twitter



          Agile Data Solutions
Twitter Data

• There's geo data in the user profile
• And in tweets...
  • ...if the user enabled it in their prefs
• And even in the 140 chars of the tweet itself


      12
A Tweet as JSON
{
    "user" : {
        "name" : "Matthew Russell",
        "description" : "Author of Mining the Social Web; International Sex Symbol",
        "location" : "Franklin, TN",
        "screen_name" : "ptwobrussell",
        ...
    },
    "geo" : { "type" : "Point", "coordinates" : [36.166, 86.784]},
    "text" : "Franklin, TN is the best small town in the whole wide world #WIN",
    ...
}


                                                    13
Exercise!
• In your browser, try accessing this URL:
  http://api.twitter.com/1/users/show.json?screen_name=ptwobrussell

• In a terminal with Python, try it programatically:
  $ sudo easy_install twitter # 1.6.1 is the current
  $ python
  >>> import twitter
  >>> t = twitter.Twitter()
  >>> user = t.users.show(screen_name='ptwobrussell')
  >>> import json
  >>> print json.dumps(user, indent=2)
                                              14
Recipe #21


• Geocode locations in profiles:
  • https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/
   master/recipe__geocode_profile_locations.py
  • Recipe #21 from 21 Recipes for Mining Twitter

                                      15
Sample Results
<?xml version="1.0" encoding="UTF-8"?>
  <kml xmlns="http://earth.google.com/kml/2.0">
    <Folder>
      <name>Geocoded profiles for Twitterers showing up in search results for ... </name>
  <Placemark>
    <Style>
      <LineStyle>
       <color>cc0000ff</color>
       <width>5.0</width>
      </LineStyle>
    </Style>
    <name>Paris</name>
    <Point>
      <coordinates>2.3509871,48.8566667,0</coordinates>
    </Point>
  </Placemark>
  ...
 </kml>                                          16
Recipe #20


• Visualizing results with a Dorling Cartogram:
  • https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/
   master/recipe__dorling_cartogram.py
  • Recipe #20 from 21 Recipes for Mining Twitter

                                      17
Sample Results




18
Recipe #22 (?!?)

• Extracting "geo" fields from a batch of search results
  • https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/
   master/recipe__geocode_tweets.py
  • Not in current edition of 21 Recipes for Mining Twitter
    • Just checked in especially for you

                                       19
Sample Results
• Unfortunately (???), "geo" data for
                                             [None, None, None, None, None, None, None, None, None, None,
 tweets seems really scarce                  None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,

• Varies according to a particular           None, None, {u'type': u'Point', u'coordinates':
                                             [32.802900000000001, -96.828100000000006]}, {u'type':
                                             u'Point', u'coordinates': [33.793300000000002, -117.852]},
                                             None, None, None, None, None, None, None, None, None, None,
 user's privacy mindset?                     None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, {u'type': u'Point', u'coordinates':
                                             [35.512099999999997, -97.631299999999996]}, None, None,
• Examining only Twitter users who           None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
 enable "geo" would be interesting           None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
                                             None, None, None, None, None, None, None, None, None, None,
 in and of itself                       20
                                             None]
Mining the 140 Characters



• Not a trivial exercise
• Mining natural language data is hard
  • Mining bastardized natural language data is even harder
• We'll look at mining natural language data later


                                      21
Fun Possibilities




#JustinBieber           #TeaParty
                22
Oh, and by the way...




          23
OAuth 1.0a - Now
import twitter
from twitter.oauth_dance import oauth_dance

# Get these from http://dev.twitter.com/apps/new
consumer_key, consumer_secret = 'key', 'secret'

(oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb',
                                       consumer_key, consumer_secret)

auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret,
                         consumer_key, consumer_secret)

t = twitter.Twitter(domain='api.twitter.com', auth=auth)
OAuth 2.0 - "Soon"
       +----------+            Client Identifier       +---------------+
       |          -+----(A)--- & Redirect URI ------>|                 |
       | End-user |                                    | Authorization |
       |     at     |<---(B)-- User authenticates --->|      Server    |
       | Browser |                                     |               |
       |          -+----(C)-- Authorization Code ---<|                 |
       +-|----|---+                                    +---------------+
          |     |                                          ^      v
         (A) (C)                                           |      |
          |     |                                          |      |
          ^     v                                          |      |
       +---------+                                         |      |
       |          |>---(D)-- Client Credentials, --------'        |
       |    Web   |           Authorization Code,                 |
       | Client |               & Redirect URI                    |
       |          |                                               |
       |          |<---(E)----- Access Token -------------------'
       +---------+         (w/ Optional Refresh Token)

          See http://tools.ietf.org/html/draft-ietf-oauth-v2-10#section-1.4.1
LinkedIn



           Agile Data Solutions
LinkedIn Data

• Coarsely grained geo data is available in user profiles
  • "Greater Nashville Area", "San Francisco Bay", etc.
  • Most geocoders don't seem to recognize these names...
  • No geocoordinates! (Yet???)
• Mitigation approach: (1) transform/normalize and then (2) geocode

                                    27
Exercise!
• Get an API key at http://code.google.com/apis/maps/signup.html
$ easy_install geopy
$ python
>>> import geopy
>>> g = geopy.geocoders.Google(GOOGLE_MAPS_API_KEY)
>>> results = g.geocode("Nashville", exactly_one=False)
>>> for r in results:
...    print r # (u'Nashville, TN, USA', (36.165889, -86.784443))
• See also https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/
 master/etc/geocoding_pattern.py      28
Diving Deeper

• Example 6-14 from MTSW (pp194-195) works though an extended example
 and dumps KML output that includes clustered output
 • See http://github.com/ptwobrussell/Mining-the-Social-Web/python_code/
   linkedin__geocode.py



                                   29
Clustering

• First half of MTSW Chapter 6 (pp167-188) provides a good/detailed intro
• Think of clustering as "approximate matching"
  • The task of grouping items together according to a similarity metric
• It's among the most useful algorithmic techniques in all of data mining
  • The catch: It's a hard problem.
• What do you name the clusters once you've created them?
                                    30
Example Output




31
Better Data Exploration




 32
Clustering Approaches


• Agglomerative (hierarchical)
• Greedy
• Approximate
  • k-means


      33
k-Means Algorithm
1. Randomly pick k points in the data space as initial values that will be used to compute the
   k clusters: K1, K2, ..., Kk.

2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating
   k clusters and requiring k*n comparisons.

3. For each of the k clusters, calculate the centroid (the mean of the cluster) and reassign
   its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the
   algorithm.)

4. Repeat steps 2–3 until the members of the clusters do not change between iterations.
   Generally speaking, relatively few iterations are required for convergence.

Let's try it: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
                                                     34
Step 0 (init)




35
Step 1




36
Step 2




37
Step 3




38
Step 4




39
Step 5




40
Step 6




41
Step 7




42
Step 8




43
Step 9 (done)




44
k-Means Applied




45
Facebook



           Agile Data Solutions
Facebook Data

• Ridiculous amounts of data (all kinds) is available via the FB Platform
• Current location, hometown, "checkins"
• Access to the FB platform data is relatively painless:
  • Social Graph: http://developers.facebook.com/docs/reference/api/
  • FQL: http://developers.facebook.com/docs/reference/fql/

                                       47
FQL Checkins
• See http://developers.facebook.com/docs/reference/fql/checkin/




                                     48
FQL Connections
• See http://developers.facebook.com/docs/reference/fql/connection/




                                     49
Sample FQL
• An excerpt from MTSW Example 9-18 (pp306-308) conveys the gist:
fql = FQL(ACCESS_TOKEN)

q= 
  """select name, current_location, hometown_location
     from user
     where uid in
       (select target_id
        from connection
        where source_id = me() and target_type = 'user')"""

results = fql.query(q)

                                            50
Example "App"

     • Basic idea is simple
     • You already have the tools to
      geocode and plot on a map...
     • See also: http://answers.oreilly.com/
      topic/2555-a-data-driven-game-
      using-facebook-data/
51
FB Platform Demo

• Mininal sample app at http://miningthesocialweb.appspot.com
• Source is at http://github.com/ptwobrussell/Mining-the-Social-Web/
 web_code/facebook_gae_demo_app




                                    52
Text Mining



              Agile Data Solutions
References


• MTSW Chapter 7 (Google Buzz: TF-IDF, Cosine Similarity, and Collocations)
• MTSW Chapter 8 (Blogs et al.: Natural Language Processing and Beyond)




                                    54
"Legacy" NLP

• "Legacy" => Classic Information Retrieval (IR) techniques
  • Often (but not always) uses a "bag of words" model
  • tf-idf metric is usually the root of the core strategy
  • Variations on cosine similarity are often the fruition
  • Additional higher order analytics are possible, but inevitably
   cannot be optimal for deep semantic analysis
• Virtually every A-list search engine has started here
                                       55
A Vector Space




56
How might you discover locations from text
       using "legacy" techniques?




                     57
Some possibilities
•Combinations of language dependent "hacks"
 •n-gram detection/examination
  •bigrams, trigrams, etc.
 •"Proper Case" hints
  •"Chipotle Mexican Grill"
 •prepositional phrase cues
  •"in the garden", "at the store"
 •Gazetteers
  •lists of "well-known" locations like "Statue of Liberty"
                                     58
"Modern" NLP Pipeline


•A deeper "understanding" the data is much harder
 •End of Sentence (EOS) Detection
 •Tokenization
 •Part-of-Speech Tagging
 •Chunking
 •Anaphora Resolution
 •Extraction
 •Entity Resolution
•Blending in "legacy" IR techniques can be very helpful in reducing noise
                                     59
Entity Interactions




60
Quality Metrics

       • Precision = TP/(TP+FP)
       • Recall = TP/(TP+FN)
       • F1 = (2*P*R)/(P+R)



61
Exercise!
• Get a webpage:
  • curl http://example.com/foo.html
• Extract the text:
  • curl -d @foo.html "http://www.datasciencetoolkit.org/html2story" > foo.json
• Extract the locations:
  • curl -d @foo.json "http://www.datasciencetoolkit.org/text2places"
• NOTE: Windows users can work directly at http://www.datasciencetoolkit.org
                                     62
Tools to Investigate



• NLTK - http://nltk.org
• Data Science Toolkit - http://www.datasciencetoolkit.org
• WordNet - http://wordnet.princeton.edu/



                          63
Q&A



      Agile Data Solutions
The End



          Agile Data Solutions

More Related Content

Viewers also liked

Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebMatthew Russell
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Matthew Russell
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Matthew Russell
 
Sustainable Organisations: Can businesses solve social and environmental issu...
Sustainable Organisations: Can businesses solve social and environmental issu...Sustainable Organisations: Can businesses solve social and environmental issu...
Sustainable Organisations: Can businesses solve social and environmental issu...London Business School
 
Emotional Data: hipsters, human beings and mapping of taste data
Emotional Data: hipsters, human beings and mapping of taste dataEmotional Data: hipsters, human beings and mapping of taste data
Emotional Data: hipsters, human beings and mapping of taste dataTara Hunt
 
Actions to protect the environment
Actions to protect the environmentActions to protect the environment
Actions to protect the environmentCristinaLigia
 
THE SENSE OF TOUCH (Science 1º Primaria)
THE SENSE OF TOUCH (Science 1º Primaria)THE SENSE OF TOUCH (Science 1º Primaria)
THE SENSE OF TOUCH (Science 1º Primaria)anabelenusero
 
Aircel - WWF Tiger Conservation Initiatives (Part I)
Aircel - WWF Tiger Conservation Initiatives (Part I)Aircel - WWF Tiger Conservation Initiatives (Part I)
Aircel - WWF Tiger Conservation Initiatives (Part I)SaveOurTigers
 
World wildlife fund (wwf)
World wildlife fund (wwf)World wildlife fund (wwf)
World wildlife fund (wwf)bengbeng13
 
Habitat Threats for Tigers
Habitat Threats for Tigers Habitat Threats for Tigers
Habitat Threats for Tigers WB_Research
 

Viewers also liked (20)

How to Build a Tech Team
How to Build a Tech TeamHow to Build a Tech Team
How to Build a Tech Team
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started Guide
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
 
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
Mining Social Web APIs with IPython Notebook (Data Day Texas 2015)
 
Sustainable Organisations: Can businesses solve social and environmental issu...
Sustainable Organisations: Can businesses solve social and environmental issu...Sustainable Organisations: Can businesses solve social and environmental issu...
Sustainable Organisations: Can businesses solve social and environmental issu...
 
UNIT 9 - MORE ANIMALS
UNIT 9 - MORE ANIMALSUNIT 9 - MORE ANIMALS
UNIT 9 - MORE ANIMALS
 
Sustainable Thinking @PLA 2012
Sustainable Thinking @PLA 2012Sustainable Thinking @PLA 2012
Sustainable Thinking @PLA 2012
 
Emotional Data: hipsters, human beings and mapping of taste data
Emotional Data: hipsters, human beings and mapping of taste dataEmotional Data: hipsters, human beings and mapping of taste data
Emotional Data: hipsters, human beings and mapping of taste data
 
quarrying
quarryingquarrying
quarrying
 
Actions to protect the environment
Actions to protect the environmentActions to protect the environment
Actions to protect the environment
 
THE SENSE OF TOUCH (Science 1º Primaria)
THE SENSE OF TOUCH (Science 1º Primaria)THE SENSE OF TOUCH (Science 1º Primaria)
THE SENSE OF TOUCH (Science 1º Primaria)
 
Touch
TouchTouch
Touch
 
Aircel - WWF Tiger Conservation Initiatives (Part I)
Aircel - WWF Tiger Conservation Initiatives (Part I)Aircel - WWF Tiger Conservation Initiatives (Part I)
Aircel - WWF Tiger Conservation Initiatives (Part I)
 
Wwf
WwfWwf
Wwf
 
Bengal Tiger
Bengal TigerBengal Tiger
Bengal Tiger
 
World wildlife fund (wwf)
World wildlife fund (wwf)World wildlife fund (wwf)
World wildlife fund (wwf)
 
Habitat Threats for Tigers
Habitat Threats for Tigers Habitat Threats for Tigers
Habitat Threats for Tigers
 
SENSE OF TOUCH
SENSE OF TOUCHSENSE OF TOUCH
SENSE OF TOUCH
 
Your sense of touch
Your sense of touchYour sense of touch
Your sense of touch
 

Similar to Mining the Geo Needles in the Social Haystack

Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in PracticeMosky Liu
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest HacksKosei Moriyama
 
Graduating To Go - A Jumpstart into the Go Programming Language
Graduating To Go - A Jumpstart into the Go Programming LanguageGraduating To Go - A Jumpstart into the Go Programming Language
Graduating To Go - A Jumpstart into the Go Programming LanguageKaylyn Gibilterra
 
Qcon beijing 2010
Qcon beijing 2010Qcon beijing 2010
Qcon beijing 2010Vonbo
 
개발자가 Serverless로 운동하는 방법
개발자가 Serverless로 운동하는 방법개발자가 Serverless로 운동하는 방법
개발자가 Serverless로 운동하는 방법Jiyeon Seo
 
How to not blow up spaceships
How to not blow up spaceshipsHow to not blow up spaceships
How to not blow up spaceshipsSabin Marcu
 
Event Storming(이벤트 스토밍)
Event Storming(이벤트 스토밍)Event Storming(이벤트 스토밍)
Event Storming(이벤트 스토밍)종일 김
 
酒店行业社会媒体营销实务
酒店行业社会媒体营销实务酒店行业社会媒体营销实务
酒店行业社会媒体营销实务Dr Matt McDougall
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1HyeonSeok Choi
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
01 GAIB Pune 2022 Session Rock Paper Scissors.pptx
01 GAIB Pune 2022 Session Rock Paper  Scissors.pptx01 GAIB Pune 2022 Session Rock Paper  Scissors.pptx
01 GAIB Pune 2022 Session Rock Paper Scissors.pptxicebeam7
 
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...Amazon Web Services Korea
 
CM UTaipei Kaggle Share
CM UTaipei Kaggle ShareCM UTaipei Kaggle Share
CM UTaipei Kaggle Share志明 陳
 
Esoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in RubyEsoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in Rubymametter
 
Mining social data
Mining social dataMining social data
Mining social dataMalk Zameth
 

Similar to Mining the Geo Needles in the Social Haystack (20)

Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in Practice
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest Hacks
 
Graduating To Go - A Jumpstart into the Go Programming Language
Graduating To Go - A Jumpstart into the Go Programming LanguageGraduating To Go - A Jumpstart into the Go Programming Language
Graduating To Go - A Jumpstart into the Go Programming Language
 
Qcon beijing 2010
Qcon beijing 2010Qcon beijing 2010
Qcon beijing 2010
 
개발자가 Serverless로 운동하는 방법
개발자가 Serverless로 운동하는 방법개발자가 Serverless로 운동하는 방법
개발자가 Serverless로 운동하는 방법
 
How to not blow up spaceships
How to not blow up spaceshipsHow to not blow up spaceships
How to not blow up spaceships
 
Event Storming(이벤트 스토밍)
Event Storming(이벤트 스토밍)Event Storming(이벤트 스토밍)
Event Storming(이벤트 스토밍)
 
CloudSkew Architecture
CloudSkew ArchitectureCloudSkew Architecture
CloudSkew Architecture
 
酒店行业社会媒体营销实务
酒店行业社会媒体营销实务酒店行业社会媒体营销实务
酒店行业社会媒体营销实务
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1
 
Python: The Dynamic!
Python: The Dynamic!Python: The Dynamic!
Python: The Dynamic!
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Tabledown
TabledownTabledown
Tabledown
 
01 GAIB Pune 2022 Session Rock Paper Scissors.pptx
01 GAIB Pune 2022 Session Rock Paper  Scissors.pptx01 GAIB Pune 2022 Session Rock Paper  Scissors.pptx
01 GAIB Pune 2022 Session Rock Paper Scissors.pptx
 
Real_World_0days.pdf
Real_World_0days.pdfReal_World_0days.pdf
Real_World_0days.pdf
 
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
신뢰성 높은 클라우드 기반 서비스 운영을 위한 Chaos Engineering in Action (윤석찬, AWS 테크에반젤리스트) :: ...
 
CM UTaipei Kaggle Share
CM UTaipei Kaggle ShareCM UTaipei Kaggle Share
CM UTaipei Kaggle Share
 
Esoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in RubyEsoteric, Obfuscated, Artistic Programming in Ruby
Esoteric, Obfuscated, Artistic Programming in Ruby
 
Mining social data
Mining social dataMining social data
Mining social data
 

More from Matthew Russell

Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Matthew Russell
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMatthew Russell
 
Why Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveWhy Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveMatthew Russell
 
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Matthew Russell
 
Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)Matthew Russell
 
Mining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to SuccessMining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to SuccessMatthew Russell
 

More from Matthew Russell (6)

Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
Why Twitter Is All The Rage: A Data Miner's Perspective (PyTN 2014)
 
Mining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started GuideMining the Social Web for Fun and Profit: A Getting Started Guide
Mining the Social Web for Fun and Profit: A Getting Started Guide
 
Why Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's PerspectiveWhy Twitter Is All the Rage: A Data Miner's Perspective
Why Twitter Is All the Rage: A Data Miner's Perspective
 
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
Mining Social Web APIs with IPython Notebook - Data Day Texas 2014
 
Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)Mining Social Web APIs with IPython Notebook (Strata 2013)
Mining Social Web APIs with IPython Notebook (Strata 2013)
 
Mining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to SuccessMining Social Web Data Like a Pro: Four Steps to Success
Mining Social Web Data Like a Pro: Four Steps to Success
 

Recently uploaded

WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfJamesConcepcion7
 
Planetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifePlanetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifeBhavana Pujan Kendra
 
Data Analytics Strategy Toolkit and Templates
Data Analytics Strategy Toolkit and TemplatesData Analytics Strategy Toolkit and Templates
Data Analytics Strategy Toolkit and TemplatesAurelien Domont, MBA
 
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdfSherl Simon
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxRakhi Bazaar
 
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdftrending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdfMintel Group
 
How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...
How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...
How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...SOFTTECHHUB
 
How to Conduct a Service Gap Analysis for Your Business
How to Conduct a Service Gap Analysis for Your BusinessHow to Conduct a Service Gap Analysis for Your Business
How to Conduct a Service Gap Analysis for Your BusinessHelp Desk Migration
 
Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...
Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...
Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...PRnews2
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdfShaun Heinrichs
 
Technical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamTechnical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamArik Fletcher
 
Excvation Safety for safety officers reference
Excvation Safety for safety officers referenceExcvation Safety for safety officers reference
Excvation Safety for safety officers referencessuser2c065e
 
MEP Plans in Construction of Building and Industrial Projects 2024
MEP Plans in Construction of Building and Industrial Projects 2024MEP Plans in Construction of Building and Industrial Projects 2024
MEP Plans in Construction of Building and Industrial Projects 2024Chandresh Chudasama
 
71368-80-4.pdf Fast delivery good quality
71368-80-4.pdf Fast delivery  good quality71368-80-4.pdf Fast delivery  good quality
71368-80-4.pdf Fast delivery good qualitycathy664059
 
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...Hector Del Castillo, CPM, CPMM
 
14680-51-4.pdf Good quality CAS Good quality CAS
14680-51-4.pdf  Good  quality CAS Good  quality CAS14680-51-4.pdf  Good  quality CAS Good  quality CAS
14680-51-4.pdf Good quality CAS Good quality CAScathy664059
 
Unveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesUnveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesDoe Paoro
 
Customizable Contents Restoration Training
Customizable Contents Restoration TrainingCustomizable Contents Restoration Training
Customizable Contents Restoration TrainingCalvinarnold843
 

Recently uploaded (20)

WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdf
 
Planetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifePlanetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in Life
 
Data Analytics Strategy Toolkit and Templates
Data Analytics Strategy Toolkit and TemplatesData Analytics Strategy Toolkit and Templates
Data Analytics Strategy Toolkit and Templates
 
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
5-Step Framework to Convert Any Business into a Wealth Generation Machine.pdf
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
 
Toyota and Seven Parts Storage Techniques
Toyota and Seven Parts Storage TechniquesToyota and Seven Parts Storage Techniques
Toyota and Seven Parts Storage Techniques
 
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdftrending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
trending-flavors-and-ingredients-in-salty-snacks-us-2024_Redacted-V2.pdf
 
How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...
How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...
How To Simplify Your Scheduling with AI Calendarfly The Hassle-Free Online Bo...
 
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptxThe Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
 
How to Conduct a Service Gap Analysis for Your Business
How to Conduct a Service Gap Analysis for Your BusinessHow to Conduct a Service Gap Analysis for Your Business
How to Conduct a Service Gap Analysis for Your Business
 
Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...
Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...
Introducing the AI ShillText Generator A New Era for Cryptocurrency Marketing...
 
1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf1911 Gold Corporate Presentation Apr 2024.pdf
1911 Gold Corporate Presentation Apr 2024.pdf
 
Technical Leaders - Working with the Management Team
Technical Leaders - Working with the Management TeamTechnical Leaders - Working with the Management Team
Technical Leaders - Working with the Management Team
 
Excvation Safety for safety officers reference
Excvation Safety for safety officers referenceExcvation Safety for safety officers reference
Excvation Safety for safety officers reference
 
MEP Plans in Construction of Building and Industrial Projects 2024
MEP Plans in Construction of Building and Industrial Projects 2024MEP Plans in Construction of Building and Industrial Projects 2024
MEP Plans in Construction of Building and Industrial Projects 2024
 
71368-80-4.pdf Fast delivery good quality
71368-80-4.pdf Fast delivery  good quality71368-80-4.pdf Fast delivery  good quality
71368-80-4.pdf Fast delivery good quality
 
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
 
14680-51-4.pdf Good quality CAS Good quality CAS
14680-51-4.pdf  Good  quality CAS Good  quality CAS14680-51-4.pdf  Good  quality CAS Good  quality CAS
14680-51-4.pdf Good quality CAS Good quality CAS
 
Unveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesUnveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic Experiences
 
Customizable Contents Restoration Training
Customizable Contents Restoration TrainingCustomizable Contents Restoration Training
Customizable Contents Restoration Training
 

Mining the Geo Needles in the Social Haystack

  • 1. Mining the Geo Needles in the Social Haystack (Where 2.0, 2011) Matthew A. Russell http://linkedin.com/in/ptwobrussell @ptwobrussell
  • 2. About Me • VP of Engineering @ Digital Reasoning Systems • Principal @ Zaffra • Author of Mining the Social Web et al. • Triathlete-in-training @SocialWebMining 2
  • 3. Objectives • Orientation to geo data in the social web space • Hands-on exercises for analyzing/visualizing geo data • Whet your appetite and send you away motivated and with useful tools/insight 3
  • 4. Approximate Schedule • Microformats: 10 minutes • Twitter: 15 minutes • LinkedIn: 15 minutes • Facebook: 15 minutes • Text-mining: 15 minutes • General Q&A (time-permitting) 4
  • 5. Development • Your local machine • Python version 2.{6,7} • Recommend Windows users try ActivePython • We'll handle the rest along the way 5
  • 6. Microformats Agile Data Solutions
  • 7. Microformats • My definition: "conventions for unambiguously including structured data into web pages in an entirely value-added way" (MTSW, p19) • Bookmark and browse: http://microformats.org • Examples: • geo, hCard, hEvent, hResume, XFN 7
  • 8. geo <!-- Download MTSW pp 30-34 from XXX --> <!-- The multiple class approach --> <span style="display: none" class="geo"> <span class="latitude">36.166</span> <span class="longitude">-86.784</span> </span> <!-- When used as one class, the separator must be a semicolon --> <span style="display: none" class="geo">36.166; -86.784</span> 8
  • 9. Exercise! • View source @ http://en.wikipedia.org/wiki/List_of_U.S._national_parks • Use http://microform.at to extract the geo data as KML • http://microform.at/?type=geo&url=http%3A%2F%2Fen.wikipedia.org %2Fwiki%2FList_of_U.S._national_parks • Try pasting this URL into Google Maps and see what happens 9
  • 10. Exercise Results • Feel free to hack on the KML • http://code.google.com/apis/kml/documentation/ • Google Earth can be fun too • But you already knew that • We'll see it later... 10
  • 11. Twitter Agile Data Solutions
  • 12. Twitter Data • There's geo data in the user profile • And in tweets... • ...if the user enabled it in their prefs • And even in the 140 chars of the tweet itself 12
  • 13. A Tweet as JSON { "user" : { "name" : "Matthew Russell", "description" : "Author of Mining the Social Web; International Sex Symbol", "location" : "Franklin, TN", "screen_name" : "ptwobrussell", ... }, "geo" : { "type" : "Point", "coordinates" : [36.166, 86.784]}, "text" : "Franklin, TN is the best small town in the whole wide world #WIN", ... } 13
  • 14. Exercise! • In your browser, try accessing this URL: http://api.twitter.com/1/users/show.json?screen_name=ptwobrussell • In a terminal with Python, try it programatically: $ sudo easy_install twitter # 1.6.1 is the current $ python >>> import twitter >>> t = twitter.Twitter() >>> user = t.users.show(screen_name='ptwobrussell') >>> import json >>> print json.dumps(user, indent=2) 14
  • 15. Recipe #21 • Geocode locations in profiles: • https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/ master/recipe__geocode_profile_locations.py • Recipe #21 from 21 Recipes for Mining Twitter 15
  • 16. Sample Results <?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://earth.google.com/kml/2.0"> <Folder> <name>Geocoded profiles for Twitterers showing up in search results for ... </name> <Placemark> <Style> <LineStyle> <color>cc0000ff</color> <width>5.0</width> </LineStyle> </Style> <name>Paris</name> <Point> <coordinates>2.3509871,48.8566667,0</coordinates> </Point> </Placemark> ... </kml> 16
  • 17. Recipe #20 • Visualizing results with a Dorling Cartogram: • https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/ master/recipe__dorling_cartogram.py • Recipe #20 from 21 Recipes for Mining Twitter 17
  • 19. Recipe #22 (?!?) • Extracting "geo" fields from a batch of search results • https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/ master/recipe__geocode_tweets.py • Not in current edition of 21 Recipes for Mining Twitter • Just checked in especially for you 19
  • 20. Sample Results • Unfortunately (???), "geo" data for [None, None, None, None, None, None, None, None, None, None, tweets seems really scarce None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, • Varies according to a particular None, None, {u'type': u'Point', u'coordinates': [32.802900000000001, -96.828100000000006]}, {u'type': u'Point', u'coordinates': [33.793300000000002, -117.852]}, None, None, None, None, None, None, None, None, None, None, user's privacy mindset? None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, {u'type': u'Point', u'coordinates': [35.512099999999997, -97.631299999999996]}, None, None, • Examining only Twitter users who None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, enable "geo" would be interesting None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, in and of itself 20 None]
  • 21. Mining the 140 Characters • Not a trivial exercise • Mining natural language data is hard • Mining bastardized natural language data is even harder • We'll look at mining natural language data later 21
  • 23. Oh, and by the way... 23
  • 24. OAuth 1.0a - Now import twitter from twitter.oauth_dance import oauth_dance # Get these from http://dev.twitter.com/apps/new consumer_key, consumer_secret = 'key', 'secret' (oauth_token, oauth_token_secret) = oauth_dance('MiningTheSocialWeb', consumer_key, consumer_secret) auth=twitter.oauth.OAuth(oauth_token, oauth_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(domain='api.twitter.com', auth=auth)
  • 25. OAuth 2.0 - "Soon" +----------+ Client Identifier +---------------+ | -+----(A)--- & Redirect URI ------>| | | End-user | | Authorization | | at |<---(B)-- User authenticates --->| Server | | Browser | | | | -+----(C)-- Authorization Code ---<| | +-|----|---+ +---------------+ | | ^ v (A) (C) | | | | | | ^ v | | +---------+ | | | |>---(D)-- Client Credentials, --------' | | Web | Authorization Code, | | Client | & Redirect URI | | | | | |<---(E)----- Access Token -------------------' +---------+ (w/ Optional Refresh Token) See http://tools.ietf.org/html/draft-ietf-oauth-v2-10#section-1.4.1
  • 26. LinkedIn Agile Data Solutions
  • 27. LinkedIn Data • Coarsely grained geo data is available in user profiles • "Greater Nashville Area", "San Francisco Bay", etc. • Most geocoders don't seem to recognize these names... • No geocoordinates! (Yet???) • Mitigation approach: (1) transform/normalize and then (2) geocode 27
  • 28. Exercise! • Get an API key at http://code.google.com/apis/maps/signup.html $ easy_install geopy $ python >>> import geopy >>> g = geopy.geocoders.Google(GOOGLE_MAPS_API_KEY) >>> results = g.geocode("Nashville", exactly_one=False) >>> for r in results: ... print r # (u'Nashville, TN, USA', (36.165889, -86.784443)) • See also https://github.com/ptwobrussell/Recipes-for-Mining-Twitter/blob/ master/etc/geocoding_pattern.py 28
  • 29. Diving Deeper • Example 6-14 from MTSW (pp194-195) works though an extended example and dumps KML output that includes clustered output • See http://github.com/ptwobrussell/Mining-the-Social-Web/python_code/ linkedin__geocode.py 29
  • 30. Clustering • First half of MTSW Chapter 6 (pp167-188) provides a good/detailed intro • Think of clustering as "approximate matching" • The task of grouping items together according to a similarity metric • It's among the most useful algorithmic techniques in all of data mining • The catch: It's a hard problem. • What do you name the clusters once you've created them? 30
  • 33. Clustering Approaches • Agglomerative (hierarchical) • Greedy • Approximate • k-means 33
  • 34. k-Means Algorithm 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk. 2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons. 3. For each of the k clusters, calculate the centroid (the mean of the cluster) and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.) 4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence. Let's try it: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html 34
  • 46. Facebook Agile Data Solutions
  • 47. Facebook Data • Ridiculous amounts of data (all kinds) is available via the FB Platform • Current location, hometown, "checkins" • Access to the FB platform data is relatively painless: • Social Graph: http://developers.facebook.com/docs/reference/api/ • FQL: http://developers.facebook.com/docs/reference/fql/ 47
  • 48. FQL Checkins • See http://developers.facebook.com/docs/reference/fql/checkin/ 48
  • 49. FQL Connections • See http://developers.facebook.com/docs/reference/fql/connection/ 49
  • 50. Sample FQL • An excerpt from MTSW Example 9-18 (pp306-308) conveys the gist: fql = FQL(ACCESS_TOKEN) q= """select name, current_location, hometown_location from user where uid in (select target_id from connection where source_id = me() and target_type = 'user')""" results = fql.query(q) 50
  • 51. Example "App" • Basic idea is simple • You already have the tools to geocode and plot on a map... • See also: http://answers.oreilly.com/ topic/2555-a-data-driven-game- using-facebook-data/ 51
  • 52. FB Platform Demo • Mininal sample app at http://miningthesocialweb.appspot.com • Source is at http://github.com/ptwobrussell/Mining-the-Social-Web/ web_code/facebook_gae_demo_app 52
  • 53. Text Mining Agile Data Solutions
  • 54. References • MTSW Chapter 7 (Google Buzz: TF-IDF, Cosine Similarity, and Collocations) • MTSW Chapter 8 (Blogs et al.: Natural Language Processing and Beyond) 54
  • 55. "Legacy" NLP • "Legacy" => Classic Information Retrieval (IR) techniques • Often (but not always) uses a "bag of words" model • tf-idf metric is usually the root of the core strategy • Variations on cosine similarity are often the fruition • Additional higher order analytics are possible, but inevitably cannot be optimal for deep semantic analysis • Virtually every A-list search engine has started here 55
  • 57. How might you discover locations from text using "legacy" techniques? 57
  • 58. Some possibilities •Combinations of language dependent "hacks" •n-gram detection/examination •bigrams, trigrams, etc. •"Proper Case" hints •"Chipotle Mexican Grill" •prepositional phrase cues •"in the garden", "at the store" •Gazetteers •lists of "well-known" locations like "Statue of Liberty" 58
  • 59. "Modern" NLP Pipeline •A deeper "understanding" the data is much harder •End of Sentence (EOS) Detection •Tokenization •Part-of-Speech Tagging •Chunking •Anaphora Resolution •Extraction •Entity Resolution •Blending in "legacy" IR techniques can be very helpful in reducing noise 59
  • 61. Quality Metrics • Precision = TP/(TP+FP) • Recall = TP/(TP+FN) • F1 = (2*P*R)/(P+R) 61
  • 62. Exercise! • Get a webpage: • curl http://example.com/foo.html • Extract the text: • curl -d @foo.html "http://www.datasciencetoolkit.org/html2story" > foo.json • Extract the locations: • curl -d @foo.json "http://www.datasciencetoolkit.org/text2places" • NOTE: Windows users can work directly at http://www.datasciencetoolkit.org 62
  • 63. Tools to Investigate • NLTK - http://nltk.org • Data Science Toolkit - http://www.datasciencetoolkit.org • WordNet - http://wordnet.princeton.edu/ 63
  • 64. Q&A Agile Data Solutions
  • 65. The End Agile Data Solutions