O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
What is a POI?POIs have names, locations, category, context (depends on envisaged use-case)A point of interest (POI) is a focused geographic entity such as a landmark, a school, an historical building, or a business.
news articles from the U.S. and the U.K., but also included a small number of examples from Yahoo! Answers and a small number of queries submitted to a search engine.The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs. The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs.
Learn the set of feature weights (big) lambda which maximises the label sequence probabilityProbability of a label sequence y, given an observed sequence xZ normalising factorF(Y,X) is the set of feature functions computed over the observations and the label transitions.
Up to ten snippets per queryUse BI0
All three model are statistically significantly higher than baseline
C_user(t,L) is the number of unique users who use the term ‘t’ in the cell ‘L’|L| is the sum of the user frequency of all terms in the locationMakes sense to use highly precise extant info when available, so use LM in combination with Placemaker (gazetteer) = cascade model
Median distances in kilometres
Re-finding existing POIs allows us to get get context from social media as well as confirm our model’s performanceNovel POIs are valuable, extending our knowledge of what is out thereNot restricted by the biases of existing sources like commercial enterprises or narrow criteria POIs
Wild text : web snippets, Tweets, news, etc, varies in cleanliness and consistency depending on sourceAutomatically detecting POIs in UGC content(“Corner of forth and main”)Discussion on the subjective nature of POI/location etc, very application-dependant (How to evaluate discover tasks?) Discussion – open questionsLocalising them Talking about manual annotation data for POI detection(How hard is it for humans?)Analytics- Combinations of sources
Mining the Web for Points of Interest
Adam RaeVanessa Murdock, Adrian Popescu, Hugues Bouchard SIGIR 2012, Portland, Oregon, Entities Session
! I’m at Adam’s Bar…? Mining the Web for Points of Interest Using social media to increase our knowledge of the world
Contents§ Motivation§ Point Of Interest (POI) extraction using user generated data§ POI localisation using social media§ Conclusions
Motivation§ Geographic Points of Interest are valuable representations of important places in the world around us.§ Browsing and search of POIs increasingly important › Web search › Mobile › Navigation
Where do POIs come from?§ Editing listings coming from NMAs, commercial directories etc. › Costly process › Expensive to maintain freshness › Coverage§ Do they reflect the kind of places that people are interested in looking for?
Can we get them from the web?§ Un/semi-structured mentions of POIs throughout text on web › Lots of context§ Structured mentions of POIs in micro blogging systems and Wikipedia articles › Easy to extract
When is a POI not a POI?1 The White House is at 1600 Pennsylvania Avenue, Washington DC.2 The White House released a statement today suggesting the moon is made of cheese.3 The people living in the white house at the end of the street turned out to be Martians.
Can we bootstrap using social media?§ Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries › Extract POI, use as query to search engine › Resultant snippets filtered to those that contain POI › Sanitise§ Also from geocoded Wikipedia articles (according to Yago2)
Ground Truth Data§ Created by manual assessors given explicit instructions › 1,337 examples of POIs in (some) context › 1,066 unique POIs › Inter-assessor agreement: Ground Truth Precision Recall F-Measure Assessor 1 0.749 0.792 0.770 2 0.814 0.716 0.762
Features§ Lexical › Word identity, shape, position, etc.§ Grammatical › Part of Speech, Apache OpenNLP§ Statistical › Normalised Point-wise Mutual Information of mobile search query logs§ Geographic › Gazetteer attributes from Yahoo! Placemaker › http://developer.yahoo.com/geo/placemaker/
Process Overview ExtractGeocoded Wikipedia Wikipedia Bootstrapped Wikipedia based Article Articles Raw Web Snippets POI Tagger Search Engine (Bing) CRF Model Training Snippet Processing Titles Foursquare Foursquare Check-Ins Bootstrapped Raw Web based POI (Foursquare) Extract Snippets Tagger POI Mentions Check-Ins Gowalla Bootstrapped Gowalla based (Gowalla) Raw Web Snippets POI Tagger … was only after he had left the Marriott Hotel that he remembered…
ResultsTraining Data Testing Data Precision RecallY! Placemaker Manual Data 0.237 0.228Wikipedia Manual Data 0.514 0.337Foursquare Manual Data 0.276 0.655Gowalla Manual Data 0.360 0.414Wikipedia 10-fold CV 0.879 0.955Foursquare 10-fold CV 0.689 0.468Gowalla 10-fold CV 0.857 0.868
Language Modelling§ Partition the world into 1km cells§ For each, create model from Flickr photos taken in that area c user (t,L) P(t | θ L ) = L = ∑c user (t i ,L) L t i ∈L§ Treat problem as IR, match a POI (query) against the cells (document) › Return centroid of of best matching cell €
Conclusions and Implications§ POIs are valuable, but useful ones difficult to define§ Generating evaluation data is hard§ Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger › Up to 88% precision on unlabelled data › Reflect the POIs users visit › Easily updated › Can be located accurately using hybrid gazetteer + Flickr language model technique
Benefits of this approach§ Discover POIs: › that we already know about (replace/extend existing sources) › we didn’t already know about (novel POIs) › of more diverse types (increasing coverage) › that are fresher§ Increase relevance of local and hyperlocal search using wisdom of the crowds
Research Areas- Automatic POI detection in UGC- Learning how users refer to places- Localising media- Generating evaluation data - (This is hard)- Multi-source combination- Quality & Credibility
Adam Rae email@example.comThank you Vanessa Murdock Adrian Popescu Hugues Bouchard