Scikits.learn (http://scikit-learn.sourceforge.net/) is a scikit for machine learning which has gained lots of popularity in recent months. In particular, it can be used for text and large scale database mining.
On another side, CubicWeb (http://www.cubicweb.org/) is a python-based framework for semantic web applications that has been used in different application fields (library, museum, conference, intranet applications).
The aim of this talk is to present how these tools can be used together for semantic data mining of rss feeds (clustering, prediction), and for building a news aggregator similar to google news.
Full description : http://www.euroscipy.org/talk/4291
1. A semantic news aggregator in Python using
Dbpedia, Cubicweb and Scikits-learn
Vincent Michel - Logilab
27 août 2011
2. Context
With a bunch of RSS news...
Google Borrows Apple Strategy : Google’s deal underscores the
allure of a business model pioneered...
Japan disaster plant cold shutdown could face delay : TOKYO
(Reuters) - Tokyo Electric Power Co said on Wednesday...
Libya shows signs of slipping from Muammar Gaddafi’s grasp :
Supply lines to capital in peril as coastal cities fall...
... how can we analyze them in Python ?
→ Clustering (grouping) RSS (e.g. Google News).
→ Extracting/synthetizing information.
→ Providing semantic useful/original visualisation and
analytics tools.
3. Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
4. Semantic ?
Sussex St.
Reading Andrews NDL
Audio- Lists Resource subjects t4gm
MySpace scrobbler Lists
Moseley (DBTune) (DBTune) RAMEAU
Folk NTU SH lobid
GTAA Plymouth Resource
Lists
Organi-
Reading
Lists
sations
Music The Open ECS
Magna- Brainz Music
DB tune Library LCSH South-
(Data Brainz LIBRIS ampton
Tropes lobid Ulm
Incubator) (zitgist) Man- EPrints
Resources
chester
Surge Reading
biz. Music RISKS
Radio Lists The Open ECS
data. John Brainz
Discogs Library PSH Gem. UB South-
gov.uk Peel (DBTune)
FanHubz (Data In- (Talis) Norm- Mann- ampton
(DB cubator) Jamendo datei heim RESEX
Tune)
Popula- Poké- DEPLOY
Last.fm
tion (En- pédia
Artists Last.FM Linked RDF
AKTing) research EUTC (DBTune) (rdfize) LCCN VIAF Book Wiki
data.gov Produc- Pisa Eurécom
P20 Mashup semantic
NHS .uk tions classical web.org
(EnAKTing) Pokedex
(DB
Mortality Tune) PBAC ECS
(En-
AKTing)
BBC MARC (RKB Budapest
Program Codes Explorer)
Energy education OpenEI BBC List Semantic Lotico Revyu OAI
(En- CO2 data.gov mes Music Crunch SW
AKTing) (En- .uk Chronic- Linked Dog
NSZL Base
AKTing) ling Event- MDB RDF Food IRIT
America Media Catalog
ohloh
BBC DBLP ACM IBM
Good- BibBase
Ord- Wildlife (RKB
Openly Recht- win
nance Finder Explorer)
Local spraak. Family DBLP
legislation Survey Tele- New VIVO UF
.gov.uk nl graphis York flickr (L3S) New-
VIVO castle
Times URI wrappr Open Indiana RAE2001
UK Post- Burner Calais DBLP
codes statistics (FU
VIVO CiteSeer Roma
data.gov LOIUS Taxon iServe Berlin) IEEE
.uk Cornell
Concept Geo
World data
ESD Fact- OS dcs
Names book dotAC
stan- reference Project
Linked Data NASA (FUB) Freebase
dards data.gov Guten-
.uk
for Intervals (Data GESIS Course-
transport DBpedia berg STW ePrints CORDIS
Incu- ware
data.gov bator) (FUB)
Fishes ERA UN/
.uk
of Texas Geo LOCODE
Uberblic
Euro- Species
The stat dbpedia TCM SIDER Pub KISTI
(FUB) lite Gene STITCH Chem JISC
London Geo KEGG
DIT LAAS
Gazette TWC LOGD Linked Daily OBO Drug
Eurostat Data UMBEL lingvoj Med
(es) Disea-
YAGO Medi some
Care ChEBI KEGG NSF
Linked KEGG KEGG
Linked Drug Cpd
GovTrack rdfabout Glycan
Sensor Data CT Bank Pathway
US SEC Open Reactome
(Kno.e.sis) riese Uni
Cyc Lexvo Path-
way PDB Media
Semantic totl.net Pfam
HGNC
XBRL
WordNet KEGG KEGG Geographic
(VUA) Linked Taxo- CAS Reaction
rdfabout Twarql UniProt Enzyme
EUNIS Open nomy
US Census Publications
Numbers PRO- ProDom
SITE Chem2
UniRef Bio2RDF User-generated content
Climbing WordNet SGD Homolo
Linked (W3C) Affy- Gene
GeoData
Cornetto
metrix Government
PubMed Gene
UniParc
Ontology
GeneID Cross-domain
Airports
Product
DB UniSTS MGI
Gen Life sciences
Bank OMIM InterPro
As of September 2010
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http ://lod-cloud.net/”
5. Tools
Fetching, storing and querying the data → CubicWeb (*)
Semantic CMS, high-level database management with metadata.
Multiple sources : RSS, micro-blogging, SQL database, . . .
Based on PostgreSQL : deals with (very) large database.
Data-mining and machine learning → Scikits-learn (*)
Easy-to-use and general-purpose machine learning in Python.
Unsupervised learning, supervised learning, model selection . . .
Semantic information database → Dbpedia (*)
∼ 8.106 articles with abstracts, images, . . . from Wikipedia.
∼ 0, 6.106 categories, 273 types (e.g. person, place, . . . )
∼ 100.106 links between articles, categories and types.
(*) Open Source/Creative Commons ! ! !
6. Storing RSS information with Cubicweb
Each object (or entity) is defined in a schema and may be displayed
using different views.
Storing RSS
Define (in schema.py ) how RSS should be stored in the database.
class R S S A r t i c l e ( E n t i t y T y p e ) :
t i t l e = S t r i n g ( ) # T i t l e o f t h e feed
u r i = S t r i n g ( unique=True ) # U r i o f t h e r s s feed
c o n t e n t = S t r i n g ( ) # Content o f t h e feed
Fetching RSS (based on feedparser and BeautifulSoup)
Simply construct a source, by giving an URL and a parser :
u r l = u ’ h t t p : / / feeds . b b c i . co . uk / news / w o r l d / r s s . xml ’
−
s = s e s s i o n . c r e a t e _ e n t i t y ( ’ CWSource ’ , name=u ’BBCNews World ’ , u r l = u r l ,
t y p e =u ’ d a t af e e d ’ , p a r s e r =u ’ rss −p a r s e r ’ ,
c o n f i g =u ’ s y n c h r o n i z a t i o n − i n t e r v a l =240min ’ )
s . p u l l _ d a t a ( session )
7 english/american journals (The New York Times, . . . )
7. Storing Dbpedia information with Cubicweb
Storing Dbpedia page
Define (in schema.py ) how Dbpedia pages should be stored in the
database.
class DbpediaPage ( E n t i t y T y p e ) :
u r i = S t r i n g ( unique=True , indexed=True ) # U r i o f t h e r e s s o u r c e
l a b e l = S t r i n g ( indexed=True ) # h t t p : / / www. w3 . org / 2 0 0 0 / 0 1 / r d f −schema
pageid = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / wikiPageID
a b s t r a c t = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / a b s t r a c t
homepage = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / homepage
t h u m b n a i l = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / t h u m b n a i l
d e p i c t i o n = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / d e p i c t i o n
w i k i p a g e = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / page
l a t i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos
l o n g i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos
Storing all dbpedia information (∼ 9.106 pages, ∼ 100.106 links,
20Go) in Cubicweb takes less than 24 hours.
→ See the full schema
8. Analyzing RSS news
What can we do with the RSS news stored in database ?
1 extract relevant features of the data.
2 construct a usable (i.e. matrix) representation of the data.
3 cluster (group) RSS together.
4 deeper semantic analyze and visualization of the information.
Example sentence :
“Google is to buy mobile phone manufacturer Motorola Mobility,
allowing it to mount a serious challenge to Apple Inc.”
9. Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
10. Extracting information : Classical approaches (Scikits-learn)
Char-N-gram
Extracts features of N characters from a text.
from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import CharNGramAnalyzer
a n a l y z e r = CharNGramAnalyzer ( min_n =3 , max_n=6)
f e a t u r e s = a n a l y z e r . analyze ( sentence )
450 features : ’goo’, ’oog’, ’ogl’, . . . , ’mob’, ’obi’, ’bil’, . . .
Word-N-gram
Extracts features of N words from a text.
from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import WordNGramAnalyzer
a n a l y z e r = WordNGramAnalyzer ( min_n =3 , max_n=6)
f e a t u r e s = a n a l y z e r . analyze ( sentence )
58 features : ’google is to’, ’is to buy’, . . . , ’serious challenge to’, . . .
Many irrelevant features (tokens).
Features do not carry lots of contextual information (i.e.
understandable by humans).
11. Feature extraction : Dbpedia - Context
Main Hypothesis : Only things that exist in Dbpedia (i.e
Wikipedia) have some interest in news analysis → Named
Entities Recognition (NER)
Dbpedia feature extraction (Cubicweb/Dbpedia)
from cubes . semnews . views . n e r t o o l s import D b p e d i a E n t i t i e s A n a l y z e r
a n a l y z e r = D b p e d i a E n t i t i e s A n a l y z e r ( session , l a n g = ’ en ’ )
tokens = a n a l y z e r . analyze ( sentence )
→ 3 features : ’Apple Inc’, ’Google’, ’Motorola’
ENAMEX TYPE=ORGANIZATION Google/ENAMEX is to buy mobile phone
manufacturer ENAMEX TYPE=ORGANIZATION Motorola/ENAMEX Mobility, allowing it to
mount a serious challenge to ENAMEX TYPE=ORGANIZATION Apple Inc/ENAMEX
→ Try it !
“DBpedia Spotlight : Shedding Light on the Web of Documents”, Pablo N. Mendes et al., I-Semantics 2011
“Learning Named Entity Recognition from Wikipedia”, Joel Nothman 2008
“Large-Scale Named Entity Disambiguation Based on Wikipedia Data”, Silviu Cucerzan 2007
12. Feature extraction : Dbpedia - Properties
Efficient and robust feature extraction
Keep the meaning of a text → interpretable features.
’. . . said former soldier Larry, . . . ’
’. . . said student Larry, . . . ’
’. . . said Larry Page, . . . ’
Robust features based on redirections.
e.g. ’Obama’, ’Barak Obamba’, ’Pres. Obama’ redirect to ’Barack Obama’
Simple RQL (Relation Query Language) queries
Fast, based on indexed SQL tables and regular expressions :
rset = rql(’Any E WHERE E is DbpediaPage,
E label %(token)s’, {’token’: token})
e.g. 19 entities extracted in 4s in 765 words, among ∼ 8.106 dbpedia entries.
Different labels but same URI → cross-language feature
extraction : e.g. Grenada/Grenade → http ://dbpedia.org/resource/Grenada
13. Overview
1 Introduction
2 Extracting information from RSS news
3 Analyzing RSS news
14. Feature extraction : Matrix representation and storage
Obama’s approval . . . 0 0 ... 1 0
Protest as Spain . . .
0
0 ... 0 1
... → ...
Libya rebels fight . . . 0 1 ... 0 0
Obama in Spain . . . 0 0 ... 1 1
E.g. The Obama-Merkel-Sarkozy
space . . .
Results stored in a relation (appears in rss) in Cubicweb :
rql(’Any X WHERE X appears_in_rss Y’)
15. Exploiting information : Clustering news
Creating clusters (groups) of news, using the matrix
representation of the data.
Meanshift algorithm (Scikits-learn)
Based on locating the maxima of a density function.
Automatically tunes the number of clusters.
from s c i k i t s . l e a r n . c l u s t e r import MeanShift , est im ate _ba nd wid th
bandwidth = es ti mat e_b an dwi dth ( X , q u a n t i l e =0.005)
c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )
c l u s t e r i n g . f i t (X)
labels = clustering . labels_
“Mean shift, mode seeking, and clustering.”, Yizong Cheng. IEEE Transactions on Pattern Analysis and Machine Intelligence
1995
→ Try it !
Other possible alternatives : Ward’s algorithm, K-means, . . .
16. Exploiting information : Queries and Views
Each query returns a result set (rset). A view is called on a result set
→ define the representation rules.
Defining a view (short version)
class M y E n t i t i e s V i e w ( View ) :
__regid__ = ’ example−view ’
def c a l l ( s e l f ) :
r s e t = s e l f . cw_rset
for e n t i t y in r s e t . e n t i t i e s ( ) :
do_whatever_you_want_ . . .
write_some_html_ . . .
. . . you just have to plug a rset from a query :
r s e t = r q l ( ’ Any X WHERE . . . ’ )
s e l f . wview ( ’ example−view ’ , r s e t )
or within an url :
http://myapplication/?rql=Any X WHERE ....vid=example-view
17. Pluging Scikits.learn in a view
Combine machine learning tools and database query system, in pure
Python.
Defining the view
class RSSClustering ( View ) :
__regid__ = ’ rss −c l u s t e r i n g ’
def c a l l ( s e l f ) :
# Get t h e r e s u l t s s e t and c r e a t e t h e data
r s e t = s e l f . cw_rset
o c c u r r e n c e _ m a t r i x = Oc c ur re n ce Ma t ri x ( )
occurrence_matrix . construct_matrix ( r s e t = r s e t )
# Scale t h e data
X = s c a l e ( o c c u r r e n c e _ m a t r i x , a x i s =0 , w i t h _ s t d =True , copy=True )
# Compute bandwith f o r c l u s t e r i n g
bandwidth = e sti mat e_ ban dwi dt h (X)
# Clustering
c l u s t e r = MeanShift ( bandwidth=bandwidth )
c l u s t e r . f i t (X)
labels = cluster . labels_
# Perform some HTML r e n d e r i n g
...
→ Try it !
18. A new approach for querying information from RSS
All musical artists in the news
rql(’DISTINCT Any E, R WHERE E appears_in_rss R,
E has_type T, T label musical artist’)
All living office holder persons in the news
rql(’DISTINCT Any E WHERE E appears_in_rss R,
E has_type T, T label office holder,
E has_subject C, C label Living people’)
All news that talk about Barack Obama and any scientist
rql(’DISTINCT Any R WHERE E1 label Barack Obama,
E1 appears_in_rss R, E2 appears_in_rss R,
E2 has_type T, T label scientist’)
All news that talk about a drug
rql(’Any X, R WHERE X appears_in_rss R,
X has_type T, T label drug’)
Try it ! . . . with an xml view . . . or a thumbnail view
19. Vizualisation : Mapping information
View
class EntityMapView ( View ) :
__regid__ = ’map ’
def c a l l ( s e l f ) :
r s e t = s e l f . cw_rset
s e l f . init_map ( )
for e n t i t y in r s e t . e n t i t i e s ( ) :
s e l f . add_marker ( e n t i t y . l a t i t u d e , e n t i t y . l o n g i t u d e ,
entity . dc_title )
s e l f . center_and_zoom ( 0 , 0 , 1 . 5 )
s e l f . finish_map ( )
Based on mapstraction (javascript) : http ://mapstraction.com/
Automatically locate information from RSS news.
Try it !
20. Conclusion
Cubicweb and Scikits-learn :
Efficient and easy-to-use tools for data storing, querying and
mining.
Easy to plug together, only Python tools, all Open Source.
Using Dbpedia allows to extract very few highly relevant features
Decrease the dimensionality of the data.
Link features to millions of pages of information.
A new semantic way for querying information
Simple information queries using RQL expressions.
Use Dbpedia types and categories to refine the selection.
21. Future Improvements
Named Entities Recognition
→ use disambiguations links / more refined Regular expressions.
→ add new databases (MusicBrainz, diseasome, . . . ).
→ closely follow Wikipedia with Dbpedia live update :
Motorola Mobility (11 :46, 15 August 2011) ’On the 15 August,
Google announced that it agreed to acquire the company.’
RSS news analyzing
→ explore new algorithms : bi-clustering, . . .
→ add new data sources : Twitters, Blogs, . . .
→ ’from scikits.learn predict __future__ . . . ’ → use Matrix
Completion to predict new edges in the correlation graph.