SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
A semantic news aggregator in Python using
   Dbpedia, Cubicweb and Scikits-learn

            Vincent Michel - Logilab



                 27 août 2011
Context

   With a bunch of RSS news...

       Google Borrows Apple Strategy : Google’s deal underscores the
       allure of a business model pioneered...
       Japan disaster plant cold shutdown could face delay : TOKYO
       (Reuters) - Tokyo Electric Power Co said on Wednesday...
       Libya shows signs of slipping from Muammar Gaddafi’s grasp :
       Supply lines to capital in peril as coastal cities fall...


   ... how can we analyze them in Python ?

    → Clustering (grouping) RSS (e.g. Google News).
    → Extracting/synthetizing information.
    → Providing semantic useful/original visualisation and
       analytics tools.
Overview




   1 Introduction

   2 Extracting information from RSS news

   3 Analyzing RSS news
Semantic ?

                                                                                                                                                                     Sussex                St.
                                                                                                                                                                     Reading            Andrews                     NDL
                                                                                                                                                  Audio-              Lists             Resource                  subjects              t4gm
                                                                                                                           MySpace               scrobbler                                Lists
                                                                                                            Moseley        (DBTune)              (DBTune)                                                                                                RAMEAU
                                                                                                             Folk                                                                                      NTU                                                 SH                  lobid
                                                                                             GTAA                                                                Plymouth                            Resource
                                                                                                                                                                                                       Lists
                                                                                                                                                                                                                                                                              Organi-
                                                                                                                                                                  Reading
                                                                                                                                                                    Lists
                                                                                                                                                                                                                                                                              sations
                                                                                                                   Music                                                              The Open                                                                                             ECS
                                                                                Magna-                             Brainz                         Music
                                                                   DB            tune                                                                                                  Library                                        LCSH                                                South-
                                                                                                                   (Data                          Brainz                                                          LIBRIS                                                                  ampton
                                                                 Tropes                                                                                                                                                                                          lobid                                       Ulm
                                                                                                                 Incubator)                      (zitgist)             Man-                                                                                                               EPrints
                                                                                                                                                                                                                                                               Resources
                                                                                                                                                                      chester
                                                                                             Surge                                                                    Reading
                                                biz.                                                                                Music                                                                                                                                                                               RISKS
                                                                                             Radio                                                                     Lists                        The Open                                                                                       ECS
                                               data.        John                                                                   Brainz
                                                                                                             Discogs                                                                                 Library                  PSH            Gem.                               UB                South-
                                              gov.uk         Peel                                                                 (DBTune)
                                                                           FanHubz                          (Data In-                                                                                (Talis)                                 Norm-                             Mann-              ampton
                                                             (DB                                            cubator)                                 Jamendo                                                                                 datei                             heim                                                  RESEX
                                                            Tune)
                                         Popula-                                                                                                                            Poké-                                                                                                                             DEPLOY
                                                                                            Last.fm
                                        tion (En-                                                                                                                           pédia
                                                                                            Artists                     Last.FM                                                            Linked                                                            RDF
                                         AKTing)         research          EUTC            (DBTune)                     (rdfize)                                                            LCCN                  VIAF                                       Book                                                            Wiki
                                                         data.gov         Produc-                                                                                                                                                                                                                 Pisa                                        Eurécom
                                                                                                                                                                                                                                     P20                    Mashup            semantic
                          NHS                               .uk            tions                                                         classical                                                                                                                            web.org
                       (EnAKTing)                                                                                                                            Pokedex
                                                                                                                                            (DB
                                            Mortality                                                                                     Tune)                               PBAC                                                                                                                            ECS
                                             (En-
                                            AKTing)
                                                                                             BBC                                                                                                     MARC                                                                                                     (RKB                                      Budapest
                                                                                           Program                                                                                                   Codes                                                                                                  Explorer)
          Energy                                         education        OpenEI                                 BBC                                                                                  List            Semantic                        Lotico          Revyu                                                         OAI
           (En-                CO2                       data.gov                            mes                 Music                                                                                                 Crunch                                                        SW
          AKTing)              (En-                         .uk                                                                  Chronic-                           Linked                                                                                                          Dog
                                                                                                                                                                                        NSZL                            Base
                              AKTing)                                                                                              ling            Event-            MDB                                                                 RDF                                        Food                                                         IRIT
                                                                                                                                 America           Media                               Catalog
                                                                                                                                                                                                                                        ohloh
                                                                                      BBC                                                                                                                                                                                                      DBLP                 ACM                                            IBM
                                                                                                                                                                                                         Good-                                          BibBase
                                            Ord-                                     Wildlife                                                                                                                                                                                                  (RKB
                                                                      Openly                          Recht-                                                                                              win
                                           nance                                     Finder                                                                                                                                                                                                  Explorer)
                                                                       Local                          spraak.                                                                                            Family                                                             DBLP
    legislation                            Survey                                                                       Tele-          New                                                                            VIVO UF
      .gov.uk                                                                                            nl            graphis         York                                  flickr                                                                                         (L3S)                                                              New-
                                                                                                                                                                                                                                            VIVO                                                                                               castle
                                                                                                                                      Times               URI               wrappr         Open                                            Indiana                                                                           RAE2001
                       UK Post-                                                                                                                          Burner                            Calais                                                            DBLP
                        codes                            statistics                                                                                                                                                                                           (FU
                                                                                                                                                                                                                               VIVO                                                                        CiteSeer                                                Roma
                                                         data.gov              LOIUS                 Taxon                                                                                                   iServe                                         Berlin)                       IEEE
                                                            .uk                                                                                                                                                               Cornell
                                                                                                    Concept             Geo
                                                                                                                                         World                                                                                                                              data
      ESD                                                                                                                                Fact-                                                                                                   OS                         dcs
                                                                                                                       Names              book                                                                                                                                                                                               dotAC
     stan-        reference                                                                                                                                                                                           Project
                                     Linked Data                                         NASA                                            (FUB)              Freebase
     dards        data.gov                                                                                                                                                                                            Guten-
                     .uk
                                     for Intervals                                       (Data                                                                                                                                                                                            GESIS          Course-
                                                                transport                                                                                                              DBpedia                         berg                                    STW                                                      ePrints                              CORDIS
                                                                                         Incu-                                                                                                                                                                                                            ware
                                                                data.gov                 bator)                                                                                                                       (FUB)
                                                                                                            Fishes                                                                                                                         ERA                             UN/
                                                                   .uk
                                                                                                           of Texas         Geo                                                                                                                                          LOCODE
                                                                                                                                              Uberblic
                                                                                 Euro-                                     Species
                The                                                               stat                                                                             dbpedia                               TCM                 SIDER                                                           Pub                                             KISTI
                                                                                 (FUB)                                                                               lite                                Gene                                          STITCH                               Chem                                                            JISC
              London                                                                               Geo                                                                                                                                                                                                       KEGG
                                                                                                                                                                                                          DIT                                                                                                                LAAS
              Gazette                   TWC LOGD                                                  Linked                                                                               Daily                                                                                OBO                              Drug
                                                               Eurostat                            Data           UMBEL            lingvoj                                             Med
                                                                                                   (es)                                                                                                                                    Disea-
                                                                                                                                                      YAGO                                                            Medi                 some
                                                                                                                                                                                                                      Care                                  ChEBI                                   KEGG                                         NSF
                                                                                                                                                                        Linked                                                                                                                                      KEGG            KEGG
                                                                               Linked                                                                                                                 Drug                                                                                           Cpd
                        GovTrack                        rdfabout                                                                                                                                                                                                                                                                    Glycan
                                                                            Sensor Data                                                                                   CT                          Bank                                                                                                         Pathway
                                                         US SEC                                                         Open                                                                                                                                                Reactome
                                                                             (Kno.e.sis)             riese                                                                               Uni
                                                                                                                         Cyc             Lexvo                                          Path-
                                                                                                                                                                                         way                                                          PDB                                                                                                  Media
                                        Semantic                                                                                                      totl.net                                                                Pfam
                                                                                                                                                                                                                                                                                             HGNC
                                          XBRL
                                                                                                                 WordNet                                                                                                                                                                                        KEGG               KEGG              Geographic
                                                                                                                  (VUA)              Linked                                  Taxo-                                                                                    CAS                                                         Reaction
                                                         rdfabout                                  Twarql                                                                                        UniProt                                                                                                       Enzyme
                                                                                 EUNIS                                                Open                                   nomy
                                                        US Census                                                                                                                                                                                                                                                                                Publications
                                                                                                                                    Numbers                                                                           PRO-              ProDom
                                                                                                                                                                                                                      SITE                                                               Chem2
                                                                                                                                                      UniRef                                                                                                                            Bio2RDF                                     User-generated content
                                                                          Climbing                                    WordNet                                                                                                                                SGD                                           Homolo
                                                                                            Linked                     (W3C)                                                            Affy-                                                                                                               Gene
                                                                                           GeoData
                                                                                                                                        Cornetto
                                                                                                                                                                                        metrix                                                                                                                                                   Government
                                                                                                                                                                                                             PubMed              Gene
                                                                                                                                                                      UniParc
                                                                                                                                                                                                                                Ontology
                                                                                                                                                                                                                                                                        GeneID                                                                 Cross-domain
                                                                                                            Airports
                                                                                                                            Product
                                                                                                                              DB                     UniSTS                                                                                            MGI
                                                                                                                                                                                    Gen                                                                                                                                                          Life sciences
                                                                                                                                                                                    Bank            OMIM              InterPro

                                                                                                                                                                                                                                                                                                                          As of September 2010




                                                                            “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
                                                                                                                                http ://lod-cloud.net/”
Tools

   Fetching, storing and querying the data → CubicWeb (*)

     Semantic CMS, high-level database management with metadata.
     Multiple sources : RSS, micro-blogging, SQL database, . . .
     Based on PostgreSQL : deals with (very) large database.

   Data-mining and machine learning → Scikits-learn (*)

     Easy-to-use and general-purpose machine learning in Python.
     Unsupervised learning, supervised learning, model selection . . .

   Semantic information database → Dbpedia (*)

     ∼ 8.106 articles with abstracts, images, . . . from Wikipedia.
     ∼ 0, 6.106 categories, 273 types (e.g. person, place, . . . )
     ∼ 100.106 links between articles, categories and types.

                                            (*) Open Source/Creative Commons ! ! !
Storing RSS information with Cubicweb
   Each object (or entity) is defined in a schema and may be displayed
   using different views.

   Storing RSS
   Define (in schema.py ) how RSS should be stored in the database.
   class R S S A r t i c l e ( E n t i t y T y p e ) :
        t i t l e = S t r i n g ( ) # T i t l e o f t h e feed
       u r i = S t r i n g ( unique=True ) # U r i o f t h e r s s feed
       c o n t e n t = S t r i n g ( ) # Content o f t h e feed



   Fetching RSS (based on feedparser and BeautifulSoup)
   Simply construct a source, by giving an URL and a parser :
   u r l = u ’ h t t p : / / feeds . b b c i . co . uk / news / w o r l d / r s s . xml ’
                                                                                                     −
   s = s e s s i o n . c r e a t e _ e n t i t y ( ’ CWSource ’ , name=u ’BBCNews World ’ , u r l = u r l ,
                                                   t y p e =u ’ d a t af e e d ’ , p a r s e r =u ’ rss −p a r s e r ’ ,
                                                   c o n f i g =u ’ s y n c h r o n i z a t i o n − i n t e r v a l =240min ’ )
   s . p u l l _ d a t a ( session )

   7 english/american journals (The New York Times, . . . )
Storing Dbpedia information with Cubicweb

   Storing Dbpedia page
   Define (in schema.py ) how Dbpedia pages should be stored in the
   database.
   class DbpediaPage ( E n t i t y T y p e ) :
       u r i = S t r i n g ( unique=True , indexed=True ) # U r i o f t h e r e s s o u r c e
       l a b e l = S t r i n g ( indexed=True ) # h t t p : / / www. w3 . org / 2 0 0 0 / 0 1 / r d f −schema
       pageid = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / wikiPageID
       a b s t r a c t = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / a b s t r a c t
       homepage = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / homepage
       t h u m b n a i l = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / t h u m b n a i l
       d e p i c t i o n = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / d e p i c t i o n
       w i k i p a g e = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / page
       l a t i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos
       l o n g i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos



   Storing all dbpedia information (∼ 9.106 pages, ∼ 100.106 links,
   20Go) in Cubicweb takes less than 24 hours.

                                            → See the full schema
Analyzing RSS news



  What can we do with the RSS news stored in database ?

    1    extract relevant features of the data.
    2    construct a usable (i.e. matrix) representation of the data.
    3    cluster (group) RSS together.
    4    deeper semantic analyze and visualization of the information.

  Example sentence :

        “Google is to buy mobile phone manufacturer Motorola Mobility,
            allowing it to mount a serious challenge to Apple Inc.”
Overview




   1 Introduction

   2 Extracting information from RSS news

   3 Analyzing RSS news
Extracting information : Classical approaches (Scikits-learn)

   Char-N-gram
   Extracts features of N characters from a text.
   from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import CharNGramAnalyzer
   a n a l y z e r = CharNGramAnalyzer ( min_n =3 , max_n=6)
   f e a t u r e s = a n a l y z e r . analyze ( sentence )

   450 features : ’goo’, ’oog’, ’ogl’, . . . , ’mob’, ’obi’, ’bil’, . . .

   Word-N-gram
   Extracts features of N words from a text.
   from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import WordNGramAnalyzer
   a n a l y z e r = WordNGramAnalyzer ( min_n =3 , max_n=6)
   f e a t u r e s = a n a l y z e r . analyze ( sentence )

   58 features : ’google is to’, ’is to buy’, . . . , ’serious challenge to’, . . .

       Many irrelevant features (tokens).
       Features do not carry lots of contextual information (i.e.
          understandable by humans).
Feature extraction : Dbpedia - Context

          Main Hypothesis : Only things that exist in Dbpedia (i.e
         Wikipedia) have some interest in news analysis → Named
                        Entities Recognition (NER)
   Dbpedia feature extraction (Cubicweb/Dbpedia)

   from cubes . semnews . views . n e r t o o l s import D b p e d i a E n t i t i e s A n a l y z e r
   a n a l y z e r = D b p e d i a E n t i t i e s A n a l y z e r ( session , l a n g = ’ en ’ )
   tokens = a n a l y z e r . analyze ( sentence )

   → 3 features : ’Apple Inc’, ’Google’, ’Motorola’

           ENAMEX TYPE=ORGANIZATION Google/ENAMEX is to buy mobile phone
   manufacturer ENAMEX TYPE=ORGANIZATION Motorola/ENAMEX Mobility, allowing it to
      mount a serious challenge to ENAMEX TYPE=ORGANIZATION Apple Inc/ENAMEX



                                                        → Try it !
                      “DBpedia Spotlight : Shedding Light on the Web of Documents”, Pablo N. Mendes et al., I-Semantics 2011
                                                      “Learning Named Entity Recognition from Wikipedia”, Joel Nothman 2008
                                 “Large-Scale Named Entity Disambiguation Based on Wikipedia Data”, Silviu Cucerzan 2007
Feature extraction : Dbpedia - Properties

   Efficient and robust feature extraction

        Keep the meaning of a text → interpretable features.
          ’. . . said former soldier Larry, . . . ’
          ’. . . said student Larry, . . . ’
          ’. . . said Larry Page, . . . ’
        Robust features based on redirections.
        e.g. ’Obama’, ’Barak Obamba’, ’Pres. Obama’ redirect to ’Barack Obama’



   Simple RQL (Relation Query Language) queries

        Fast, based on indexed SQL tables and regular expressions :
        rset = rql(’Any E WHERE E is DbpediaPage,
                                E label %(token)s’, {’token’: token})

        e.g. 19 entities extracted in 4s in 765 words, among ∼ 8.106 dbpedia entries.

        Different labels but same URI → cross-language feature
        extraction : e.g. Grenada/Grenade → http ://dbpedia.org/resource/Grenada
Overview




   1 Introduction

   2 Extracting information from RSS news

   3 Analyzing RSS news
Feature extraction : Matrix representation and storage

                                                                       
                   Obama’s approval . . .           0     0   ...   1   0
                  Protest as Spain . . .  
                                            
                                                    0
                                                          0   ...   0   1 
                                                                           
                            ...               →  ...                   
                                                                       
                  Libya rebels fight . . .    0        1   ...   0   0 
                     Obama in Spain . . .            0   0   ...   1   1




   E.g. The Obama-Merkel-Sarkozy
   space . . .




   Results stored in a relation (appears in rss) in Cubicweb :

   rql(’Any X WHERE X appears_in_rss Y’)
Exploiting information : Clustering news

              Creating clusters (groups) of news, using the matrix
                           representation of the data.


   Meanshift algorithm (Scikits-learn)

          Based on locating the maxima of a density function.
          Automatically tunes the number of clusters.

   from s c i k i t s . l e a r n . c l u s t e r import MeanShift , est im ate _ba nd wid th
   bandwidth = es ti mat e_b an dwi dth ( X , q u a n t i l e =0.005)
   c l u s t e r i n g = MeanShift ( bandwidth=bandwidth )
   c l u s t e r i n g . f i t (X)
   labels = clustering . labels_

      “Mean shift, mode seeking, and clustering.”, Yizong Cheng. IEEE Transactions on Pattern Analysis and Machine Intelligence
                                                                                                                           1995



                                                          → Try it !


   Other possible alternatives : Ward’s algorithm, K-means, . . .
Exploiting information : Queries and Views
   Each query returns a result set (rset). A view is called on a result set
   → define the representation rules.
   Defining a view (short version)

   class M y E n t i t i e s V i e w ( View ) :
       __regid__ = ’ example−view ’

        def c a l l ( s e l f ) :
            r s e t = s e l f . cw_rset
            for e n t i t y in r s e t . e n t i t i e s ( ) :
                    do_whatever_you_want_ . . .
                    write_some_html_ . . .



   . . . you just have to plug a rset from a query :
   r s e t = r q l ( ’ Any X WHERE . . . ’ )
   s e l f . wview ( ’ example−view ’ , r s e t )

   or within an url :

   http://myapplication/?rql=Any X WHERE ....vid=example-view
Pluging Scikits.learn in a view
   Combine machine learning tools and database query system, in pure
   Python.
   Defining the view

   class RSSClustering ( View ) :
       __regid__ = ’ rss −c l u s t e r i n g ’

        def c a l l ( s e l f ) :
            # Get t h e r e s u l t s s e t and c r e a t e t h e data
            r s e t = s e l f . cw_rset
            o c c u r r e n c e _ m a t r i x = Oc c ur re n ce Ma t ri x ( )
            occurrence_matrix . construct_matrix ( r s e t = r s e t )
            # Scale t h e data
            X = s c a l e ( o c c u r r e n c e _ m a t r i x , a x i s =0 , w i t h _ s t d =True , copy=True )
            # Compute bandwith f o r c l u s t e r i n g
            bandwidth = e sti mat e_ ban dwi dt h (X)
            # Clustering
            c l u s t e r = MeanShift ( bandwidth=bandwidth )
            c l u s t e r . f i t (X)
            labels = cluster . labels_
            # Perform some HTML r e n d e r i n g
            ...

                                                    → Try it !
A new approach for querying information from RSS
   All musical artists in the news
   rql(’DISTINCT Any E, R WHERE E appears_in_rss R,
        E has_type T, T label musical artist’)

   All living office holder persons in the news
   rql(’DISTINCT Any E WHERE E appears_in_rss R,
        E has_type T, T label office holder,
        E has_subject C, C label Living people’)

   All news that talk about Barack Obama and any scientist
   rql(’DISTINCT Any R WHERE E1 label Barack Obama,
        E1 appears_in_rss R, E2 appears_in_rss R,
        E2 has_type T, T label scientist’)

   All news that talk about a drug
   rql(’Any X, R WHERE X appears_in_rss R,
        X has_type T, T label drug’)
   Try it ! . . . with an xml view . . . or a thumbnail view
Vizualisation : Mapping information

   View
   class EntityMapView ( View ) :
       __regid__ = ’map ’

        def c a l l ( s e l f ) :
           r s e t = s e l f . cw_rset
           s e l f . init_map ( )
           for e n t i t y in r s e t . e n t i t i e s ( ) :
                 s e l f . add_marker ( e n t i t y . l a t i t u d e , e n t i t y . l o n g i t u d e ,
                                        entity . dc_title )
           s e l f . center_and_zoom ( 0 , 0 , 1 . 5 )
           s e l f . finish_map ( )

   Based on mapstraction (javascript) : http ://mapstraction.com/



               Automatically locate information from RSS news.

                                                          Try it !
Conclusion

   Cubicweb and Scikits-learn :
        Efficient and easy-to-use tools for data storing, querying and
        mining.
        Easy to plug together, only Python tools, all Open Source.

   Using Dbpedia allows to extract very few highly relevant features

     Decrease the dimensionality of the data.
     Link features to millions of pages of information.


   A new semantic way for querying information

     Simple information queries using RQL expressions.
     Use Dbpedia types and categories to refine the selection.
Future Improvements

   Named Entities Recognition

    → use disambiguations links / more refined Regular expressions.
    → add new databases (MusicBrainz, diseasome, . . . ).
    → closely follow Wikipedia with Dbpedia live update :
         Motorola Mobility (11 :46, 15 August 2011) ’On the 15 August,
          Google announced that it agreed to acquire the company.’


   RSS news analyzing

    → explore new algorithms : bi-clustering, . . .
    → add new data sources : Twitters, Blogs, . . .
    → ’from scikits.learn predict __future__ . . . ’ → use Matrix
       Completion to predict new edges in the correlation graph.
Thanks you for your attention !
          Questions ?

Mais conteúdo relacionado

Semelhante a Euroscipy SemNews 2011

Ontology Alignment using Linked Data
Ontology Alignment using Linked DataOntology Alignment using Linked Data
Ontology Alignment using Linked DataTim Hodson
 
SIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web LanguagesSIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web LanguagesRinke Hoekstra
 
Semantic Representations for Research
Semantic Representations for ResearchSemantic Representations for Research
Semantic Representations for ResearchRinke Hoekstra
 
Graph-based Ontology Analysis in the Linked Open Data
Graph-based Ontology Analysis in the Linked Open DataGraph-based Ontology Analysis in the Linked Open Data
Graph-based Ontology Analysis in the Linked Open DataLihua Zhao
 
LinkedDataと地理空間情報
LinkedDataと地理空間情報LinkedDataと地理空間情報
LinkedDataと地理空間情報Fumihiro Kato
 
20111110 LOD のご紹介
20111110 LOD のご紹介20111110 LOD のご紹介
20111110 LOD のご紹介Fumihiro Kato
 

Semelhante a Euroscipy SemNews 2011 (7)

Ontology Alignment using Linked Data
Ontology Alignment using Linked DataOntology Alignment using Linked Data
Ontology Alignment using Linked Data
 
SIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web LanguagesSIKS 2011 Semantic Web Languages
SIKS 2011 Semantic Web Languages
 
Semantic Representations for Research
Semantic Representations for ResearchSemantic Representations for Research
Semantic Representations for Research
 
Graph-based Ontology Analysis in the Linked Open Data
Graph-based Ontology Analysis in the Linked Open DataGraph-based Ontology Analysis in the Linked Open Data
Graph-based Ontology Analysis in the Linked Open Data
 
LinkedDataと地理空間情報
LinkedDataと地理空間情報LinkedDataと地理空間情報
LinkedDataと地理空間情報
 
20111110 LOD のご紹介
20111110 LOD のご紹介20111110 LOD のご紹介
20111110 LOD のご紹介
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
 

Mais de Logilab

Testinfra pyconfr 2017
Testinfra pyconfr 2017Testinfra pyconfr 2017
Testinfra pyconfr 2017Logilab
 
Open Source & Open Data : les bienfaits des communs
Open Source & Open Data : les bienfaits des communsOpen Source & Open Data : les bienfaits des communs
Open Source & Open Data : les bienfaits des communsLogilab
 
Salon Open Data
Salon Open DataSalon Open Data
Salon Open DataLogilab
 
Pydata Paris Python for manufacturing musical instruments
Pydata Paris Python for manufacturing musical instrumentsPydata Paris Python for manufacturing musical instruments
Pydata Paris Python for manufacturing musical instrumentsLogilab
 
Présentation Logilab
Présentation LogilabPrésentation Logilab
Présentation LogilabLogilab
 
Système d'archivage électronique mutualisé
Système d'archivage électronique mutualiséSystème d'archivage électronique mutualisé
Système d'archivage électronique mutualiséLogilab
 
Utiliser salt pour tester son infrastructure sur open stack ou docker
Utiliser salt pour tester son infrastructure sur open stack ou dockerUtiliser salt pour tester son infrastructure sur open stack ou docker
Utiliser salt pour tester son infrastructure sur open stack ou dockerLogilab
 
Importer des données en Python avec CubicWeb 3.21
Importer des données en Python avec CubicWeb 3.21Importer des données en Python avec CubicWeb 3.21
Importer des données en Python avec CubicWeb 3.21Logilab
 
Simulagora au service d'un grand défi industriel
Simulagora au service d'un grand défi industrielSimulagora au service d'un grand défi industriel
Simulagora au service d'un grand défi industrielLogilab
 
Simulagora - Salon du Bourget
Simulagora - Salon du BourgetSimulagora - Salon du Bourget
Simulagora - Salon du BourgetLogilab
 
Innover par et pour la donnée - Logilab ADBU Bibcamp 2015
Innover par et pour la donnée - Logilab ADBU Bibcamp 2015Innover par et pour la donnée - Logilab ADBU Bibcamp 2015
Innover par et pour la donnée - Logilab ADBU Bibcamp 2015Logilab
 
Study of the dynamic behavior of a pump with Code_ASTER on Simulagora
Study of the dynamic behavior of a pump with Code_ASTER on SimulagoraStudy of the dynamic behavior of a pump with Code_ASTER on Simulagora
Study of the dynamic behavior of a pump with Code_ASTER on SimulagoraLogilab
 
Initialiser des conteneurs Docker à partir de configurations Salt construites...
Initialiser des conteneurs Docker à partir de configurations Salt construites...Initialiser des conteneurs Docker à partir de configurations Salt construites...
Initialiser des conteneurs Docker à partir de configurations Salt construites...Logilab
 
Battle Opendata - Logilab - Cubicweb
Battle Opendata - Logilab - CubicwebBattle Opendata - Logilab - Cubicweb
Battle Opendata - Logilab - CubicwebLogilab
 
Debconf14 : Putting some salt in your Debian systems -- Julien Cristau
Debconf14 : Putting some salt in your Debian systems -- Julien CristauDebconf14 : Putting some salt in your Debian systems -- Julien Cristau
Debconf14 : Putting some salt in your Debian systems -- Julien CristauLogilab
 
Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)Logilab
 
PAFI (Euroscipy2014 - Logilab)
PAFI (Euroscipy2014 - Logilab)PAFI (Euroscipy2014 - Logilab)
PAFI (Euroscipy2014 - Logilab)Logilab
 
Open Legislative Data Conference 2014
Open Legislative Data Conference 2014Open Legislative Data Conference 2014
Open Legislative Data Conference 2014Logilab
 
Pylint : 10 ans, état des lieux
Pylint : 10 ans, état des lieuxPylint : 10 ans, état des lieux
Pylint : 10 ans, état des lieuxLogilab
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...Logilab
 

Mais de Logilab (20)

Testinfra pyconfr 2017
Testinfra pyconfr 2017Testinfra pyconfr 2017
Testinfra pyconfr 2017
 
Open Source & Open Data : les bienfaits des communs
Open Source & Open Data : les bienfaits des communsOpen Source & Open Data : les bienfaits des communs
Open Source & Open Data : les bienfaits des communs
 
Salon Open Data
Salon Open DataSalon Open Data
Salon Open Data
 
Pydata Paris Python for manufacturing musical instruments
Pydata Paris Python for manufacturing musical instrumentsPydata Paris Python for manufacturing musical instruments
Pydata Paris Python for manufacturing musical instruments
 
Présentation Logilab
Présentation LogilabPrésentation Logilab
Présentation Logilab
 
Système d'archivage électronique mutualisé
Système d'archivage électronique mutualiséSystème d'archivage électronique mutualisé
Système d'archivage électronique mutualisé
 
Utiliser salt pour tester son infrastructure sur open stack ou docker
Utiliser salt pour tester son infrastructure sur open stack ou dockerUtiliser salt pour tester son infrastructure sur open stack ou docker
Utiliser salt pour tester son infrastructure sur open stack ou docker
 
Importer des données en Python avec CubicWeb 3.21
Importer des données en Python avec CubicWeb 3.21Importer des données en Python avec CubicWeb 3.21
Importer des données en Python avec CubicWeb 3.21
 
Simulagora au service d'un grand défi industriel
Simulagora au service d'un grand défi industrielSimulagora au service d'un grand défi industriel
Simulagora au service d'un grand défi industriel
 
Simulagora - Salon du Bourget
Simulagora - Salon du BourgetSimulagora - Salon du Bourget
Simulagora - Salon du Bourget
 
Innover par et pour la donnée - Logilab ADBU Bibcamp 2015
Innover par et pour la donnée - Logilab ADBU Bibcamp 2015Innover par et pour la donnée - Logilab ADBU Bibcamp 2015
Innover par et pour la donnée - Logilab ADBU Bibcamp 2015
 
Study of the dynamic behavior of a pump with Code_ASTER on Simulagora
Study of the dynamic behavior of a pump with Code_ASTER on SimulagoraStudy of the dynamic behavior of a pump with Code_ASTER on Simulagora
Study of the dynamic behavior of a pump with Code_ASTER on Simulagora
 
Initialiser des conteneurs Docker à partir de configurations Salt construites...
Initialiser des conteneurs Docker à partir de configurations Salt construites...Initialiser des conteneurs Docker à partir de configurations Salt construites...
Initialiser des conteneurs Docker à partir de configurations Salt construites...
 
Battle Opendata - Logilab - Cubicweb
Battle Opendata - Logilab - CubicwebBattle Opendata - Logilab - Cubicweb
Battle Opendata - Logilab - Cubicweb
 
Debconf14 : Putting some salt in your Debian systems -- Julien Cristau
Debconf14 : Putting some salt in your Debian systems -- Julien CristauDebconf14 : Putting some salt in your Debian systems -- Julien Cristau
Debconf14 : Putting some salt in your Debian systems -- Julien Cristau
 
Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)Simulagora (Euroscipy2014 - Logilab)
Simulagora (Euroscipy2014 - Logilab)
 
PAFI (Euroscipy2014 - Logilab)
PAFI (Euroscipy2014 - Logilab)PAFI (Euroscipy2014 - Logilab)
PAFI (Euroscipy2014 - Logilab)
 
Open Legislative Data Conference 2014
Open Legislative Data Conference 2014Open Legislative Data Conference 2014
Open Legislative Data Conference 2014
 
Pylint : 10 ans, état des lieux
Pylint : 10 ans, état des lieuxPylint : 10 ans, état des lieux
Pylint : 10 ans, état des lieux
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Euroscipy SemNews 2011

  • 1. A semantic news aggregator in Python using Dbpedia, Cubicweb and Scikits-learn Vincent Michel - Logilab 27 août 2011
  • 2. Context With a bunch of RSS news... Google Borrows Apple Strategy : Google’s deal underscores the allure of a business model pioneered... Japan disaster plant cold shutdown could face delay : TOKYO (Reuters) - Tokyo Electric Power Co said on Wednesday... Libya shows signs of slipping from Muammar Gaddafi’s grasp : Supply lines to capital in peril as coastal cities fall... ... how can we analyze them in Python ? → Clustering (grouping) RSS (e.g. Google News). → Extracting/synthetizing information. → Providing semantic useful/original visualisation and analytics tools.
  • 3. Overview 1 Introduction 2 Extracting information from RSS news 3 Analyzing RSS news
  • 4. Semantic ? Sussex St. Reading Andrews NDL Audio- Lists Resource subjects t4gm MySpace scrobbler Lists Moseley (DBTune) (DBTune) RAMEAU Folk NTU SH lobid GTAA Plymouth Resource Lists Organi- Reading Lists sations Music The Open ECS Magna- Brainz Music DB tune Library LCSH South- (Data Brainz LIBRIS ampton Tropes lobid Ulm Incubator) (zitgist) Man- EPrints Resources chester Surge Reading biz. Music RISKS Radio Lists The Open ECS data. John Brainz Discogs Library PSH Gem. UB South- gov.uk Peel (DBTune) FanHubz (Data In- (Talis) Norm- Mann- ampton (DB cubator) Jamendo datei heim RESEX Tune) Popula- Poké- DEPLOY Last.fm tion (En- pédia Artists Last.FM Linked RDF AKTing) research EUTC (DBTune) (rdfize) LCCN VIAF Book Wiki data.gov Produc- Pisa Eurécom P20 Mashup semantic NHS .uk tions classical web.org (EnAKTing) Pokedex (DB Mortality Tune) PBAC ECS (En- AKTing) BBC MARC (RKB Budapest Program Codes Explorer) Energy education OpenEI BBC List Semantic Lotico Revyu OAI (En- CO2 data.gov mes Music Crunch SW AKTing) (En- .uk Chronic- Linked Dog NSZL Base AKTing) ling Event- MDB RDF Food IRIT America Media Catalog ohloh BBC DBLP ACM IBM Good- BibBase Ord- Wildlife (RKB Openly Recht- win nance Finder Explorer) Local spraak. Family DBLP legislation Survey Tele- New VIVO UF .gov.uk nl graphis York flickr (L3S) New- VIVO castle Times URI wrappr Open Indiana RAE2001 UK Post- Burner Calais DBLP codes statistics (FU VIVO CiteSeer Roma data.gov LOIUS Taxon iServe Berlin) IEEE .uk Cornell Concept Geo World data ESD Fact- OS dcs Names book dotAC stan- reference Project Linked Data NASA (FUB) Freebase dards data.gov Guten- .uk for Intervals (Data GESIS Course- transport DBpedia berg STW ePrints CORDIS Incu- ware data.gov bator) (FUB) Fishes ERA UN/ .uk of Texas Geo LOCODE Uberblic Euro- Species The stat dbpedia TCM SIDER Pub KISTI (FUB) lite Gene STITCH Chem JISC London Geo KEGG DIT LAAS Gazette TWC LOGD Linked Daily OBO Drug Eurostat Data UMBEL lingvoj Med (es) Disea- YAGO Medi some Care ChEBI KEGG NSF Linked KEGG KEGG Linked Drug Cpd GovTrack rdfabout Glycan Sensor Data CT Bank Pathway US SEC Open Reactome (Kno.e.sis) riese Uni Cyc Lexvo Path- way PDB Media Semantic totl.net Pfam HGNC XBRL WordNet KEGG KEGG Geographic (VUA) Linked Taxo- CAS Reaction rdfabout Twarql UniProt Enzyme EUNIS Open nomy US Census Publications Numbers PRO- ProDom SITE Chem2 UniRef Bio2RDF User-generated content Climbing WordNet SGD Homolo Linked (W3C) Affy- Gene GeoData Cornetto metrix Government PubMed Gene UniParc Ontology GeneID Cross-domain Airports Product DB UniSTS MGI Gen Life sciences Bank OMIM InterPro As of September 2010 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http ://lod-cloud.net/”
  • 5. Tools Fetching, storing and querying the data → CubicWeb (*) Semantic CMS, high-level database management with metadata. Multiple sources : RSS, micro-blogging, SQL database, . . . Based on PostgreSQL : deals with (very) large database. Data-mining and machine learning → Scikits-learn (*) Easy-to-use and general-purpose machine learning in Python. Unsupervised learning, supervised learning, model selection . . . Semantic information database → Dbpedia (*) ∼ 8.106 articles with abstracts, images, . . . from Wikipedia. ∼ 0, 6.106 categories, 273 types (e.g. person, place, . . . ) ∼ 100.106 links between articles, categories and types. (*) Open Source/Creative Commons ! ! !
  • 6. Storing RSS information with Cubicweb Each object (or entity) is defined in a schema and may be displayed using different views. Storing RSS Define (in schema.py ) how RSS should be stored in the database. class R S S A r t i c l e ( E n t i t y T y p e ) : t i t l e = S t r i n g ( ) # T i t l e o f t h e feed u r i = S t r i n g ( unique=True ) # U r i o f t h e r s s feed c o n t e n t = S t r i n g ( ) # Content o f t h e feed Fetching RSS (based on feedparser and BeautifulSoup) Simply construct a source, by giving an URL and a parser : u r l = u ’ h t t p : / / feeds . b b c i . co . uk / news / w o r l d / r s s . xml ’ − s = s e s s i o n . c r e a t e _ e n t i t y ( ’ CWSource ’ , name=u ’BBCNews World ’ , u r l = u r l , t y p e =u ’ d a t af e e d ’ , p a r s e r =u ’ rss −p a r s e r ’ , c o n f i g =u ’ s y n c h r o n i z a t i o n − i n t e r v a l =240min ’ ) s . p u l l _ d a t a ( session ) 7 english/american journals (The New York Times, . . . )
  • 7. Storing Dbpedia information with Cubicweb Storing Dbpedia page Define (in schema.py ) how Dbpedia pages should be stored in the database. class DbpediaPage ( E n t i t y T y p e ) : u r i = S t r i n g ( unique=True , indexed=True ) # U r i o f t h e r e s s o u r c e l a b e l = S t r i n g ( indexed=True ) # h t t p : / / www. w3 . org / 2 0 0 0 / 0 1 / r d f −schema pageid = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / wikiPageID a b s t r a c t = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / a b s t r a c t homepage = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / homepage t h u m b n a i l = S t r i n g ( ) # h t t p : / / dbpedia . org / o n t o l o g y / t h u m b n a i l d e p i c t i o n = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / d e p i c t i o n w i k i p a g e = S t r i n g ( ) # h t t p : / / xmlns . com / f o a f / 0 . 1 / page l a t i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos l o n g i t u d e = S t r i n g ( ) # h t t p : / / www. w3 . org / 2 0 0 3 / 0 1 / geo / wgs84_pos Storing all dbpedia information (∼ 9.106 pages, ∼ 100.106 links, 20Go) in Cubicweb takes less than 24 hours. → See the full schema
  • 8. Analyzing RSS news What can we do with the RSS news stored in database ? 1 extract relevant features of the data. 2 construct a usable (i.e. matrix) representation of the data. 3 cluster (group) RSS together. 4 deeper semantic analyze and visualization of the information. Example sentence : “Google is to buy mobile phone manufacturer Motorola Mobility, allowing it to mount a serious challenge to Apple Inc.”
  • 9. Overview 1 Introduction 2 Extracting information from RSS news 3 Analyzing RSS news
  • 10. Extracting information : Classical approaches (Scikits-learn) Char-N-gram Extracts features of N characters from a text. from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import CharNGramAnalyzer a n a l y z e r = CharNGramAnalyzer ( min_n =3 , max_n=6) f e a t u r e s = a n a l y z e r . analyze ( sentence ) 450 features : ’goo’, ’oog’, ’ogl’, . . . , ’mob’, ’obi’, ’bil’, . . . Word-N-gram Extracts features of N words from a text. from s c i k i t s . l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import WordNGramAnalyzer a n a l y z e r = WordNGramAnalyzer ( min_n =3 , max_n=6) f e a t u r e s = a n a l y z e r . analyze ( sentence ) 58 features : ’google is to’, ’is to buy’, . . . , ’serious challenge to’, . . . Many irrelevant features (tokens). Features do not carry lots of contextual information (i.e. understandable by humans).
  • 11. Feature extraction : Dbpedia - Context Main Hypothesis : Only things that exist in Dbpedia (i.e Wikipedia) have some interest in news analysis → Named Entities Recognition (NER) Dbpedia feature extraction (Cubicweb/Dbpedia) from cubes . semnews . views . n e r t o o l s import D b p e d i a E n t i t i e s A n a l y z e r a n a l y z e r = D b p e d i a E n t i t i e s A n a l y z e r ( session , l a n g = ’ en ’ ) tokens = a n a l y z e r . analyze ( sentence ) → 3 features : ’Apple Inc’, ’Google’, ’Motorola’ ENAMEX TYPE=ORGANIZATION Google/ENAMEX is to buy mobile phone manufacturer ENAMEX TYPE=ORGANIZATION Motorola/ENAMEX Mobility, allowing it to mount a serious challenge to ENAMEX TYPE=ORGANIZATION Apple Inc/ENAMEX → Try it ! “DBpedia Spotlight : Shedding Light on the Web of Documents”, Pablo N. Mendes et al., I-Semantics 2011 “Learning Named Entity Recognition from Wikipedia”, Joel Nothman 2008 “Large-Scale Named Entity Disambiguation Based on Wikipedia Data”, Silviu Cucerzan 2007
  • 12. Feature extraction : Dbpedia - Properties Efficient and robust feature extraction Keep the meaning of a text → interpretable features. ’. . . said former soldier Larry, . . . ’ ’. . . said student Larry, . . . ’ ’. . . said Larry Page, . . . ’ Robust features based on redirections. e.g. ’Obama’, ’Barak Obamba’, ’Pres. Obama’ redirect to ’Barack Obama’ Simple RQL (Relation Query Language) queries Fast, based on indexed SQL tables and regular expressions : rset = rql(’Any E WHERE E is DbpediaPage, E label %(token)s’, {’token’: token}) e.g. 19 entities extracted in 4s in 765 words, among ∼ 8.106 dbpedia entries. Different labels but same URI → cross-language feature extraction : e.g. Grenada/Grenade → http ://dbpedia.org/resource/Grenada
  • 13. Overview 1 Introduction 2 Extracting information from RSS news 3 Analyzing RSS news
  • 14. Feature extraction : Matrix representation and storage     Obama’s approval . . . 0 0 ... 1 0  Protest as Spain . . .     0  0 ... 0 1    ...  →  ...       Libya rebels fight . . .   0 1 ... 0 0  Obama in Spain . . . 0 0 ... 1 1 E.g. The Obama-Merkel-Sarkozy space . . . Results stored in a relation (appears in rss) in Cubicweb : rql(’Any X WHERE X appears_in_rss Y’)
  • 15. Exploiting information : Clustering news Creating clusters (groups) of news, using the matrix representation of the data. Meanshift algorithm (Scikits-learn) Based on locating the maxima of a density function. Automatically tunes the number of clusters. from s c i k i t s . l e a r n . c l u s t e r import MeanShift , est im ate _ba nd wid th bandwidth = es ti mat e_b an dwi dth ( X , q u a n t i l e =0.005) c l u s t e r i n g = MeanShift ( bandwidth=bandwidth ) c l u s t e r i n g . f i t (X) labels = clustering . labels_ “Mean shift, mode seeking, and clustering.”, Yizong Cheng. IEEE Transactions on Pattern Analysis and Machine Intelligence 1995 → Try it ! Other possible alternatives : Ward’s algorithm, K-means, . . .
  • 16. Exploiting information : Queries and Views Each query returns a result set (rset). A view is called on a result set → define the representation rules. Defining a view (short version) class M y E n t i t i e s V i e w ( View ) : __regid__ = ’ example−view ’ def c a l l ( s e l f ) : r s e t = s e l f . cw_rset for e n t i t y in r s e t . e n t i t i e s ( ) : do_whatever_you_want_ . . . write_some_html_ . . . . . . you just have to plug a rset from a query : r s e t = r q l ( ’ Any X WHERE . . . ’ ) s e l f . wview ( ’ example−view ’ , r s e t ) or within an url : http://myapplication/?rql=Any X WHERE ....vid=example-view
  • 17. Pluging Scikits.learn in a view Combine machine learning tools and database query system, in pure Python. Defining the view class RSSClustering ( View ) : __regid__ = ’ rss −c l u s t e r i n g ’ def c a l l ( s e l f ) : # Get t h e r e s u l t s s e t and c r e a t e t h e data r s e t = s e l f . cw_rset o c c u r r e n c e _ m a t r i x = Oc c ur re n ce Ma t ri x ( ) occurrence_matrix . construct_matrix ( r s e t = r s e t ) # Scale t h e data X = s c a l e ( o c c u r r e n c e _ m a t r i x , a x i s =0 , w i t h _ s t d =True , copy=True ) # Compute bandwith f o r c l u s t e r i n g bandwidth = e sti mat e_ ban dwi dt h (X) # Clustering c l u s t e r = MeanShift ( bandwidth=bandwidth ) c l u s t e r . f i t (X) labels = cluster . labels_ # Perform some HTML r e n d e r i n g ... → Try it !
  • 18. A new approach for querying information from RSS All musical artists in the news rql(’DISTINCT Any E, R WHERE E appears_in_rss R, E has_type T, T label musical artist’) All living office holder persons in the news rql(’DISTINCT Any E WHERE E appears_in_rss R, E has_type T, T label office holder, E has_subject C, C label Living people’) All news that talk about Barack Obama and any scientist rql(’DISTINCT Any R WHERE E1 label Barack Obama, E1 appears_in_rss R, E2 appears_in_rss R, E2 has_type T, T label scientist’) All news that talk about a drug rql(’Any X, R WHERE X appears_in_rss R, X has_type T, T label drug’) Try it ! . . . with an xml view . . . or a thumbnail view
  • 19. Vizualisation : Mapping information View class EntityMapView ( View ) : __regid__ = ’map ’ def c a l l ( s e l f ) : r s e t = s e l f . cw_rset s e l f . init_map ( ) for e n t i t y in r s e t . e n t i t i e s ( ) : s e l f . add_marker ( e n t i t y . l a t i t u d e , e n t i t y . l o n g i t u d e , entity . dc_title ) s e l f . center_and_zoom ( 0 , 0 , 1 . 5 ) s e l f . finish_map ( ) Based on mapstraction (javascript) : http ://mapstraction.com/ Automatically locate information from RSS news. Try it !
  • 20. Conclusion Cubicweb and Scikits-learn : Efficient and easy-to-use tools for data storing, querying and mining. Easy to plug together, only Python tools, all Open Source. Using Dbpedia allows to extract very few highly relevant features Decrease the dimensionality of the data. Link features to millions of pages of information. A new semantic way for querying information Simple information queries using RQL expressions. Use Dbpedia types and categories to refine the selection.
  • 21. Future Improvements Named Entities Recognition → use disambiguations links / more refined Regular expressions. → add new databases (MusicBrainz, diseasome, . . . ). → closely follow Wikipedia with Dbpedia live update : Motorola Mobility (11 :46, 15 August 2011) ’On the 15 August, Google announced that it agreed to acquire the company.’ RSS news analyzing → explore new algorithms : bi-clustering, . . . → add new data sources : Twitters, Blogs, . . . → ’from scikits.learn predict __future__ . . . ’ → use Matrix Completion to predict new edges in the correlation graph.
  • 22. Thanks you for your attention ! Questions ?