SlideShare uma empresa Scribd logo
1 de 25
Algoritmo di text-similarity
 per l’annotazione semantica di WS
               SWAP research group - 27 luglio 2010
                     Michele Filannino, @bronko85
Outline
      Il problema
           Scenario di riferimento
           Similarità

      SAWA
           Word-to-word similarity
           Text-to-text similarity

      Risultati sperimentali
           Qualità dei risultati
           Tempo di esecuzione




2
      Sviluppi futuri
      Sessione dimostrativa
Il problema
Come misurare la similarità tra due testi?
4   Scenario di riferimento
                Natural language     To approve/reject
                  descriptions     suggested annotations




     WSDL file   CODEArchitects       CODEArchitects        SAWSDL file
                Annotation Tool      Annotation Tool
5   Similarità semantica
      Assegnare una metrica di somiglianza, basata sul significato, ad un insieme di
      termini e/o documenti;


      Similarità ≠ Correlatività;
      “Banca” e “denaro” sono correlati sebbene non siano affatto simili;


      Similarità   Correlatività;

      “Ragazza” e “fanciulla” sono simili quindi anche correlati.
6   Similarità semantica in SWOP

    Concetti del WS          Concetti ontologici
    - RequestOrder                         Order -
    - Order                        OrderNumber -
    - BillingInformation                 OrderID -
    - ...                                   BillID -
                                   BillReference -
                                   BusinessFirm -
                                         Product -
                                         Catalog -
                                                ... -
7   Peso computazionale


     Esempio:
      Ontologia con 1200 concetti

      WSDL con 15 annotazioni

                    18.000 esecuzioni di SAWA




                                                :(
     1.200 x 15 =
SAWA
Similarity Algorithm Wikipedia-bAsed
9   Word-to-word similarity

      Date due parole stabilire quanto esse sono simili;
      Tipi di algoritmi per il calcolo della similarità tra parole:
        Corpus-based: pointwise mutual information, latent semantic analysis;

        Hierarchy-based: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin, Jiang &
        Conrath;



      Input: due parole;
      Output: score compreso tra 0 e 1.
10   Algoritmo di Lin (1998)
11   Tool di word-to-word similarity



       Libreria utilizzata: LinguaTools DISCO;
       Utilizza Wikipedia come gerarchia di concetti
         202.578 concetti;

         Aggiornato al 1° gennaio 2008

       Utilizza l’algoritmo di Lin per il calcolo della similarità.
12   Esempi

      Tiger, lion = 90%
      Doctor, nurse = 70%
      Stock, market = 47%
      Love, sex = 46%
      FBI, investigation = 35%
      Professor, cucumber = 0,006%
Qualità dell’algoritmo
         Corpus per la misurazione della qualità: WordSim353;
         Coefficienti di correlazione (Pearson):
           Wikipedia: 0,574;

           BNC: 0,415;

           PubMed: 0,105;

90.000

67.500

45.000

22.500

    0
14   Text-to-text similarity
       Dati due testi stabilire quanto essi sono simili;
       Estensione opportuna degli algoritmi di word-to-word similarity;
       Rimozione delle parole (stopword)
         basso potere discriminatorio;

         alta frequenza di occorrenza;



       Input: due testi;
       Output: score compreso tra 0 e 1.
15   Stopword


      “Returns the first and last name of each customer who is categorized as an
                                  individual consumer”

                                             STOPWORD


                  “name customer categorized individual consumer”
Algoritmo di Corley & Mihalcea
16   (2005)
Ottimizzazioni (v1.2)

  Caching delle frequenze di ogni termine;
  Caching delle similarità tra termini;
  Apprendimento incrementale;
  Riduzione degli accessi a DISCO;


  Performance ridotte di 10 volte;
Risultati sperimentali
Qualità e tempo di esecuzione
DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
     "returns the first and last name of each customer who is categorized as an individual consumer"

     RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
     *---------------------------------------------------------------------------------------------------------------*
     | Descrizione                                                                                          | Score |
     *---------------------------------------------------------------------------------------------------------------*
     | name: name of customer                                                                               | 62,85% |
     | customer: Current customer individual information                                                    | 56,91% |
     | customeraddress: Customer address                                                                    | 42,36% |
     | customercredicard: Customer credit card information                                                  | 35,08% |
     | salesreason: Reasons why a customer may purchase a particular product.                               | 30,35% |
     | customerstore:Stores of our Company (customer and resellers).                                        | 17,31% |
     | salesorderdetail: Product details associated with a specific sales order.                            | 2,99% |
     | productinventory: Product inventory information.                                                     | 2,59% |
     | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,39% |
     | productlocation: Product manufacturing locations                                                     | 2,36% |
     | salestaxrate: Sales Tax rate.                                                                        | 2,36% |
     | salesterritory: Sales territory.                                                                     | 2,22% |
     | employeeaddress: Employee information such as salary, department, and title.                         | 2,18% |
     | product: Products sold or used in the manfacturing of sold products.                                 | 2,12% |
     | enterpricedepartment: Departments of Enterprise                                                      | 2,00% |
     | salesspecialoffer: Sales Special Offer (discounts).                                                  | 1,99% |
     | productlistpricehistory: Changes in the list price of a product over time.                           | 1,80% |
     | shipmethod: Shipping methods.                                                                        | 1,79% |
     | salesorder: General sales order information (header).                                                | 1,76% |
     | productdocument: Product Document                                                                    | 1,73% |
     | productcosthistory: Changes in the cost of a product over time.                                      | 1,68% |
     | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,61% |
     | productmodel: Product model classification.                                                          | 1,48% |
     | currencyrate: Currency exchange rates.                                                               | 1,40% |
     | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled.       | 1,29% |
     | productcategory: High-level product categorization.                                                  | 1,27% |
     | addresstype: Types of addresses                                                                      | 0,95% |
     | unitmeasure: Unit of measure.                                                                        | 0,80% |
     | currency: Standard ISO currencies.                                                                   | 0,51% |




19
     | countryregion: ISO standard codes for countries and regions.                                         | 0,51% |
     | stateprovince: States and provinces                                                                  | 0,12% |
     *---------------------------------------------------------------------------------------------------------------*
     Time elapsed: 9.4 seconds.
DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
     "lists the names and addresses of all individual customers"

     RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
     *---------------------------------------------------------------------------------------------------------------*
     | Descrizione                                                                                          | Score |
     *---------------------------------------------------------------------------------------------------------------*
     | addresstype: Types of addresses                                                                      | 51,77% |
     | customer: Current customer individual information                                                    | 24,03% |
     | customeraddress: Customer address                                                                    | 10,83% |
     | name: name of customer                                                                               | 6,32% |
     | productlistpricehistory: Changes in the list price of a product over time.                           | 4,91% |
     | customercredicard: Customer credit card information                                                  | 4,47% |
     | salesreason: Reasons why a customer may purchase a particular product.                               | 4,20% |
     | customerstore:Stores of our Company (customer and resellers).                                        | 3,21% |
     | salesorder: General sales order information (header).                                                | 2,72% |
     | salesspecialoffer: Sales Special Offer (discounts).                                                  | 2,53% |
     | salesorderdetail: Product details associated with a specific sales order.                            | 2,49% |
     | salesterritory: Sales territory.                                                                     | 2,14% |
     | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,08% |
     | employeeaddress: Employee information such as salary, department, and title.                         | 1,81% |
     | salestaxrate: Sales Tax rate.                                                                        | 1,79% |
     | productlocation: Product manufacturing locations                                                     | 1,78% |
     | countryregion: ISO standard codes for countries and regions.                                         | 1,64% |
     | product: Products sold or used in the manfacturing of sold products.                                 | 1,62% |
     | productinventory: Product inventory information.                                                     | 1,60% |
     | currencyrate: Currency exchange rates.                                                               | 1,46% |
     | enterpricedepartment: Departments of Enterprise                                                      | 1,45% |
     | productmodel: Product model classification.                                                          | 1,38% |
     | shipmethod: Shipping methods.                                                                        | 1,37% |
     | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled.       | 1,36% |
     | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,32% |
     | productdocument: Product Document                                                                    | 1,27% |
     | productcosthistory: Changes in the cost of a product over time.                                      | 1,26% |
     | productcategory: High-level product categorization.                                                  | 1,01% |
     | currency: Standard ISO currencies.                                                                   | 0,85% |




20
     | stateprovince: States and provinces                                                                  | 0,73% |
     | unitmeasure: Unit of measure.                                                                        | 0,71% |
     *---------------------------------------------------------------------------------------------------------------*
     Time elapsed: 4.177 seconds.
DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
     "returns the name of each customer that is categorized as a store"

     RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
     *---------------------------------------------------------------------------------------------------------------*
     | Descrizione                                                                                          | Score |
     *---------------------------------------------------------------------------------------------------------------*
     | name: name of customer                                                                               | 64,29% |
     | customeraddress: Customer address                                                                    | 43,83% |
     | customer: Current customer individual information                                                    | 40,05% |
     | customercredicard: Customer credit card information                                                  | 36,52% |
     | salesreason: Reasons why a customer may purchase a particular product.                               | 31,74% |
     | customerstore:Stores of our Company (customer and resellers).                                        | 21,07% |
     | employeeaddress: Employee information such as salary, department, and title.                         | 2,75% |
     | salesorderdetail: Product details associated with a specific sales order.                            | 2,67% |
     | productinventory: Product inventory information.                                                     | 2,52% |
     | salestaxrate: Sales Tax rate.                                                                        | 2,22% |
     | salesterritory: Sales territory.                                                                     | 2,19% |
     | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,09% |
     | productlocation: Product manufacturing locations                                                     | 1,91% |
     | enterpricedepartment: Departments of Enterprise                                                      | 1,87% |
     | salesorder: General sales order information (header).                                                | 1,84% |
     | product: Products sold or used in the manfacturing of sold products.                                 | 1,79% |
     | salesspecialoffer: Sales Special Offer (discounts).                                                  | 1,72% |
     | productlistpricehistory: Changes in the list price of a product over time.                           | 1,68% |
     | productdocument: Product Document                                                                    | 1,63% |
     | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled.       | 1,61% |
     | shipmethod: Shipping methods.                                                                        | 1,52% |
     | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,47% |
     | productcosthistory: Changes in the cost of a product over time.                                      | 1,43% |
     | productmodel: Product model classification.                                                          | 1,42% |
     | currencyrate: Currency exchange rates.                                                               | 1,30% |
     | productcategory: High-level product categorization.                                                  | 1,15% |
     | addresstype: Types of addresses                                                                      | 1,02% |
     | unitmeasure: Unit of measure.                                                                        | 0,93% |
     | countryregion: ISO standard codes for countries and regions.                                         | 0,45% |




21
     | currency: Standard ISO currencies.                                                                   | 0,44% |
     | stateprovince: States and provinces                                                                  | 0,12% |
     *---------------------------------------------------------------------------------------------------------------*
     Time elapsed: 1.245 seconds.
22   Tempo di esecuzione
                              Ottimizzato                              Non ottimizzato


     3   1.0 s   9.4 s


     6   1.7 s      9.8 s


     5   2.7 s                              18.1 s


     7   3.6 s                                       21.8 s


     2   3.9 s                         15.5 s


     8   5.6 s                                                23.1 s


     1   6.2 s                              14.3 s


     4   9.4 s                                                                           39.4 s


         0                  12.5                        25                       37.5             50
Sviluppi futuri
Imminenti e futuri
Sviluppi futuri
  Imminenti:
    Realizzazione dell’interfaccia Web Service

    Realizzazione dell’interfaccia Web (gratuita)

    Realizzazione dell’interfaccia di rete

    Disseminazione scientifica

  Altri:
    Introduzione di soglie per migliorare le performance

    Rilascio con licenza open-source del codice sorgente
Sessione dimostrativa

Mais conteúdo relacionado

Semelhante a Algoritmo di text-similarity per l'annotazione semantica di Web Service

SAP Performance Testing Best Practice Guide v1.0
SAP Performance Testing Best Practice Guide v1.0SAP Performance Testing Best Practice Guide v1.0
SAP Performance Testing Best Practice Guide v1.0Argos
 
Sap performance testing best practice guidev1 0-130121141448-phpapp02
Sap performance testing best practice guidev1 0-130121141448-phpapp02Sap performance testing best practice guidev1 0-130121141448-phpapp02
Sap performance testing best practice guidev1 0-130121141448-phpapp02Kamalaksha Das
 
Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02
Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02
Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02Pompee Das
 
VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...
VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...
VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...VMworld
 
Discover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsDiscover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsSriskandarajah Suhothayan
 
Fusion Applications - PIM Deep Dive
Fusion Applications - PIM Deep DiveFusion Applications - PIM Deep Dive
Fusion Applications - PIM Deep DiveNachiketa Sharma
 
Agile in Medical Software Development
Agile in Medical Software DevelopmentAgile in Medical Software Development
Agile in Medical Software DevelopmentBernhard Kappe
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...marcja
 
Presentatie Duncan Rogers NMD2010 17 juni 2010
Presentatie Duncan Rogers NMD2010 17 juni 2010Presentatie Duncan Rogers NMD2010 17 juni 2010
Presentatie Duncan Rogers NMD2010 17 juni 2010OGZ
 
A wrapper for QuantLib and reference data
A wrapper for QuantLib and reference dataA wrapper for QuantLib and reference data
A wrapper for QuantLib and reference dataJun Hong
 
INWK Overview
INWK  OverviewINWK  Overview
INWK Overviewrlmeyers
 
Successfully manage your business model transformation
Successfully manage your business model transformationSuccessfully manage your business model transformation
Successfully manage your business model transformationSercan Yemeni
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API ProgramPronovix
 
Why Generic Configurators dont work in the valve Industry
Why Generic Configurators dont work in the valve IndustryWhy Generic Configurators dont work in the valve Industry
Why Generic Configurators dont work in the valve IndustrySanjeev Nadkarni
 
Mathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer ManagementMathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer ManagementZehra Kendir
 

Semelhante a Algoritmo di text-similarity per l'annotazione semantica di Web Service (20)

SAP Performance Testing Best Practice Guide v1.0
SAP Performance Testing Best Practice Guide v1.0SAP Performance Testing Best Practice Guide v1.0
SAP Performance Testing Best Practice Guide v1.0
 
Sap performance testing best practice guidev1 0-130121141448-phpapp02
Sap performance testing best practice guidev1 0-130121141448-phpapp02Sap performance testing best practice guidev1 0-130121141448-phpapp02
Sap performance testing best practice guidev1 0-130121141448-phpapp02
 
Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02
Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02
Sapperformancetestingbestpracticeguidev1 0-130121141448-phpapp02
 
Forecast 2014: SaaS Data Exchange
Forecast 2014: SaaS Data ExchangeForecast 2014: SaaS Data Exchange
Forecast 2014: SaaS Data Exchange
 
VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...
VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...
VMworld 2013: Create a Key Metrics-based Actionable Roadmap to Deliver IT as ...
 
Discover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsDiscover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 Analytics
 
Crystal Qube™ Presentation
Crystal Qube™ PresentationCrystal Qube™ Presentation
Crystal Qube™ Presentation
 
Fusion Applications - PIM Deep Dive
Fusion Applications - PIM Deep DiveFusion Applications - PIM Deep Dive
Fusion Applications - PIM Deep Dive
 
Agile in Medical Software Development
Agile in Medical Software DevelopmentAgile in Medical Software Development
Agile in Medical Software Development
 
WFX Cloud ERP
WFX Cloud ERPWFX Cloud ERP
WFX Cloud ERP
 
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
Distributed Caches: A Developer’s Guide to Unleashing Your Data in High-Perfo...
 
Presentatie Duncan Rogers NMD2010 17 juni 2010
Presentatie Duncan Rogers NMD2010 17 juni 2010Presentatie Duncan Rogers NMD2010 17 juni 2010
Presentatie Duncan Rogers NMD2010 17 juni 2010
 
Data Warehouse-Final
Data Warehouse-FinalData Warehouse-Final
Data Warehouse-Final
 
A wrapper for QuantLib and reference data
A wrapper for QuantLib and reference dataA wrapper for QuantLib and reference data
A wrapper for QuantLib and reference data
 
Media
MediaMedia
Media
 
INWK Overview
INWK  OverviewINWK  Overview
INWK Overview
 
Successfully manage your business model transformation
Successfully manage your business model transformationSuccessfully manage your business model transformation
Successfully manage your business model transformation
 
Designing a Future-proof API Program
Designing a Future-proof API ProgramDesigning a Future-proof API Program
Designing a Future-proof API Program
 
Why Generic Configurators dont work in the valve Industry
Why Generic Configurators dont work in the valve IndustryWhy Generic Configurators dont work in the valve Industry
Why Generic Configurators dont work in the valve Industry
 
Mathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer ManagementMathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer Management
 

Mais de Michele Filannino

Using machine learning to predict temporal orientation of search engines’ que...
Using machine learning to predict temporal orientation of search engines’ que...Using machine learning to predict temporal orientation of search engines’ que...
Using machine learning to predict temporal orientation of search engines’ que...Michele Filannino
 
Temporal information extraction in the general and clinical domain
Temporal information extraction in the general and clinical domainTemporal information extraction in the general and clinical domain
Temporal information extraction in the general and clinical domainMichele Filannino
 
Mining temporal footprints from Wikipedia
Mining temporal footprints from WikipediaMining temporal footprints from Wikipedia
Mining temporal footprints from WikipediaMichele Filannino
 
Can computers understand time?
Can computers understand time?Can computers understand time?
Can computers understand time?Michele Filannino
 
Detecting novel associations in large data sets
Detecting novel associations in large data setsDetecting novel associations in large data sets
Detecting novel associations in large data setsMichele Filannino
 
Temporal expressions identification in biomedical texts
Temporal expressions identification in biomedical textsTemporal expressions identification in biomedical texts
Temporal expressions identification in biomedical textsMichele Filannino
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemMichele Filannino
 
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...Michele Filannino
 
Tecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturaleTecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturaleMichele Filannino
 
SWOP project and META software
SWOP project and META softwareSWOP project and META software
SWOP project and META softwareMichele Filannino
 
Semantic Web Service Annotation
Semantic Web Service AnnotationSemantic Web Service Annotation
Semantic Web Service AnnotationMichele Filannino
 
Orchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPMOrchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPMMichele Filannino
 
Serendipity module in Item Recommender System
Serendipity module in Item Recommender SystemSerendipity module in Item Recommender System
Serendipity module in Item Recommender SystemMichele Filannino
 
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...Michele Filannino
 

Mais de Michele Filannino (16)

me_t3_october
me_t3_octoberme_t3_october
me_t3_october
 
Using machine learning to predict temporal orientation of search engines’ que...
Using machine learning to predict temporal orientation of search engines’ que...Using machine learning to predict temporal orientation of search engines’ que...
Using machine learning to predict temporal orientation of search engines’ que...
 
Temporal information extraction in the general and clinical domain
Temporal information extraction in the general and clinical domainTemporal information extraction in the general and clinical domain
Temporal information extraction in the general and clinical domain
 
Mining temporal footprints from Wikipedia
Mining temporal footprints from WikipediaMining temporal footprints from Wikipedia
Mining temporal footprints from Wikipedia
 
Can computers understand time?
Can computers understand time?Can computers understand time?
Can computers understand time?
 
Detecting novel associations in large data sets
Detecting novel associations in large data setsDetecting novel associations in large data sets
Detecting novel associations in large data sets
 
Temporal expressions identification in biomedical texts
Temporal expressions identification in biomedical textsTemporal expressions identification in biomedical texts
Temporal expressions identification in biomedical texts
 
My research taster project
My research taster projectMy research taster project
My research taster project
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
Sviluppo di un algoritmo di similarità a supporto dell'annotazione semantica ...
 
Tecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturaleTecniche fuzzy per l'elaborazione del linguaggio naturale
Tecniche fuzzy per l'elaborazione del linguaggio naturale
 
SWOP project and META software
SWOP project and META softwareSWOP project and META software
SWOP project and META software
 
Semantic Web Service Annotation
Semantic Web Service AnnotationSemantic Web Service Annotation
Semantic Web Service Annotation
 
Orchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPMOrchestrazione delle risorse umane nel BPM
Orchestrazione delle risorse umane nel BPM
 
Serendipity module in Item Recommender System
Serendipity module in Item Recommender SystemSerendipity module in Item Recommender System
Serendipity module in Item Recommender System
 
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
Orchestrazione di risorse umane nel BPM: Gestione dinamica feature-based dell...
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Algoritmo di text-similarity per l'annotazione semantica di Web Service

  • 1. Algoritmo di text-similarity per l’annotazione semantica di WS SWAP research group - 27 luglio 2010 Michele Filannino, @bronko85
  • 2. Outline Il problema Scenario di riferimento Similarità SAWA Word-to-word similarity Text-to-text similarity Risultati sperimentali Qualità dei risultati Tempo di esecuzione 2 Sviluppi futuri Sessione dimostrativa
  • 3. Il problema Come misurare la similarità tra due testi?
  • 4. 4 Scenario di riferimento Natural language To approve/reject descriptions suggested annotations WSDL file CODEArchitects CODEArchitects SAWSDL file Annotation Tool Annotation Tool
  • 5. 5 Similarità semantica Assegnare una metrica di somiglianza, basata sul significato, ad un insieme di termini e/o documenti; Similarità ≠ Correlatività; “Banca” e “denaro” sono correlati sebbene non siano affatto simili; Similarità Correlatività; “Ragazza” e “fanciulla” sono simili quindi anche correlati.
  • 6. 6 Similarità semantica in SWOP Concetti del WS Concetti ontologici - RequestOrder Order - - Order OrderNumber - - BillingInformation OrderID - - ... BillID - BillReference - BusinessFirm - Product - Catalog - ... -
  • 7. 7 Peso computazionale Esempio: Ontologia con 1200 concetti WSDL con 15 annotazioni 18.000 esecuzioni di SAWA :( 1.200 x 15 =
  • 9. 9 Word-to-word similarity Date due parole stabilire quanto esse sono simili; Tipi di algoritmi per il calcolo della similarità tra parole: Corpus-based: pointwise mutual information, latent semantic analysis; Hierarchy-based: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin, Jiang & Conrath; Input: due parole; Output: score compreso tra 0 e 1.
  • 10. 10 Algoritmo di Lin (1998)
  • 11. 11 Tool di word-to-word similarity Libreria utilizzata: LinguaTools DISCO; Utilizza Wikipedia come gerarchia di concetti 202.578 concetti; Aggiornato al 1° gennaio 2008 Utilizza l’algoritmo di Lin per il calcolo della similarità.
  • 12. 12 Esempi Tiger, lion = 90% Doctor, nurse = 70% Stock, market = 47% Love, sex = 46% FBI, investigation = 35% Professor, cucumber = 0,006%
  • 13. Qualità dell’algoritmo Corpus per la misurazione della qualità: WordSim353; Coefficienti di correlazione (Pearson): Wikipedia: 0,574; BNC: 0,415; PubMed: 0,105; 90.000 67.500 45.000 22.500 0
  • 14. 14 Text-to-text similarity Dati due testi stabilire quanto essi sono simili; Estensione opportuna degli algoritmi di word-to-word similarity; Rimozione delle parole (stopword) basso potere discriminatorio; alta frequenza di occorrenza; Input: due testi; Output: score compreso tra 0 e 1.
  • 15. 15 Stopword “Returns the first and last name of each customer who is categorized as an individual consumer” STOPWORD “name customer categorized individual consumer”
  • 16. Algoritmo di Corley & Mihalcea 16 (2005)
  • 17. Ottimizzazioni (v1.2) Caching delle frequenze di ogni termine; Caching delle similarità tra termini; Apprendimento incrementale; Riduzione degli accessi a DISCO; Performance ridotte di 10 volte;
  • 18. Risultati sperimentali Qualità e tempo di esecuzione
  • 19. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA: "returns the first and last name of each customer who is categorized as an individual consumer" RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score): *---------------------------------------------------------------------------------------------------------------* | Descrizione | Score | *---------------------------------------------------------------------------------------------------------------* | name: name of customer | 62,85% | | customer: Current customer individual information | 56,91% | | customeraddress: Customer address | 42,36% | | customercredicard: Customer credit card information | 35,08% | | salesreason: Reasons why a customer may purchase a particular product. | 30,35% | | customerstore:Stores of our Company (customer and resellers). | 17,31% | | salesorderdetail: Product details associated with a specific sales order. | 2,99% | | productinventory: Product inventory information. | 2,59% | | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,39% | | productlocation: Product manufacturing locations | 2,36% | | salestaxrate: Sales Tax rate. | 2,36% | | salesterritory: Sales territory. | 2,22% | | employeeaddress: Employee information such as salary, department, and title. | 2,18% | | product: Products sold or used in the manfacturing of sold products. | 2,12% | | enterpricedepartment: Departments of Enterprise | 2,00% | | salesspecialoffer: Sales Special Offer (discounts). | 1,99% | | productlistpricehistory: Changes in the list price of a product over time. | 1,80% | | shipmethod: Shipping methods. | 1,79% | | salesorder: General sales order information (header). | 1,76% | | productdocument: Product Document | 1,73% | | productcosthistory: Changes in the cost of a product over time. | 1,68% | | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,61% | | productmodel: Product model classification. | 1,48% | | currencyrate: Currency exchange rates. | 1,40% | | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,29% | | productcategory: High-level product categorization. | 1,27% | | addresstype: Types of addresses | 0,95% | | unitmeasure: Unit of measure. | 0,80% | | currency: Standard ISO currencies. | 0,51% | 19 | countryregion: ISO standard codes for countries and regions. | 0,51% | | stateprovince: States and provinces | 0,12% | *---------------------------------------------------------------------------------------------------------------* Time elapsed: 9.4 seconds.
  • 20. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA: "lists the names and addresses of all individual customers" RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score): *---------------------------------------------------------------------------------------------------------------* | Descrizione | Score | *---------------------------------------------------------------------------------------------------------------* | addresstype: Types of addresses | 51,77% | | customer: Current customer individual information | 24,03% | | customeraddress: Customer address | 10,83% | | name: name of customer | 6,32% | | productlistpricehistory: Changes in the list price of a product over time. | 4,91% | | customercredicard: Customer credit card information | 4,47% | | salesreason: Reasons why a customer may purchase a particular product. | 4,20% | | customerstore:Stores of our Company (customer and resellers). | 3,21% | | salesorder: General sales order information (header). | 2,72% | | salesspecialoffer: Sales Special Offer (discounts). | 2,53% | | salesorderdetail: Product details associated with a specific sales order. | 2,49% | | salesterritory: Sales territory. | 2,14% | | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,08% | | employeeaddress: Employee information such as salary, department, and title. | 1,81% | | salestaxrate: Sales Tax rate. | 1,79% | | productlocation: Product manufacturing locations | 1,78% | | countryregion: ISO standard codes for countries and regions. | 1,64% | | product: Products sold or used in the manfacturing of sold products. | 1,62% | | productinventory: Product inventory information. | 1,60% | | currencyrate: Currency exchange rates. | 1,46% | | enterpricedepartment: Departments of Enterprise | 1,45% | | productmodel: Product model classification. | 1,38% | | shipmethod: Shipping methods. | 1,37% | | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,36% | | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,32% | | productdocument: Product Document | 1,27% | | productcosthistory: Changes in the cost of a product over time. | 1,26% | | productcategory: High-level product categorization. | 1,01% | | currency: Standard ISO currencies. | 0,85% | 20 | stateprovince: States and provinces | 0,73% | | unitmeasure: Unit of measure. | 0,71% | *---------------------------------------------------------------------------------------------------------------* Time elapsed: 4.177 seconds.
  • 21. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA: "returns the name of each customer that is categorized as a store" RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score): *---------------------------------------------------------------------------------------------------------------* | Descrizione | Score | *---------------------------------------------------------------------------------------------------------------* | name: name of customer | 64,29% | | customeraddress: Customer address | 43,83% | | customer: Current customer individual information | 40,05% | | customercredicard: Customer credit card information | 36,52% | | salesreason: Reasons why a customer may purchase a particular product. | 31,74% | | customerstore:Stores of our Company (customer and resellers). | 21,07% | | employeeaddress: Employee information such as salary, department, and title. | 2,75% | | salesorderdetail: Product details associated with a specific sales order. | 2,67% | | productinventory: Product inventory information. | 2,52% | | salestaxrate: Sales Tax rate. | 2,22% | | salesterritory: Sales territory. | 2,19% | | salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,09% | | productlocation: Product manufacturing locations | 1,91% | | enterpricedepartment: Departments of Enterprise | 1,87% | | salesorder: General sales order information (header). | 1,84% | | product: Products sold or used in the manfacturing of sold products. | 1,79% | | salesspecialoffer: Sales Special Offer (discounts). | 1,72% | | productlistpricehistory: Changes in the list price of a product over time. | 1,68% | | productdocument: Product Document | 1,63% | | salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,61% | | shipmethod: Shipping methods. | 1,52% | | productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,47% | | productcosthistory: Changes in the cost of a product over time. | 1,43% | | productmodel: Product model classification. | 1,42% | | currencyrate: Currency exchange rates. | 1,30% | | productcategory: High-level product categorization. | 1,15% | | addresstype: Types of addresses | 1,02% | | unitmeasure: Unit of measure. | 0,93% | | countryregion: ISO standard codes for countries and regions. | 0,45% | 21 | currency: Standard ISO currencies. | 0,44% | | stateprovince: States and provinces | 0,12% | *---------------------------------------------------------------------------------------------------------------* Time elapsed: 1.245 seconds.
  • 22. 22 Tempo di esecuzione Ottimizzato Non ottimizzato 3 1.0 s 9.4 s 6 1.7 s 9.8 s 5 2.7 s 18.1 s 7 3.6 s 21.8 s 2 3.9 s 15.5 s 8 5.6 s 23.1 s 1 6.2 s 14.3 s 4 9.4 s 39.4 s 0 12.5 25 37.5 50
  • 24. Sviluppi futuri Imminenti: Realizzazione dell’interfaccia Web Service Realizzazione dell’interfaccia Web (gratuita) Realizzazione dell’interfaccia di rete Disseminazione scientifica Altri: Introduzione di soglie per migliorare le performance Rilascio con licenza open-source del codice sorgente

Notas do Editor

  1. LCS = Least Common Subsumer (Ultimo sussuntore comune)