Tata AIG General Insurance Company - Insurer Innovation Award 2024
Algoritmo di text-similarity per l'annotazione semantica di Web Service
1. Algoritmo di text-similarity
per l’annotazione semantica di WS
SWAP research group - 27 luglio 2010
Michele Filannino, @bronko85
2. Outline
Il problema
Scenario di riferimento
Similarità
SAWA
Word-to-word similarity
Text-to-text similarity
Risultati sperimentali
Qualità dei risultati
Tempo di esecuzione
2
Sviluppi futuri
Sessione dimostrativa
4. 4 Scenario di riferimento
Natural language To approve/reject
descriptions suggested annotations
WSDL file CODEArchitects CODEArchitects SAWSDL file
Annotation Tool Annotation Tool
5. 5 Similarità semantica
Assegnare una metrica di somiglianza, basata sul significato, ad un insieme di
termini e/o documenti;
Similarità ≠ Correlatività;
“Banca” e “denaro” sono correlati sebbene non siano affatto simili;
Similarità Correlatività;
“Ragazza” e “fanciulla” sono simili quindi anche correlati.
6. 6 Similarità semantica in SWOP
Concetti del WS Concetti ontologici
- RequestOrder Order -
- Order OrderNumber -
- BillingInformation OrderID -
- ... BillID -
BillReference -
BusinessFirm -
Product -
Catalog -
... -
7. 7 Peso computazionale
Esempio:
Ontologia con 1200 concetti
WSDL con 15 annotazioni
18.000 esecuzioni di SAWA
:(
1.200 x 15 =
9. 9 Word-to-word similarity
Date due parole stabilire quanto esse sono simili;
Tipi di algoritmi per il calcolo della similarità tra parole:
Corpus-based: pointwise mutual information, latent semantic analysis;
Hierarchy-based: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin, Jiang &
Conrath;
Input: due parole;
Output: score compreso tra 0 e 1.
11. 11 Tool di word-to-word similarity
Libreria utilizzata: LinguaTools DISCO;
Utilizza Wikipedia come gerarchia di concetti
202.578 concetti;
Aggiornato al 1° gennaio 2008
Utilizza l’algoritmo di Lin per il calcolo della similarità.
13. Qualità dell’algoritmo
Corpus per la misurazione della qualità: WordSim353;
Coefficienti di correlazione (Pearson):
Wikipedia: 0,574;
BNC: 0,415;
PubMed: 0,105;
90.000
67.500
45.000
22.500
0
14. 14 Text-to-text similarity
Dati due testi stabilire quanto essi sono simili;
Estensione opportuna degli algoritmi di word-to-word similarity;
Rimozione delle parole (stopword)
basso potere discriminatorio;
alta frequenza di occorrenza;
Input: due testi;
Output: score compreso tra 0 e 1.
15. 15 Stopword
“Returns the first and last name of each customer who is categorized as an
individual consumer”
STOPWORD
“name customer categorized individual consumer”
17. Ottimizzazioni (v1.2)
Caching delle frequenze di ogni termine;
Caching delle similarità tra termini;
Apprendimento incrementale;
Riduzione degli accessi a DISCO;
Performance ridotte di 10 volte;
19. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"returns the first and last name of each customer who is categorized as an individual consumer"
RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| name: name of customer | 62,85% |
| customer: Current customer individual information | 56,91% |
| customeraddress: Customer address | 42,36% |
| customercredicard: Customer credit card information | 35,08% |
| salesreason: Reasons why a customer may purchase a particular product. | 30,35% |
| customerstore:Stores of our Company (customer and resellers). | 17,31% |
| salesorderdetail: Product details associated with a specific sales order. | 2,99% |
| productinventory: Product inventory information. | 2,59% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,39% |
| productlocation: Product manufacturing locations | 2,36% |
| salestaxrate: Sales Tax rate. | 2,36% |
| salesterritory: Sales territory. | 2,22% |
| employeeaddress: Employee information such as salary, department, and title. | 2,18% |
| product: Products sold or used in the manfacturing of sold products. | 2,12% |
| enterpricedepartment: Departments of Enterprise | 2,00% |
| salesspecialoffer: Sales Special Offer (discounts). | 1,99% |
| productlistpricehistory: Changes in the list price of a product over time. | 1,80% |
| shipmethod: Shipping methods. | 1,79% |
| salesorder: General sales order information (header). | 1,76% |
| productdocument: Product Document | 1,73% |
| productcosthistory: Changes in the cost of a product over time. | 1,68% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,61% |
| productmodel: Product model classification. | 1,48% |
| currencyrate: Currency exchange rates. | 1,40% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,29% |
| productcategory: High-level product categorization. | 1,27% |
| addresstype: Types of addresses | 0,95% |
| unitmeasure: Unit of measure. | 0,80% |
| currency: Standard ISO currencies. | 0,51% |
19
| countryregion: ISO standard codes for countries and regions. | 0,51% |
| stateprovince: States and provinces | 0,12% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 9.4 seconds.
20. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"lists the names and addresses of all individual customers"
RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| addresstype: Types of addresses | 51,77% |
| customer: Current customer individual information | 24,03% |
| customeraddress: Customer address | 10,83% |
| name: name of customer | 6,32% |
| productlistpricehistory: Changes in the list price of a product over time. | 4,91% |
| customercredicard: Customer credit card information | 4,47% |
| salesreason: Reasons why a customer may purchase a particular product. | 4,20% |
| customerstore:Stores of our Company (customer and resellers). | 3,21% |
| salesorder: General sales order information (header). | 2,72% |
| salesspecialoffer: Sales Special Offer (discounts). | 2,53% |
| salesorderdetail: Product details associated with a specific sales order. | 2,49% |
| salesterritory: Sales territory. | 2,14% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,08% |
| employeeaddress: Employee information such as salary, department, and title. | 1,81% |
| salestaxrate: Sales Tax rate. | 1,79% |
| productlocation: Product manufacturing locations | 1,78% |
| countryregion: ISO standard codes for countries and regions. | 1,64% |
| product: Products sold or used in the manfacturing of sold products. | 1,62% |
| productinventory: Product inventory information. | 1,60% |
| currencyrate: Currency exchange rates. | 1,46% |
| enterpricedepartment: Departments of Enterprise | 1,45% |
| productmodel: Product model classification. | 1,38% |
| shipmethod: Shipping methods. | 1,37% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,36% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,32% |
| productdocument: Product Document | 1,27% |
| productcosthistory: Changes in the cost of a product over time. | 1,26% |
| productcategory: High-level product categorization. | 1,01% |
| currency: Standard ISO currencies. | 0,85% |
20
| stateprovince: States and provinces | 0,73% |
| unitmeasure: Unit of measure. | 0,71% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 4.177 seconds.
21. DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"returns the name of each customer that is categorized as a store"
RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| name: name of customer | 64,29% |
| customeraddress: Customer address | 43,83% |
| customer: Current customer individual information | 40,05% |
| customercredicard: Customer credit card information | 36,52% |
| salesreason: Reasons why a customer may purchase a particular product. | 31,74% |
| customerstore:Stores of our Company (customer and resellers). | 21,07% |
| employeeaddress: Employee information such as salary, department, and title. | 2,75% |
| salesorderdetail: Product details associated with a specific sales order. | 2,67% |
| productinventory: Product inventory information. | 2,52% |
| salestaxrate: Sales Tax rate. | 2,22% |
| salesterritory: Sales territory. | 2,19% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,09% |
| productlocation: Product manufacturing locations | 1,91% |
| enterpricedepartment: Departments of Enterprise | 1,87% |
| salesorder: General sales order information (header). | 1,84% |
| product: Products sold or used in the manfacturing of sold products. | 1,79% |
| salesspecialoffer: Sales Special Offer (discounts). | 1,72% |
| productlistpricehistory: Changes in the list price of a product over time. | 1,68% |
| productdocument: Product Document | 1,63% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,61% |
| shipmethod: Shipping methods. | 1,52% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,47% |
| productcosthistory: Changes in the cost of a product over time. | 1,43% |
| productmodel: Product model classification. | 1,42% |
| currencyrate: Currency exchange rates. | 1,30% |
| productcategory: High-level product categorization. | 1,15% |
| addresstype: Types of addresses | 1,02% |
| unitmeasure: Unit of measure. | 0,93% |
| countryregion: ISO standard codes for countries and regions. | 0,45% |
21
| currency: Standard ISO currencies. | 0,44% |
| stateprovince: States and provinces | 0,12% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 1.245 seconds.
22. 22 Tempo di esecuzione
Ottimizzato Non ottimizzato
3 1.0 s 9.4 s
6 1.7 s 9.8 s
5 2.7 s 18.1 s
7 3.6 s 21.8 s
2 3.9 s 15.5 s
8 5.6 s 23.1 s
1 6.2 s 14.3 s
4 9.4 s 39.4 s
0 12.5 25 37.5 50
24. Sviluppi futuri
Imminenti:
Realizzazione dell’interfaccia Web Service
Realizzazione dell’interfaccia Web (gratuita)
Realizzazione dell’interfaccia di rete
Disseminazione scientifica
Altri:
Introduzione di soglie per migliorare le performance
Rilascio con licenza open-source del codice sorgente