2. 2nd SEALS Yards-cks for Ontology Management • Conformance and interoperability results • Scalability results • Conclusions 2
3. Conformance evalua-on • Ontology language conformance – The ability to adhere to exis-ng ontology language specifica-ons • Goal: to evaluate the conformance of seman-c technologies with regards to ontology representa-on languages Tool X O1 O1’ O1’’ Step 1: Import + Export O1 = O1’’ + α - α’3
4. Metrics • Execu9on informs about the correct execu-on: – OK. No execu-on problem – FAIL. Some execu-on problem – Pla+orm Error (P.E.) PlaKorm excep-on • Informa9on added or lost in terms of triples, axioms, etc. Oi = Oi’ + α - α’• Conformance informs whether the ontology has been processed correctly with no addi-on or loss of informa-on: – SAME if Execuon is OK and Informaon added and Informaon lost are void – DIFFERENT if Execuon is OK but Informaon added or Oi = Oi’ ? Informaon lost are not void – NO if Execuon is FAIL or P.E. 4
5. Interoperability evalua-on • Ontology language interoperability – The ability to interchange ontologies and use them • Goal: to evaluate the interoperability of seman-c technologies in terms of the ability that such technologies have to interchange ontologies and use them Tool X Tool Y O1 O1’ O1’’ O1’’’ O1’’’’ Step 1: Import + Export Step 2: Import + Export O1 = O1’’ + α - α’ O1’’=O1’’’’ + β - β’ Interchange O1 = O1’’’’ + α - α’ + β - β’5
6. Metrics • Execu9on informs about the correct execu-on: – OK. No execu-on problem – FAIL. Some execu-on problem – Pla+orm Error (P.E.) PlaKorm excep-on – Not Executed. (N.E.) Second step not executed • Informa9on added or lost in terms of triples, axioms, etc. Oi = Oi’ + α - α’• Interchange informs whether the ontology has been interchanged correctly with no addi-on or loss of informa-on: – SAME if Execuon is OK and Informaon added and Informaon lost are void – DIFFERENT if Execuon is OK but Informaon added or Informaon lost are not void Oi = Oi’ ? – NO if Execuon is FAIL, N.E., or P.E. 6
7. Test suites used Name Defini9on Nº Tests RDF(S) Import Test Suite Manual 82 OWL Lite Import Test Suite Manual 82 OWL DL Import Test Suite Keyword-‐driven generator 561 OWL Full Import Test Suite Manual 90 OWL Content PaXern Expressive generator 81 OWL Content PaXern Expressive Expressive generator 81 OWL Content PaXern Full Expressive Expressive generator 81 7
9. Evalua-on Execu-on • Evalua-ons automa-cally performed with the SEALS PlaKorm – hXp://www.seals-‐project.eu/ SEALS• Evalua-on materials available Test Suite Test Suite Test Suite Raw Result – Test Data – Results Test Suite Interpretation – Metadata Conformance Interoperability Scalability9
11. RDF(S) conformance results • Jena and Sesame behave iden-cally (no problems) • The behaviour of the OWL API-‐ based tools (NeOn Toolkit, OWL API and Protégé 4) has significantly changed – Transform ontologies to OWL 2 – Some problems • Less in newer versions • Protégé OWL improves 11
12. OWL Lite conformance results • Jena and Sesame behave iden-cally (no problems) • The OWL API-‐based tools (NeOn Toolkit, OWL API and Protégé 4) improve – Transform ontologies to OWL 2 • Protégé OWL improves 12
13. OWL DL conformance results • Jena and Sesame behave iden-cally (no problems) • OWL API and Protégé 4 improve • NeOn Toolkit worsenes • Protégé OWL behaves iden-cally • Robustness increases 13
14. Content paXern conformance results • New issues iden-fied in the OWL API-‐based tools (NeOn Toolkit, OWL API and Protégé 4) • New issue iden-fied in Protégé 4 • No new issues 14
15. Interoperability results 1st Evalua-on 2nd Evalua-on Campaign Campaign • Same analysis as in conformance • OWL DL: New issue found in interchanges from Protégé 4 to Protégé OWL • Conclusions: – RDF-‐based tool have no interoperability problems – OWL-‐based tools have no interoperability problems with OWL Lite but have some with OWL DL. – Tools based on the OWL API cannot interoperate using RDF(S) (they convert ontologies into OWL 2) 04.08.2010 15
16. 2nd SEALS Yards-cks for Ontology Management • Conformance and interoperability results • Scalability results • Conclusions 16
18. Execu-on se]ngs Test suites: • Real World. Complex ontologies from biological and medical domains • Real World NCI. Thesaurus subsets (1.5-‐2 -mes bigger) • LUBM. Synthe-c ontologies Execu9on Environment: • Win7-‐64bit, Intel Core 2 Duo CPU, 2.40GHz, 4.00 GB RAM (Real World Ontologies Test Collecons) • WinServer-‐64bit, AMD Dual Core, 2.60 GHz (4 Processors), 8.00 GB RAM (LUBM Ontologies Test Collecon) Constraint: • 30 min threshold per test case 18
23. 2nd SEALS Yards-cks for Ontology Management • Conformance and interoperability results • Scalability results • Conclusions 23
24. Conclusions – Test data • Test suites are not exhaus-ve – The new test suites helped detec-ng new issues • A more expressive test suite does not imply detec-ng more issues • We used exis-ng ontologies as input for the test data generator – Requires a previous analysis of the ontologies to detect defects – We found ontologies with issues that we had to correct 24
25. Conclusions -‐ Results • Tools have improved their conformance, interoperability, and robustness • High influence of development decisions – the OWL API radically changed the way of dealing with RDF ontologies • need tools for easy evalua-on • need stronger regression tes-ng • The automated genera-or defined test cases that a person would have never though about but which iden-fied new tool issues • using bigger ontologies for conformance and interoperability tes-ng makes much more difficult to find problems in the tools 25
28. Advanced reasoning system • Descrip-on logic based system (DLBS) • Standard reasoning services – Classifica-on – Class sa-sfiability – Ontology sa-sfiability – Logical entailment
30. Evaluation criteria• Interoperability – the capability of the software product to interact with one or more specified systems – a system must • conform to the standard input formats • be able to perform standard inference services• Performance – the capability of the software to provide appropriate performance, relative to the amount of resources used, under stated conditions
31. Evaluation metrics• Interoperability – Number of tests passed without parsing errors – Number of inference tests passed• Performance – Loading time – Inference time
32. Class satisfiability evaluation• Standard inference service that is widely used in ontology engineering• The goal: to assess both DLBS s interoperability and performance• Input – OWL ontology – One or several class IRIs• Output – TRUE the evaluation outcome coincide with expected result – FALSE the evaluation outcome differ from expected outcome – ERROR indicates IO error – UNKNOWN indicates that the system is unable to compute inference in the given timeframe
34. Ontology satisfiability evaluation• Standard inference service typically carried out before performing any other reasoning task• The goal: to assess both DLBS s interoperability and performance• Input – OWL ontology• Output – TRUE the evaluation outcome coincide with expected result – FALSE the evaluation outcome differ from expected outcome – ERROR indicates IO error – UNKNOWN indicates that the system is unable to compute inference in the given timeframe
36. Classification evaluation• Inference service that is typically carried out after testing ontology satisfiability and prior to performing any other reasoning task• The goal: to assess both DLBS s interoperability and performance• Input – OWL ontology• Output – OWL ontology – ERROR indicates IO error – UNKNOWN indicates that the system is unable to compute inference in the given timeframe
38. Logical entailment evaluation• Standard inference service that is the basis for query answering• The goal: to assess both DLBS s interoperability and performance• Input – 2 OWL ontologies• Output – TRUE the evaluation outcome coincide with expected result – FALSE the evaluation outcome differ from expected outcome – ERROR indicates IO error – UNKNOWN indicates that the system is unable to compute inference in the given timeframe
40. Storage and reasoning systems evaluation component• SRS component is intended to evaluate the description logic based systems (DLBS) – Implementing OWL-API 3 de-facto standard for DLBS – Implementing SRS SEALS DLBS interface• SRS supports test data in all syntactic formats supported by OWL-API 3• SRS saves the evaluation results and interpretations in MathML 3 format
41. DLBS interface• Java methods to be implemented by system developers – OWLOntology loadOntology(IRI iri) – boolean isSatisfiable(OWLOntology onto, OWLClass class) – boolean isSatisfiable(OWLOntology onto) – OWLOntology classifyOntology(OWLOntology onto) – URI saveOntology(OWLOntology onto, IRI iri) – boolean entails(OWLOntology onto1, OWLOntology onto2)
42. Testing Data• The ontologies from the Gardiner evaluation suite. – Over 300 ontologies of varying expressivity and size.• Various versions of the GALEN ontology• Various ontologies that have been created in EU funded projects, such as SEMINTEC, VICODI and AEO• 155 entailment tests from OWL 2 test cases repository
43. Evaluation setup• 3 DLBSs – FaCT++ C++ implementa-on of FaCT OWL DL reasoner – HermiT Java based OWL DL reasoner u-lizing novel hypertableau algorithms – Jcel Java based OWL 2 EL reasoner – FaCT++C evaluated without OWL prepareReasoner() call – HermiTC evaluated without OWL prepareReasoner() call • 2 AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ machines with 2GB of main memory – DLBSs were allowed to allocate up to 1 GB
61. Conclusion• Errors: – datatypes not supported in the systems – syntax related : a system was unable to register a role or a concept – expressivity errors• Execution time is dominated by small number of hard problems
64. OAEI & SEALS • OAEI : Ontology Alignment Evalua-on Ini-a-ve – Organized as annual campaign from 2005 to 2012 – Included in Ontology Matching workshop at ISWC – Different tracks (evalua-on scenarios) organized by different researchers • Star-ng in 2010: Support from SEALS – OAEI 2010, OAEI 2011, and OAEI 2011.5 6/6/1264
66. Jose Aguirre OAEI tracks Jerome Euzenat INRIA Grenoble • Benchmark – Matching different versions of the same ontology – Scalability: Size run-mes • Conference • Mul-Farm • Anatomy • Large BioMed 6/6/1266
67. Ondřej Šváb-‐Zamazal OAEI tracks Vojtěch Svátek Prague University of Economics • Benchmark • Conference – Same domain, different ontology – Manually generated reference alignment • Mul-Farm • Anatomy • Large BioMed 6/6/1267
68. Chris-an Meilicke, OAEI tracks Cassia Trojahn University Mannheim INRIA Grenoble • Benchmark • Conference • Mul-Farm: Mul-lingual Ontology Matching – Based on Conference – Testcases for Spanish, German, French, Russian, Portuguese, Czech, Dutch, Chinese • Anatomy • Large BioMed 6/6/1268
69. Chris-an Meilicke, OAEI tracks Heiner Stuckenschmidt University Mannheim • Benchmark • Conference • Mul-Farm • Anatomy – Matching mouse on human anatomy – Run-mes • Large BioMed 6/6/1269
70. Ernesto Jimenez Ruiz OAEI tracks Bernardo Cuenca Grau Ian Horrocks University of Oxford • Benchmark • Conference • Mul-Farm • Anatomy • Large BioMed – Very large dataset (FMA-‐NCI) – Includes coherence analysis 6/6/1270
72. Ques-ons? Write a mail to Chris-an Meilicke chris-an@informa-k.uni-‐mannheim.de 6/6/1272
73. IWEST 2012 workshop located at ESWC 2012 Seman-c Search Systems Evalua-on Campaign 6/6/12 73
74. Two phase approach • Seman-c search tools evalua-on demands a user-‐in-‐the-‐loop phase – usability criterion • Two phases: – User-‐in-‐the-‐loop – Automated 6/6/1274
75. Evalua-on criteria by phase Each phase will address a different subset of criteria. • Automated phase: query expressiveness, scalability, performance • User-‐in-‐the-‐loop phase: usability, query expressiveness 6/6/1275
76. Par-cipants Tool Descrip9on UITL Auto K-‐Search Form-‐based x x Ginseng Natural language with constrained vocabulary and x grammar NLP-‐Reduce Natural language for full English ques-ons, sentence x fragments, and keywords. Jena Arq SPARQL query engine. Automated phase baseline x RDF.Net Query SPARQL-‐based x Seman-c Crystal Graph-‐based x Affec-ve Graphs Graph-‐based x 6/6/12 76
77. Usability Evalua-on Setup • Data: Mooney Natural Language Learning Data • Subjects: 20 (10 expert users; 10 casual users) – Each subject evaluated the 5 par-cipa-ng tools • Task: Formulate 5 ques-ons in each tool’s interface • Data Collected: success rate, input -me, number of aXempts, response -me, user sa-sfac-on ques-onnaires, demographics 04.08.201077
78. 1 concept, 1 rela-on Ques-ons 1) Give me all the capitals of the USA? 2 concepts, 2 rela-ons 2) What are the ci9es in states through which the Mississippi runs? compara-ve 3) Which states have a city named Columbia with a city popula-on over 50,000? superla-ve 4) Which lakes are in the state with the highest point? 5) Tell me which rivers do not traverse the nega-on state with the capital Nashville? 04.08.2010 78
79. Automated Evalua-on Setup • Data: EvoOnt dataset – Five sizes: 1K 10K 100K 1M 10M triples • Task: Answer 10 ques-ons per dataset size • Data Collected: ontology load -me, query -me, number of results, result list • Analyses: precision, recall, f-‐measure, mean query -me, mean -me per result, etc 04.08.201079
80. Configura-on • All tools executed on SEALS PlaKorm • Each tool executed within a Virtual Machine Linux Windows OS Ubuntu 10.10 (64-‐bit) Windows 7 (64-‐bit) Num CPUs 2 4 Memory (GB) 4 4 Tools Arq v2.8.2 and Arq v2.9.0 RDF Query v0.5.1-‐beta 6/6/1280
82. Graph-‐based tools most liked (highest ranks and average SUS scores) Tool 100.0 Semantic-Crystal • Perceived by expert users System Usability Scale "SUS" Questionnaire score Affective-Graphs K-Search Ginseng Nlp-Reduce 80.0 as intui9ve allowing them to easily formulate more 60.0 complex queries. 40.0 • Casual users enjoyed the fun and visually-‐appealing 20.0 interfaces which created a 17 pleasant search .0 experience. Casual Expert UserType 04.08.2010 82
83. Form-‐based approach most liked by casual users • Perceived by casual users as Tool 5Extended Questionnaire Question "The systems query Semantic-Crystal language was easy to understand and use" score Affective-Graphs K-Search Ginseng Nlp-Reduce midpoint between NL and 4 graph-‐based. • Allow more complex queries 3 than the NL does. • Less complicated and less 2 61 query input -me than the graph-‐based. 1 17 • Together with graph-‐based: Casual Expert most liked by expert users UserType 04.08.2010 83
84. Casual Users liked Controlled-‐NL approach • Casuals: Tool • liked guidance through 100.0 Semantic-CrystalSystem Usability Scale "SUS" Questionnaire score Affective-Graphs sugges-ons. K-Search Ginseng Nlp-Reduce 80.0 • Prefer to be ‘controlled’ by the language model, allowing only 60.0 valid queries. 40.0 • Experts: • restric-ve and frustra-ng. 20.0 • Prefer to have more flexibility and expressiveness rather than .0 17 support and restric-on. Casual Expert UserType 04.08.2010 84
85. Free-‐NL challenge: habitability problem 1.0 Tool Semantic-Crystal Affective-Graphs • Free-‐NL liked for its simplicity, K-Search .8 Ginseng Nlp-Reduce familiarity, naturalness and low query input -me required. Answer found rate 42 96 .6 • Facing habitability problem: mismatch between users query 98 .4 terms and tools ones. .2 99 • Lead to lowest success rate, highest number of trials to get .0 97 Casual Expert UserType a sa-sfying answer, and in turn very low user sa-sfac-on. 04.08.2010 85
87. Overview • K-‐Search couldn’t load the ontologies – external ontology import not supported – cyclic rela-ons with concepts in remote ontologies not supported • Non-‐NL tools transform queries a priori • Na-ve SPARQL tools exhibit differences in query approach (see load and query -mes) 6/6/1287
88. Ontology load -me Arq v2.8.2 ontology load time Arq v2.9.0 ontology load time 100000 RDF Query v0.5.1-beta ontology load time • RDF Query loads ontology on-‐the-‐fly. Load -mes therefore independent of Time (ms) 10000 dataset size. • Arq loads ontology 1000 into memory. 1 10 100 1000 Dataset size (thousands of triples) 6/6/12 88
89. Query -me Arq v2.8.2 mean query time • RDF Query loads Arq v2.9.0 mean query time ontology on-‐the-‐fly. 100000 RDF Query v0.5.1-beta mean query time Query -mes therefore incorporate load -me. • Expensive for more than one query in a Time (ms) 10000 session. • Arq loads ontology into memory. 1000 • Query -mes largely independent of dataset size 1 10 100 1000 Dataset size (thousands of triples) 6/6/12 89
90. SEALS Seman-c Web Service Tools Evalua-on Campaign 2011 Seman9c Web Service Discovery Evalua9on Results 04.08.20106/6/1204.08.201090
91. Evalua-on of SWS Discovery • Finding Web Services based on their seman-c descrip-ons • For a given goal, and a given set of service descrip-ons, the tool returns the match degree between the goal and each service • Measurement services are provided via the SEALS plaKorm to measure the rate of matching correctness 91 91
92. Campaign Overviewhttp://www.seals-project.eu/seals-evaluation-campaigns/2nd-seals-evaluation-campaigns/ semantic-web-service-tools-evaluation-campaign-2011• Goal – Which ontology/annota-on is the best: WSMO-‐Lite, OWL-‐S or SAWSDL? • Assump-ons: – Same corresponding Test Collec-ons (TCs) – Same corresponding Matchmaking algorithms (Tools) – The corresponding tools will belong to the same provider – The level of performance of a tool for a specific TC is of secondary importance 92 92
103. Tools WSMO-‐LITE-‐TC SAWSDL-‐TC OWLS-‐TC WSMO-‐LITE-‐OU1 SAWSDL-‐OU1 SAWSDL-‐URJC2 OWLS-‐URJC2 SAWSDL-‐M03 OWLS-‐M03 1. Ning Li, The Open University 2. Ziji Cong et al., University of Rey Juan Carlos 3. MaXhias Klusch et al. German Research Center for Ar-ficial Intelligence 103 103
104. Tools WSMO-‐LITE-‐TC SAWSDL-‐TC OWLS-‐TC WSMO-‐LITE-‐OU1 SAWSDL-‐OU1 SAWSDL-‐URJC2 OWLS-‐URJC2 SAWSDL-‐M03 OWLS-‐M03 1. Ning Li, The Open University 2. Ziji Cong et al., University of Rey Juan Carlos 3. MaXhias Klusch et al. German Research Center for Ar-ficial Intelligence 104 104
105. Evalua-on Execu-on • Evalua-on workflow was executed on the SEALS PlaKorm • All tools were executed within a Virtual Machine Windows OS Windows 7 (64-‐bit) Num CPUs 4 Memory (GB) 4 Tools WSMO-‐LITE-‐OU, SAWSDL-‐OU 105 6/6/12
106. Par-al Evalua-on Results WSMO-‐LITE vs. SAWSDL WSMO-‐LITE-‐OU SAWSDL-‐OU M WSMO-‐LITE-‐TC SAWSDL-‐TC 106
107. * This table only shows the results that are different 107
108. Analysis • Out of 42 goals, only 19 have different results in terms of Precision and recall • On 17 out of 19 occasions, WSMO-‐Lite improves discovery precision over SAWSDL through specializing service seman-cs • WSMO-‐Lite performs worse than SAWSDL in 6 of 19 occasions on discovery recall while performing the same for the other 13 occasions 108
110. Lessons Learned • WSMO-‐LITE-‐OU tends to perform beXer than SAWSDL-‐OU in terms of precision, but slightly worse in recall. • The only feature of WSMO-‐Lite used against SAWSDL was the service category (based on TC domains). – Services were filtered by service category in WSMO-‐LITE-‐ OU and not in SAWSDL-‐OU • Further tests with addi-onal tools and measures are needed for any conclusive results about WSMO-‐Lite vs. SAWSDL (many tools are not available yet) 110
111. Conclusions • This has been the first SWS evalua-on campaign in the community focusing on the impact of the service ontology/ annota-on on performance • This comparison has been facilitated by the genera-on of WSMO-‐LITE-‐TC as a counterpart of SAWSDL-‐TC and OWLS-‐TC in the SEALS repository • The current comparison only involves 2 ontologies/ annota-ons (WSMO-‐Lite and SAWSDL) • Raw and Interpreta-on results are available in RDF via the SEALS repository (public access) 111