Case study of the American Institute of Physics thesaurus. Presented by Mark Cassar of the American Institute of Physics and Jack Bruce of Access Innovations, Inc. at the 2012 Data Harmony User Group meeting on February 8, 2012 at the Access Innovations, Inc. offices.
Food processing presentation for bsc agriculture hons
Developing the AIP Thesaurus: The Platform for an Ontology
1. Developing the AIP Thesaurus:
The Platform for an Ontology
Mark Cassar
American Institute of Physics
Jack Bruce
Marjorie (Margie) M.K. Hlava
Access Innovations:
505-998-0800
2. Background
• Physics and Astronomy Classification Scheme (PACS)
• Six digit code schema used for indexing scholarly
content
• 10 digit based
– domain headings with subcategories nested under
each domain.
• Precoordinated system
– Combine terms (concepts) at the time of indexing
3.
4. Why Change?
• Improve searchability
• Move to Post coordinated system
– Combine terms at time of search
• Semantic enrichment
• Flexible metadata for many applications
• Naturalize the vocabulary
– Represent concepts succinctly and concisely
– Easily add new concepts based on new and emerging
technologies and applications
– Allow unlimited hierarchy levels and polyhierarchy
5. Better ROI
• Rules-assisted indexing
– Provide end users with a swift indexing solution
based on the Machine-Aided Indexer (M.A.I.)
engine.
– Batch index large corpus of scholarly content, as
well as future content.
• Improve costs
– Automate a large portion of electronic indexing
– Less overhead for indexing
6. Roadmap of the AIP Thesaurus
• Data Collection
– Load PACS codes and terms
– Incorporate Search logs; add top searched concepts into the
vocabulary
• Analysis of Content
– Test comparison of indexing to humanly indexed articles
• Thesaurus Construction
– Separate, disambiguate, and migrate concepts; Break up top
domains
– Apply thesaurus and taxonomy standardization to each term
– Multiple reviews for each top section
• Evaluation and Feedback
– Send back working draft to AIP for review
– Gather feedback from subject matter experts and incorporate the
changes into the thesaurus
• Finalization and Product Delivery
7. Source Data
• PACS 2009 ed.
• 1999 ed. Of AIP Thesaurus (out of date)
• Terms added to INSPEC since 2000
• Internal and external search logs
• Cumulative journal indexes
– Digital
– (2006 through 2009)
• List of AIP divisions and their internal classifications
8. Analysis of Content
• Organizational warrant
– PACS 2009 (2010)
– www.aip.org
– UniPHY
• Literary warrant
– Where we found the term used
• Most frequent search terms loaded into thesaurus
9. Thesaurus Creation Process
• Load data (vocabulary) into Data Harmony MAIstro™
• PACS
– Restructure top domains
– Separate into discrete
– Disambiguate terms
– Remove parenthetical qualifiers
– Create post coordinated terms
– Migrate separated terms into new/relevant categories
• Sort flat lists (search logs) into main categories determined
• Use multiple reviewers for each physics domain
• About 8181 preferred terms and 5217 synonyms
10.
11. PACS TERM:
– Low-energy electron diffraction (LEED) and reflection
high-energy electron diffraction (RHEED) (condensed
matter structure determination)
– Becomes
– BT Condensed matter structure determination
• NT Low energy electron diffraction
–Synonym LEED
• NT Reflection high energy electron diffraction
–Synonym RHEED
12. Evaluation and Feedback
• Weekly scheduled live demos of the thesaurus
• Free web-hosted version of the thesaurus and
periodic spreadsheet exports
• Collect feedback based on SME suggestions and AIP
PACS experts
– Correspondence via email
• Incorporate changes into thesaurus
13. Available versions
• Electronic copy of AIP thesaurus supplied in
– XML
– Excel
– Web-based, read-only versions (Thesviewer)
– MARC, SKOS, OWL, CSV etc
15. To make an ontology
• Define additional Associative relationships
• Define additional Hierarchical relationships
– IsA, IsPartOf, HasA
• Define additional Equivalence relationship
• Multilingual options
• Weights and measures
16. Clearer disambiguation?
Temperature
Planets
IsA
TypeOf
IsA BrandOf
Mercury
Roman god IsA Automobile
Metallic element
17. Knowledge Organization Systems
• Uncontrolled list Not complex
• Name authority file
• Synonym set/ring
• Controlled vocabulary
• Taxonomy
• Thesaurus AIP Thesaurus is here
• Ontology
• Semantic network Highly complex
18. Lessons Learned
• Learning the style for indexing
• Tendency to reversion to PACS style of language and
classification
• SME feedback turnaround
– Sit with them 2 hours
– Incorporate suggestions 8 hours
– 2117 Terms Added
1354 Terms changed or updated
1333 Terms deleted
11259 Other actions
19. Where are we now?
• Platform is established
• OWL and other formats available
• One kind of Associative relationship
– (Related terms)
• One kind of Hierarchical relationship
– Broader Narrower / Parent Child
– Multiple broader terms for interdisciplinary options
• One kind of Equivalence relationship
• Synonym non preferred terms
• Built using the Z39.19 standard - interoperable
20. To Review AIP Thes
• Use a web browser
• http://thesview.accessinn.com/aipThes/
• username/password twice - in all cases both are
'aip'.
• Begins a java app in your browser that shows the
thesaurus starting from the top level of the hierarchy.
• Use the collaboration module to comment and
discuss
21. Thank you
Marjorie Hlava
mhlava@accessin.com
505-998-0800