Metadata for mere mortals - Part 3: Controlled vocabularies
Presented by Erin Antognoli, Metadata Librarian
In this webinar, Erin discusses controlled vocabularies and how they fit into the use of metadata. If you haven’t yet listened to our previous discussions on metadata, you may want to take a few minutes to catch up.
Metadata and the data lifecycle: https://lac.gp/MetadataIntro
Choosing metadata standards: https://lac.gp/ChoosingStandards
Contact us: https://lac-group.com/contact-us/
3. Standardizing information
Controlled vocabulary = standardization of information
Benefits to record stewards and information users
● Consistent wording / spelling
● Limits on content entry
● Identical formatting
● Integration with linked data
5. Structure and consistency
Consistent representation of subject matter across records/collections
Example of inconsistency:
Avoid redundancy – maintain consistency
6. Organization and findability
Define how information will be structured and communicated
● Within records, within the organization as a whole, and beyond.
● Hierarchy example:
13. Controlled vocabularies – Complexity scale
Flat term lists
● No ordering; may have definitions and attribution; no overlap.
Authority files
● One term identified as the preferred term, other synonyms are
variant terms; usually link back to each other.
Classification schemes
● Codes that represent controlled vocabulary terms (i.e. Dewey
Decimal System, LOC system).
Hierarchical term lists
● List of terms grouped to imply order or organization
(parent/child); may have definitions and attribution.
Lesscomplex
14. Controlled vocabularies – Complexity scale
Thesauri
● Controlled vocabularies networked together by
relationships between terms; often indicate
preferred/variant terms; definitions, and attribution.
● Equivalence, hierarchy, associative relationships
defined.
Ontologies
● Describe concepts and relationships in programmatic
ways and enable arbitrary relationships.
● Often no preferred terms; concepts and relationships
are described in machine-readable ways (supports
interoperability).
● Child terms inherit properties of parent terms
(enables reuse and scalability).
Morecomplex
15. Controlled vocabularies
LOC (Library of Congress)
● Subject headings
● Name authority file
● Genre/Form Terms for Library and Archival
Materials (LCGFT)
GCMD (Global Master Change Directory)
● Earth science
● Instruments
● Locations
16. Controlled vocabularies
Crossref
● Funder Registry
Getty Vocabularies
● The Art & Architecture Thesaurus (AAT)
● The Cultural Objects Name Authority (CONA)
● The Union List of Artist Names (ULAN)
National Agricultural Library Thesaurus (NALT)
17. Decide on details
Define what important information is repeatable
Decide how much detail to include
Quality > Quantity
● Identify stakeholders and their needs
● Consider available resources
18. Less ambiguous, more useful metadata
Enhanced discovery and interoperability of information
Greater efficiency and client / end-user satisfaction
Why controlled vocabularies matter
How can we help?
19. Thank you
Presented by Erin Antognoli, Metadata Librarian
For more information, contact us.
Notas do Editor
Welcome to part 3 of our Metadata for Mere Mortals series, which serves as a basic introduction to the principles and function of metadata for content and digital asset managers who lack formal training in this area. My name is Erin Antognoli, and I’m a metadata librarian with LAC group, on assignment at the National Agricultural Library in Beltsville, Maryland.
In this segment, we will examine how controlled vocabularies fit into the quest for metadata consistency and organization.
To begin, what is a controlled vocabulary?
A controlled vocabulary is essentially a vehicle to help standardize information.
One facet of controlled vocabularies is to ensure the wording and spelling of information is consistent. Another aspect makes certain that only desired information ends up in particular fields. Still another benefit of a controlled vocabulary is to keep similar fields formatted the same way.
One last benefit, which we will cover in more detail in an upcoming segment, involves implementing controlled vocabularies using linked data.
The results of using controlled vocabularies offer many benefits to both record stewards and information users.
Given that definition, how can controlled vocabularies help maximize metadata efficiency?
Controlled vocabularies that use linked data can help implement consistent representation of subject matter across records and collections and define how information will be structured and communicated. Some choices as basic as how to refer to your organization should be standardized. We struggle with this between divisions and collections at my library – the library itself will be referred to as the USDA-NAL, the USDA National Agricultural Library, and the United States Department of Agriculture National Agricultural Library, often within the same repository. However you decide to enter it, normalizing the data is critical. A controlled vocabulary of commonly used element terms, even if strictly internal, will help achieve this.
Avoid redundancy. If information can be entered into a variety of fields in a number of ways, choose a single field and representation and apply the metadata consistently in that field. Use any community standards that exist for your subject matter.
Your controlled vocabulary choices can help maximize both the organization and findability of your information.
Depending on the collection, certain types of metadata can be organized in hierarchical levels. Examples might include noting how the information in a given record relates to other items within a collection, within your organization as a whole, and beyond the organization if your information needs to integrate with outside collections.
Included here is an example of my organization, and how I might create a hierarchy to express our information’s place within the overall structure of the USDA. You can use part or all of a hierarchy in your metadata, whatever makes the most sense to your stakeholders.
Most of you have probably shopped on Amazon or at least perused their site, so this is an example of controlled vocabularies in the wild we can all relate to. Amazon uses controlled vocabularies to consistently organize and categorize the millions of products they sell to make them more manageable and discoverable. For their main product description, they use about 3 dozen high-level category terms like Books, Music, and Clothing & Accessories, complete with a hierarchy of lower-level terms to further narrow down the nature of the product being described. This makes items more easily discoverable, and connects products with people who want to buy them.
For example, a top level category of “Clothing & Accessories” is a start, but is still too broad to be useful to most users, so a seller may further narrow down the product description to include the type of clothing, age, gender, or size the product is designed for. Sellers can also describe a category underneath the top level such as Coats, Shirts, Pants, Belts, Socks, and so on. This way, when a customer begins to type in a search, Amazon’s user interface will suggest related sub-categories to help narrow down the choices and provide the most relevant results.
This approach to metadata is exceptionally helpful if a customer doesn’t remember exactly what a product is called, or isn’t quite sure what they need. They can type in a high level category term and the database suggests potentially relevant refined categories based on that input, which allows users to browse Amazon’s offerings to discover available products.
Let’s take a closer look at the NAL Thesaurus, since that’s a controlled vocabulary I use across several of the repositories I work with. For example, if I’m cataloging information about products and commodities, there are a lot of potential keywords I could use, and different catalogers may prioritize different keyword tags. Without a controlled vocabulary, the poor soul who has to search for this record in the future would have to search under every one of those variant terms to possibly find the one the cataloger decided to use in this particular record.
Employing a controlled vocabulary cleans this inconsistency and uncertainty up easily. In addition, using the thesaurus provides even more context about the nature of my particular record. In this case, I’m looking at NALT’s entry for “products and commodities”, which is a preferred term.
>>> The thesaurus tells me to use the preferred term for any of the variant terms listed below in the “Used For” section. The NALT entry also provides the broader and narrower term lists, related terms list, and scope notes defining when to use which preferred term for added consistency. This thesaurus entry also provides subject categories to give more context to this keyword. So, in this case, a single keyword, can provide much more detail while maintaining metadata clarity than 10 independently conceived variations of this keyword applied as flat text might.
>>> Even better, each preferred term in the NAL thesaurus is assigned a unique number, so if updates to the thesaurus occur in the future, the metadata tagged with this keyword, provided it is linked using the persistent number and not applied as a flat keyword, will automatically be updated. Applying flat keywords would be out of date and potentially inaccurate in the event of a change in the preferred term or definitions, and using a controlled vocabulary as linked data guards against this type of obsolescence.
Let’s look at the different types of controlled vocabularies available that help achieve consistency, interoperability, and organization goals.
The Thesaurus is an example of a controlled vocabulary, though this one won’t be going extinct any time soon.
Controlled vocabularies fall into different categories, or types. You can choose more than one type to use throughout your metadata. The key, like everything else, is to evaluate your needs and review the different options out there to see what fits best with your information, the needs of the users, and the vision for how this collection will interact with others in the future.
There are several types of controlled vocabularies, and you can choose more than one type to use throughout your metadata. The key, like everything else, is to evaluate your needs and review the different options out there to see what fits best with your information, the needs of the users, and the vision for how this collection will interact with others in the future.
A Flat term list does not have a hierarchy or specific ordering. Think of this like a school vocabulary test.
Authority files list one term as the preferred term and all other synonyms link back to the single preferred term for consistency. For example, author Samuel Clemens would link back to the preferred term “Mark Twain”.
Classification schemes are codes that represent controlled vocabulary terms. The Dewey Decimal system codes, which you find on the spines of books in the library, is a classification theme.
Hierarchical term lists group terms to imply order or organization. Listing a place according to a hierarchy might look something like Earth > then Continent > then Country > then State > then City > then Street > then Street Number
Thesauri network terms by their relationships. They indicate preferred terms, broader and narrower term hierarchies, and define terms for clarity. The National Agricultural Library Thesaurus is a comprehensive grouping of agricultural terms.
Ontologies describe concepts and relationships in machine-readable ways, which is desired for large, dynamic, and more automated collections.
So, looking at this list, you can see that some of the more complex controlled vocabularies have features of the less complex ones. There is indeed some overlap in functionality, but all of them help get metadata content more normalized and under control.
Don’t reinvent the wheel! Utilize existing controlled vocabularies wherever possible to maintain consistency and achieve greater findability and interoperability for your collections.
Here are a few of the MANY controlled vocabularies out there. Some are fairly general and multi-purpose, and others more subject specific.
Often times an organization will produce multiple controlled vocabularies to meet the needs of their own collections, such as Library of Congress, Global Master Change Directory, USDA, and Getty, and then make those vocabularies available to the general public through APIs, linked open data, or other means.
https://www.controlledvocabulary.com/examples.html
When choosing when and where to implement controlled vocabularies, define what information is repeatable, and how much detail to include.
Identify stakeholder needs, but also consider prioritizing information, especially if you have limited staff. To save time and process as much as possible, you could enter metadata at the collection level rather than the individual item level.
Quality is better than quantity, but it isn’t a black and white issue. Some carefully chosen information that will be most useful to your core stakeholders can be more valuable than filling every field and more effective at making the information findable, interoperable, and usable.
Above all, it’s essential to have a metadata strategy before getting into the details of a controlled vocabulary.