3. Before You Begin
Be Sure to Know
How will we use the thesaurus?
What is the size and the scope of the project?
Who will be viewing and accessing the vocabulary?
When will we update the thesaurus?
4. Approaches For Taxonomy
Design
Top-down
• Identifying top categories first or utilizing a pre-established category list
for your top areas.
• Organizing each top domain with the most relevant and broad coverage of
documents
• Dividing each category with subcategories to narrow granular topical
areas
• Establishing attributes and additional subcategories to each thesaurus
node created
5. Approaches For Taxonomy
Design
Bottom-up
• Beginning from an unsorted list of vocabulary terms and concepts
compiled from multiple resources
• Moving terms in the list to classify their Broader / Narrower relationships
• Declaring top domains after exploring the amount of topics covered in a
single term
• Guaranteeing every term to be evaluated at least once
• Sorting extremely large unsorted data sets efficiently
6. Top-down
• Easier for smaller vocabulary sets
• Quick method of identifying key top areas
• Designed for a navigational mindset
Bottom-up
• More accurate representation of the content
• Ideal for larger scale thesauri
• Content drives the entire structure of the thesaurus
We recommend a mix of both, but every vocabulary demands
different courses of action
8. Resources for Designing a
Thesaurus
Existing Controlled Vocabularies
• Additional taxonomies
• Classification schemes
• Topics and headings
• Sitemaps
• Glossaries and Definitions
Listing of Keywords
• Entered by an author or indexer
• May range in size from 100 to 100,000 terms
9. Resources for Designing a
Thesaurus
• Search Logs
• An unruly mess of words
• What to look out for…
• Which topics are more frequently searched for by users?
• Has common terminology for concepts and technologies changed within the past x
years?
• Trim search logs to the most frequent and concise topics
• Data Mining
• N-gram tests
• “Content-aware” vocabulary
11. Selecting Thesaurus Terms
• Looking for descriptors, terms in the thesaurus which must
adequately reflect the content
• Terms which describe fields of study, technology, applications,
devices, research, and other content
• Thesaurus terms must be concise, must express a single concept,
and must be free of ambiguity.
• Concepts such as General and Applications will not describe what is
written within a single document.
12. Literary warrant
• Justification for the representation of a concept in an indexing language or
for the selection of a preferred term because of its frequent occurrence in
the literature
Organizational warrant
• Justification for the representation of a concept in an indexing language or
for the selection of a preferred term due to characteristics and context of
the organization
User Warrant
• Justification for the representation of a concept in an indexing language or
for the selection of a preferred term because of frequent requests for
information on the concept or free-text searches on the term by users of
an information storage and retrieval system.
14. Compiling the Terms
Existing vocabularies
• Be aware of overlap and multiple terminologies
• Standardize the terms (plural, hyphenation, etc.)
• Breakup pre-coordination if it exists
Whether to include the vocabularies current hierarchy (if it
contains one) is purely the decision of the thesaurus developer
• Will save time and effort to retain existing hierarchy while providing an
early look at the structure of the vocabulary
• However, conflicting and overlapping terms may cause problems when
reviewing the initial build
15. Filtering the Unsorted Lists
Standardize the “Word Salad”
• Combining singular and plural forms of terms
• Combining hyphenated terms
• Removing named entities
• Identifying and/or removing acronyms
Add only the most frequently searched terms and added keywords
• Can limit to the top 50 or 100 most frequent
• Too many results can litter a vocabulary with rubbish terms
17. Creation of the Initial Build
• Establish primary categories for the thesaurus
• Sort uncontrolled terms into appropriate categories
• Most time-consuming process
• Content will be re-evaluated, don’t stress too much on getting it right the first
time
• Create synonyms and related terms as you sort each term
• Double-check for conceptual duplicates within the project
• Ensure standardized spelling (American vs. British English)
• Check for typos
• Review Literary, Organizational, and User Warrant for each term
• Delete terms with little to no indexing value
18. Initial Build - Equivalence and
Associations
Six-Second Rule
• As a rule-of-thumb, give yourself six seconds to brainstorm multiple ways
to express a single concept.
Creating synonyms not only allows for a stronger thesaurus, but
will potentially identify duplicate concepts within the early
vocabulary.
Adding and searching for related terms will identify other subject
areas included in the unsorted taxonomy
20. Evaluation
• Review Literary, Organizational, and User Warrant
• Division of top terms
• Assign team members top levels to review
• Fill in missing gaps of classification
• Ensure no flat list of topics (more than 15 terms in a category) exist within a
single section
• Merge conceptual duplications within the content
• Preferring one expression over the others
• Delete terms with little to zero indexing value
• Add synonyms not listed for each term
• Add related terms which do not appear
21. Evaluation - Term style and Form
Must represent single-train of thought
• Removes ambiguity and uncertainty of concepts
• Pre-coordination of terms should be disregarded (“Acoustics in music”,
“Cancer and metastasis”)
Reduce slang and jargon for preferred terms unless no other word
describes the concept or if the older terminology is infrequently
• (Microelectromechanical Systems and MEMS)
• (Quantum bits and Qubits)
22. Evaluation - Term style and Form
Use nouns, or noun phrases / Avoid action verbs for concepts
• Catalysis rather than catalyze
• Distillation rather than distill
• Reading rather than read
Adjectives and Adverbs
• May be used to differentiate different concepts
• Should not be used as individual terms
23. Evaluation
Proper nouns (including names, places, etc.) should have proper
capitalization
Compound terms
• Used for Disambiguation and for specificity
• Granular descriptors
“Lead coating on copper pipes”
Arabian Peninsula
Milky Way Galaxy
Louvre
Albert Einstein
24. Evaluation - Term style and Form
Loanwords are fine if they are covered well within the content
(habeas corpus)
Abbreviations and acronyms should be spelled out, unless the
proper name is rarely used (DNA)
Do not include parentheses unless disambiguating the term
• Mercury (element) = Okay
• Computed tomography (CT) = Frowned upon
25. Indexing
Post-coordination
• Two or more thesaurus terms are applied to an article to represent a
concept.
• Used at the time of search and retrieval
Pre-coordination
• Terms are combined before indexing
• Uses one node to describe content
Liver AND Anatomy
New York AND Subway
Furniture-California-San Francisco-History-20th Century
Liver-Blood Vessels-Diseases-Congresses
26. Post-coordinated terms work more effectively for MAIstro
(Thesaurus Master and M.A.I.)
• Allows M.A.I. to easily identify subject terms within a range documents
without elaborate rules
• Easier to maintain simpler vocabulary terms
Pre-coordination allows an unlimited amount of terms to be added
to the Thesaurus
• Expressing multiple concepts within a singular thesaurus term will set a
precedence for enabling all terms in this manner
• If you have the term Computers in chemistry, what will stop you from
creating Computers in biology, Computers in dentistry, Computers in
echolocation, etc.
27. Evaluation - Term style and form
Keep terms plural unless
• changing the term to a plural form alters the meaning of the term (e.g.
Technology; Technologies)
• If this is the case, disambiguate the concepts with parenthetical qualifiers
Technology (applied sciences) and Technologies (devices)
• Literary warrant or User warrant dictates the term to be singular
Control the vocabulary through use of synonyms
• Terms must represent unique concepts
Keep single Train-of-Thought
28. Revision and Reiteration
Thesaurus development is highly cyclical
• For multiple personnel, reviewing alternate sections and others work is
highly recommended
• Alternating a pair of eyes will catch plenty of errors and inconsistencies
within the thesaurus terms
Subject Matter Expert feedback is always recommended
• Must be clear what SMEs are reviewing and why they are reviewing it
• Many experts are highly opinionated and unaware of the
scope/implementation of the project
• Feedback must be re-evaluated (sometimes taken with a grain of salt)
29. Standards and Compliance
• American National Standards Institute / National Information
Standards Organization
• ANSI/NISO Z.39.19
• British Standards Institute
• BS 8723 parts 1-4
• International Standards Institute
• ISO 25964