Detailed information on the operation of the Data Harmony Machine Aided Indexer module from Access Innovation’s, Inc. Presented by Alice Redmond-Neal and Jack Bruce at the 2012 Data Harmony User Group meeting on February 7, 2012 at the Access Innovations, Inc. offices.
2. Machine Aided IndexerTM is available as a
stand-alone version or as part of MAIstro™
(integrated with Thesaurus MasterTM).
M.A.I.TM creates a simple rulebase from your
thesaurus terms to use for categorizing
documents.
You can fine-tune the rulebase to reflect
editorial knowledge and judgment, specifying
when thesaurus terms should be used.
Your result: Precision Indexing
3. M.A.I. under the hood
Concept Extractor™
Compares text to Knowledge Base rules to present
suggested index terms
Statistics Collector™
Gathers and stores the index experience of the
system, sorting into Hits / Misses / Noise
Prioritizes terms needing rule finetuning to improve
indexing accuracy
Rule Builder™
Human editor creates, edits, and reviews rules for
indexing terms
4. IN Knowledge
MAI Base
Text Concept Rule Builder
Extractor Editor manages
Knowledge Base
List of
suggested
terms from Statistics
controlled Collector
vocabulary improves the
Knowledge
Base
OUT
Human review Database
results in:
Indexed
Hits—selected terms
set of
Misses—added terms
documents
Noise—rejected terms
5. Objective in indexing:
apply indexing terms with...
Accuracy
Speed
Depth -- specificity
Breadth -- exhaustivity
Consistency
Objective in M.A.I. rulebuilding:
make rules reflect human thinking for
optimal categorization
6. How?
Formulate standard rules
for interpreting text
for applying thesaurus terms as subject
metadata to index/categorize
documents
2/14/2012
7. Why use rules for indexing?
Rules provides consistent direction for
interpreting text and applying indexing terms.
Accurate indexing results in precise
information retrieval.
8. M.A.I.’s starter rulebase
M.A.I. automatically generates rules
Starter rules match exactly to words in text
Identity rules for thesaurus terms
Synonym rules for established NonPreferred
terms
Success out of the box depends on
Taxonomy term expression of concepts
Writer’s creative expression of concepts
9. Fine-tuned by editors, rules enable
context clues to pinpoint word meaning
―reading between the lines‖
natural language processing
greater accuracy over simple rule
indexing
Use M.A.I.’s Rule Builder Module
to fine-tune rules for applying terms.
10. Indexing and rule-building –
two processes
Indexing:
Read and interpret document text
Decide on indexing term
Rulebuilding:
Identify prompt word(s)
What brought the indexing term to mind?
This text to match in the document is the
starting point for rule-building.
2/14/2012
11. Indexer reads the document text
―Indian leaders are asking the government…‖
11
12. Indexer considers indexing terms
―government‖
State government?
Federal government?
City government?
―Indians‖
in India?
Indigenous people?
Native Americans?
2/14/2012
13. Indexer selects indexing terms
―Indian leaders are asking the government
to prevent a repeat of the 1990 census
undercount that missed nearly 3000
Indians
in New Mexico.‖
13
14. M.A.I. term suggestions
Government
New Mexico
Use your knowledge to select best terms –
from M.A.l. suggested terms
from thesaurus
Decide on indexing terms and apply them to
document.
15. Indexing done,
rule-building begins
The rule-building editor’s question:
What words in the text prompted
selection of those terms?
This word (or words) is the starting point for
building a rule with M.A.I. – the ―gatekeeper.‖
16. Choose the MAI Rule Builder tab
A rule has
two parts:
Viewing options:
--Text to Match
font
--rule body
style
size
17. M.A.I. rule starts with
Text to Match
The prompt word (or word part or
phrase) in the document --
whatever made the indexer think of
a specific indexing term --
becomes the Text to Match of a rule.
18. Importance of Text to Match
TTM opens the door to the rulebase
Without a word or phrase to match, the
knowledge of the rulebase is unavailable.
M.A.I. system programmatically creates a
starter rulebase
Identity rules – exact match to thesaurus term
Synonym rules – exact match to NonPref term
Starting point for a rulebase – Ready for finetuning
19. M.A.I. out of the box
Estimate 60% accuracy
Success depends on:
Style of thesaurus terms
Writing style of documents
Addition of synonyms
20. If only…
Document authors wrote
using the language of thesaurus terms,
then the starter rulebase would be sufficient…
but...
21. Editors make M.A.I. rules smarter
1. Modify the Text to Match
2. Modify the rule body
22. 1. Modify the Text to Match
Words with the same root
crystal ~ crystallize ~ crystalline ~
crystallization ~ crystal-forming
Text to match: crystal*
Words in inverted sequence
Power, Solar = Solar power
Text to match: solar
Phrases with same meaning, different syntax
Pollution control = Control of pollution
Text to match: pollution
23. 2. Modify the rule body
Starter rules (identity and synonym)
specify term to be used –
no ifs, ands or buts
You can
establish conditions or limits on the
suggestion of the indexing term(s)
direct M.A.I. to ignore a word or phrase in
text (NULL rule)
24. Two basic types of rules
1. Simple rules (starter rules)
no conditions to limit the use
of the indexing term
2. Condition rules
where rules get interesting!
25. Simple rules – how they work
The prompt word in the text suggests the
same indexing term every time that word
occurs
No IFs qualify the use of the indexing
term
Text to Match in the document
USE Indexing term
27. Simple rules – identity rule
Text to Match
is identical to
thesaurus term
in the
rule body --
No conditions
28. Simple rules – identity
Text to match: irrigation
USE Irrigation
Text to match: Lake Michigan
USE Lake Michigan
Text to match: marriage and divorce records
USE Marriage and divorce records
30. Simple rules – synonym rule
Show term equivalents (Use/Used
for)
Text to match: jobless USE Unemployment
Text to match: fish farm USE Aquaculture
Text to match: Y2K USE Y2K issue
Text to match: parish USE County
Text to match: e-business USE Ecommerce
31. Simple rules – synonym rule
Simplify morphological, punctuation,
spelling, and sequencing variations
Text to match: worker’s compensation
workman’s compensation
workmen’s compensation
work* comp*
USE Worker’s compensation
Text to match: e-commerce
USE Ecommerce
32. A synonym rule for the Text to Match ―jobless‖
suggests … USE Unemployment
When M.A.I. is integrated with
Thesaurus Master,
synonym rules for Non Preferred terms
are generated programmatically.
33. Simple rules – synonym rule
Separate out compound terms
Text to match: fishing USE Fishing and hunting
Text to match: hunting USE Fishing and hunting
Text to match: adoption USE Adoption and foster
care
Text to match: divorce USE Marriage and divorce
records
TIP: Trim TTM down to one core element
34. Simple rules – NULL
Ignore a thesaurus word that occurs
• as part of an irrelevant phrase
―physician’s orders‖
• as part of an idiom
―in light of…‖
―a bird in hand‖
―looking back...‖
Text to match: in light of
Rule: NULL
35. NULL rule –
Do not index with the thesaurus term
―Light‖ in this instance.
36. Two basic types of rules
1. Simple rules (starter rules)
no conditions limit the use
of the indexing term
2. Condition rules
where rules get interesting!
39. Jay Leno’s headlines
Police Begin Campaign to Run Down
Jaywalkers
Local High School Dropouts Cut in Half
Red Tape Holds Up New Bridges
Include Your Children When Baking
Cookies
Kids Make Nutritious Snacks
Iraqi Head Loses Arm
40. How would you disambiguate…
• bush – What other words and/or conditions
should lead to using the term
Shrubs – OR
U.S. presidents
balloon
Aerostatic aviation – OR
Party supplies
will(s)
Jurisprudence, Last will and testament, Living wills
(auxiliary verb)
41. Example: routing
vehicles (direction)
work (workflow)
people, data, stuff (distribute, disperse)
the other team (overwhelming defeat)
wood (using power tool)
42. Example: Technology –
Need conditions?
Top term
Narrow terms
Engineering Information technology
Medical technology Technology transfer
Radio frequency identification technology
Scope note
The practical use of scientific knowledge in industry and
everyday life; the scientific method and material used to
achieve a commercial or industrial objective
Related terms
Technology assessment Technology research
Set conditions on using term Technology?
―new fangled technology‖ ―cooking technology‖
―report from the Massachusetts Institute of Technology‖
43. When the prompt word
is ambiguous
Could prompt word be interpreted differently?
Indian leaders are asking the government…
balloon
bush
bridge
adoption
Under what conditions would another
interpretation be correct?
44. Thinking conditionally –
let the IFs begin...
Convergent thinking
What other words in text would
confirm your interpretation of the
text-to-match meaning and your
proposed indexing term?
Divergent thinking
What words in text would contradict
your interpretation?
45. Condition rules – IF rules
For ambiguous word meanings, editor can set
IF conditions that must be met for rules to
suggest an indexing term.
Can incorporate conditions from Scope Notes
Editor can set one or more conditions, joined
with Boolean operators AND, OR, and NOT.
46. Example: Sniffer
BT Malicious code
SN A program that intercepts routed data and
examines each packet in search of specified
information, such as passwords transmitted in
clear text.
M.A.I. rule
TTM: sniffer
USE Sniffer
“Customs used a sniffer dog to identify
the contraband …”
47. In a botany taxonomy, ―bushes‖ is a NonPref Term
that prompts the preferred term ―Shrubs‖ --
even if the text is about (former) President Bush.
When a simple rule won’t do, set conditions in the
rule to increase precision Hits and decrease Noise.
49. 4 types of conditions
1. Proximity of rule’s TTM to quoted word
from document text
(4 levels of proximity)
2. Capitalization of TTM
3. Exact MATCH of TTM to word in text
4. TTM begins or ends a sentence
Mix and match conditions with
Boolean operators: AND, OR, NOT
50. Condition rules – Proximity
Text to match: safety
IF (NEAR “security”) WITHIN 3 WORDS
USE Crime prevention
ENDIF
IF (WITH “community”) WITHIN SENTENCE
USE Public safety
ENDIF
IF (AROUND “product”) WITHIN 50 WORDS
USE Product safety
ENDIF
IF (MENTIONS “food”) WITHIN 250 WORDS
USE Food handling and safety
ENDIF
51. Condition rules – Proximity
Text to match: bear
IF (NEAR “Chicago” OR WITH “football”)
USE Chicago Bears
ENDIF
IF (NEAR “market” OR AROUND “stock”)
USE Stock market
ENDIF
IF (MENTIONS “forest” OR MENTIONS “woods”)
USE Wild animals
ENDIF
52. Example: Documentation
Text to match: documentation
USE Documentation
Identity rule created problems
Add conditions for greater precision:
IF (AROUND "software" OR WITH "application"
OR AROUND "hardware" OR WITH "instruction“)
USE Documentation
ENDIF
53. Condition rules – Negation
Text to match: wages
IF (NOT WITH “war”)
USE Wages and salaries
ENDIF
• Text to match: web
IF (NOT WITH “spin*”)
USE Internet
ENDIF
(“spider” no longer differentiates internet from arachnids)
54. Condition rules – Case
Text to match: aids
IF (ALL CAPS)
USE AIDS and HIV
ENDIF
Text to match: masters
IF (INITIAL CAPS AND MENTIONS “poet*”)
USE Edgar Lee Masters
ENDIF
55. Condition rules – Match
Text to match: employ*
IF (MATCH “employment”)
USE Employment
ENDIF
IF (MATCH “employee” AND
(WITH “municipal” OR WITH “city”
OR WITH “town”))
USE Municipal employees
ENDIF
56. Condition rules – Sentence position
IF (BEGIN SENTENCE)
IF (END SENTENCE)
57. Conditions in rules help
increase precision Hits
decrease Noise
for more precise information retrieval.
Conditions depend on human logic.
58. M.A.I. can save illogical statements bad results.
M.A.I. can not save a rule with incorrect syntax.
Rule Check and Save check the syntax of a rule.
Error warning – explains syntax problems
– shows line location
Closing
parenthesis
missing
59. Mind your IFs and ( )s – come in 2s
IF starts the system thinking about a condition;
ENDIF completes the thought.
Every IF condition goes in ( )s.
Every ( must close with ) -- multiple ( )s are OK.
Every IF condition must close with an ENDIF.
Every ― must close with ‖.
Function words must be spelled correctly.
60. Kicking rules up a notch
Rules can express
Multiple concepts
Alternative concepts
Contingent concepts
61. Condition rules – IF-IF
Text to match: housing
IF (AROUND “afford*”)
USE Affordable housing
IS DIFFERENT FROM
ENDIF
IF (AROUND “public”) Text to match : housing
USE Public housing IF (AROUND “afford*”)
ENDIF USE Affordable housing
IF (AROUND “public”)
Independent conditions USE Public housing
ENDIF
ENDIF
Contingent conditions
62. Condition rules – IF-IF
Text to Match: agricultur* Text to Match: agricultur*
IF (WITH “products”) IF (WITH “products”)
USE Agricultural products USE Agricultural products
IF (WITH “programs”) ENDIF
USE Agricultural programs IF (WITH “programs”)
ENDIF USE Agricultural programs
ENDIF ENDIF
Agricultural programs
is available ONLY IF BOTH terms may be used—
Agricultural products they are independent
condition is met.
63. Condition rules – IF-IF
Text to Match: agricultur*
IF (WITH “products”)
USE Agricultural products
IF (WITH “programs”)
USE Agricultural programs
ENDIF
ENDIF
Indentation emphasizes contingent condition
64. Condition rules – IF-ELSE 1
IF - ELSE provides further options in
rules, a default if the first condition is not
met.
It may be used without condition
Text to match: technology
IF (AROUND “transfer*”)
USE Technology transfer
ELSE
USE Technology
ENDIF
65. Condition rules – IF-ELSE 2
Text to match: norwegian
IF (AROUND “language” OR
WITH “speak*”)
USE Norwegian language
ELSE
USE Norway
ENDIF
66. Condition rules – IF-ELSE IF
IF - ELSE IF
or add extra conditions
Text to match: norwegian
IF (MENTIONS “language”)
USE Norwegian language
ELSE IF (MENTIONS “country”)
USE Norway
ENDIF
ENDIF
67. You can...
Truncate a single word with *
e.g. agri*
Use * as a wild card between words,
e.g. drinking * driving
Truncate in the text to match and/or
in the rule body
68. And you can...
Include multiple conditions in a
rule, starting from a single text-to-
match tax*
Text to match:
IF (WITH “business”) USE Business taxes
IF (WITH “income”) USE Income taxes
IF (WITH “sales”) USE Sales taxes
IF (AROUND “forms”) USE Tax forms
IF (AROUND “law*” OR AROUND “legis*”
OR AROUND “legal”) USE Tax laws
69.
70.
71. And you can...
Use multiple Boolean operators in rules
Embed clauses within clauses using Boolean
operators
Text to match: activit*
IF (WITH “extracurricular” OR (WITH “school” AND
(WITH “after” OR WITH “before” OR WITH “outside”)))
USE Extracurricular activities
ENDIF
Watch the ( )s!
72. M.A.I. in action
(105 ILCS 45/1-20)
Sec. 1-20. Enrollment. If the parents or guardians of a homeless
child or youth choose to enroll the child in a school other than the
school of origin, that school immediately shall enroll the homeless
child or youth even if the child or youth is unable to produce records
normally required for enrollment, such as previous academic records,
medical records, proof of residency, or other documentation. Nothing in
this subsection shall prohibit school districts from requiring parents
or guardians of a homeless child to submit an address or such other
contact information as the district may require from parents or
guardians of nonhomeless children. It shall be the duty of the
enrolling school to immediately contact the school last attended by the
child or youth to obtain relevant academic and other records. If the
child or youth must obtain immunizations, it shall be the duty of the
enrolling school to promptly refer the child or youth for those
immunizations.
(Source: P.A. 88-634, eff. 1-1-95; 88-686, eff. 1-24-95.)
73. Original identity rule for “Children and youth”
Modify rule for “Children and youth”
to Text to Match: child*
74.
75. Reading M.A.I. results
Indexing terms | Document words match TTM
Children and youth | (15) child*(9) youth (6)
Schools | (7) school*(7)
Homeless people | (3) homeless*(3)
Immunizations | (2) immuniz*(2)
76. M.A.I. Statistics let you track performance
as you fine-tune the Knowledge Base.
M.A.I.’s Statistics Collector gathers and stores
indexing experience.
Statistics compare editor’s indexing results to
M.A.I.’s suggestions Hits, Misses, Noise
Statistics prioritizes the terms for which rules
need fine-tuning.
77. M.A.I. statistics
Hits
System suggests indexing terms that are
chosen by the editor--good!
Misses
System misses terms editor uses
Noise
System suggests terms not used by editor
Misses and Noise … need more rule-building
78. Open Misses to reveal thesaurus terms used
by an editor but not suggested by M.A.I.
Buddhism was used by editors for indexing
3 records, but was not suggested by M.A.I.
79. Open the key beside the term to see the list
of records where the term was used...
The file name, record number and editor’s
name are stored with each record.
80. Click to highlight any record line on the left.
The full record appears on the right, with
M.A.I.’s Suggested Terms and the editor’s
Used Terms.
81. In this record, M.A.I. interprets ―devotion‖ and
suggests the indexing term ―Prayer‖ -- Hit.
The editor used ―Buddhism‖ though M.A.I.
did not suggest the term -- Miss.
M.A.I. suggested ―Libraries‖ and ―Religions‖
though the terms were not used -- Noise.
For this record, M.A.I. scored
• 3 Hits -- Prayer, Sri Lanka, Religious beliefs
• 1 Miss -- Buddhism
• 2 Noise -- Religions, Libraries
82. The word ―Buddhism‖ does not appear
in the record, although ―Buddha‖ does.
The editor’s use of the thesaurus term
Buddhism to index the record is
appropriate.
M.A.I.’s Knowledge Base can be fine-tuned
to reflect human knowledge and
interpretation of the text.
83. Search the Knowledge Base for rules for Buddha.
(Truncate buddh* to widen the search.)
Click Search,
results appear
84. Rules exist for ―Buddhism‖ and ―Buddhist‖
but not for ―Buddha,‖ which is in the text.
You can easily create a new rule …
Text to Match: Buddha
IF (MENTIONS “religion”)
USE Buddhism
ENDIF
If ―buddha‖ and ―religion‖ are both in the text,
M.A.I. suggests the indexing term Buddhism.
85. Enter a rule for Text to Match: buddha ...
Better yet: combine all 3 rules by using
Text to Match: buddh*
87. The new rule Text to Match: buddha prompts
Buddhism in Suggested Terms for indexing.
88. At any time, you can: modify a rule
check the rule
for syntax
save the rule
see the rule’s
history
add an editorial
note
find a word
clear the screen
delete the rule
89. Each rule in the Knowledge Base that the
editor fine-tunes increases M.A.I.’s
• ability to recognize synonyms,
• find connections between non-contiguous
words
• interpret idioms,
• make sense of allusions,
• ―read between the lines‖
Over time, statistics for Hits increase,
while Misses and Noise decrease.
91. When to make rules
Before processing documents
Proactive rule building provides head start
Increases hits from the start
After processing documents
Statistics report lets indexer see what rules
need fine-tuning to improve Hits, avoid Misses,
and decrease Noise, based on comparison of
M.A.I. suggestions with editor’s indexing
Rule-building is an on-going process
Frequency diminishes, results improve
92. Custom configure M.A.I.
How many term suggestions?
Limit use of a term to n documents?
How much text to scan? Treat singular the same as plural?
Ignore
stopwords?
Quote marks?
Plural=Singular?
Most specific
term only?
Suggest
Candidates?
93. M.A.I. measurably improves indexing results:
• Consistency
same term suggested under same text conditions
• Indexing coverage
terms reflect full range of indexable concepts in data
• Indexing depth
terms reflect the granularity and precision of
deeper levels of thesaurus
• Faster throughput
nearly 7 times faster indexing
94. M.A.I. mines the full depth of your
thesaurus, suggesting the most specific
and appropriate indexing term.
M.A.I. can also filter indexing terms,
displaying more general Broad Terms,
while retaining the more precise indexing
terms stored with the document.
95. Pairing Machine Aided Indexer
with Thesaurus Master
as MAIstro provides
• simple thesaurus construction
and maintenance
• faster indexing
• deeper indexing
• greater concept coverage
• more consistent indexing
Efficiency and Economy
in document storage and retrieval