This was originally presented at BEA 2105. This presentation looks at the experiences of two publishers as they conducted machine indexing projects. It also shows the capabilities of machine indexing today.
2. Presenters
Moderator
• Pat Payton, Senior Manager Publisher Relations, Bowker
Speakers
• Randi Park, Publishing Officer, The World Bank
• Hassan Zaidi, Digital Publishing Officer, International Monetary Fund
• Jim Bryant, CEO, Trajectory Inc.
3. Terminology
• Automated or Machine Indexing
– Process of assigning index terms against a set
vocabulary or taxonomy without human intervention
– Full text or bibliographic records
– Multiple vocabularies/rule sets allow for complex text
analysis
• Optical Character Recognition (OCR)
– Machine conversion of an image to text
– PDF of book content
• Extensible Markup Language (XML)
– Set of rules for encoding documents
– Both machine readable and human readable
2
5. ABOUT THE WORLD BANK
4
• The World Bank Group is the world’s largest
source of funding and technical assistance for
developing countries.
• Through its five institutions, the Bank Group
partners with developing countries to reduce
poverty, increase economic growth, and
improve the quality of life.
• Comprised of 188 member countries with
offices in 120 countries around the world.
around the world.
Our Twin Goals
End Extreme Poverty within a Generation &
Boost Shared Prosperity
6. Likeotherpublishersinsomerespects but...
• Publishing arm of a larger institution, with institutional
imperatives
• Open access
o Dissemination trumps revenue
• Research is performed by in-house economists and experts in
other fields, by development practitioners working on the ground,
and by external contributors.
• Our publishing outputs are meant to enrich the development
debate, inform policies, and support the development goals of our
client countries.
We are a “Knowledge Bank”
The World Bank is the largest source of development knowledge
12. Metadata strategy
Primary Purpose
• Supports user-centered
discovery in WB electronic
products
• Semantic fields often exposed
and browseable
• Complimented by full text
search and filtering
• Book, chapter and article level
abstracts, topics, regions,
countries, keywords
• Books do not inherit chapter
semantics
Secondary Re-purpose
• Search and discovery services
• Aggregators
• Retail sales channels, both print
and electronic
13. Ourexperiencewithmachinegenerated
metadata
Set up
• Customized our enterprise system as much as was practical
Pros
• Reasonable solution when
there is a huge corpus
• Fast throughput
• Inexpensive to run after labor-
intensive set up
• PDF source for extraction of
topics, subtopics, countries,
regions, keywords
• XML output easily
transformed
Cons
• Set up effort/cost
• Inconsistent use of keyword
terms, depending on how
they were used in the text
anti-corruption/anticorruption
decision-making/decision making
policy-making/policy making
• Abstracts must be written by
humans
• False hits due to footnotes,
references, names, etc..
14.
15. Presentworkflow –humangenerated
Pros
• Book and chapter level
including abstracts
• Able to manage keyword
vocabulary using pick-lists
with additions as needed
• More accurate, author
provides book level draft, EP
team does sense check
• New rules and terms can be
added any time with little set-
up
Cons
• Cost per book/chapter
• Capacity
• Inconsistencies between
legacy (edited machine-
generated) and newer content
to be addressed
• Single version of keywords
may not be ideal for all
channels (ie more keywords
for discovery services)
16. Future
• Interested in using technology to improve
discovery for direct users and in discovery
services
• Full text XML and ePub available for indexing
• Institutional need to implement new taxonomy
and full text search for over 200k documents
18. Introduction: IMF Publications
Objectives: Establish digital publishing program 2010-2011
• New IMF eLibrary
• Digital distribution
• Digital production
• New metadata management system
• Create metadata to a granular level (chapters and articles) ***
21. New Challenges – New Solutions
Manual vs. Machine
•Metadata quality
•Time factor
•Cost of labor comparison
Challenge: Cataloging to a granular level (keywords,
countries, topics and sub-topics)
22. New challenges – New solutions
Do the Math
IMF example:
• 12, 000 titles containing 60,000 chapters/articles (assumes an
average of 5 per title),
• 15 minutes to catalog each chapter/article with keywords etc,
• 15,000 hours/40 (per week) hours =375 weeks
• 375 weeks/52 = 7 years of work for one cataloger.
If you pay just $30 per hour to a cataloger, the overall cost would be
$450,000. Not to mention new content is being created daily.
Automation allows us to slash the time it takes to catalog our
content, saving us time and money.
30. Simple Search - Type a word or phrase into the
search bar at the top of every page…
…or Advanced Search allows
multiple concepts and filters
31. Search within results to search
within publications using a single
word or phrase.
Select Content Type (Books and
Journals/Chapters and Articles),
Countries/Region, Topics,
Languages, or Date.
Type a word in the Starts with box
to go to the first title that begins
with the word.
Sort by Title, Date, Source or
Author.
Change the number of Items per
page.
Keywords
32. Read on screen
in HTML
Read on a
variety of
devices
Citation
tools
Click on a title from the results page to go to the publication
landing page.
37. • New IMF eLibrary was delivered in March 2011
• Digital distribution: Distribute IMF contents to 35 channels
in various digital formats
• Digital production: Have an established workflow to
generate XML based contents, ePubs, Mobi and PDF ebooks
• New metadata management system. MetaLogic is a full
functioning metadata management system
• Create metadata to a granular level
38. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Generating Metadata By Machine
BEA May 29, 2015 11:30 – 12:20
39. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Attributes/Entities that Characterize A Book
38
40. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
“Outstanding”words(5) breathtaking,thrilled,superb
hell,rape,(more unmentionables)“Catastrophic”words(-5)
torture,fraud,(unmentionables)“Damned”words(-4)
woeful,worsen,kill“Terrible”words(-3)
worthless,travesty,threaten“Upset”words(-2)
numb, provoke,pushy“No”words(-1)
validate,safe,adequate“Yes”words(1):
strengthen,rich,funky“Welcome”words(2)
praise,marvelous,impressive
winning,stunning
“Happy”words(3)
“Wow”words(4)
39
Each wordisgivena numericvalue
basedon itssubjectivemeaning.
“Positive”wordsrangeona positive
scale;“Negative”wordsrangeon a
negativescale.
Trajectory’sAnalyticsEngineuses
thesevaluestocomputethebook’s
sentimentcurveacrosssentence,
paragraph,chapterandentirebook.
Thissentiment“fingerprint”atan
aggregatelevelyieldsaunique
pictureofthebook.
41. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
40
42. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Sentiment: Analyzing the Words Within the Book
41
43. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Trajectory Index
42
44. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Keyword Analysis and Comparison
43
45. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Keyword Translation into Local Languages
44
46. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Recommendations
45
47. ™
THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY
THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.
Thank You
46
2015BEA – BOOTH 1347
United States:
50 Doaks Lane
Marblehead, Massachusetts
01945 United States
info@trajectory.com
www.trajectory.com
China:
No. 3, 8 ChuangYe Road
HaidanDistrict,
Beijing, China100085
48. Q & A
Generating Metadata by Machine
BEA 2015
Friday, May 29, 11:30-12:20
Room 1E10