O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

BEA 2015 Generating Metadata by Machine

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 48 Anúncio

BEA 2015 Generating Metadata by Machine

This was originally presented at BEA 2105. This presentation looks at the experiences of two publishers as they conducted machine indexing projects. It also shows the capabilities of machine indexing today.

This was originally presented at BEA 2105. This presentation looks at the experiences of two publishers as they conducted machine indexing projects. It also shows the capabilities of machine indexing today.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Semelhante a BEA 2015 Generating Metadata by Machine (20)

Mais de Bowker (20)

Anúncio

Mais recentes (20)

BEA 2015 Generating Metadata by Machine

  1. 1. Generating Metadata by Machine BEA 2015 Friday, May 29, 11:30-12:20 Room 1E10
  2. 2. Presenters Moderator • Pat Payton, Senior Manager Publisher Relations, Bowker Speakers • Randi Park, Publishing Officer, The World Bank • Hassan Zaidi, Digital Publishing Officer, International Monetary Fund • Jim Bryant, CEO, Trajectory Inc.
  3. 3. Terminology • Automated or Machine Indexing – Process of assigning index terms against a set vocabulary or taxonomy without human intervention – Full text or bibliographic records – Multiple vocabularies/rule sets allow for complex text analysis • Optical Character Recognition (OCR) – Machine conversion of an image to text – PDF of book content • Extensible Markup Language (XML) – Set of rules for encoding documents – Both machine readable and human readable 2
  4. 4. Experience with semantic metadata creation Randi Park Rpark@worldbankgroup.org WORLD BANK PUBLICATIONS
  5. 5. ABOUT THE WORLD BANK 4 • The World Bank Group is the world’s largest source of funding and technical assistance for developing countries. • Through its five institutions, the Bank Group partners with developing countries to reduce poverty, increase economic growth, and improve the quality of life. • Comprised of 188 member countries with offices in 120 countries around the world. around the world. Our Twin Goals End Extreme Poverty within a Generation & Boost Shared Prosperity
  6. 6. Likeotherpublishersinsomerespects but... • Publishing arm of a larger institution, with institutional imperatives • Open access o Dissemination trumps revenue • Research is performed by in-house economists and experts in other fields, by development practitioners working on the ground, and by external contributors. • Our publishing outputs are meant to enrich the development debate, inform policies, and support the development goals of our client countries. We are a “Knowledge Bank” The World Bank is the largest source of development knowledge
  7. 7. PopularAnnualsandFlagships 7
  8. 8. Two platforms: The World Bank eLibrary and the Open Knowledge Repository (OKR)
  9. 9. Mobileapplications
  10. 10. Topics wecover=29 • Plus 5 Regions, Countries and Keywords
  11. 11. Metadata strategy Primary Purpose • Supports user-centered discovery in WB electronic products • Semantic fields often exposed and browseable • Complimented by full text search and filtering • Book, chapter and article level abstracts, topics, regions, countries, keywords • Books do not inherit chapter semantics Secondary Re-purpose • Search and discovery services • Aggregators • Retail sales channels, both print and electronic
  12. 12. Ourexperiencewithmachinegenerated metadata Set up • Customized our enterprise system as much as was practical Pros • Reasonable solution when there is a huge corpus • Fast throughput • Inexpensive to run after labor- intensive set up • PDF source for extraction of topics, subtopics, countries, regions, keywords • XML output easily transformed Cons • Set up effort/cost • Inconsistent use of keyword terms, depending on how they were used in the text anti-corruption/anticorruption decision-making/decision making policy-making/policy making • Abstracts must be written by humans • False hits due to footnotes, references, names, etc..
  13. 13. Presentworkflow –humangenerated Pros • Book and chapter level including abstracts • Able to manage keyword vocabulary using pick-lists with additions as needed • More accurate, author provides book level draft, EP team does sense check • New rules and terms can be added any time with little set- up Cons • Cost per book/chapter • Capacity • Inconsistencies between legacy (edited machine- generated) and newer content to be addressed • Single version of keywords may not be ideal for all channels (ie more keywords for discovery services)
  14. 14. Future • Interested in using technology to improve discovery for direct users and in discovery services • Full text XML and ePub available for indexing • Institutional need to implement new taxonomy and full text search for over 200k documents
  15. 15. Randi Park Rpark@worldbankgroup.org WORLD BANK PUBLICATIONS
  16. 16. Introduction: IMF Publications Objectives: Establish digital publishing program 2010-2011 • New IMF eLibrary • Digital distribution • Digital production • New metadata management system • Create metadata to a granular level (chapters and articles) ***
  17. 17. Digitization and Metadata Challenges 2010-2011
  18. 18. Digitization and Metadata Challenges: 2010-2011
  19. 19. New Challenges – New Solutions Manual vs. Machine •Metadata quality •Time factor •Cost of labor comparison Challenge: Cataloging to a granular level (keywords, countries, topics and sub-topics)
  20. 20. New challenges – New solutions Do the Math IMF example: • 12, 000 titles containing 60,000 chapters/articles (assumes an average of 5 per title), • 15 minutes to catalog each chapter/article with keywords etc, • 15,000 hours/40 (per week) hours =375 weeks • 375 weeks/52 = 7 years of work for one cataloger. If you pay just $30 per hour to a cataloger, the overall cost would be $450,000. Not to mention new content is being created daily. Automation allows us to slash the time it takes to catalog our content, saving us time and money.
  21. 21. Machine in Action
  22. 22. Machine in Action
  23. 23. Machine in Action
  24. 24. Results on eLibrary Super keywords or specific subjects
  25. 25. Browsing the IMF eLibrary
  26. 26. Browse by Topics
  27. 27. Simple Search - Type a word or phrase into the search bar at the top of every page… …or Advanced Search allows multiple concepts and filters
  28. 28. Search within results to search within publications using a single word or phrase. Select Content Type (Books and Journals/Chapters and Articles), Countries/Region, Topics, Languages, or Date. Type a word in the Starts with box to go to the first title that begins with the word. Sort by Title, Date, Source or Author. Change the number of Items per page. Keywords
  29. 29. Read on screen in HTML Read on a variety of devices Citation tools Click on a title from the results page to go to the publication landing page.
  30. 30. Related documents
  31. 31. Related documents
  32. 32. • New IMF eLibrary was delivered in March 2011 • Digital distribution: Distribute IMF contents to 35 channels in various digital formats • Digital production: Have an established workflow to generate XML based contents, ePubs, Mobi and PDF ebooks • New metadata management system. MetaLogic is a full functioning metadata management system • Create metadata to a granular level
  33. 33. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Generating Metadata By Machine BEA May 29, 2015 11:30 – 12:20
  34. 34. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Attributes/Entities that Characterize A Book 38
  35. 35. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Sentiment: Analyzing the Words Within the Book “Outstanding”words(5) breathtaking,thrilled,superb hell,rape,(more unmentionables)“Catastrophic”words(-5) torture,fraud,(unmentionables)“Damned”words(-4) woeful,worsen,kill“Terrible”words(-3) worthless,travesty,threaten“Upset”words(-2) numb, provoke,pushy“No”words(-1) validate,safe,adequate“Yes”words(1): strengthen,rich,funky“Welcome”words(2) praise,marvelous,impressive winning,stunning “Happy”words(3) “Wow”words(4) 39 Each wordisgivena numericvalue basedon itssubjectivemeaning. “Positive”wordsrangeona positive scale;“Negative”wordsrangeon a negativescale. Trajectory’sAnalyticsEngineuses thesevaluestocomputethebook’s sentimentcurveacrosssentence, paragraph,chapterandentirebook. Thissentiment“fingerprint”atan aggregatelevelyieldsaunique pictureofthebook.
  36. 36. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Sentiment: Analyzing the Words Within the Book 40
  37. 37. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Sentiment: Analyzing the Words Within the Book 41
  38. 38. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Trajectory Index 42
  39. 39. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Keyword Analysis and Comparison 43
  40. 40. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Keyword Translation into Local Languages 44
  41. 41. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Recommendations 45
  42. 42. ™ THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANY THIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC. Thank You 46 2015BEA – BOOTH 1347 United States: 50 Doaks Lane Marblehead, Massachusetts 01945 United States info@trajectory.com www.trajectory.com China: No. 3, 8 ChuangYe Road HaidanDistrict, Beijing, China100085
  43. 43. Q & A Generating Metadata by Machine BEA 2015 Friday, May 29, 11:30-12:20 Room 1E10

×