SlideShare uma empresa Scribd logo
1 de 5
Baixar para ler offline
Outline

       IMPACT final event – The Hague – 26 June 2012          1. Institutional background
                                                              2. IMPACT test case
       Metadata extraction from title pages                   3.     Strategic goals
                                                              4.     Preliminary work
       Evaluation of the FEP pilot
                                                              5.     Results
       at the German National Library
                                                              6.     Perspective

       Christa Schöning-Walter

1                                                      2 | IMPACT event | June 26, 2012 |




       The German National Library (DNB)                      The German National Library (DNB)
       – some facts and figures (I)                           – some facts and figures (II)

       − Legal deposit:                                       − Collection size             (January 2012):   27 million media units
         Collecting, cataloguing, archiving
         and making available to the                          − Daily input: 1.500 physical units (each with 2 copies)
         general public all German and                        − Since 2006:
         German-language publications,
                                                                Collection mandate includes non-physical media
         publications about Germany etc                         (online publications)
         from 1913
                                                                    − DNBG = Law regarding the German National Library
       − Bibliographic services:                                    − PflAV = Legal Deposit Regulation
          − National Bibliography
          − Authority files                                   − Since 2009:
                                                                Considerations on and implementation of automated
          − Bibliographic standards
                                                                cataloguing processes
       − 2 sites: Leipzig, Frankfurt am Main
3 | IMPACT event | June 26, 2012 |                     4 | IMPACT event | June 26, 2012 |




                                                                                                                                       1
Target of the IMPACT scenario                                        Starting point


       Opening questions (summer 2011):                                     Since January 2012:
                                                                            − Experimental application studies in collaboration with
       − Can metadata extraction from title pages successfully                 the University of Innsbruck
         be done by a rule engine in case of simple structured
         monographic publications?                                          − Using the rule-based exploitation features of FEP
                                                                              (Functional Extension Parser)
       − Is this useful in order to accelerate the cataloguing
         processes if no machine-readable metadata from                     What is FEP?
         other sources is available?
                                                                            − Software platform for the purpose of analysing the
                                     Test case: Theses                        logical structure of documents

                                                                            − Developed within IMPACT work package EE4
                                     − 14.000 print units annually
                                                                              (Goal: enrichment of OCR output with structure
                                     − simple structure !?
                                                                              information)
5 | IMPACT event | June 26, 2012 |                                   6 | IMPACT event | June 26, 2012 |




       Strategic goals                                                      Conceptual design of the workflow

                                                                            Example: http://d-nb.info/1017138931
       In particular:
                                                                                     Accession             Repository              FEP results
       − Making descriptive cataloguing less time-consuming and
                                                                                   (Printed media
         literature processing of printed media faster by                               units)             OAI-Harvester   Cataloguing
           − Partial digitisation
           − Automated metadata extraction
                                                                                  Bibliographic                             Qualitiy
           − Result transfer into the bibliographic record                                                Data Provider      check
                                                                                     record
           − Quality check and completion of cataloguing by the                                                             Statistics
             staff
                                                                                                    a
       Generally:                                                                  Service partner
                                                                                   Scan service           OCR output/
                                                                                                                              Stack
       − Gaining experience in the area of automated metadata                       (title page +          Indexing
         extraction / automated cataloguing                                              ToC)
7 | IMPACT event | June 26, 2012 |                                   8 | IMPACT event | June 26, 2012 |




                                                                                                                                                 2
The Objective: Automated
                                                                                            exploitation of descriptive bibliographic data

                                                                                                                             − Specification, implementation,
                                                                                                                               evaluation and gradually
                                                                                                                               improvement of
                                                                                                                                − Appropriate structure
                                                                                                                                  types
                                                                                                                                − Dictionaries
                                                                                                                                  (controlled vocabulary,
                                              The idea:                                                                           indicating keywords,
                                              Taking bibliographic data over                                                      abbreviations etc)
                                              from metadata mining tools.                                                       − Expert rules
                                                                                                                                − Etc

                                                                                                                                  Illustration: University of Innsbruck
9 | IMPACT event | June 26, 2012 |                                                   10 |




       Preliminary work (I)                                                                 Preliminary work (II)


       − Specification of the bibliographic statements to be mined                          − Going over some hundreds of title pages of theses
         from the title page                                                                  (scans from 2009-2011 + documents from daily business)
             Attribute                Value
                                                                                            − Exploring typical structural patterns / regularities etc,
             Publication year         2010                                                    such as                    Examples of indicating phrases to find out
             Language code            /1ger                                                    − Prefixes                the creator:
                                      /1eng                                                                              von
                                                                                               − Phrases                 von <Verfasser> vorgelegte Dissertation
             Creator                  <last name>,<first name>
                                                                                               − Notation                von Herrn/Frau:
             Title                    <full title>:<additional title information>/                                       vorgelegt von(:)
                                                                                               − Position                vorgelegt JJJJ von
                                      <author statement>
                                                                                                                            vorgelegt dem Fachbereich ... von
             Size                     30 cm                                                                                 Name:
                                      21 cm                                                                                 Name des Verfassers:
             Theses statement         <city name>, <corporate body name>,
                                                                                                     Expert rules           Name der Verfasserin:
                                                                                                                            verfasst von(:)
                                      <type of publication>,<year of graduation>
                                                                                                                            eingereicht von
11 | IMPACT event | June 26, 2012 |                                                  12 | IMPACT event | June 26, 2012 |
                                                                                                                            ...




                                                                                                                                                                          3
Preliminary work (III)                                                                      Preliminary work (IV)

                                               Theses statement items (examples):
                                               …
       Choosing / preparing                    Berlin, ESCP Europe Wirtschaftshochschule          − Setting up a sample of documents for evaluation
       dictionaries for tagging,               Berlin, Freie Univ.                                  purposes:
       matching and mapping                    Berlin, Humboldt-Univ.
                                               Berlin, Steinbeis-Hochsch.
                                                                                                     − 1.000 theses from several universities
       purposes:                               Berlin, Techn. Univ.                                  − Publication year: 2010 – 2011
                                               Berlin, Univ. der Künste
       − List of universities                  …
                                                                                                     − Different dimensions (A- and B-size)
         which have the right to                                                                     − Scans: 300 dpi, bitonal
         graduation (identifying               Academic grades (examples):                           − Transfer format: Pdf (in future: XML files)
         the corporate bodies)                 …
                                               M.A.   Master of Arts / Magister Artium            − Ground truth determination:
       − Name Authority File                   M.Sc.  Master of Science
                                               M.Eng. Master of Engineering                          − Manually region tagging on image files
         subset (identifying
                                               LL.M.  Master of Laws / Legum Magister                  (done in Vietnam by the Aletheia tool)
         personal names)                       M.F.A. Master of Fine Arts
                                               M.Mus. Master of Music
       − List of academic grades               M.Ed.  Master of Education
13 | IMPACT event | June 26, 2012 |            …                                           14 | IMPACT event | June 26, 2012 |




       Document processing in brief                                                                Results


       − Database: Storage of all                                                                 Second test phase with a revised list of universities                       (June 2012):
         available information
         (OCR output, automatically
         or manually produced
         annotations, dictionaries,
         facts etc)

       − Input of expert rules

       − Rule engine: Stepwise
         proceeding taking
         intermediary results into
         account
       Illustration: University of Innsbruck                                                     (1) total conformity            (2) complete title + noise (just to be deleted by the staff)
15 | IMPACT event | June 26, 2012 |                                                        16 | IMPACT event | June 26, 2012 |




                                                                                                                                                                                                4
Forecast: Feasibility study                                          New ideas


       − Technical and organisational requirements:                         − Extraction of defined structures from the body of
         Operational aspects, technical workflow, interfaces etc              monographic publications, such as table of contents,
                                                                              abstracts, pure text (without any introductory remarks,
       − Further functional enhancements needed:                              footers, references etc)
          − Dictionary maintenance: Expanding controlled
            vocabulary, sorting out unsuitable items etc                    Target:
          − Taking additional facts into account: Ground truth etc          − Improvement of the results of current automated
                                                                              subject cataloguing projects, such as
          − Additional expert rules (?)
                                                                               − Thematic classification by machine learning
          − Additional functions: Language guesser, document                     techniques
            size etc
                                                                               − Subject headings obtainment by text analysis
          − Customising FEP (?)                                                  techniques
                                                                                                   Reducing the noise via preceding
                                                                                                   structure analysis processes
17 | IMPACT event | June 26, 2012 |                                  18 | IMPACT event | June 26, 2012 |




       Thank you for your attention.

       Christa Schöning-Walter                   Sandra Hamm
       Staff position ’Automated Cataloguing’    Project leader
       c.schoening@dnb.de                        s.hamm@dnb.de


       German National Library
       Digital Services
       Frankfurt am Main, Germany



19 | IMPACT event | June 26, 2012 |




                                                                                                                                        5

Mais conteúdo relacionado

Semelhante a IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]IMPACT Centre of Competence
 
SplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - IntuitSplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - IntuitSplunk
 
Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...MOCA Platform
 
Presentation of SCAPE Project
Presentation of SCAPE ProjectPresentation of SCAPE Project
Presentation of SCAPE ProjectSCAPE Project
 
Scientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informaticsScientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informatics Khaled Tumbi
 
ISO 15926 Reference Data Engineering Methodology
ISO 15926 Reference Data Engineering MethodologyISO 15926 Reference Data Engineering Methodology
ISO 15926 Reference Data Engineering MethodologyAnatoly Levenchuk
 
Who cares about Software Process Modelling? A First Investigation about the P...
Who cares about Software Process Modelling? A First Investigation about the P...Who cares about Software Process Modelling? A First Investigation about the P...
Who cares about Software Process Modelling? A First Investigation about the P...Daniel Mendez
 
SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2SpagoWorld
 
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBIGeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBIACSG Section Montréal
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...ICSM 2011
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1stat
 
Ac2017 8. metrics forprivacysafety-notes
Ac2017   8. metrics forprivacysafety-notesAc2017   8. metrics forprivacysafety-notes
Ac2017 8. metrics forprivacysafety-notesNesma
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Centre of Competence
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Modeling Requirements for the Management of Electronic Records
Modeling Requirements for the Management of Electronic RecordsModeling Requirements for the Management of Electronic Records
Modeling Requirements for the Management of Electronic RecordsMichele Chinosi
 
Towards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
Towards an INSPIREd e-reporting & INSPIRE priority datasets in SlovakiaTowards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
Towards an INSPIREd e-reporting & INSPIRE priority datasets in SlovakiaMartin Tuchyna
 

Semelhante a IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB) (20)

Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
 
SplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - IntuitSplunkLive! San Francisco Dec 2012 - Intuit
SplunkLive! San Francisco Dec 2012 - Intuit
 
Goobi
GoobiGoobi
Goobi
 
Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...Online performance modeling and analysis of message-passing parallel applicat...
Online performance modeling and analysis of message-passing parallel applicat...
 
Presentation of SCAPE Project
Presentation of SCAPE ProjectPresentation of SCAPE Project
Presentation of SCAPE Project
 
Scientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informaticsScientific Workflows Systems :In Drug discovery informatics
Scientific Workflows Systems :In Drug discovery informatics
 
ISO 15926 Reference Data Engineering Methodology
ISO 15926 Reference Data Engineering MethodologyISO 15926 Reference Data Engineering Methodology
ISO 15926 Reference Data Engineering Methodology
 
Who cares about Software Process Modelling? A First Investigation about the P...
Who cares about Software Process Modelling? A First Investigation about the P...Who cares about Software Process Modelling? A First Investigation about the P...
Who cares about Software Process Modelling? A First Investigation about the P...
 
SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2SpagoBI Webinar @ OW2
SpagoBI Webinar @ OW2
 
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBIGeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
GeoKettle, GeoMondrian et Spatialytics : une suite open source de GeoBI
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
 
Ac2017 8. metrics forprivacysafety-notes
Ac2017   8. metrics forprivacysafety-notesAc2017   8. metrics forprivacysafety-notes
Ac2017 8. metrics forprivacysafety-notes
 
KNIME tutorial
KNIME tutorialKNIME tutorial
KNIME tutorial
 
Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016Hobbit presentation at Apache Big Data Europe 2016
Hobbit presentation at Apache Big Data Europe 2016
 
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Modeling Requirements for the Management of Electronic Records
Modeling Requirements for the Management of Electronic RecordsModeling Requirements for the Management of Electronic Records
Modeling Requirements for the Management of Electronic Records
 
Towards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
Towards an INSPIREd e-reporting & INSPIRE priority datasets in SlovakiaTowards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
Towards an INSPIREd e-reporting & INSPIRE priority datasets in Slovakia
 
Ws2001 sessione8 cibella_tuoto
Ws2001 sessione8 cibella_tuotoWs2001 sessione8 cibella_tuoto
Ws2001 sessione8 cibella_tuoto
 

Mais de IMPACT Centre of Competence

Mais de IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Último

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Último (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

IMPACT Final Event 26-06-2012 - Automated metadata extraction from title pages. Report on the FEP pilot at the German National Library by Christa Schöning-Walter (DNB)

  • 1. Outline IMPACT final event – The Hague – 26 June 2012 1. Institutional background 2. IMPACT test case Metadata extraction from title pages 3. Strategic goals 4. Preliminary work Evaluation of the FEP pilot 5. Results at the German National Library 6. Perspective Christa Schöning-Walter 1 2 | IMPACT event | June 26, 2012 | The German National Library (DNB) The German National Library (DNB) – some facts and figures (I) – some facts and figures (II) − Legal deposit: − Collection size (January 2012): 27 million media units Collecting, cataloguing, archiving and making available to the − Daily input: 1.500 physical units (each with 2 copies) general public all German and − Since 2006: German-language publications, Collection mandate includes non-physical media publications about Germany etc (online publications) from 1913 − DNBG = Law regarding the German National Library − Bibliographic services: − PflAV = Legal Deposit Regulation − National Bibliography − Authority files − Since 2009: Considerations on and implementation of automated − Bibliographic standards cataloguing processes − 2 sites: Leipzig, Frankfurt am Main 3 | IMPACT event | June 26, 2012 | 4 | IMPACT event | June 26, 2012 | 1
  • 2. Target of the IMPACT scenario Starting point Opening questions (summer 2011): Since January 2012: − Experimental application studies in collaboration with − Can metadata extraction from title pages successfully the University of Innsbruck be done by a rule engine in case of simple structured monographic publications? − Using the rule-based exploitation features of FEP (Functional Extension Parser) − Is this useful in order to accelerate the cataloguing processes if no machine-readable metadata from What is FEP? other sources is available? − Software platform for the purpose of analysing the Test case: Theses logical structure of documents − Developed within IMPACT work package EE4 − 14.000 print units annually (Goal: enrichment of OCR output with structure − simple structure !? information) 5 | IMPACT event | June 26, 2012 | 6 | IMPACT event | June 26, 2012 | Strategic goals Conceptual design of the workflow Example: http://d-nb.info/1017138931 In particular: Accession Repository FEP results − Making descriptive cataloguing less time-consuming and (Printed media literature processing of printed media faster by units) OAI-Harvester Cataloguing − Partial digitisation − Automated metadata extraction Bibliographic Qualitiy − Result transfer into the bibliographic record Data Provider check record − Quality check and completion of cataloguing by the Statistics staff a Generally: Service partner Scan service OCR output/ Stack − Gaining experience in the area of automated metadata (title page + Indexing extraction / automated cataloguing ToC) 7 | IMPACT event | June 26, 2012 | 8 | IMPACT event | June 26, 2012 | 2
  • 3. The Objective: Automated exploitation of descriptive bibliographic data − Specification, implementation, evaluation and gradually improvement of − Appropriate structure types − Dictionaries (controlled vocabulary, The idea: indicating keywords, Taking bibliographic data over abbreviations etc) from metadata mining tools. − Expert rules − Etc Illustration: University of Innsbruck 9 | IMPACT event | June 26, 2012 | 10 | Preliminary work (I) Preliminary work (II) − Specification of the bibliographic statements to be mined − Going over some hundreds of title pages of theses from the title page (scans from 2009-2011 + documents from daily business) Attribute Value − Exploring typical structural patterns / regularities etc, Publication year 2010 such as Examples of indicating phrases to find out Language code /1ger − Prefixes the creator: /1eng von − Phrases von <Verfasser> vorgelegte Dissertation Creator <last name>,<first name> − Notation von Herrn/Frau: Title <full title>:<additional title information>/ vorgelegt von(:) − Position vorgelegt JJJJ von <author statement> vorgelegt dem Fachbereich ... von Size 30 cm Name: 21 cm Name des Verfassers: Theses statement <city name>, <corporate body name>, Expert rules Name der Verfasserin: verfasst von(:) <type of publication>,<year of graduation> eingereicht von 11 | IMPACT event | June 26, 2012 | 12 | IMPACT event | June 26, 2012 | ... 3
  • 4. Preliminary work (III) Preliminary work (IV) Theses statement items (examples): … Choosing / preparing Berlin, ESCP Europe Wirtschaftshochschule − Setting up a sample of documents for evaluation dictionaries for tagging, Berlin, Freie Univ. purposes: matching and mapping Berlin, Humboldt-Univ. Berlin, Steinbeis-Hochsch. − 1.000 theses from several universities purposes: Berlin, Techn. Univ. − Publication year: 2010 – 2011 Berlin, Univ. der Künste − List of universities … − Different dimensions (A- and B-size) which have the right to − Scans: 300 dpi, bitonal graduation (identifying Academic grades (examples): − Transfer format: Pdf (in future: XML files) the corporate bodies) … M.A. Master of Arts / Magister Artium − Ground truth determination: − Name Authority File M.Sc. Master of Science M.Eng. Master of Engineering − Manually region tagging on image files subset (identifying LL.M. Master of Laws / Legum Magister (done in Vietnam by the Aletheia tool) personal names) M.F.A. Master of Fine Arts M.Mus. Master of Music − List of academic grades M.Ed. Master of Education 13 | IMPACT event | June 26, 2012 | … 14 | IMPACT event | June 26, 2012 | Document processing in brief Results − Database: Storage of all Second test phase with a revised list of universities (June 2012): available information (OCR output, automatically or manually produced annotations, dictionaries, facts etc) − Input of expert rules − Rule engine: Stepwise proceeding taking intermediary results into account Illustration: University of Innsbruck (1) total conformity (2) complete title + noise (just to be deleted by the staff) 15 | IMPACT event | June 26, 2012 | 16 | IMPACT event | June 26, 2012 | 4
  • 5. Forecast: Feasibility study New ideas − Technical and organisational requirements: − Extraction of defined structures from the body of Operational aspects, technical workflow, interfaces etc monographic publications, such as table of contents, abstracts, pure text (without any introductory remarks, − Further functional enhancements needed: footers, references etc) − Dictionary maintenance: Expanding controlled vocabulary, sorting out unsuitable items etc Target: − Taking additional facts into account: Ground truth etc − Improvement of the results of current automated subject cataloguing projects, such as − Additional expert rules (?) − Thematic classification by machine learning − Additional functions: Language guesser, document techniques size etc − Subject headings obtainment by text analysis − Customising FEP (?) techniques Reducing the noise via preceding structure analysis processes 17 | IMPACT event | June 26, 2012 | 18 | IMPACT event | June 26, 2012 | Thank you for your attention. Christa Schöning-Walter Sandra Hamm Staff position ’Automated Cataloguing’ Project leader c.schoening@dnb.de s.hamm@dnb.de German National Library Digital Services Frankfurt am Main, Germany 19 | IMPACT event | June 26, 2012 | 5