SlideShare a Scribd company logo
1 of 29
Exploring Challenges in Mining Historical Text
Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein

      Working with text: Tools, techniques and approaches for text mining
                           Edinburgh - 07/07/2012
Overview
‣ Project
‣ Data
‣ Preprocessing historical text
      ‣ Improvements to OCR
      ‣ Language identification
      ‣ Text mining tables
‣    Text-mining
      ‣ Improved commodity identification
      ‣ Ports-based geo-grounding
      ‣ Relation extraction


    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Project (01/2012-12/2014)
‣ Funded by Digging into Data (round 2)
‣ Partners
                    Ewan Klein, Claire Grover, Bea Alex (text mining)


                    Colin Coates, Jim Clifford (historical analysis)


                    James Reid (data integration)


                    Aaron Quigley, Uta Hinrichs (information
                    visualisation)
 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Trading Consequences
‣ What does archival text say about the
     economic and environmental consequences of
     global commodity trading during the
     nineteenth century?
‣    Help historians to discover novel patters and
     explore new hypotheses.
‣    Example questions:
      ‣ What were the routes and volumes of international
            trade in resource commodities 1850-1914?
      ‣     What were the local environmental consequences of
            this demand for these resources?

    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Geolocating Cinchona




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Trading Consequences
‣ Scope: global but with focus on Canadian
     natural resource flows to test reliability and
     efficacy of our methods
‣    Methods:
      ‣ Text mining and geo-parsing to transform the text
            into structured data, e.g. relational database
      ‣     Query interface targeted at historians
      ‣     Information visualisation for interactive exploration




    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Historical Data
‣ Digitised sources from the 19th century
   British Empire, currently processing
    ‣ Early Canadiana Online: 83,038 files
    ‣ JSTOR data: 1,000 XML files
    ‣ House of Commons Parliamentary Papers: 4,135
          files
    ‣     Books: selected books on nineteenth century trade

‣ Further sources:
    ‣ ProQuest data
    ‣ Encyclopaedia Britannica, Jstor Plants, Forestry
          Journals?, The Botanist?

  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Processing Historical Data
‣ Challenges so far:
    ‣ Different formats
    ‣ Low-quality OCRed text
      ‣ Old/low-quality prints, quality of OCR
             technology
         ‣ Historical English: historical word variants,
             ſ (long s) characters mixed up with f by OCR
         ‣ Artefacts in original documents: headers/footers,
             page numbers, notes in margins, end-of-line
             hyphenation
    ‣     Text in different languages
    ‣     Information in tables


  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Processing Historical Data
‣ Challenges so far:
    ‣ Different formats
    ‣ Low-quality OCRed text
      ‣ Old/low-quality prints, quality of OCR
             technology
         ‣ Historical English: historical word variants,
             ſ (long s) characters mixed up with f by OCR
         ‣ Artefacts in original documents: headers/footers,
             page numbers, notes in margins, end-of-line
             hyphenation
    ‣     Text in different languages
    ‣     Information in tables


  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR
‣ Normalisation and post-correction
‣ Fixed end-of-line hyphenation
    ‣ Dehyphen all token-splitting hyphens using a
          dictionary-based approach (dictionary is the system
          dictionary + the text of the current document)
‣ Added f-to-s conversion
    ‣ Convert all false f characters to s using a corpus-
          based a approach (corpus is a collection of historical
          documents from the Gutenberg Project)
‣ Example: reduced number of words
  unrecognised by spell checker from 61 to 21 -
  > approx. 67% improvement
  Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR




Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR




Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Improvements to OCR
‣ Extensive evaluation of both tools against
    human corrected/normalised gold standard
‣   Reduce word error rate by 12.5% in a random
    Canadiana sample (word acc: 0.776 -> 0.804)
‣   Improvements have an effect on later text
    mining steps and would also be beneficial for
    searching text in any IR system (e.g. Jstor
    database search for “French colonifts”)



    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Language Identification
‣    Most sources do not                           ISO Code
                                                      eng
                                                                      Language
                                                                         English
                                                                                           Frequency
                                                                                               2,677,498
     contain language                                  fra               French                1,208,811
                                                      deu               German                     2,886
     information like                                 chn            Chinook jargon                2,488
     Canadiana does                                   moh               Mohawk                     1,547
                                                       oji               Ojibwa                    1,395
‣    The table displays                               emg               Eastern
                                                                        Meohang
                                                                                                     835
     the number of text                               enb
                                                      cre
                                                                       Markweeta
                                                                          Cree
                                                                                                     666
                                                                                                     501
     elements in                                       iro             Iroquoian                     324
                                                       alg             Algonquian                    210
     Canadiana per                                    nge               Ngemba                       157
     language ignoring                                nld
                                                       lat
                                                                         Dutch
                                                                          Latin
                                                                                                     131
                                                                                                     119
     notes and titles                                 mic               Micmac                        61
                                                       gla           Scottish Gaelic                  22

    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Language Identification
‣ Make use of automatic language
     identification using TextCat, especially for the
     JSTOR data which is also multi-lingual.
‣    LID is done for each paragraph and for the
     entire document by taking the most frequent
     language tag assigned.
‣    Can limit processing to English (and French)
     documents only.
‣    740 English documents (out of 1,000)


    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Tables




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Tables
‣ Tables contain a lot of relevant information
     but are difficult to mine.
‣    HCPP documents contain coordinates for
     each table entry.
                <w p="961,1777,1026,1807" v="d">Rio</w>
                <w p="1026,1777,1170,1807" v="d">Janeiro</w>
                ...
                <w p="961,1892,1087,1921" v="n">Culcutta</w>
                <w p="1496,1530,1565,1555" v="o">141</w>
                <w p="1565,1525,1631,1555" v="d">bags</w>
                <w p="1227,1774,1336,1804" v="d">Wood</w>
                <w p="1353,1791,1366,1799" v="o">-</w>
                <w p="1494,1776,1565,1804" v="o">338</w>
                <w p="1565,1783,1676,1803" v="d">planks</w>
                <w p="1704,1791,1718,1799" v="o">-</w>

‣ Planning to do a feasibility study for a table
     mining algorithm.
    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
   ‣     Tokenisation
   ‣     Part-of-speech tagging
   ‣     Lemmatisation
   ‣     Wordnet lookup to find commodities
   ‣     Named-entity recognition including commodity
         lexicon lookup
   ‣     Port-based Geo-grounding
   ‣     Chunking




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
   ‣     Tokenisation
   ‣     Part-of-speech tagging
   ‣     Lemmatisation
   ‣     Wordnet lookup to find commodities
   ‣     Named-entity recognition including commodity
         lexicon lookup
   ‣     Port-based Geo-grounding
   ‣     Chunking




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Commodities Identification
‣ WordNet lookup using an approximation of
  commodity named entities:
   ‣ Noun phrases with hypernyms such as substance,
         physical matter, plant or animal in WordNet.
   ‣     Each NP which leads to a match is assigned a
         wn=”true” attribute.
‣ Commodities gazetteer lookup using a list of
  commodities derived by historians.
   ‣ Strings matching the entries in the gazetteer are
         assigned a commlex=”true” attribute.
‣ Words/phrases with wn=”true” and
  commlex=”true” are good candidates.
 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Ports-based Geo-grounding
‣ Started with non-optimised geo-resolution.
‣ Incorporated the list of ports.
      Locations are assigned with an is_port="1" or an
      is_port="0" attribute.
      ‣ Grounding now ignores non-port candidates in case
         of ambiguous location mentions.
      ‣ is_port locations are also given a higher weight in
         the scoring.
‣ Hypothesis: ports are more likely to be
     significant locations in historic documents
     about trade.
‣    Not tested yet as need gold standard data.
    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Ports-based Geo-grounding
‣ Example:
Dalhousie is in the list of ports as:
DALHOUSIE                -66.4   48.1

Geo-grounding in non-optimised resolver:
<ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN"
gazref="geonames:1273648" feat-type="ppl" pop-size="7601">
  <parts>
   <part ew="w136" sw="w136">Dalhousie</part>
  </parts>
 </ent>

Geo-grounding in ports-dependent resolver:
 <ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in-
country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0">
  <parts>
   <part ew="w97" sw="w97">Dalhousie</part>
  </parts>
 </ent>
   Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Ports-based Geo-grounding
‣ Geo-grounding assumes that each text is a
     coherent whole. All locations contribute to
     the resolution of all others. May have to
     change that.
‣    Segmentation (e.g. of books) into smaller
     units might improve the resolution.
‣    Need to consider old spellings of place
     names.



    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Relation Extraction
‣ Crude way to identify commodity-location
   relations:
    ‣ Sentences (s) containing words (w) with the
          commlex="true" and wn="true" and a location.

 Good: The quantity of raw cotton imported annually into the United Kingdom—take for
 example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States
 supplied 722,154,101 lbs.

 Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the
 Cordillera, which produces more sulphate than the common cinchona; and as the cinchona
 grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in
 the lands of Gualaquiza and Canelos.

 Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old
 whisky is sold there. OR
 This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that
 way.


 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Relation Extraction




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Relation Extraction
‣ Need to improve the relation extraction.
‣ Will look at pattern-based relation extraction
     exploiting vocabulary like "import", "export",
     "ship", "shipment", "trade", “manufacture”,
     “grow” etc.
‣    Will annotate a small test corpus for
     evaluation.
‣    Need to distinguish between irrelevant or
     false commodity-location relations and
     commodity-location relations referring to
     trade.
    Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Thank You
‣ Questions?




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Example Input
‣ Different sources converted into common
  XML format




 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
Example Output




Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012

More Related Content

Viewers also liked

Domadoras 2017
Domadoras 2017Domadoras 2017
Domadoras 2017CSJ-1-2
 
AS PARTES DO CORPO
AS PARTES DO CORPOAS PARTES DO CORPO
AS PARTES DO CORPOCSJ-1-2
 
IMPROPER INTEGRAL
   IMPROPER INTEGRAL   IMPROPER INTEGRAL
IMPROPER INTEGRALkishan619
 
skimming and previewing
skimming and previewingskimming and previewing
skimming and previewingmardiatun nisa
 
Thuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vnThuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vnOBS Việt Nam
 
Carnaval
CarnavalCarnaval
CarnavalCSJ-1-2
 

Viewers also liked (8)

Lec2
Lec2Lec2
Lec2
 
Domadoras 2017
Domadoras 2017Domadoras 2017
Domadoras 2017
 
AS PARTES DO CORPO
AS PARTES DO CORPOAS PARTES DO CORPO
AS PARTES DO CORPO
 
IMPROPER INTEGRAL
   IMPROPER INTEGRAL   IMPROPER INTEGRAL
IMPROPER INTEGRAL
 
skimming and previewing
skimming and previewingskimming and previewing
skimming and previewing
 
Thuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vnThuat ngu marketing online obs.com.vn
Thuat ngu marketing online obs.com.vn
 
Carnaval
CarnavalCarnaval
Carnaval
 
AA Resume 2016
AA Resume 2016AA Resume 2016
AA Resume 2016
 

Similar to Exploring Challenges in Mining Historical Text

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...DH Benelux
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
The EPO document collection: A technical treasure chest
The EPO document collection:A technical treasure chestThe EPO document collection:A technical treasure chest
The EPO document collection: A technical treasure chestGO opleidingen
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011François Scharffe
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
 
Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...Molly Knapp
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeGeorg Rehm
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Projectmbruemmer
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711STIinnsbruck
 

Similar to Exploring Challenges in Mining Historical Text (20)

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...Language adaptability and performance evaluation of historical text normaliza...
Language adaptability and performance evaluation of historical text normaliza...
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
The EPO document collection: A technical treasure chest
The EPO document collection:A technical treasure chestThe EPO document collection:A technical treasure chest
The EPO document collection: A technical treasure chest
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...Digitizing a newspaper clippings collection: a case study in small-scale digi...
Digitizing a newspaper clippings collection: a case study in small-scale digi...
 
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Accessing Treasure on lands and peoples
Accessing Treasure on lands and peoplesAccessing Treasure on lands and peoples
Accessing Treasure on lands and peoples
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Exploring Challenges in Mining Historical Text

  • 1. Exploring Challenges in Mining Historical Text Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein Working with text: Tools, techniques and approaches for text mining Edinburgh - 07/07/2012
  • 2. Overview ‣ Project ‣ Data ‣ Preprocessing historical text ‣ Improvements to OCR ‣ Language identification ‣ Text mining tables ‣ Text-mining ‣ Improved commodity identification ‣ Ports-based geo-grounding ‣ Relation extraction Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 3. Project (01/2012-12/2014) ‣ Funded by Digging into Data (round 2) ‣ Partners Ewan Klein, Claire Grover, Bea Alex (text mining) Colin Coates, Jim Clifford (historical analysis) James Reid (data integration) Aaron Quigley, Uta Hinrichs (information visualisation) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 4. Trading Consequences ‣ What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century? ‣ Help historians to discover novel patters and explore new hypotheses. ‣ Example questions: ‣ What were the routes and volumes of international trade in resource commodities 1850-1914? ‣ What were the local environmental consequences of this demand for these resources? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 5. Geolocating Cinchona Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 6. Trading Consequences ‣ Scope: global but with focus on Canadian natural resource flows to test reliability and efficacy of our methods ‣ Methods: ‣ Text mining and geo-parsing to transform the text into structured data, e.g. relational database ‣ Query interface targeted at historians ‣ Information visualisation for interactive exploration Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 7. Historical Data ‣ Digitised sources from the 19th century British Empire, currently processing ‣ Early Canadiana Online: 83,038 files ‣ JSTOR data: 1,000 XML files ‣ House of Commons Parliamentary Papers: 4,135 files ‣ Books: selected books on nineteenth century trade ‣ Further sources: ‣ ProQuest data ‣ Encyclopaedia Britannica, Jstor Plants, Forestry Journals?, The Botanist? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 8. Processing Historical Data ‣ Challenges so far: ‣ Different formats ‣ Low-quality OCRed text ‣ Old/low-quality prints, quality of OCR technology ‣ Historical English: historical word variants, ſ (long s) characters mixed up with f by OCR ‣ Artefacts in original documents: headers/footers, page numbers, notes in margins, end-of-line hyphenation ‣ Text in different languages ‣ Information in tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 9. Processing Historical Data ‣ Challenges so far: ‣ Different formats ‣ Low-quality OCRed text ‣ Old/low-quality prints, quality of OCR technology ‣ Historical English: historical word variants, ſ (long s) characters mixed up with f by OCR ‣ Artefacts in original documents: headers/footers, page numbers, notes in margins, end-of-line hyphenation ‣ Text in different languages ‣ Information in tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 10. Improvements to OCR ‣ Normalisation and post-correction ‣ Fixed end-of-line hyphenation ‣ Dehyphen all token-splitting hyphens using a dictionary-based approach (dictionary is the system dictionary + the text of the current document) ‣ Added f-to-s conversion ‣ Convert all false f characters to s using a corpus- based a approach (corpus is a collection of historical documents from the Gutenberg Project) ‣ Example: reduced number of words unrecognised by spell checker from 61 to 21 - > approx. 67% improvement Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 11. Improvements to OCR Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 12. Improvements to OCR Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 13. Improvements to OCR ‣ Extensive evaluation of both tools against human corrected/normalised gold standard ‣ Reduce word error rate by 12.5% in a random Canadiana sample (word acc: 0.776 -> 0.804) ‣ Improvements have an effect on later text mining steps and would also be beneficial for searching text in any IR system (e.g. Jstor database search for “French colonifts”) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 14. Language Identification ‣ Most sources do not ISO Code eng Language English Frequency 2,677,498 contain language fra French 1,208,811 deu German 2,886 information like chn Chinook jargon 2,488 Canadiana does moh Mohawk 1,547 oji Ojibwa 1,395 ‣ The table displays emg Eastern Meohang 835 the number of text enb cre Markweeta Cree 666 501 elements in iro Iroquoian 324 alg Algonquian 210 Canadiana per nge Ngemba 157 language ignoring nld lat Dutch Latin 131 119 notes and titles mic Micmac 61 gla Scottish Gaelic 22 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 15. Language Identification ‣ Make use of automatic language identification using TextCat, especially for the JSTOR data which is also multi-lingual. ‣ LID is done for each paragraph and for the entire document by taking the most frequent language tag assigned. ‣ Can limit processing to English (and French) documents only. ‣ 740 English documents (out of 1,000) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 16. Text Mining Tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 17. Text Mining Tables ‣ Tables contain a lot of relevant information but are difficult to mine. ‣ HCPP documents contain coordinates for each table entry. <w p="961,1777,1026,1807" v="d">Rio</w> <w p="1026,1777,1170,1807" v="d">Janeiro</w> ... <w p="961,1892,1087,1921" v="n">Culcutta</w> <w p="1496,1530,1565,1555" v="o">141</w> <w p="1565,1525,1631,1555" v="d">bags</w> <w p="1227,1774,1336,1804" v="d">Wood</w> <w p="1353,1791,1366,1799" v="o">-</w> <w p="1494,1776,1565,1804" v="o">338</w> <w p="1565,1783,1676,1803" v="d">planks</w> <w p="1704,1791,1718,1799" v="o">-</w> ‣ Planning to do a feasibility study for a table mining algorithm. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 18. Text Mining Pipeline ‣ Steps after that OCR improvements and LID: ‣ Tokenisation ‣ Part-of-speech tagging ‣ Lemmatisation ‣ Wordnet lookup to find commodities ‣ Named-entity recognition including commodity lexicon lookup ‣ Port-based Geo-grounding ‣ Chunking Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 19. Text Mining Pipeline ‣ Steps after that OCR improvements and LID: ‣ Tokenisation ‣ Part-of-speech tagging ‣ Lemmatisation ‣ Wordnet lookup to find commodities ‣ Named-entity recognition including commodity lexicon lookup ‣ Port-based Geo-grounding ‣ Chunking Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 20. Commodities Identification ‣ WordNet lookup using an approximation of commodity named entities: ‣ Noun phrases with hypernyms such as substance, physical matter, plant or animal in WordNet. ‣ Each NP which leads to a match is assigned a wn=”true” attribute. ‣ Commodities gazetteer lookup using a list of commodities derived by historians. ‣ Strings matching the entries in the gazetteer are assigned a commlex=”true” attribute. ‣ Words/phrases with wn=”true” and commlex=”true” are good candidates. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 21. Ports-based Geo-grounding ‣ Started with non-optimised geo-resolution. ‣ Incorporated the list of ports. Locations are assigned with an is_port="1" or an is_port="0" attribute. ‣ Grounding now ignores non-port candidates in case of ambiguous location mentions. ‣ is_port locations are also given a higher weight in the scoring. ‣ Hypothesis: ports are more likely to be significant locations in historic documents about trade. ‣ Not tested yet as need gold standard data. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 22. Ports-based Geo-grounding ‣ Example: Dalhousie is in the list of ports as: DALHOUSIE                -66.4   48.1 Geo-grounding in non-optimised resolver: <ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN" gazref="geonames:1273648" feat-type="ppl" pop-size="7601"> <parts> <part ew="w136" sw="w136">Dalhousie</part> </parts> </ent> Geo-grounding in ports-dependent resolver: <ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in- country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0"> <parts> <part ew="w97" sw="w97">Dalhousie</part> </parts> </ent> Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 23. Ports-based Geo-grounding ‣ Geo-grounding assumes that each text is a coherent whole. All locations contribute to the resolution of all others. May have to change that. ‣ Segmentation (e.g. of books) into smaller units might improve the resolution. ‣ Need to consider old spellings of place names. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 24. Relation Extraction ‣ Crude way to identify commodity-location relations: ‣ Sentences (s) containing words (w) with the commlex="true" and wn="true" and a location. Good: The quantity of raw cotton imported annually into the United Kingdom—take for example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States supplied 722,154,101 lbs. Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the Cordillera, which produces more sulphate than the common cinchona; and as the cinchona grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in the lands of Gualaquiza and Canelos. Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old whisky is sold there. OR This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that way. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 25. Relation Extraction Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 26. Relation Extraction ‣ Need to improve the relation extraction. ‣ Will look at pattern-based relation extraction exploiting vocabulary like "import", "export", "ship", "shipment", "trade", “manufacture”, “grow” etc. ‣ Will annotate a small test corpus for evaluation. ‣ Need to distinguish between irrelevant or false commodity-location relations and commodity-location relations referring to trade. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 27. Thank You ‣ Questions? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 28. Example Input ‣ Different sources converted into common XML format Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
  • 29. Example Output Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n