Cpascoe pimms or2012_

The PIMMS project and Natural Language
Processing for Climate Science
Extending the Chemical Tagger natural language processing tool with
climate science controlled vocabularies

Charlotte Pascoe, Hannah Barjat
Peter Murray-Rust and Gerry Devine

June 9th 2012, Open Repositories 2012

Portable Infrastructure for the
Metafor Metadata System
http://proj.badc.rl.ac.uk/pimms/

Common Information Model
Data Software
We can talk about DataObjects
collected together in any number of
ways, stored in a particular medium

Shared ISO

We reuse various ISO classes

Quality
We can talk about
Some concepts hierarchical
are shared ModelComponents
with
We can record the ModelProperties, som
quality of things A particular Activity uses
e of which can be
a particular
coupled together
Grids Activity SoftwareComponent

We can talk about
Simulations run in
support of Experiments.
Experiments consist of
Requirements;
We can define a GridSpec Simulations conform to
or some other geometry Requirements

Mind Maps

Mind maps are used to capture
information requirements from domain
experts and build a controlled vocabulary.

Python Parser
The python parser processes the XML files generated by the mind maps
<component name="Radiation">
<definition status="missing">Definition of component type Radiation required</definition>
<parameter name="RadiativeTimeStep" choice="keyboard">
<definition status="missing">Definition of property name RadiativeTimeStep required</definition>
<value format="numerical" name="time step" units="time units"/>
</parameter>
<parametergroup name="Longwave">
<parameter name="SchemeType" choice="XOR">
<definition status="missing">Definition of property name SchemeType required</definition>
<value name="Wide-band model"/>
<value name="Wide-band (Morcrette)"/>
<value name="K-correlated"/>
<value name="K-correlated (RRTM)"/>
<value name="other"/>
</parameter>
<parameter name="Method" choice="XOR">
<definition status="missing">Definition of property name Method required</definition>
<value name="Two stream"/>
<value name="Layer interaction"/>
<value name="other"/>
</parameter>
<parameter name="NumberOfSpectralIntervals" choice="keyboard">
<definition status="missing">Definition of property name NumberOfSpectralIntervals required</definition>
<value format="numerical" name=""/>
</parameter>
</parametergroup>

Web Forms
Web forms generate content in CIM xml format http://q.cmip5.ceda.ac.uk/

CIM Viewer
http://zonda5.badc.rl.ac.uk/site/public/tools/viewer/integrated/1.5/en/73c59aba-dc6d-11df-a442-00163e9152a5/1

Chemical Tagger
http://chemicaltagger.ch.cam.ac.uk/
ChemicalTagger is an open-source tool that uses OSCAR4 and NLP techniques for tagging and
parsing experimental sections in the chemistry literature.

Chemical Tagger
https://bitbucket.org/wwmm/chemicaltagger & https://bitbucket.org/wwmm/acpgeo
• Java project Developed by the Peter Murray-Rust
group, Cambridge. Online demo:
http://chemicaltagger.ch.cam.ac.uk/
• Adapted for use with ACP Abstracts (Lezan Hawizy and
Hannah Barjat).
– Modification by use of dictionaries and changes to grammar.
– First use case outside of laboratory chemistry.
– Still with a significant chemistry component.
– Wider physical science.
• Open Source NLP tool for processing
• Open Source NLP tool for processing chemical text
chemical text
• Combines Chemical Entity Recognitions (OSCAR) with NLP
• techniquesChemical Entity Recognitions
Combines
• Extendible and Reconfigurable Taggers and Parsers
(OSCAR) with NLP techniques
• Extendible and Reconfigurable Taggers
and Parsers generated using ANTLR
(ANother Tool for Language Recognition)

Chemical Tagger & PIMMS

• To extend chemical tagger to be more suited to
climate modelling.
– Specifically:
• Palaeoclimate modelling and how process of text mining
might differ from development of a controlled vocabulary.
• High-lighting of text for comparison with CIM documents.
• Initially only using XML Abstracts e.g. from EGU’s
Geoscientific Model Development and Climate of the Past.
– Brief look at PDF to Text.

11

Paleoclimate Language
• Time periods and climatic events
– Includes named Ages, Epochs, Eras etc. [Including all those in a mind map produced
for the PIMMS project at Bristol].
– context of proper nouns e.g. with words such as ‘period’, ‘era’, ‘epoch’
– Numbers with appropriate units e.g. Mya, yr BP
– Likely date numbers e.g. 1750 AD.
– Acronyms – known’LGM’ e.g. [in context ACRONYMS have not been investigated]
– Related adjectives e.g.
seasonal, decadal, glacial, interglacial, stadial, interstadial, maximum, minimum
where used as proper nouns.

• Palaeoclimate Models
– Can guess model names from context
• e.g. proper noun or acronym followed by model
• e.g. reconstruction / simulation with XXX
– Can develop/use glossary of model names.

• Palaeoclimate Acronyms
– Time periods and models.
– Theories, techniques, physical and chemical parameters?
– Can develop/use glossary of acronyms – problem area: often not unique even
within subject.

Natural Language vs CV

• Quick compilation of proper nouns used for time periods
(primarily from Wikipedia) contains 185 words.
– Use of these words together with adjective/ dates / details of
events would produce a very large number of phrases.

• Controlled Vocabulary from Bristol contains around 24 of
these.
• Use of these words together with other proper nouns /
adjectives / dates gives only 44 phrases within the Bristol CV.

• Map natural language to CV?
– Straightforward for most dates?
– Understanding of context important
• Does context refer to main emphasis of paper?
13 • Is an event/time period described unambiguously? e.g. “Last Glacial

Preliminary Results
Preliminary Results (from 68 files)

Tag / Tags Example Comment
<timePhrase> (i) Holocene, (ii) 8 kyr BP
<PALAEOTIME> (iii)

<referencePhrase> (i) (Otto et al. 2009b) Important to distinguish
(ii) Giraudeau et al. 2000 year pattern from dates
relevant to the study.

<locationPhrase> (i) around Lake Kotokel, False positives: e.g. “from
(ii) over Tibetan Plateau Sphagnum”

<LOCATION> (i) 52°47´ N, 108°07´ E, Cannot currently do
458 m a.s.l (ii) London. degrees from pdf-text.

<TempPhrase> „warm‟ and „cool‟: verbs in
synthetic chem unlike env.
chem.

Tag / Tags Example Numbers found
<CAMPAIGN> (i) PMIP, (ii) PANASH Less relevant here than to
ACP in general
<MODEL> (i) REVEALS model, (ii)
ECBILT-CLIO intermediate
complexity climate model

<acronymPhrase> (i) Modern Analogues May pick up campaigns /
Technique ( MAT ) models where phrases
(ii) REVEALS ( Regional above have failed.
Estimates of VEgetation
Abundance from Large
Sites )
<QUANTITY> (i) 10 ppm (ii) 0.53 mm/day units dictionary could be
more extensive
<MOLECULE> (i) CO2, (ii) calcium Many false positives as
carbonate what chemical tagger was
designed for.

Chemical Tagger
Rendering of PALEOTIME
XML rendered with CSS http://www.clim-past.net/2/205/2006/cp-2-205-2006.html

16

GMD Journal Article
http://www.geosci-model-dev.net/4/1035/2011/gmd-4-1035-2011.html

CIM Document Viewer

The acronym / name
MIROC4 is not explained – so
reproduce sentence

The description is just
first few sentences after
appearance of
<MODEL>

CIM Document Viewer
http://zonda5.badc.rl.ac.uk/site/public/tools/viewer

Makes use of existing
chemical tagging.

CIM Document Viewer
http://zonda5.badc.rl.ac.uk/site/public/repository

Number of spectral
intervals were not
found! No place for
“not found”

Climate Models –
General Constraints
• Unless paper is specifically about the model we
are unlikely to find much MEAFOR type CV in
the abstract
– Look at experimental / methods sections
• model name
• model resolution
• model schemes
– Problem with PDF -> text.
– Only certain elements easy to extract (e.g.
resolution)

Refine ACPgeo Output

• Add a few more phrases e.g. specific phrases to
look for model resolution, using expected
vocabulary (e.g. grid, levels, resolution, directions
etc).
• Refine output of ACPgeo to look for specific CV
terms.
• Try to put CV terms in context:
– Look for proximity of CV terms to other phrases:
• Within phrase; within sentence or within a number of
sentences

22

<MOLECULE>

– Chemical Tagger was designed to be used primarily with
chemistry.
• Unsurprising that there is a tendency to to assign acronyms;
hyphenated words; and words with common chemical
endings as molecules.
– It is possible to filter some of these wrongly assigned words by
probability.
– There are still conflicts e.g. C3 and C4 could refer to
hydrocarbons or plants.
• Extensive testing and modifying / machine learning might
reduce these.
– Better to get right first time if important!

Harvested Metadata vs
Documented Metadata
http://proj.badc.rl.ac.uk/pimms/blog/
CIM was designed to be populated by modellers with the (probably over simplistic) assumption
that if something isn't in the CIM document then it either isn't in the model or isn't relevant. But
CIM documents created by harvesting information from papers will naturally not cover
everything about a model, so missing info doesn't mean that those things weren't
included/aren't relevant.

PIMMS will need to describe different protocols for interpreting CIM documents depending on
how they were created, but we will also want to ensure that that CIM accounts for missing data
more intelligently in future releases.

In essence the difference between journal article descriptions and metadata documentation is
Narrative. Journal articles need to tell a story so the information they include is only that which
is relevant to the narrative, whereas metadata documentation is an attempt to include as much
as possible across the board. The general nature of metadata documentation is probably why it
has historically been perceived as such a boring task to complete.

PIMMS will make metadata documentation more fun by bringing back the Narrative, once
PIMMS is established at an institution users will be able to create generalised metadata having
only described those things that are relevant to the story of their experiment.

Cpascoe pimms or2012_

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Cpascoe pimms or2012_

Semelhante a Cpascoe pimms or2012_ (20)

Último

Último (20)

Cpascoe pimms or2012_

Notas do Editor