Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014.
Argo: a platform for interoperable and customisable text analytics, by Sophia Ananiadou - School of Computer Science, Director, National Centre for Text Mining, University of Manchester
What 33 Successful Entrepreneurs Learned From Failure
Semelhante a OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester
SLE/GPCE Keynote: What's the value of an end user? Platforms and Research: Th...Stéphane Ducasse
Semelhante a OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester (20)
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester
1. Argo: a platform for interoperable
and customisable text mining
Sophia Ananiadou
National Centre for Text Mining
School of Computer Science
The University of Manchester
2. Overview
• Sharing tools, resources and text mining workflows
• Challenges
• Interoperable infrastructure for processing and
annotation
2Open AIRE-COAR ConferenceAnaniadou
3. NaCTeM
• 1st publicly funded national
text mining centre
• Location: Manchester Institute
of Biotechnology
• Phase I - Biology (2004-2008)
• Phase II - Biology, Medicine,
Social Sciences (2008-2011)
• Phase III – Biology, Medicine,
Humanities, Social Sciences;
Fully sustainable centre (2011-
)
www.nactem.ac.uk
4. Challenges
Language Technology
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Chinese
Hindu
Urdu
Japanese
Korean….Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
….
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities….
Text Types
Newswire
Scientific Literature
Full papers/abstracts
Twitter
Patents
Clinical records, EMR
Textbooks, monographs
Online forums….
Technology
Sentence Splitter
Paragraph Splitter
NP Chunkers
C-parser
D-parser
Semantic parser
NE recognizers
Relation recognizers
…….
Diversity of Languages
Diversity of Contexts
Diversity of Applications
TM Workflows
TM Modules
Shared!
4Open AIRE-COAR ConferenceAnaniadou
6. Requirements from TM infrastructure
• Modularity of TM modules
• Interoperability among TM modules and resources
• Generic across different languages, domains, and text
types
– Adaptability
6Open AIRE-COAR ConferenceAnaniadou
8. Example: extracting proteins, annotations
8
GENIA
PennBioIE
AIMed
GENETAG
Incompatibility
Type definitions
Texts
Problem: Inconsistency
Open AIRE-COAR ConferenceAnaniadou
9. The problem with incompatibility
• Difficult to evaluate NERs
9
Corpus C Corpus D
NER A
Which NER is
best for my
task?
NER B
A: 93% B: 36%
A is better than B.
A: 63% B: 90%
B is better than A.
Why so different among
different corpora and
NERs ?
Open AIRE-COAR ConferenceAnaniadou
10. Text mining workflows
• A pipeline that executes particular tools and resources in
order
• Example: semantic search
• Various versions (language- or domain-specific) of basic
components needed for different applications and tasks
• Different workflows can be created, compared and evaluated
by the ability to seamlessly “mix and match” various versions
of components
PoS
Tagger
Dictionary
Lookup
NE
Extraction
Chunking Parsing
Semantic
Query
10Open AIRE-COAR ConferenceAnaniadou
11. Text mining workflows
Interoperability
Common Data Representation and Types
IBM Journal of Research and
Development (2011)
U-Compare: a modular NLP workflow
construction and evaluation system.
Kano, Y., Miwa, M., Cohen, K. B., Hunter,
L., Ananiadou, S. and Tsujii, J.
11Open AIRE-COAR ConferenceAnaniadou
12. Common Type System
• A common type system is required for the complete
interoperability
• Solution: Maintain local type systems and bridge them
via a sharable type system
12
A single common type is almost impossible to impose
for all developers.
U-Compare
Sharable Type System
Local Type System A Local Type System B
bridging bridging
12Open AIRE-COAR ConferenceAnaniadou
14. POS tagger
B
Sentence
Splitter B
library
POS tagger
A
Sentence
Splitter A
NER
Sentence
Splitter A
Sentence
Splitter A
Sentence
Splitter A
Sentence
Splitter B
Sentence
Splitter B
Sentence
Splitter B
POS tagger
A
POS tagger
A
POS tagger
A
POS tagger
B
POS tagger
B
POS tagger
B
NERNERNER
Workflow A Workflow B Workflow C
F-Score A F-Score B F-Score C
U-Compare: Evaluate and Compare TM
Worklfows
UIMA SD
OpenNLP SD
GENIA SD
UIMA Tokenizer
OpenNLP Tokenizer
GENIA Tagger as
Tokenizer
GENIA Tagger
Stepp Tagger
OpenNLP
Tagger
ABNER
MedT-NER
GENIA Tagger
as NER
15. • Web-based application
• Interactive creation of
workflows
• Cloud and high-
performance computing
• Integrated TM/NLP processing system
• GUI for workflow creation
• Library of ready-to-use processing components
• Statistics, visualizations, developer APIs
• Supports UIMA
• http://argo.nactem.ac.uk
15
Database: The Journal of Biological Databases
and Curation (2012)
Argo: an integrative, interactive, text mining-
based workbench supporting curation.
Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
17. Processing Components
• Approaching 100 components (U-Compare)
– Additional 50 will be added soon
• META-NET
• Developed or co-developed by NaCTeM
– Planned: Make the library open to others to contribute
• Generic Listener component
– Developers can plug in their own locally run UIMA
component to a workflow in Argo
17Open AIRE-COAR ConferenceAnaniadou
19. Workflows
• Users create workflows as block diagrams
• Workflows can be shared among users
– Read only
– Planned: Read & write
– Planned: downloadable workflows
• Workflows can be deployed as web services
– Plain text (input only), XMI, RDF, BioC
19Open AIRE-COAR ConferenceAnaniadou
22. Sample Use Cases
1 Recognition of chemical entities (chemical NER)
2 Semi-automatic curation of metabolic pathways
3 Evaluation of inter-annotator agreement
4 Information extraction as a Web service
Ananiadou Open AIRE-COAR Conference 22
23. Use Case 1: Chemical NER
Supplies gold
standard corpus
Removes golden annotations
so that they can be created
automatically
Combinations of syntactic and
semantic components create
annotations
Compares and reports precision, recall
and F1 of the different branches
against the gold standard corpus
24. Chemical Entity Recogniser
• Chemical model evaluated at BioCreative IV
CHEMDNER challenge
• The challenge
– Data: 10,000 manually annotated PubMed abstracts
– Automatically recognises names of chemical entities in text
24Open AIRE-COAR ConferenceAnaniadou
25. Chemical Entity Recogniser
• Our solution
– Ranked unique mentions: ranked 1st out of 18 groups
– All mentions: ranked 3rd out of 19 groups
Subtask Precision % Recall % F-score %
Ranked unique mentions 91 85 88
All mentions 93 81 87
25Open AIRE-COAR ConferenceAnaniadou
26. Use Case 2: Semi-automatic Curation –
Metabolic Pathways
Search for
relevant
documents
Manual correction of
automatic annotations
NER for chemicals,
genes, process
indicators
Linking to
ontologies: CTD,
ChEBI, UniProt
26Open AIRE-COAR ConferenceAnaniadou
Save results in
various formats,
e.g., RDF for
querying and
incorporation into
databases
27. Manual Annotation Editor
Create new
annotations by
selecting text
Create, modify or
delete annotations
Edit details of
annotations
Open a graphical
interface to link
annotations to
ontologies
27Open AIRE-COAR ConferenceAnaniadou
29. Manual Annotation Editor: linking to
ontologiesAutomatic pre-
selection can be
modified by the user
Details show
ontology entry
webpage
29Open AIRE-COAR ConferenceAnaniadou
30. Use Case 3: Information extraction
as a Web service
Web service-
enabled
reader
Web service-
enabled
writer
34Open AIRE-COAR ConferenceAnaniadou
31. Language Universal
• Reusable modules
• Generic TM modules: Competence
• Annotated Text, corpora: Performance
• Standards of Data Representation and Types for
Resources: Competence
• Dictionaries, Thesauri, Ontologies: Performance
36Open AIRE-COAR ConferenceAnaniadou