SlideShare a Scribd company logo
1 of 18
Download to read offline
Scalable Text Mining
Jee-Hyub Kim
Text-Mining Pipeline Builder
Literature Services Team
2 Feb 2016
A Text-Mining Pipeline
Text
Contents
● Text-Mining Pipeline Crisis
● Session 1: Build Your Own Pipeline
● Session 2: Build Your Own Dictionary
● Wrap Up
Use case Semantic type
Dictionary
type
Document
type
Section Metadata Delivery method
OpenAIRE
accession
numbers
pattern
(e.g, [0-9][A-
Za-z0-9]{3})
patents
Title, Claim,
Description,
Abstract, Figure,
Table
Pubyear,
IPCR
summary table
ERC grant identifiers pattern articles Acknowledgements search index
CTTV gene, disease
term
(e.g., IBD)
articles,
abstracts
json
ELIXIR-EXCELERTAE resource names term articles summary table
1000 Genomes cell line names pattern articles !Acknowledgements REST API
Wikipedia
accession
numbers
pattern wikipages summary table
KEW Garden
species names
(muitilingual)
term articles summary table
ChEMBL resource name term articles
Author,
Journal
summary table
Ensembl genomic range pattern articles summary table
A long list of requests
Scalable Text Mining
● For the last few years, we’re having a pipeline crisis!
● A long list of requests and our slow responses
○ Makes you unhappy.
● Even worse, it’s a long tail!
○ Never the same pipeline used for each request.
○ Every time, we have to build a new pipeline.
○ We need a new approach to solve this crisis.
Objective
● We want to build a LEGO-like platform that helps you to
build your own text-mining pipeline and your own
dictionary.
A Key Block: Dictionary-Based Tagger
● Role: To identify names (e.g., proteins, species,
accession numbers, etc.)
● Dictionary-based approach for mining names.
○ Simple
○ Readable
○ Interactive
● Building a dictionary is a VERY iterative process
○ 20% for building an initial dictionary and the rest for
refining it.
● Good dictionaries are a key for text-mining success
stories.
Agile Revision Process
Session 1
Build Your Own Pipeline
As …, I want a pipeline to do ...
Pipeline Stories
● CTTV
○ As a researcher, I want to find articles with
supporting evidence from drug discovery
● ERC
○ As a funder, I want to funded articles more
searchable.
● ELIXIR-EXCELERATE
○ As a resource manager, I want to know impacts of
resources.
Second, Find & Describe Blocks You Need
When you want You can use
to extract a sentence Sentence splitter
to limit your mining to an article section Section tagger
to identify disease names
to identify database idetifiers
Dictionary-based tagger
to find relations between genes and diseases Relation extractor
to get some analytics Summary table generator
to get article meta data Europe PMC REST API
to produce text-mined data in RDF RDF generator
Then, Build a Pipeline using Blocks
Session 2
Build Your Own Dictionary
Designing filtering rules
How to Revise a Dictionary?
● We want to build an expressive language for filtering.
● Global filtering rule
○ A length of term > 2
○ Case sensitive
● Per-entry filtering rule
○ A term should be tagged when it is mentioned in
Methods section.
○ A pattern should be tagged when it follows a term
“omim”
● Blacklist: e.g., stop words
Per-Entry Rules
● A spreadsheet per entry
● Definitions
○ Context: should (not) be after a tem.
○ Section: should (not) be mentioned a section.
○ URI: check if http://www.ebi.ac.
uk/efo/EFO_0001997 exists
Entry information Filtering rules
Term/Pattern Entry ID DB Context Section URI
Pattern HG[0-9]{5}
1000
genomes
!
(grant|fun
d)
!ACK
Term basal cell EFO_0001997 efo Methods Yes
Analytics
● Summary table
● Top 100 frequent terms
PMCID Term ID Frequency
PMCID4698870 Nutlin-3 ChEBI:46742 16
PMCID4698870 cell cycle arrests GO:0007050 6
Top Name Document Freq. Collection Freq.
1 protein 678,987 1,823,783
2 water 563,234 1,233,332
Spreadsheet for Filtering Rules
http://tinyurl.com/zlwbx2y
Wrap Up
● What is your pipeline story?
● Have you managed to create your own dictionary?
● What service blocks are missing?
● What should be the interfaces?
● How should we deliver?

More Related Content

What's hot

Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITAnkit Sharma
 
Resource description framework
Resource description frameworkResource description framework
Resource description frameworkhozifa1010
 
The ENCODE Portal REST API
The ENCODE Portal REST API The ENCODE Portal REST API
The ENCODE Portal REST API ENCODE-DCC
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationReynold Xin
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)Dan Brickley
 
Supporting search as-you-type using sql in databases
Supporting search as-you-type using sql in databasesSupporting search as-you-type using sql in databases
Supporting search as-you-type using sql in databasesEcway Technologies
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesEmir Muñoz
 
Search as-you-type (Exact search)
Search as-you-type (Exact search)Search as-you-type (Exact search)
Search as-you-type (Exact search)Gabani Bhavik
 
EDRAK: Entity-centric Data Resource for Arabic Knowledge
EDRAK: Entity-centric Data Resource for Arabic KnowledgeEDRAK: Entity-centric Data Resource for Arabic Knowledge
EDRAK: Entity-centric Data Resource for Arabic KnowledgeMohamed Gad-elrab
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked DataKaren Estlund
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...ICZN
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search EngineJay R Modi
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Annotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonAnnotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonRobert Sanderson
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectAlexandro Colorado
 

What's hot (20)

Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
The ENCODE Portal REST API
The ENCODE Portal REST API The ENCODE Portal REST API
The ENCODE Portal REST API
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
Supporting search as-you-type using sql in databases
Supporting search as-you-type using sql in databasesSupporting search as-you-type using sql in databases
Supporting search as-you-type using sql in databases
 
Using Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's TablesUsing Linked Data to Mine RDF from Wikipedia's Tables
Using Linked Data to Mine RDF from Wikipedia's Tables
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Search as-you-type (Exact search)
Search as-you-type (Exact search)Search as-you-type (Exact search)
Search as-you-type (Exact search)
 
EDRAK: Entity-centric Data Resource for Arabic Knowledge
EDRAK: Entity-centric Data Resource for Arabic KnowledgeEDRAK: Entity-centric Data Resource for Arabic Knowledge
EDRAK: Entity-centric Data Resource for Arabic Knowledge
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products ...
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 
Stack queue
Stack queueStack queue
Stack queue
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Annotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonAnnotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and Triannon
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit Project
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 

Similar to Scalable Text Mining

Literature Services Resource Description Framework
Literature Services Resource Description FrameworkLiterature Services Resource Description Framework
Literature Services Resource Description FrameworkJee-Hyub Kim
 
SKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCSKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCjonphipps
 
Europe PubMed Central and Linked Data
Europe PubMed Central and Linked DataEurope PubMed Central and Linked Data
Europe PubMed Central and Linked DataJee-Hyub Kim
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
Terminology Services
Terminology ServicesTerminology Services
Terminology ServicesOCLC Research
 
Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...Toni Hermoso Pulido
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic webWorawith Sangkatip
 
ontology.ppt
ontology.pptontology.ppt
ontology.pptPrerak10
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
슬라이드 1
슬라이드 1슬라이드 1
슬라이드 1butest
 
Eprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEduserv Foundation
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Editor IJARCET
 

Similar to Scalable Text Mining (20)

Literature Services Resource Description Framework
Literature Services Resource Description FrameworkLiterature Services Resource Description Framework
Literature Services Resource Description Framework
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
SKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCSKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYC
 
Europe PubMed Central and Linked Data
Europe PubMed Central and Linked DataEurope PubMed Central and Linked Data
Europe PubMed Central and Linked Data
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Terminology Services
Terminology ServicesTerminology Services
Terminology Services
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...Semantic web technologies applied to bioinformatics and laboratory data manag...
Semantic web technologies applied to bioinformatics and laboratory data manag...
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic web
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
Longwell final ppt
Longwell final pptLongwell final ppt
Longwell final ppt
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
슬라이드 1
슬라이드 1슬라이드 1
슬라이드 1
 
Eprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, Mexico
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Scalable Text Mining

  • 1. Scalable Text Mining Jee-Hyub Kim Text-Mining Pipeline Builder Literature Services Team 2 Feb 2016
  • 3. Contents ● Text-Mining Pipeline Crisis ● Session 1: Build Your Own Pipeline ● Session 2: Build Your Own Dictionary ● Wrap Up
  • 4. Use case Semantic type Dictionary type Document type Section Metadata Delivery method OpenAIRE accession numbers pattern (e.g, [0-9][A- Za-z0-9]{3}) patents Title, Claim, Description, Abstract, Figure, Table Pubyear, IPCR summary table ERC grant identifiers pattern articles Acknowledgements search index CTTV gene, disease term (e.g., IBD) articles, abstracts json ELIXIR-EXCELERTAE resource names term articles summary table 1000 Genomes cell line names pattern articles !Acknowledgements REST API Wikipedia accession numbers pattern wikipages summary table KEW Garden species names (muitilingual) term articles summary table ChEMBL resource name term articles Author, Journal summary table Ensembl genomic range pattern articles summary table A long list of requests
  • 5. Scalable Text Mining ● For the last few years, we’re having a pipeline crisis! ● A long list of requests and our slow responses ○ Makes you unhappy. ● Even worse, it’s a long tail! ○ Never the same pipeline used for each request. ○ Every time, we have to build a new pipeline. ○ We need a new approach to solve this crisis.
  • 6. Objective ● We want to build a LEGO-like platform that helps you to build your own text-mining pipeline and your own dictionary.
  • 7. A Key Block: Dictionary-Based Tagger ● Role: To identify names (e.g., proteins, species, accession numbers, etc.) ● Dictionary-based approach for mining names. ○ Simple ○ Readable ○ Interactive ● Building a dictionary is a VERY iterative process ○ 20% for building an initial dictionary and the rest for refining it. ● Good dictionaries are a key for text-mining success stories.
  • 9. Session 1 Build Your Own Pipeline As …, I want a pipeline to do ...
  • 10. Pipeline Stories ● CTTV ○ As a researcher, I want to find articles with supporting evidence from drug discovery ● ERC ○ As a funder, I want to funded articles more searchable. ● ELIXIR-EXCELERATE ○ As a resource manager, I want to know impacts of resources.
  • 11. Second, Find & Describe Blocks You Need When you want You can use to extract a sentence Sentence splitter to limit your mining to an article section Section tagger to identify disease names to identify database idetifiers Dictionary-based tagger to find relations between genes and diseases Relation extractor to get some analytics Summary table generator to get article meta data Europe PMC REST API to produce text-mined data in RDF RDF generator
  • 12. Then, Build a Pipeline using Blocks
  • 13. Session 2 Build Your Own Dictionary Designing filtering rules
  • 14. How to Revise a Dictionary? ● We want to build an expressive language for filtering. ● Global filtering rule ○ A length of term > 2 ○ Case sensitive ● Per-entry filtering rule ○ A term should be tagged when it is mentioned in Methods section. ○ A pattern should be tagged when it follows a term “omim” ● Blacklist: e.g., stop words
  • 15. Per-Entry Rules ● A spreadsheet per entry ● Definitions ○ Context: should (not) be after a tem. ○ Section: should (not) be mentioned a section. ○ URI: check if http://www.ebi.ac. uk/efo/EFO_0001997 exists Entry information Filtering rules Term/Pattern Entry ID DB Context Section URI Pattern HG[0-9]{5} 1000 genomes ! (grant|fun d) !ACK Term basal cell EFO_0001997 efo Methods Yes
  • 16. Analytics ● Summary table ● Top 100 frequent terms PMCID Term ID Frequency PMCID4698870 Nutlin-3 ChEBI:46742 16 PMCID4698870 cell cycle arrests GO:0007050 6 Top Name Document Freq. Collection Freq. 1 protein 678,987 1,823,783 2 water 563,234 1,233,332
  • 17. Spreadsheet for Filtering Rules http://tinyurl.com/zlwbx2y
  • 18. Wrap Up ● What is your pipeline story? ● Have you managed to create your own dictionary? ● What service blocks are missing? ● What should be the interfaces? ● How should we deliver?