SlideShare uma empresa Scribd logo
1 de 151
Lars Juhl Jensen
Text mining and data integration
exponential growth
~45 seconds per paper
information retrieval
named entity recognition
information extraction
association networks
data integration
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
no tool will find that
named entity recognition
computer
as smart as a dog
teach it specific tricks
identify the concepts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
comprehensive lexicon
CDC2
cyclin dependent kinase 1
orthographic variation
upper- and lower-case
CDC2
Cdc2
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
prefixes and postfixes
CDC2
hCDC2
“black list”
SDS
scalable implementation
>10 km
<10 hours
augmented browsing
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
Reflect
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
O’Donoghue et al., Journal of Web Semantics, 2010
information extraction
formalize the facts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
two approaches
co-mentioning
counting
within documents
within paragraphs
within sentences
co-mentioning score
NLP
Natural Language Processing
grammatical analysis
part-of-speech tagging
multiword detection
semantic tagging
sentence parsing
Gene and protein names
Cue words for entity
recognition
Verbs for relation extraction
[nxexpr The expression of
[nxgene the cytochrome
genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
extract stated facts
high precision
poor recall
text corpus
most use abstracts
few use full-text articles
no access
PDF files
layout-aware extraction
my corpus
~22 million abstracts
~4 million articles
association networks
guilt by association
STRING
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
hard work
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
data integration
general approach
suite of web resources
STITCH
STRING + 300k chemicals
Kuhn et al., Nucleic Acids Research, 2012
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
TISSUES
tissue expression
tissues.jensenlab.org
DISEASES
disease genes
unification
curated knowledge
text mining
experimental data
computational predictions
common identifiers
quality scores
visualization
dissemination
web interfaces
evidence viewers
web services
diseases.jensenlab.org
bulk download
thank you!

Mais conteúdo relacionado

Mais procurados

The pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health recordsThe pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
Lars Juhl Jensen
 

Mais procurados (9)

Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Explaining the assembly model
Explaining the assembly modelExplaining the assembly model
Explaining the assembly model
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
The pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health recordsThe pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 

Destaque

Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
Marcy Morales
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Destaque (12)

The Integration Issue
The Integration IssueThe Integration Issue
The Integration Issue
 
Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
 
IS Issue: IT Integration during M&A
IS Issue: IT Integration during M&AIS Issue: IT Integration during M&A
IS Issue: IT Integration during M&A
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Types of Data Processing
Types of Data ProcessingTypes of Data Processing
Types of Data Processing
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
DataPreProcessing
DataPreProcessing DataPreProcessing
DataPreProcessing
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Semelhante a Text mining and data integration

Mining literature and medical records
Mining literature and medical recordsMining literature and medical records
Mining literature and medical records
Lars Juhl Jensen
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
Lars Juhl Jensen
 

Semelhante a Text mining and data integration (20)

Mining literature and medical records
Mining literature and medical recordsMining literature and medical records
Mining literature and medical records
 
Biomedical text mining
Biomedical text miningBiomedical text mining
Biomedical text mining
 
Text mining
Text miningText mining
Text mining
 
Text mining
Text miningText mining
Text mining
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databases
 
Literature mining and large-scale data integration
Literature mining and large-scale data integrationLiterature mining and large-scale data integration
Literature mining and large-scale data integration
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databases
 
Literature Mining and Systems Biology
Literature Mining and Systems BiologyLiterature Mining and Systems Biology
Literature Mining and Systems Biology
 
Biomedical literature mining (and why we really need open access)
Biomedical literature mining (and why we really need open access)Biomedical literature mining (and why we really need open access)
Biomedical literature mining (and why we really need open access)
 
Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 
STRING: Large-scale data and text mining
STRING: Large-scale data and text miningSTRING: Large-scale data and text mining
STRING: Large-scale data and text mining
 
Text mining for protein and small molecule relations
Text mining for protein and small molecule relationsText mining for protein and small molecule relations
Text mining for protein and small molecule relations
 
Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
The STRING database and related tools
The STRING database and related toolsThe STRING database and related tools
The STRING database and related tools
 
Large-scale integration of data and text
Large-scale integration of data and textLarge-scale integration of data and text
Large-scale integration of data and text
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
 
Information integration
Information integrationInformation integration
Information integration
 
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networks
 

Mais de Lars Juhl Jensen

Mais de Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...
 
One tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicineOne tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicine
 
Extract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotationExtract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotation
 
Network visualization: A crash course on using Cytoscape
Network visualization: A crash course on using CytoscapeNetwork visualization: A crash course on using Cytoscape
Network visualization: A crash course on using Cytoscape
 
STRING & STITCH : Network integration of heterogeneous data
STRING & STITCH: Network integration of heterogeneous dataSTRING & STITCH: Network integration of heterogeneous data
STRING & STITCH : Network integration of heterogeneous data
 
Biomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured textBiomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured text
 
Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...
 
Network Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and CytoscapeNetwork Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and Cytoscape
 
Cellular networks
Cellular networksCellular networks
Cellular networks
 
Cellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and textCellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and text
 
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
 
STRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous dataSTRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous data
 
Tagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognitionTagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognition
 
Network Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and textNetwork Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and text
 
Medical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactionsMedical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactions
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactionsMedical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactions
 
Cellular Network Biology
Cellular Network BiologyCellular Network Biology
Cellular Network Biology
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Biomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritizationBiomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritization
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 

Text mining and data integration