SlideShare uma empresa Scribd logo
1 de 19
Stefan Geißler, Kairntech, AI-SDV, Oct 2020
From “Knowledge Acquisition Bottleneck” to
Knowledge Firehose
Kairntech?
 Software & Service company with a focus on NLP &
AI for industry use cases
 Focus on making powerful ML approaches accessible
for domain experts (not just programmers and data
scientists)
 Created in dec 2018, HQ in Grenoble, France
 20+ years of experience in the field (Xerox, IBM,
TEMIS, …)
 We’ve been attending the SDV for many years, it is
a pleasure to be ‘here’ again 
Kairntech @ SDV: 2019 vs 2020
2019:
Introducing ML-powered creation of
models by simple and fast annotation of
content by domain experts
2020:
Adding broad entity annotation
with existing vocabularies
Introduction
 Finding and extracting concepts and named entities
is an important ingredient in many document
analysis processes
 Today broad public resources exist that can be used
for this purpose
 Required: Turning these resources into high quality,
large scale services
 We’ll describe challenges and solutions
Enriching documents with world knowledge
Requirements:
 Broad:
 Many topics: Life Sciences, IT, Business, Legal, General Interest, …
 Multilingual
 Up-to-date
 Accurate
 Entity extraction more than just matching a string
 Scoring: Assign confidence value
 Typing: Distinguish places, persons, substances, body parts, diseases, …
 Disambiguating: Select the appropriate meaning among several choices
 Linking: Connect entities with background information
 High-throughput
Our choice: Wikidata
 Community effort : wikidata.org
 Designed as a foundation for other projects,
among them Wikipedia
 ~90mio concepts (and growing)
 Wikidata is a superset of many domain-specific
vocabularies:
 MeSH, Geonames, DrugBank, …
 … and contains identifiers/links to these
 CC License
 Why Wikidata? (There are other large public
vocabularies: DBPedia, Freebase, Yago, OpenCyc)
Wikidata
 Wikidata is ‘FAIR’
 FAIR data: ‘Findable, Accessible, Interoperable, and Reusable’
 Cf. Waagmester et al. 2019: ‘Wikidata as a FAIR knowledge graph for the life
sciences’, doi: https://doi.org/10.1101/799684
 Wikidata compares favorably on a range of >30 criteria against other public
knowledge graphs:
 Cf. Färber et al. 2017: „Linked data quality of DBpedia, Freebase, OpenCyc,
Wikidata, and YAGO “, doi: https://doi.org/10.3233/SW-170275
 Open source software package (Working Entity Recognition software) exists
(https://github.com/kermitt2/entity-fishing), incidentally written by Kairntech‘s
Chief ML Expert Patrice Lopez
Wikidata … some details
 Wikidata available as data dumps
 English alone: ~13,5 GB compressed
 Updated dumps are published regularly (often even weekly).
 90mio entities and (2019) 750mio statements
 By default Kairntech considers EN, DE, FR, SP, IT: ~37mio entitites
 Transformation of the dumps into the content of our service at Kairntech:
~ 2 days of compilation
 Wikidata entities can be annotated with hundreds of specific properties.
E.g. property P351 links a gene with its identifier in Entrez Gene
Wikidata … sample life science use case
 Example taken from
https://www.biorxiv.org/content/10.1101/79968
4v1.full.pdf
“consider a pulmonologist interested in identifying
candidate chemical compounds for testing in
disease models. […] She may start by identifying
genes with a genetic association to any respiratory
disease, with a particular interest in genes that
encode membrane-bound proteins. She may then
look for chemical compounds that either directly
inhibit those proteins, or finding none, compounds that
inhibit another protein in the same pathway. Finally
she may specifically filter for proteins containing a
serine-threonine kinase domain”
SELECT DISTINCT ?compound ?compoundLabel
where
{
# gene has genetic association with a respiratory disease
?gene wdt:P31 wd:Q7187 .
?gene wdt:P2293 ?diseaseGA.
?diseaseGA wdt:P279* wd:Q3286546 .
# gene product is localized to the membrane
?gene wdt:P688 ?protein .
?protein wdt:P681 ?cc .
?cc wdt:P279*|wdt:P361* wd:Q14349455 .
# gene is involved in a pathway with another gene ("gene2")
?pathway wdt:P31 wd:Q4915012 ;
wdt:P527 ?gene ;
wdt:P527 ?gene2 .
?gene2 wdt:P31 wd:Q7187 .
# gene2 product has a Ser/Thrprotein kinase domain AND known enzyme inhibitor
?gene2 wdt:P688 ?protein2 .
?protein2 wdt:P129 ?compound ;
wdt:P527 wd:Q24787419 ;
p:P129 ?s2 .
?s2 ps:P129 ?cp2 .
?compound wdt:P31 wd:Q11173 .
FILTER EXISTS {?s2 pq:P366 wd:Q427492 .}
SERVICE wikibase:label { bd:serviceParam wikibase:language"en". }
}
Wikidata … sample life science use case
 Running the query on https://query.wikidata.org/
Wikidata @ Kairntech
Three scenarios:
1. Annotate content with entities and concepts from a wide variety of
types
2. Let annotation with Wikidata create suggestions to be reviewed
and refined by the users
 Example: Annotate documents on clinical trials with “Diseases”.
 User loops through matches and specifies each one according
to whether it is a “Adverse Event” or not.
3. Let Wikidata annotation enrich the input for ML algorithms.
1. Wikidata annotation
Scenario:
 Using the annotation service (REST Api) to enrich document content
 Information is typed, scored, disambiguated and linked
{
"start" : 236,
"end" : 251,
"labelName" : "misc",
"score" : 1.0,
"properties" : {
"wikidataId" : "Q6271957",
"preferredTerm" : "Jonah crab",
"wikipediaExternalRef" : 6098993
},
"text" : "Cancer borealis",
"label" : "Wikidata Concept"
}
REST API Annotation Result DB record
(with links into Freebase, NCBI,
SealifeBase and many others)
1. Wikidata annotation
 Practically every non trivial thesaurus
is massively ambiguous: Labels may
have different meanings
 When to pick which meaning?
 Example “NHL”
 Disambiguation depending on
context: “NHL”: the “Non-Hodgkin-
Lymphoma” or the “National Hockey
League”?
 Entity Extraction inside Kairntech
disambiguates these and many many
others, based on automatically
learned context information (no
manual maintenance required)
 Linking: Connect the occurrence of
the extracted terms with background
information from Wikidata or client-
specific knowledge
1. Wikidata annotation (more examples)
What do Kairntech clients say?
Olivier Deguernel (Sealk.co)
“Our processes require the
enrichment of document content
with a wide range of named-
entities. We have analysed the
available APIs on the market and
have decided to integrate the
Kairntech Named-Entity Extraction
API into out offering. The clear
API, the superior quality and
the wealth of information
returned by the API made it a
valuable completion of our
processes.”
2. Wikidata supporting the manual annotation
Scenario:
 Train a model to distinguish in clinical
trials « Adverse Events » from the
disease that is the subject of the trial
 Requires to annotate a training corpus
 Normally that means manually
scanning the documents, finding
occurences of a disease and assign it
to this or that type
 … can be lengthy …
 Instead: Let the Wikidata annotator
suggest all occuring diseases. User
only needs to review these.
 Available for thousands of entity
types: Diseases, Substances, Car
Models, Amino Acids, …
3. Wikidata information enriching DL training?
Current best practice:
 Input to Learning as sequence of token vectors.
 Vectors fed into Embedding layer, that expresses contextual semantic information drawn from
massive amounts of text data (options in the software are currently BERT and ElMO)
 Embeddings are able to express impressive amounts of semantic information:
 Example: Most similar tokens to « cytosine »: Thymine, Uracil, Adenine, Cytidine, Nucleotides,
Pyrimidine, …
 Evidence from the literature suggests that adding explicit semantic information from large
taxonomies may not necessarily improve on the state of the art for embedding-powered Entity
Recognition
 Raiman&Raiman 2018: « DeepType: Multilingual Entity Linking by Neural Type System
Evolution », AAAI 2018, https://arxiv.org/pdf/1802.01021.pdf
 Investigation in our team under way. Maybe further evidence for the massive benefits of large
pretrained language models such as BERT compared to explicit knowledge.
Summary
At Kairntech we combine in an intuitive environment the possibility to
 Learn powerful models processing your own specific analysis demands
 With the analysis according to vast existing knowledge on just about any domain
Conclusion
Document analysis tasks benefit from World Knowledge
Thank you!
info@kairntech.com
www.kairntech.com
1
3 Annotation with Wikidata knowledge is an integrated part
of Kairntech software, available of-the-shelf
4 We are looking forward to hearing about your analysis
requirements
Wikidata is a powerful resource for World Knowledge2

Mais conteúdo relacionado

Mais procurados

II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisII-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
Dr. Haxel Consult
 
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
Dr. Haxel Consult
 
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
Kerstin Forsberg
 

Mais procurados (20)

II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisII-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesApplication of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
 
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent Office
 
Fairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matricesFairification experience clarifying the semantics of data matrices
Fairification experience clarifying the semantics of data matrices
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-DV 2017: Averbis
II-DV 2017: AverbisII-DV 2017: Averbis
II-DV 2017: Averbis
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
 
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
II-SDV 2012 Expert System Driven Insights into Patent Quality and Competitive...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
Linked Data for Biopharma
Linked Data for BiopharmaLinked Data for Biopharma
Linked Data for Biopharma
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterReinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
 
Fair by design
Fair by designFair by design
Fair by design
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
II-SDV 2017: Semantic Search Jargon - A short Guide
II-SDV 2017: Semantic Search Jargon - A short GuideII-SDV 2017: Semantic Search Jargon - A short Guide
II-SDV 2017: Semantic Search Jargon - A short Guide
 
Fair data principles for AOASG
Fair data principles for AOASGFair data principles for AOASG
Fair data principles for AOASG
 

Semelhante a AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Scientific Documents Stefan Geißler (Consultant, Germany)

Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital Objects
Carole Goble
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
Carole Goble
 

Semelhante a AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Scientific Documents Stefan Geißler (Consultant, Germany) (20)

Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
 
Semantic Web Adoption
Semantic Web AdoptionSemantic Web Adoption
Semantic Web Adoption
 
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge Graphs
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
 
Isf vivo2013
Isf vivo2013Isf vivo2013
Isf vivo2013
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networks
 
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODOLinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Ontologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and DataverseOntologies, controlled vocabularies and Dataverse
Ontologies, controlled vocabularies and Dataverse
 
Dataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* DataDataset Catalogs as a Foundation for FAIR* Data
Dataset Catalogs as a Foundation for FAIR* Data
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital Objects
 
Gbrds Tech Issues Op
Gbrds Tech Issues OpGbrds Tech Issues Op
Gbrds Tech Issues Op
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 
NIH BD2K DataMed data index - DATS model
NIH BD2K DataMed data index - DATS modelNIH BD2K DataMed data index - DATS model
NIH BD2K DataMed data index - DATS model
 

Mais de Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
Dr. Haxel Consult
 

Mais de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
 

Último

一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 

AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Scientific Documents Stefan Geißler (Consultant, Germany)

  • 1. Stefan Geißler, Kairntech, AI-SDV, Oct 2020 From “Knowledge Acquisition Bottleneck” to Knowledge Firehose
  • 2. Kairntech?  Software & Service company with a focus on NLP & AI for industry use cases  Focus on making powerful ML approaches accessible for domain experts (not just programmers and data scientists)  Created in dec 2018, HQ in Grenoble, France  20+ years of experience in the field (Xerox, IBM, TEMIS, …)  We’ve been attending the SDV for many years, it is a pleasure to be ‘here’ again 
  • 3. Kairntech @ SDV: 2019 vs 2020 2019: Introducing ML-powered creation of models by simple and fast annotation of content by domain experts 2020: Adding broad entity annotation with existing vocabularies
  • 4. Introduction  Finding and extracting concepts and named entities is an important ingredient in many document analysis processes  Today broad public resources exist that can be used for this purpose  Required: Turning these resources into high quality, large scale services  We’ll describe challenges and solutions
  • 5. Enriching documents with world knowledge Requirements:  Broad:  Many topics: Life Sciences, IT, Business, Legal, General Interest, …  Multilingual  Up-to-date  Accurate  Entity extraction more than just matching a string  Scoring: Assign confidence value  Typing: Distinguish places, persons, substances, body parts, diseases, …  Disambiguating: Select the appropriate meaning among several choices  Linking: Connect entities with background information  High-throughput
  • 6. Our choice: Wikidata  Community effort : wikidata.org  Designed as a foundation for other projects, among them Wikipedia  ~90mio concepts (and growing)  Wikidata is a superset of many domain-specific vocabularies:  MeSH, Geonames, DrugBank, …  … and contains identifiers/links to these  CC License  Why Wikidata? (There are other large public vocabularies: DBPedia, Freebase, Yago, OpenCyc)
  • 7. Wikidata  Wikidata is ‘FAIR’  FAIR data: ‘Findable, Accessible, Interoperable, and Reusable’  Cf. Waagmester et al. 2019: ‘Wikidata as a FAIR knowledge graph for the life sciences’, doi: https://doi.org/10.1101/799684  Wikidata compares favorably on a range of >30 criteria against other public knowledge graphs:  Cf. Färber et al. 2017: „Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO “, doi: https://doi.org/10.3233/SW-170275  Open source software package (Working Entity Recognition software) exists (https://github.com/kermitt2/entity-fishing), incidentally written by Kairntech‘s Chief ML Expert Patrice Lopez
  • 8. Wikidata … some details  Wikidata available as data dumps  English alone: ~13,5 GB compressed  Updated dumps are published regularly (often even weekly).  90mio entities and (2019) 750mio statements  By default Kairntech considers EN, DE, FR, SP, IT: ~37mio entitites  Transformation of the dumps into the content of our service at Kairntech: ~ 2 days of compilation  Wikidata entities can be annotated with hundreds of specific properties. E.g. property P351 links a gene with its identifier in Entrez Gene
  • 9. Wikidata … sample life science use case  Example taken from https://www.biorxiv.org/content/10.1101/79968 4v1.full.pdf “consider a pulmonologist interested in identifying candidate chemical compounds for testing in disease models. […] She may start by identifying genes with a genetic association to any respiratory disease, with a particular interest in genes that encode membrane-bound proteins. She may then look for chemical compounds that either directly inhibit those proteins, or finding none, compounds that inhibit another protein in the same pathway. Finally she may specifically filter for proteins containing a serine-threonine kinase domain” SELECT DISTINCT ?compound ?compoundLabel where { # gene has genetic association with a respiratory disease ?gene wdt:P31 wd:Q7187 . ?gene wdt:P2293 ?diseaseGA. ?diseaseGA wdt:P279* wd:Q3286546 . # gene product is localized to the membrane ?gene wdt:P688 ?protein . ?protein wdt:P681 ?cc . ?cc wdt:P279*|wdt:P361* wd:Q14349455 . # gene is involved in a pathway with another gene ("gene2") ?pathway wdt:P31 wd:Q4915012 ; wdt:P527 ?gene ; wdt:P527 ?gene2 . ?gene2 wdt:P31 wd:Q7187 . # gene2 product has a Ser/Thrprotein kinase domain AND known enzyme inhibitor ?gene2 wdt:P688 ?protein2 . ?protein2 wdt:P129 ?compound ; wdt:P527 wd:Q24787419 ; p:P129 ?s2 . ?s2 ps:P129 ?cp2 . ?compound wdt:P31 wd:Q11173 . FILTER EXISTS {?s2 pq:P366 wd:Q427492 .} SERVICE wikibase:label { bd:serviceParam wikibase:language"en". } }
  • 10. Wikidata … sample life science use case  Running the query on https://query.wikidata.org/
  • 11. Wikidata @ Kairntech Three scenarios: 1. Annotate content with entities and concepts from a wide variety of types 2. Let annotation with Wikidata create suggestions to be reviewed and refined by the users  Example: Annotate documents on clinical trials with “Diseases”.  User loops through matches and specifies each one according to whether it is a “Adverse Event” or not. 3. Let Wikidata annotation enrich the input for ML algorithms.
  • 12. 1. Wikidata annotation Scenario:  Using the annotation service (REST Api) to enrich document content  Information is typed, scored, disambiguated and linked { "start" : 236, "end" : 251, "labelName" : "misc", "score" : 1.0, "properties" : { "wikidataId" : "Q6271957", "preferredTerm" : "Jonah crab", "wikipediaExternalRef" : 6098993 }, "text" : "Cancer borealis", "label" : "Wikidata Concept" } REST API Annotation Result DB record (with links into Freebase, NCBI, SealifeBase and many others)
  • 13. 1. Wikidata annotation  Practically every non trivial thesaurus is massively ambiguous: Labels may have different meanings  When to pick which meaning?  Example “NHL”  Disambiguation depending on context: “NHL”: the “Non-Hodgkin- Lymphoma” or the “National Hockey League”?  Entity Extraction inside Kairntech disambiguates these and many many others, based on automatically learned context information (no manual maintenance required)  Linking: Connect the occurrence of the extracted terms with background information from Wikidata or client- specific knowledge
  • 14. 1. Wikidata annotation (more examples)
  • 15. What do Kairntech clients say? Olivier Deguernel (Sealk.co) “Our processes require the enrichment of document content with a wide range of named- entities. We have analysed the available APIs on the market and have decided to integrate the Kairntech Named-Entity Extraction API into out offering. The clear API, the superior quality and the wealth of information returned by the API made it a valuable completion of our processes.”
  • 16. 2. Wikidata supporting the manual annotation Scenario:  Train a model to distinguish in clinical trials « Adverse Events » from the disease that is the subject of the trial  Requires to annotate a training corpus  Normally that means manually scanning the documents, finding occurences of a disease and assign it to this or that type  … can be lengthy …  Instead: Let the Wikidata annotator suggest all occuring diseases. User only needs to review these.  Available for thousands of entity types: Diseases, Substances, Car Models, Amino Acids, …
  • 17. 3. Wikidata information enriching DL training? Current best practice:  Input to Learning as sequence of token vectors.  Vectors fed into Embedding layer, that expresses contextual semantic information drawn from massive amounts of text data (options in the software are currently BERT and ElMO)  Embeddings are able to express impressive amounts of semantic information:  Example: Most similar tokens to « cytosine »: Thymine, Uracil, Adenine, Cytidine, Nucleotides, Pyrimidine, …  Evidence from the literature suggests that adding explicit semantic information from large taxonomies may not necessarily improve on the state of the art for embedding-powered Entity Recognition  Raiman&Raiman 2018: « DeepType: Multilingual Entity Linking by Neural Type System Evolution », AAAI 2018, https://arxiv.org/pdf/1802.01021.pdf  Investigation in our team under way. Maybe further evidence for the massive benefits of large pretrained language models such as BERT compared to explicit knowledge.
  • 18. Summary At Kairntech we combine in an intuitive environment the possibility to  Learn powerful models processing your own specific analysis demands  With the analysis according to vast existing knowledge on just about any domain
  • 19. Conclusion Document analysis tasks benefit from World Knowledge Thank you! info@kairntech.com www.kairntech.com 1 3 Annotation with Wikidata knowledge is an integrated part of Kairntech software, available of-the-shelf 4 We are looking forward to hearing about your analysis requirements Wikidata is a powerful resource for World Knowledge2