Using text to build semantic networks for pharmacaogenomics2

Using Text to Build Semantics
Networks for Pharmacogenomics

George Karystianis

Adrien Coulet, Nigam Shah, Yael Garten, Mark Musen, Russ B. Altman

Journal of Biomedical informatics (2010)

Motivation
● Manually crafted rules to define relationships
between entities.
– Limited scope domains.
● Pharmacogenomics.
– Semantic complexity.
● Enhance the PharmaGKB.
● Large size of literature.
● NLP techniques promising.
2

Aim
● Automatic relationship extraction.
● Entity mapping in a schema.
– Semantic network structure.
● Curation of PGx knowledge.
● Resource for knowledge discovery.

3

What is the meaning of
Pharmacogenomics?

5

Pharmacogenomics (1)

Pharmaco Genomics PGx

Φάρμακο Γίνομαι

6

Pharmacogenomics (2)
● How genetic variation influences drug
response in patients.
● Most of this knowledge presented in binary
relationships.

R(a,b)

Relationship Subject Object
7

Is This Something New?
● Co-occurrence approach: Complex relationship
– Pharmexpresso. semantics.

– Tri-co-occurrences. Manual relationship
evaluation.

● Syntactic parser approach: Explicit relationship
identification.
– OpenDMAP.
Large pattern sets.
– Vocabularies.
Stable ontologies.
8

So...

Gene-disease networks Molecular interaction networks

Drug-disease networks Regular gene expression networks

9

Method Overview
Ontology

MEDLINE
Abstracts
Dependency
Graphs of
Sentences
R
PGx network

10

1a. Sentence Parsing
● Implementation of lexicons for sentence
retrieval.
● Stanford Parser.
● Focused on sentences with at least 2 key PGX
entities.

11

1b. Sentence Parsing
● Querying the sentence index using seeds.
– particular terms corresponding to recognized entities.
– focus on gene-drug/gene-phenotype pairs.
● Reducing set/size of parse trees.
● Parse trees -> dependency graphs.
– rooted, oriented, labelled, easy to read, process,
understand than parse trees.

12

Parsing Example
“Several single nucleotide polymorphisms (SNPs) in VKORC1 are associated
with warfarin dose across the normal dose range”

13

Dependency Graph

14

2a. Relation Extraction
● Sentence analysis for raw relationship
extraction.
● Seed recognition:
– through PharmGKB lexicons.
● Seed expansion:
– edge traversal of DG to see if the seed is a key entity
or a modified entity.

15

Dependencies for Seed
Expansion

● Expand the seed
● End the expansion
● Interrupt the expansion
16

2b. Relation Extraction
● Seed coupling
– Two seeds wend with a normalised verb.
– Relationship creation.

17

2c. Relation Extraction
● Evaluation of precision:
– manual precision evaluation of extracting raw
relationships.
– random selection of 220 raw relationships.
– classification-complete and true, incomplete and true,
false.

18

3. Ontology Construction
● Identification of R types.
● Hierarchical organisation of R types and E.
– 4 lists: most frequent, the most frequent modified
entities by genes, drugs, phenotype.
● Refine choice available.

19

4a. Relationship Normalization
● Application of ontology to relationship
instances.
● Creation of set of normalised relationships.
● Normalization of entity names:
– modified entity name returned in normalized form
according to ontology.
– Decomposition of modified entity to iterate for the
construction of normalised form.

20

Example
● Seed: VKORC1_polymorphisms.
● Seed concept: Gene.
● Next word: polymorphism.
– refers to a concept modified by Gene.
– synonym of the concept “variant”.
● Normalised word:
– VKORC1_variant.

22

4b. Relation Normalization
● Normalization of relationship types.
– search for a role label which matches the relationship.
– the identifier of the corresponding role is the
normalized type.
– creation of knowledge base of PGX relationships.

23

Did it work?
● Input:
– 17.396.436 MEDLINE abstracts
● Sentences:
– 87.806.828.
● Sentences with pairs of PGx entities:
– 295.569.
● After pruning:
– 41.134 raw relationships, 21.050 gene-drug pair,
20.084 gene-phenotype pair. 24

Results
● The 200 most frequent raw relationship types:
– 80% of the extracted relationships.
● Creation of an ontology:
– 200 most frequent relationship types and modified
entities called PHARE-PHArmacogenomics
RElationships.
– 237 concepts and 76 roles.

26

Discussion (1)
● Identification of both PGx entities.
● Identification of PGx modified entities.
● Use of key entity lexicons for discovery and
normalization of modified entities.
● Record and recognition of modified entities
under very general textual conditions.
● Flexible, precise method.
31

Discussion (2)
● Concern: lower recall due to the large corpus
size.
– improve precision with full text parsing.
● Applicable to other domains.
– Human effort required for the ontology creation.

32

Conclusions (1)
● New method for PGX relationship extraction.
● Use of key PGX entities to identify modified
entities.
● Capture and normalization of raw
relationships.
● Automatic labelling of parsed sentences.

33

Conclusions (2)
● Creation of a knowledge base.
● Creation of relationship summaries between:
– Genes, drugs, phenotypes.
● Novel approach for PGX text processing.

34

Questions?
(in French ^_^)

Questions?

質問 ?

Ερωτήσεις;
Preguntas?
35

Using text to build semantic networks for pharmacaogenomics2

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Using text to build semantic networks for pharmacaogenomics2

Semelhante a Using text to build semantic networks for pharmacaogenomics2 (20)

Último

Último (20)

Using text to build semantic networks for pharmacaogenomics2