Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)
1. PhenDisco: a new phenotype
discovery system for the database
of genotypes and phenotypes
Son Doan, Hyeoneui Kim
Division of Biomedical Informatics
University of California San Diego
Open Access Journal Club, 09/05/2013
2. Roadmap to the Presentation
— Background
— dbGaP
— Challenges in using dbGaP
— pFINDR program
— PhenDisco development
— User requirement analysis for PhenDisco
— Data standardization (variables, study metadata)
— System development: technical details
— PhenDisco demo
— Performance evaluation
9/5/13 2
4. Overview on dbGaP
— Database of Genotypes and Phenotypes
— Developed by NCBI
— Stores and distributes the data and outputs of the
studies on the interactions of genotypes &
phenotypes
— Provides 2 levels of access
— Open access: variable information including
summary statistics and study information
— Controlled access: raw data – upon approval by
NIH DAC
9/5/13 4
5. A Typical Challenge in Using dbGaP
Potentially, dbGaP is great…it contains so many different types of studies
and their data!
However, I find it very hard to reuse dbGaP data because there is no easy
but robust way to filter studies by important study related information such
as study design, analysis methods, analysis data produced by the studies.
Even if I find the studies that seem fitting to my needs, I still need to make
sure that the studies have the genotype and/or the phenotype information
that I need.
Of course, dealing with the data values with all sort of different formats is
another challenge to go through…
(Erin Smith, PhD, Division of Genome Information Science, UCSD)
9/5/13 5
9. pFINDR (phenotype Finding IN Data Repositories)
9/5/13
• Funded by NHLBI
• To facilitate dbGaP use by improving
accuracy and completeness of search
returns
– Standardized phenotype variables
– Searchable study related information
9
11. Use-Case Driven Development
— User requirements collected from
— Analysis of data use descriptions from data
requests available in dbGaP (14,287 requests)
— Online user survey (17 users)
— User interviews (8 local dbGaP users)
— NIH officers/Scientific Advisory Board
recommendations and suggestions
9/5/13 11
12. Genetic
Disease
Congenital
Abnormality
(8.6%)
Cardiovascular
Disease (8.1%)
Data Request Analysis
9/5/13
Disease
Chemical or
Biological
Substance
Therapeutic or
preventive
Procedure
Research
Activity
Laboratory
Procedure or
Test
Pathologic
Function
Signs or
Symptoms
Diagnostic
Procedure
Clinical
Attributes
Mood,
Emotion, and
Individual
Behavior
Qualitative
Concept
Mental Process
Social
Behavior
Organism
Function
Daily Function or
Activity
Health Care
Activity Food
Other
Neoplasm/Cancer
(30%)
Psychiatric
Disease (13%)
12
13. Interviews, Survey and
SAB/NIH officers’ feedback
— Functions that maximize search efficiency
— Examples
— “option to expand search terms through
synonyms”
— “studies displayed in the order of relevancy”
— “select studies from the returned list and save for
later review”
— “search results organized in a way that supports
quick browsing”
9/5/13 13
14. Problems We Addressed
— Focus areas:
— Completeness and accuracy of search results
— Abbreviation expansion
— Concept-based search
— Ease of result review
— Sorting the results by relevancy
— Highlighting search keywords in the retrieved records
— Additional functionality
— Export of selected study and variable information
— Categorization of variables
9/5/13 14
16. Phenotype Variable Standardization
— Used variable descriptions
— Focused on identifying
— Topic (main theme: “pain”, “walking”)
— Subject of information (i.e., bearer: “study subject”)
— Mapped the topic and SOI concepts to UMLS
Metathesaurus
9/5/13
Variable ID Variable Name Variable Description
Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace?
16
19. Normalization
• Spell out
abbreviations
and short
hand
expressions
• Drop
question
numbers and
other
unimportant
characters
Variable
Descriptions
• 135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
Phenotype Variable Standardization
9/5/13 19
20. Normalization
• Spell out
abbreviations
and short
hand
expressions
• Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
• Generate CUIs,
concept names,
semantic types
Variable
Descriptions
• 135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
Phenotype Variable Standardization
9/5/13 20
21. Normalization
• Spell out
abbreviations
and short
hand
expressions
• Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
• Generate CUIs,
concept names,
semantic types
Semantic Role
Assignment
• Semantic
types and
keyword-
based role
identification
• Evaluation
from random
sample of
500:
73% accuracy
Variable
Descriptions
• 135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic
C0026591: Mother – subject of information
Phenotype Variable Standardization
9/5/13 21
22. Normalization
• Spell out
abbreviations
and short
hand
expressions
• Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
• Generate CUIs,
concept names,
semantic types
Semantic Role
Assignment
• Semantic
types and
keyword-
based role
identification
• Evaluation
from random
sample of
500:
73% accuracy
Variable
Categorization
• Semantic
types and
keyword-based
categorization
• Evaluation
from random
sample of
500: 71%
accuracy
Variable
Descriptions
• 135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic
C0026591: Mother – subject of information
family history, demographics
Phenotype Variable Standardization
9/5/13 22
23. Category Examples
Variable
Descriptions
Topics Subject of
Information
Variable Categories
Gender of the
participant
gender study subject Demographics
Last known smoking
status
smoking study subject Smoking History
Cigarettes/day, exam 1 smoking,
medical
examination
study subject Smoking History
Healthcare Activity
Finding
Age in years at uric
acid measurement
age, uric acid
measurement
study subject Demographics
Lab Tests
AGE of living mother age mother Demographics - Family
Age at dementia onset
as defined by the DSM
IV definition
age, dementia study subject Demographics
Medical History
9/5/13 23
24. Normalization
• Spell out
abbreviations
and short
hand
expressions
• Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
• Generate CUIs,
concept names,
semantic types
Semantic Role
Assignment
• Semantic types
and keyword-
based role
identification
• Evaluation
from random
sample of 500:
73% accuracy
Variable
Categorization
• Semantic
types and
keyword-
based
categorization
• Evaluation
from random
sample of
500: 71%
accuracy
Identification of
Similar
Variables
• Same CUI,
similar
keywords, and
same category
in progress
Variable
Descriptions
• 135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic
C0026591: Mother – subject of information
family history, demographics
Phenotype Variable Standardization
9/5/13 24
25. Study Level Metadata Annotation
9/5/13
• Manual annotation of 422 studies (07/31/13)
• Metadata items generated
• Disease topics (encoded with UMLS)
• Geographical information (encoded with ISO
3166-2 subdivision code: state and country)
• IRB approval (required or not)
• Consent type (not restricted, restricted,
unspecified)
• Sample demographics (race and/or ethnicity,
gender, age)
9/5/13 25
29. Contextual Query Language
— Query types:
— Simple queries: keywords, phrases.
— Using Boolean logic: AND, OR, NOT
— Can process index values, e.g., age > 40
— Build a language guideline:
— BNF form
9/5/139/5/13 29
32. BM25 ranking algorithm
9/5/13
• N: total number of studies.
• nt – number of studies contains
the term t
• c – field in study d
• wc – boost factor for each field c
• Tf – term frequency
• Idf – inverted document
frequency
9/5/13 32
33. Technical Infrastructure
— URL: http://pfindr-data.ucsd.edu/_PhDVer1/
— Linux machine: Ubuntu 64 bits
— Memory: 32GB RAM
— Database: MySQL 14.14
— Apache 2.2.20 Web server
— Programming languages: PHP, Python, JavaScripts
— Python toolkits: pyparsing, Whoosh
9/5/139/5/13 33
36. Evaluation on Basic Search
9/5/13
Basic Search
dbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND white 100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast
density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)
9/5/13 36
37. Evaluation on Advanced Search
9/5/13
Advanced Search in PhenDisco Recall Precision
“macular degeneration” AND white AND [whole
genome genotyping]
100 % 66.67%
“breast cancer” AND “breast density” AND [IRB
not required] AND [whole genome genotyping]
100 % 100 %
schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 %
cardiomyopathy AND [copy number variant
analysis]
100 % 100 %
Average 100 % 91.67 %
Average F-measure 0.96
(as of July 7, 2013)
9/5/13 37