SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
PhenDisco: a new phenotype
discovery system for the database
of genotypes and phenotypes
Son Doan, Hyeoneui Kim
Division of Biomedical Informatics
University of California San Diego
Open Access Journal Club, 09/05/2013
Roadmap to the Presentation
—  Background
—  dbGaP
—  Challenges in using dbGaP
—  pFINDR program
—  PhenDisco development
—  User requirement analysis for PhenDisco
—  Data standardization (variables, study metadata)
—  System development: technical details
—  PhenDisco demo
—  Performance evaluation
9/5/13 2
Background
9/5/13 3
Overview on dbGaP
—  Database of Genotypes and Phenotypes
—  Developed by NCBI
—  Stores and distributes the data and outputs of the
studies on the interactions of genotypes &
phenotypes
—  Provides 2 levels of access
—  Open access: variable information including
summary statistics and study information
—  Controlled access: raw data – upon approval by
NIH DAC
9/5/13 4
A Typical Challenge in Using dbGaP
Potentially, dbGaP is great…it contains so many different types of studies
and their data!
However, I find it very hard to reuse dbGaP data because there is no easy
but robust way to filter studies by important study related information such
as study design, analysis methods, analysis data produced by the studies.
Even if I find the studies that seem fitting to my needs, I still need to make
sure that the studies have the genotype and/or the phenotype information
that I need.
Of course, dealing with the data values with all sort of different formats is
another challenge to go through…
(Erin Smith, PhD, Division of Genome Information Science, UCSD)
9/5/13 5
9/5/13
http://www.ncbi.nlm.nih.gov/gap
6
9/5/13
http://www.ncbi.nlm.nih.gov/gap
7
9/5/13
http://www.ncbi.nlm.nih.gov/gap
8
pFINDR (phenotype Finding IN Data Repositories)
9/5/13
•  Funded by NHLBI
•  To facilitate dbGaP use by improving
accuracy and completeness of search
returns
–  Standardized phenotype variables
–  Searchable study related information
9
User Requirement Analysis
9/5/13 10
Use-Case Driven Development
— User requirements collected from
—  Analysis of data use descriptions from data
requests available in dbGaP (14,287 requests)
—  Online user survey (17 users)
—  User interviews (8 local dbGaP users)
—  NIH officers/Scientific Advisory Board
recommendations and suggestions
9/5/13 11
Genetic
Disease
Congenital
Abnormality
(8.6%)
Cardiovascular
Disease (8.1%)
Data Request Analysis
9/5/13
Disease
Chemical or
Biological
Substance
Therapeutic or
preventive
Procedure
Research
Activity
Laboratory
Procedure or
Test
Pathologic
Function
Signs or
Symptoms
Diagnostic
Procedure
Clinical
Attributes
Mood,
Emotion, and
Individual
Behavior
Qualitative
Concept
Mental Process
Social
Behavior
Organism
Function
Daily Function or
Activity
Health Care
Activity Food
Other
Neoplasm/Cancer
(30%)
Psychiatric
Disease (13%)
12
Interviews, Survey and
SAB/NIH officers’ feedback
—  Functions that maximize search efficiency
—  Examples
—  “option to expand search terms through
synonyms”
—  “studies displayed in the order of relevancy”
—  “select studies from the returned list and save for
later review”
—  “search results organized in a way that supports
quick browsing”
9/5/13 13
Problems We Addressed
—  Focus areas:
—  Completeness and accuracy of search results
—  Abbreviation expansion
—  Concept-based search
—  Ease of result review
—  Sorting the results by relevancy
—  Highlighting search keywords in the retrieved records
—  Additional functionality
—  Export of selected study and variable information
—  Categorization of variables
9/5/13 14
Data Standardization
9/5/13
•  Variable Standardization
•  Study Level Metadata Generation
15
Phenotype Variable Standardization
—  Used variable descriptions
—  Focused on identifying
—  Topic (main theme: “pain”, “walking”)
—  Subject of information (i.e., bearer: “study subject”)
—  Mapped the topic and SOI concepts to UMLS
Metathesaurus
9/5/13
Variable ID Variable Name Variable Description
Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace?
16
Variable
Descriptions
•  135,608
variables
9/5/13 17
Phenotype Variable Standardization
Variable
Descriptions
•  135,608
variables
“77 age mom diagnosed – stroke (tia)”
Phenotype Variable Standardization
9/5/13 18
Normalization
•  Spell out
abbreviations
and short
hand
expressions
•  Drop
question
numbers and
other
unimportant
characters
Variable
Descriptions
•  135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
Phenotype Variable Standardization
9/5/13 19
Normalization
•  Spell out
abbreviations
and short
hand
expressions
•  Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
•  Generate CUIs,
concept names,
semantic types
Variable
Descriptions
•  135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
Phenotype Variable Standardization
9/5/13 20
Normalization
•  Spell out
abbreviations
and short
hand
expressions
•  Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
•  Generate CUIs,
concept names,
semantic types
Semantic Role
Assignment
•  Semantic
types and
keyword-
based role
identification
•  Evaluation
from random
sample of
500:
73% accuracy
Variable
Descriptions
•  135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic
C0026591: Mother – subject of information
Phenotype Variable Standardization
9/5/13 21
Normalization
•  Spell out
abbreviations
and short
hand
expressions
•  Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
•  Generate CUIs,
concept names,
semantic types
Semantic Role
Assignment
•  Semantic
types and
keyword-
based role
identification
•  Evaluation
from random
sample of
500:
73% accuracy
Variable
Categorization
•  Semantic
types and
keyword-based
categorization
•  Evaluation
from random
sample of
500: 71%
accuracy
Variable
Descriptions
•  135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic
C0026591: Mother – subject of information
family history, demographics
Phenotype Variable Standardization
9/5/13 22
Category Examples
Variable
Descriptions
Topics Subject of
Information
Variable Categories
Gender of the
participant
gender study subject Demographics
Last known smoking
status
smoking study subject Smoking History
Cigarettes/day, exam 1 smoking,
medical
examination
study subject Smoking History
Healthcare Activity
Finding
Age in years at uric
acid measurement
age, uric acid
measurement
study subject Demographics
Lab Tests
AGE of living mother age mother Demographics - Family
Age at dementia onset
as defined by the DSM
IV definition
age, dementia study subject Demographics
Medical History
9/5/13 23
Normalization
•  Spell out
abbreviations
and short
hand
expressions
•  Drop
question
numbers and
other
unimportant
characters
MetaMap
Processing
•  Generate CUIs,
concept names,
semantic types
Semantic Role
Assignment
•  Semantic types
and keyword-
based role
identification
•  Evaluation
from random
sample of 500:
73% accuracy
Variable
Categorization
•  Semantic
types and
keyword-
based
categorization
•  Evaluation
from random
sample of
500: 71%
accuracy
Identification of
Similar
Variables
•  Same CUI,
similar
keywords, and
same category
in progress
Variable
Descriptions
•  135,608
variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group]
C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic
C0026591: Mother – subject of information
family history, demographics
Phenotype Variable Standardization
9/5/13 24
Study Level Metadata Annotation
9/5/13
•  Manual annotation of 422 studies (07/31/13)
•  Metadata items generated
•  Disease topics (encoded with UMLS)
•  Geographical information (encoded with ISO
3166-2 subdivision code: state and country)
•  IRB approval (required or not)
•  Consent type (not restricted, restricted,
unspecified)
•  Sample demographics (race and/or ethnicity,
gender, age)
9/5/13 25
System Development:
Integration
9/5/139/5/13 26
Free
text
Query
parser
sdGaP
Relevant
studies
Ranked
studies
NLP tools + MetaMap
Information
Model Mapping
dbGaP
PhenDisco: Put-it-all-together
BM25 ranking algorithm 9/5/13 27
System Development:
Query Parser
9/5/139/5/13 28
Contextual Query Language
—  Query types:
—  Simple queries: keywords, phrases.
—  Using Boolean logic: AND, OR, NOT
—  Can process index values, e.g., age > 40
—  Build a language guideline:
—  BNF form
9/5/139/5/13 29
BNF form
cqlQuery ::= prefixAssignment cqlQuery | scopedClause
prefixAssignment ::= '>' prefix '=' uri | '>' uri
scopedClause ::= scopedClause booleanGroup searchClause | searchClause
booleanGroup ::= boolean [modifierList] boolean ::= 'and' | 'or' | 'not' | 'prox'
searchClause ::= '(' cqlQuery ')’| index relation searchTerm| searchTerm
relation ::= comparitor [modifierList]
comparitor ::= comparitorSymbol | namedComparitor
comparitorSymbol ::= '=' | '>' | '<' | '>=' | '<=' | '<>' | '=='
namedComparitor ::= identifier
modifierList ::= modifierList modifier | modifier
modifier ::= '/' modifierName [comparitorSymbol modifierValue]
prefix, uri, modifierName, modifierValue, searchTerm, index ::= term
term ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby'
identifier ::= charString1 | charString2 9/5/139/5/13 30
System Development:
Study Ranking
9/5/139/5/13 31
BM25 ranking algorithm
9/5/13
•  N: total number of studies.
•  nt – number of studies contains
the term t
•  c – field in study d
•  wc – boost factor for each field c
•  Tf – term frequency
•  Idf – inverted document
frequency
9/5/13 32
Technical Infrastructure
—  URL: http://pfindr-data.ucsd.edu/_PhDVer1/
—  Linux machine: Ubuntu 64 bits
—  Memory: 32GB RAM
—  Database: MySQL 14.14
—  Apache 2.2.20 Web server
—  Programming languages: PHP, Python, JavaScripts
—  Python toolkits: pyparsing, Whoosh
9/5/139/5/13 33
9/5/13
System	
  
Demonstra-on	
  
9/5/13 34
System Evaluation
9/5/13
•  Search Accuracy
•  User Interface
9/5/13 35
Evaluation on Basic Search
9/5/13
Basic Search
dbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND white 100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast
density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)
9/5/13 36
Evaluation on Advanced Search
9/5/13
Advanced Search in PhenDisco Recall Precision
“macular degeneration” AND white AND [whole
genome genotyping]
100 % 66.67%
“breast cancer” AND “breast density” AND [IRB
not required] AND [whole genome genotyping]
100 % 100 %
schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 %
cardiomyopathy AND [copy number variant
analysis]
100 % 100 %
Average 100 % 91.67 %
Average F-measure 0.96
(as of July 7, 2013)
9/5/13 37
Feedback on the User Interface (N=6)
9/5/139/5/13 38
Trainees
—  Post-doctoral trainees
—  Ko-Wei Lin, DVM, PhD (Study Abstraction, Standardization,
Evaluation)
—  Mindy Ross, MD, MBA (Study Abstraction, Ontology Building)
—  Neda Alipanah, PhD (Ontology Building)
—  Xiaoqian Jiang, PhD (Ranking Algorithm)
—  Mike Conway, PhD (Study Abstraction)
—  Undergraduate trainees
—  Alexander Hsieh (Standardization)
—  Vinay Venkatesh (System Development)
—  Rafael Talavera (Evaluation)
—  Karen Truong (Study Abstraction)
—  Asher Garland (System Development)
9/5/13
Acknowledgements
—  Lucila Ohno-Machado (PI)
—  Collaborator
—  Hua Xu
—  Other contribution
—  Jihoon Kim
—  Wendy Chapman
—  Melissa Tharp
—  Staff
—  Stephanie Feudjio Feupe, MS
—  Seena Farzaneh, MS
—  Rebecca Walker, BS
—  Funding: UH2HL108785 from NHLBI, NIH
9/5/139/5/13 40
Questions?
Project Homepage: http://pfindr.net
PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1/index.php
Contact:
lohnomachado@ucsd.edu
hyk038@ucsd.edu
sondoan@ucsd.edu

Mais conteúdo relacionado

Semelhante a PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

FAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataFAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataIan Fore
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management inscit2006
 
Biostats2019 5
Biostats2019 5Biostats2019 5
Biostats2019 5daforerog
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...European School of Oncology
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansBrook White, PMP
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
The Future of Personalized Medicine
The Future of Personalized MedicineThe Future of Personalized Medicine
The Future of Personalized MedicineEdgewater
 
Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Joel Saltz
 
The Role of Statistician in Personalized Medicine: An Overview of Statistical...
The Role of Statistician in Personalized Medicine: An Overview of Statistical...The Role of Statistician in Personalized Medicine: An Overview of Statistical...
The Role of Statistician in Personalized Medicine: An Overview of Statistical...Setia Pramana
 
TCIA Data Harmonization Project
TCIA Data Harmonization ProjectTCIA Data Harmonization Project
TCIA Data Harmonization Projectimgcommcall
 
Narrative review | Prisma systematic review | Medical writing
Narrative review | Prisma systematic review | Medical writingNarrative review | Prisma systematic review | Medical writing
Narrative review | Prisma systematic review | Medical writingPubrica
 
The Clinical Genome Conference 2014
The Clinical Genome Conference 2014The Clinical Genome Conference 2014
The Clinical Genome Conference 2014Nicole Proulx
 
Introduction to Systematic Reviews
Introduction to Systematic ReviewsIntroduction to Systematic Reviews
Introduction to Systematic ReviewsLaura Koltutsky
 
Critical Analysis Journal club how to do as a beginner
Critical Analysis Journal club  how to do as a beginnerCritical Analysis Journal club  how to do as a beginner
Critical Analysis Journal club how to do as a beginnerebinroshan07
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experimentsHelena Deus
 
Secondary Data Analysis
Secondary Data AnalysisSecondary Data Analysis
Secondary Data AnalysisREY DECASTRO
 
The Role of The Statisticians in Personalized Medicine: An Overview of Stati...
The Role of The Statisticians in Personalized Medicine:  An Overview of Stati...The Role of The Statisticians in Personalized Medicine:  An Overview of Stati...
The Role of The Statisticians in Personalized Medicine: An Overview of Stati...Setia Pramana
 
1Big Data Analytics forHealthcareChandan K. ReddyD.docx
1Big Data Analytics forHealthcareChandan K. ReddyD.docx1Big Data Analytics forHealthcareChandan K. ReddyD.docx
1Big Data Analytics forHealthcareChandan K. ReddyD.docxaulasnilda
 

Semelhante a PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP) (20)

FAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic DataFAIR as a Working Principle for Cancer Genomic Data
FAIR as a Working Principle for Cancer Genomic Data
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 
Biostats2019 5
Biostats2019 5Biostats2019 5
Biostats2019 5
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
 
Maas, Andrew
Maas, AndrewMaas, Andrew
Maas, Andrew
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
The Future of Personalized Medicine
The Future of Personalized MedicineThe Future of Personalized Medicine
The Future of Personalized Medicine
 
Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014
 
The Role of Statistician in Personalized Medicine: An Overview of Statistical...
The Role of Statistician in Personalized Medicine: An Overview of Statistical...The Role of Statistician in Personalized Medicine: An Overview of Statistical...
The Role of Statistician in Personalized Medicine: An Overview of Statistical...
 
TCIA Data Harmonization Project
TCIA Data Harmonization ProjectTCIA Data Harmonization Project
TCIA Data Harmonization Project
 
Narrative review | Prisma systematic review | Medical writing
Narrative review | Prisma systematic review | Medical writingNarrative review | Prisma systematic review | Medical writing
Narrative review | Prisma systematic review | Medical writing
 
The Clinical Genome Conference 2014
The Clinical Genome Conference 2014The Clinical Genome Conference 2014
The Clinical Genome Conference 2014
 
Introduction to Systematic Reviews
Introduction to Systematic ReviewsIntroduction to Systematic Reviews
Introduction to Systematic Reviews
 
Critical Analysis Journal club how to do as a beginner
Critical Analysis Journal club  how to do as a beginnerCritical Analysis Journal club  how to do as a beginner
Critical Analysis Journal club how to do as a beginner
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experiments
 
Secondary Data Analysis
Secondary Data AnalysisSecondary Data Analysis
Secondary Data Analysis
 
The Role of The Statisticians in Personalized Medicine: An Overview of Stati...
The Role of The Statisticians in Personalized Medicine:  An Overview of Stati...The Role of The Statisticians in Personalized Medicine:  An Overview of Stati...
The Role of The Statisticians in Personalized Medicine: An Overview of Stati...
 
1Big Data Analytics forHealthcareChandan K. ReddyD.docx
1Big Data Analytics forHealthcareChandan K. ReddyD.docx1Big Data Analytics forHealthcareChandan K. ReddyD.docx
1Big Data Analytics forHealthcareChandan K. ReddyD.docx
 

Último

Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...narwatsonia7
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Dipal Arora
 
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...Neha Kaur
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...narwatsonia7
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomdiscovermytutordmt
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...chandars293
 
Bangalore Call Girl Whatsapp Number 100% Complete Your Sexual Needs
Bangalore Call Girl Whatsapp Number 100% Complete Your Sexual NeedsBangalore Call Girl Whatsapp Number 100% Complete Your Sexual Needs
Bangalore Call Girl Whatsapp Number 100% Complete Your Sexual NeedsGfnyt
 
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service KochiLow Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service KochiSuhani Kapoor
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...jageshsingh5554
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...hotbabesbook
 

Último (20)

Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
 
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
 
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
VIP Call Girls Tirunelveli Aaradhya 8250192130 Independent Escort Service Tir...
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 9907093804 Top Class Call Girl Service Available
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
Bangalore Call Girl Whatsapp Number 100% Complete Your Sexual Needs
Bangalore Call Girl Whatsapp Number 100% Complete Your Sexual NeedsBangalore Call Girl Whatsapp Number 100% Complete Your Sexual Needs
Bangalore Call Girl Whatsapp Number 100% Complete Your Sexual Needs
 
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
 
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service KochiLow Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
 

PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)

  • 1. PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes Son Doan, Hyeoneui Kim Division of Biomedical Informatics University of California San Diego Open Access Journal Club, 09/05/2013
  • 2. Roadmap to the Presentation —  Background —  dbGaP —  Challenges in using dbGaP —  pFINDR program —  PhenDisco development —  User requirement analysis for PhenDisco —  Data standardization (variables, study metadata) —  System development: technical details —  PhenDisco demo —  Performance evaluation 9/5/13 2
  • 4. Overview on dbGaP —  Database of Genotypes and Phenotypes —  Developed by NCBI —  Stores and distributes the data and outputs of the studies on the interactions of genotypes & phenotypes —  Provides 2 levels of access —  Open access: variable information including summary statistics and study information —  Controlled access: raw data – upon approval by NIH DAC 9/5/13 4
  • 5. A Typical Challenge in Using dbGaP Potentially, dbGaP is great…it contains so many different types of studies and their data! However, I find it very hard to reuse dbGaP data because there is no easy but robust way to filter studies by important study related information such as study design, analysis methods, analysis data produced by the studies. Even if I find the studies that seem fitting to my needs, I still need to make sure that the studies have the genotype and/or the phenotype information that I need. Of course, dealing with the data values with all sort of different formats is another challenge to go through… (Erin Smith, PhD, Division of Genome Information Science, UCSD) 9/5/13 5
  • 9. pFINDR (phenotype Finding IN Data Repositories) 9/5/13 •  Funded by NHLBI •  To facilitate dbGaP use by improving accuracy and completeness of search returns –  Standardized phenotype variables –  Searchable study related information 9
  • 11. Use-Case Driven Development — User requirements collected from —  Analysis of data use descriptions from data requests available in dbGaP (14,287 requests) —  Online user survey (17 users) —  User interviews (8 local dbGaP users) —  NIH officers/Scientific Advisory Board recommendations and suggestions 9/5/13 11
  • 12. Genetic Disease Congenital Abnormality (8.6%) Cardiovascular Disease (8.1%) Data Request Analysis 9/5/13 Disease Chemical or Biological Substance Therapeutic or preventive Procedure Research Activity Laboratory Procedure or Test Pathologic Function Signs or Symptoms Diagnostic Procedure Clinical Attributes Mood, Emotion, and Individual Behavior Qualitative Concept Mental Process Social Behavior Organism Function Daily Function or Activity Health Care Activity Food Other Neoplasm/Cancer (30%) Psychiatric Disease (13%) 12
  • 13. Interviews, Survey and SAB/NIH officers’ feedback —  Functions that maximize search efficiency —  Examples —  “option to expand search terms through synonyms” —  “studies displayed in the order of relevancy” —  “select studies from the returned list and save for later review” —  “search results organized in a way that supports quick browsing” 9/5/13 13
  • 14. Problems We Addressed —  Focus areas: —  Completeness and accuracy of search results —  Abbreviation expansion —  Concept-based search —  Ease of result review —  Sorting the results by relevancy —  Highlighting search keywords in the retrieved records —  Additional functionality —  Export of selected study and variable information —  Categorization of variables 9/5/13 14
  • 15. Data Standardization 9/5/13 •  Variable Standardization •  Study Level Metadata Generation 15
  • 16. Phenotype Variable Standardization —  Used variable descriptions —  Focused on identifying —  Topic (main theme: “pain”, “walking”) —  Subject of information (i.e., bearer: “study subject”) —  Mapped the topic and SOI concepts to UMLS Metathesaurus 9/5/13 Variable ID Variable Name Variable Description Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace? 16
  • 18. Variable Descriptions •  135,608 variables “77 age mom diagnosed – stroke (tia)” Phenotype Variable Standardization 9/5/13 18
  • 19. Normalization •  Spell out abbreviations and short hand expressions •  Drop question numbers and other unimportant characters Variable Descriptions •  135,608 variables “77 age mom diagnosed – stroke (tia)” “age mother diagnosed stroke (tia)” Phenotype Variable Standardization 9/5/13 19
  • 20. Normalization •  Spell out abbreviations and short hand expressions •  Drop question numbers and other unimportant characters MetaMap Processing •  Generate CUIs, concept names, semantic types Variable Descriptions •  135,608 variables “77 age mom diagnosed – stroke (tia)” “age mother diagnosed stroke (tia)” C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Phenotype Variable Standardization 9/5/13 20
  • 21. Normalization •  Spell out abbreviations and short hand expressions •  Drop question numbers and other unimportant characters MetaMap Processing •  Generate CUIs, concept names, semantic types Semantic Role Assignment •  Semantic types and keyword- based role identification •  Evaluation from random sample of 500: 73% accuracy Variable Descriptions •  135,608 variables “77 age mom diagnosed – stroke (tia)” “age mother diagnosed stroke (tia)” C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information Phenotype Variable Standardization 9/5/13 21
  • 22. Normalization •  Spell out abbreviations and short hand expressions •  Drop question numbers and other unimportant characters MetaMap Processing •  Generate CUIs, concept names, semantic types Semantic Role Assignment •  Semantic types and keyword- based role identification •  Evaluation from random sample of 500: 73% accuracy Variable Categorization •  Semantic types and keyword-based categorization •  Evaluation from random sample of 500: 71% accuracy Variable Descriptions •  135,608 variables “77 age mom diagnosed – stroke (tia)” “age mother diagnosed stroke (tia)” C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information family history, demographics Phenotype Variable Standardization 9/5/13 22
  • 23. Category Examples Variable Descriptions Topics Subject of Information Variable Categories Gender of the participant gender study subject Demographics Last known smoking status smoking study subject Smoking History Cigarettes/day, exam 1 smoking, medical examination study subject Smoking History Healthcare Activity Finding Age in years at uric acid measurement age, uric acid measurement study subject Demographics Lab Tests AGE of living mother age mother Demographics - Family Age at dementia onset as defined by the DSM IV definition age, dementia study subject Demographics Medical History 9/5/13 23
  • 24. Normalization •  Spell out abbreviations and short hand expressions •  Drop question numbers and other unimportant characters MetaMap Processing •  Generate CUIs, concept names, semantic types Semantic Role Assignment •  Semantic types and keyword- based role identification •  Evaluation from random sample of 500: 73% accuracy Variable Categorization •  Semantic types and keyword- based categorization •  Evaluation from random sample of 500: 71% accuracy Identification of Similar Variables •  Same CUI, similar keywords, and same category in progress Variable Descriptions •  135,608 variables “77 age mom diagnosed – stroke (tia)” “age mother diagnosed stroke (tia)” C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information family history, demographics Phenotype Variable Standardization 9/5/13 24
  • 25. Study Level Metadata Annotation 9/5/13 •  Manual annotation of 422 studies (07/31/13) •  Metadata items generated •  Disease topics (encoded with UMLS) •  Geographical information (encoded with ISO 3166-2 subdivision code: state and country) •  IRB approval (required or not) •  Consent type (not restricted, restricted, unspecified) •  Sample demographics (race and/or ethnicity, gender, age) 9/5/13 25
  • 27. Free text Query parser sdGaP Relevant studies Ranked studies NLP tools + MetaMap Information Model Mapping dbGaP PhenDisco: Put-it-all-together BM25 ranking algorithm 9/5/13 27
  • 29. Contextual Query Language —  Query types: —  Simple queries: keywords, phrases. —  Using Boolean logic: AND, OR, NOT —  Can process index values, e.g., age > 40 —  Build a language guideline: —  BNF form 9/5/139/5/13 29
  • 30. BNF form cqlQuery ::= prefixAssignment cqlQuery | scopedClause prefixAssignment ::= '>' prefix '=' uri | '>' uri scopedClause ::= scopedClause booleanGroup searchClause | searchClause booleanGroup ::= boolean [modifierList] boolean ::= 'and' | 'or' | 'not' | 'prox' searchClause ::= '(' cqlQuery ')’| index relation searchTerm| searchTerm relation ::= comparitor [modifierList] comparitor ::= comparitorSymbol | namedComparitor comparitorSymbol ::= '=' | '>' | '<' | '>=' | '<=' | '<>' | '==' namedComparitor ::= identifier modifierList ::= modifierList modifier | modifier modifier ::= '/' modifierName [comparitorSymbol modifierValue] prefix, uri, modifierName, modifierValue, searchTerm, index ::= term term ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby' identifier ::= charString1 | charString2 9/5/139/5/13 30
  • 32. BM25 ranking algorithm 9/5/13 •  N: total number of studies. •  nt – number of studies contains the term t •  c – field in study d •  wc – boost factor for each field c •  Tf – term frequency •  Idf – inverted document frequency 9/5/13 32
  • 33. Technical Infrastructure —  URL: http://pfindr-data.ucsd.edu/_PhDVer1/ —  Linux machine: Ubuntu 64 bits —  Memory: 32GB RAM —  Database: MySQL 14.14 —  Apache 2.2.20 Web server —  Programming languages: PHP, Python, JavaScripts —  Python toolkits: pyparsing, Whoosh 9/5/139/5/13 33
  • 35. System Evaluation 9/5/13 •  Search Accuracy •  User Interface 9/5/13 35
  • 36. Evaluation on Basic Search 9/5/13 Basic Search dbGaP PhenDisco Recall Precision Recall Precision COPD 100 % 41.67% 80.00% 100 % “macular degeneration” AND white 100 % 42.86% 100 % 85.71% “breast cancer” AND “breast density” 100 % 66.67% 50.00% 100 % schizophrenia 100 % 46.88% 86.67% 92.86% cardiomyopathy 100 % 35.00% 100 % 100 % Average 100 % 46.61% 83.33% 95.71% Average F-measure 0.64 0.89 (as of July 7, 2013) 9/5/13 36
  • 37. Evaluation on Advanced Search 9/5/13 Advanced Search in PhenDisco Recall Precision “macular degeneration” AND white AND [whole genome genotyping] 100 % 66.67% “breast cancer” AND “breast density” AND [IRB not required] AND [whole genome genotyping] 100 % 100 % schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 % cardiomyopathy AND [copy number variant analysis] 100 % 100 % Average 100 % 91.67 % Average F-measure 0.96 (as of July 7, 2013) 9/5/13 37
  • 38. Feedback on the User Interface (N=6) 9/5/139/5/13 38
  • 39. Trainees —  Post-doctoral trainees —  Ko-Wei Lin, DVM, PhD (Study Abstraction, Standardization, Evaluation) —  Mindy Ross, MD, MBA (Study Abstraction, Ontology Building) —  Neda Alipanah, PhD (Ontology Building) —  Xiaoqian Jiang, PhD (Ranking Algorithm) —  Mike Conway, PhD (Study Abstraction) —  Undergraduate trainees —  Alexander Hsieh (Standardization) —  Vinay Venkatesh (System Development) —  Rafael Talavera (Evaluation) —  Karen Truong (Study Abstraction) —  Asher Garland (System Development) 9/5/13
  • 40. Acknowledgements —  Lucila Ohno-Machado (PI) —  Collaborator —  Hua Xu —  Other contribution —  Jihoon Kim —  Wendy Chapman —  Melissa Tharp —  Staff —  Stephanie Feudjio Feupe, MS —  Seena Farzaneh, MS —  Rebecca Walker, BS —  Funding: UH2HL108785 from NHLBI, NIH 9/5/139/5/13 40
  • 41. Questions? Project Homepage: http://pfindr.net PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1/index.php Contact: lohnomachado@ucsd.edu hyk038@ucsd.edu sondoan@ucsd.edu