The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Presentation
1. Text mining
Text mining the PCD literature
PCD validity
Uses and Validity of Primary Care Database studies
May 2013
David Springate, Evan Kontopantelis, Ivan Olier, David Reeves
May 2013 Uses and Validity of Primary Care Database studies
2. Text mining
Text mining the PCD literature
PCD validity
Outline
1 Use of text-mining to explore the scientific literature
May 2013 Uses and Validity of Primary Care Database studies
3. Text mining
Text mining the PCD literature
PCD validity
Outline
1 Use of text-mining to explore the scientific literature
2 Text-mining the PCD literature
May 2013 Uses and Validity of Primary Care Database studies
4. Text mining
Text mining the PCD literature
PCD validity
Outline
1 Use of text-mining to explore the scientific literature
2 Text-mining the PCD literature
What is being studied using PCD’s?
May 2013 Uses and Validity of Primary Care Database studies
5. Text mining
Text mining the PCD literature
PCD validity
Outline
1 Use of text-mining to explore the scientific literature
2 Text-mining the PCD literature
What is being studied using PCD’s?
Changes in topics of investigation over time
May 2013 Uses and Validity of Primary Care Database studies
6. Text mining
Text mining the PCD literature
PCD validity
Outline
1 Use of text-mining to explore the scientific literature
2 Text-mining the PCD literature
What is being studied using PCD’s?
Changes in topics of investigation over time
3 Validity of Clinical coding
May 2013 Uses and Validity of Primary Care Database studies
7. Text mining
Text mining the PCD literature
PCD validity
Outline
1 Use of text-mining to explore the scientific literature
2 Text-mining the PCD literature
What is being studied using PCD’s?
Changes in topics of investigation over time
3 Validity of Clinical coding
4 ClinicalCodes.org : A new repository for clinical code lists
May 2013 Uses and Validity of Primary Care Database studies
8. Text mining
Text mining the PCD literature
PCD validity
Text mining
May 2013 Uses and Validity of Primary Care Database studies
9. Text mining
Text mining the PCD literature
PCD validity
What is it?
The process of extracting high-quality structured information
from unstructured text (e.g. Scientific literature).
Uses a variety of computational and statistical methods to
find patterns and trends in text
Text mining consists of:
1 Information extraction
May 2013 Uses and Validity of Primary Care Database studies
10. Text mining
Text mining the PCD literature
PCD validity
What is it?
The process of extracting high-quality structured information
from unstructured text (e.g. Scientific literature).
Uses a variety of computational and statistical methods to
find patterns and trends in text
Text mining consists of:
1 Information extraction
Automatically extracting structured information from
unstructured text
May 2013 Uses and Validity of Primary Care Database studies
11. Text mining
Text mining the PCD literature
PCD validity
What is it?
The process of extracting high-quality structured information
from unstructured text (e.g. Scientific literature).
Uses a variety of computational and statistical methods to
find patterns and trends in text
Text mining consists of:
1 Information extraction
Automatically extracting structured information from
unstructured text
2 Semantic searching
May 2013 Uses and Validity of Primary Care Database studies
12. Text mining
Text mining the PCD literature
PCD validity
What is it?
The process of extracting high-quality structured information
from unstructured text (e.g. Scientific literature).
Uses a variety of computational and statistical methods to
find patterns and trends in text
Text mining consists of:
1 Information extraction
Automatically extracting structured information from
unstructured text
2 Semantic searching
Improves search accuracy by including context into a search
May 2013 Uses and Validity of Primary Care Database studies
13. Text mining
Text mining the PCD literature
PCD validity
What is it?
The process of extracting high-quality structured information
from unstructured text (e.g. Scientific literature).
Uses a variety of computational and statistical methods to
find patterns and trends in text
Text mining consists of:
1 Information extraction
Automatically extracting structured information from
unstructured text
2 Semantic searching
Improves search accuracy by including context into a search
3 Knowledge discovery
May 2013 Uses and Validity of Primary Care Database studies
14. Text mining
Text mining the PCD literature
PCD validity
What is it?
The process of extracting high-quality structured information
from unstructured text (e.g. Scientific literature).
Uses a variety of computational and statistical methods to
find patterns and trends in text
Text mining consists of:
1 Information extraction
Automatically extracting structured information from
unstructured text
2 Semantic searching
Improves search accuracy by including context into a search
3 Knowledge discovery
Identifying relationships in extracted data
May 2013 Uses and Validity of Primary Care Database studies
15. Text mining
Text mining the PCD literature
PCD validity
Why do we need it?
The scientific literature is rapidly
increasing in size
May 2013 Uses and Validity of Primary Care Database studies
16. Text mining
Text mining the PCD literature
PCD validity
Why do we need it?
The scientific literature is rapidly
increasing in size
Humans can’t keep up to date with
the literature
May 2013 Uses and Validity of Primary Care Database studies
17. Text mining
Text mining the PCD literature
PCD validity
Why do we need it?
The scientific literature is rapidly
increasing in size
Humans can’t keep up to date with
the literature
75 trials and 11 Systematic
reviews published per day!
Bastian et al. (2010) PLoS
Medicine
May 2013 Uses and Validity of Primary Care Database studies
18. Text mining
Text mining the PCD literature
PCD validity
Why do we need it?
The scientific literature is rapidly
increasing in size
Humans can’t keep up to date with
the literature
75 trials and 11 Systematic
reviews published per day!
Bastian et al. (2010) PLoS
Medicine
It is increasingly difficult to hone in
on relevant papers
May 2013 Uses and Validity of Primary Care Database studies
19. Text mining
Text mining the PCD literature
PCD validity
Why do we need it?
The scientific literature is rapidly
increasing in size
Humans can’t keep up to date with
the literature
75 trials and 11 Systematic
reviews published per day!
Bastian et al. (2010) PLoS
Medicine
It is increasingly difficult to hone in
on relevant papers
More of the literature is being held
online in machine-readable archives
May 2013 Uses and Validity of Primary Care Database studies
20. Text mining
Text mining the PCD literature
PCD validity
Why do we need it?
The scientific literature is rapidly
increasing in size
Humans can’t keep up to date with
the literature
75 trials and 11 Systematic
reviews published per day!
Bastian et al. (2010) PLoS
Medicine
It is increasingly difficult to hone in
on relevant papers
More of the literature is being held
online in machine-readable archives
TM can reduce processing time for
systematic reviews by 80%
(NCTM)
May 2013 Uses and Validity of Primary Care Database studies
21. Text mining
Text mining the PCD literature
PCD validity
Text-mining is not a magic bullet
Many publications are not open
access
Often need to rely on
abstracts
Grey literature is often
inaccessable
May 2013 Uses and Validity of Primary Care Database studies
22. Text mining
Text mining the PCD literature
PCD validity
Text-mining is not a magic bullet
Many publications are not open
access
Often need to rely on
abstracts
Grey literature is often
inaccessable
Still need plenty of human
input!
TM algorithms can be very
complex
Breadth at the expense of depth
May 2013 Uses and Validity of Primary Care Database studies
23. Text mining
Text mining the PCD literature
PCD validity
Text mining the PCD literature
May 2013 Uses and Validity of Primary Care Database studies
24. Text mining
Text mining the PCD literature
PCD validity
UK Primary Care Databases
GPRD / CPRD
The General Practice Research Database / The Clinical Practice
Research Datalink
˜ 900 papers
THIN
The Health Improvement Network
˜ 360 papers
QResearch
˜ 75 papers
May 2013 Uses and Validity of Primary Care Database studies
25. Text mining
Text mining the PCD literature
PCD validity
The Dataset
All articles reported by CPRD, THIN, QResearch in Pubmed
May 2013 Uses and Validity of Primary Care Database studies
26. Text mining
Text mining the PCD literature
PCD validity
The Dataset
All articles reported by CPRD, THIN, QResearch in Pubmed
1185 Abstracts with metadata
May 2013 Uses and Validity of Primary Care Database studies
27. Text mining
Text mining the PCD literature
PCD validity
The Dataset
All articles reported by CPRD, THIN, QResearch in Pubmed
1185 Abstracts with metadata
141 full-text articles for validation
May 2013 Uses and Validity of Primary Care Database studies
28. Text mining
Text mining the PCD literature
PCD validity
The Dataset
All articles reported by CPRD, THIN, QResearch in Pubmed
1185 Abstracts with metadata
141 full-text articles for validation
May 2013 Uses and Validity of Primary Care Database studies
29. Text mining
Text mining the PCD literature
PCD validity
The Dataset
All articles reported by CPRD, THIN, QResearch in Pubmed
1185 Abstracts with metadata
141 full-text articles for validation
How are PCD’s being used by researchers?
May 2013 Uses and Validity of Primary Care Database studies
30. Text mining
Text mining the PCD literature
PCD validity
PCD studies are a growth area!
Number of publications is rapidly increasing. . .
1990 1995 2000 2005 2010
050100150
PCD articles in pubmed
year
Numberofarticles
May 2013 Uses and Validity of Primary Care Database studies
31. Text mining
Text mining the PCD literature
PCD validity
PCD studies are a growth area!
. . . and there is global interest in UK PCD research
Institutions affiliated with UK PCD publications
xx
x x
x
xxxxxx
x x x
x
xx
x
x
x
xx
x
xx
x
xxx xx
xxx
x
xx
xxxxxx
xx
xx
x
xx
x
x
x xx x
x
x
xx
x
x
xxxxxx
x
x
x
x
x
x
xx
x xxx
x
xxxxx
xxx
xxx
x
x
x
x
xx
xxx
xx
x xx
x
x
xxx
xx
x
x
x
x
x
x
x
x
x
x
xx xx
x
xxx
x x
x
x
x
x
x
x
xx
xx
x
xx
xxxx
x
x
x
x
xx
x
xx
x
xxx
x xx
xx
xx
x
x
xx
x
xxxxxx x
x
x
xx
x
xxx
x
x
xxx
x
x
x
x
xxxxx
x
xx
x
xx
xxxxxx
xx
xx
x
x
x
x
xx x
xxx
x
x
xx
xx
x xxxxx
x
xxxxx
x
xx
xxx
x
x
x
xx
xx
xxx
x
x
x
x
xx
x
xx
xx
x
x
x
x
xx
xx
xxxxxx x
x
x
xx
x
x
x
x
x
x
x
xx
x x
x
x
x
xx
x
x
x
xxx
x
xxxxx
x
x
x xxxx
x x
xxxxx
xx
xx
x
x
xxxxxxxx xxx
x
xxxxx
x
x xx xxx x
x
xx xxxx
x
x
xxx
xx
x
xx
xxxxx
x
xx
x
x
xx
x
x
x
xx
x
x
x
xxx
x
xx
x
xxxx
xx
xxx
xx
x
xx x
xx
x
xxxx
xxx
x
x
xxx
x
x xxxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x xxxxxx xxxx xx
x
xxx
x
x
x
x
x x
x x
xx
x
x
x
x
x
xxx
x
x
x
xx
x
xxx
x
x
x
x
x
x
x
xx x
x
x
x
x
x
xx
xx
xxxx
x
x
x
x
xxx
x
x
xx
xxx
x
x xxx
x
x
x
xx
xxxxxxxxx
xx
xx
x
xxxx
xx
x
xxxx
x
x
xx
x
x
x
x xx
x
xxx
x
xx
xxxxxx xx
x
xx
x
x
x
xxx
x
x
x
xxxxx
xx
xx
x
x
x
x
x
x
x
x
x x
x
xxxxx
x
xx xxx
x
xxx
x
x
x
x
x
x
x
xx
x
x
x
x
x
xxx
xx
x
xxxxx
x x
xx
xx
x
x
x
xxxxxxx
xx
x
x
xxxx
xx xx
x
x x x
xxxx
xx
xx
xxx
xxx
x xx
xx
x
xxx
x
x
x
x
x
xxx
x
x
x
x
xxxx
x
x
x x
xxxx
xxx
xxxxxxxx x
xx
xx
xx
x
xxxx x
x
x
xxxx
x
x
x
xx
xxxx
xx
x xx
xxx
xxx
x
x
x xxx
xxxx xxxx
x
xx
x
x
x
x
xx
x
x x
x
xx
xxx
x
x
x
x
x
x
x
x
xxxxxxxx
x
xxx
xx
x
xxx
x
xx xxxx
xx
x
xxxxxxxx xxxxx
x
xx
xx
x
xxxxxxx
x
x
xx
xxx
x
x xx
x
xx
xx
x
x
xx
x
x
x
xxx
x x
x
xxx
x
x
xx
xx
xx
xxx
x
x
x
xx x
x
xxx x
x
x x
xx
x
x
xxx
x
xx
xxxxxxxx
x x
x
x
x
x
xx
x
xxxxxx
x
x
xxxx
xxx
x
xxx
x
x
x
x xx
x
x
xx
x
x
x
x
x
x
x
x
x
x x
x
x x
xx x
xx
xx x
x
x
xx
x
x
x
xxx
x
xxx
x
xx
xx
x
x
x
x
x
xx
xx
xx
x
x
xx
x
xxxxxxx
x
xxxxxxxxxxx xxxxxxxxxxx
xxx x
x
x
xxxx
x
xxxx
xxxxxxxxxxxxxxxxxx xxxxx
May 2013 Uses and Validity of Primary Care Database studies
32. Text mining
Text mining the PCD literature
PCD validity
Broad scope of topics in PCD studies
A network graph of PCD topics of investigation
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
Cancer1
Fractures/osteo
VTE
antipsychotics/smi
Diabetes
Asthma
NSAID's
HRT
Flu vaccination
Pregnancy
CHD/antihypertensives
Stroke
Pneumonia
Statins
Psoriasis
Antibiotics
Steroids
Atrial/warfarin
Epilepsy
AntidepressantsParacetamol
Heart attack
IBS
BMI/obesity
Kidney disease
Cancer2
Seizures
Auto−immune
COPD
Healthcare costs
Beta blockers
May 2013 Uses and Validity of Primary Care Database studies
35. Text mining
Text mining the PCD literature
PCD validity
PCD validity
May 2013 Uses and Validity of Primary Care Database studies
36. Text mining
Text mining the PCD literature
PCD validity
Threats to validity
Unmeasured confounding
May 2013 Uses and Validity of Primary Care Database studies
37. Text mining
Text mining the PCD literature
PCD validity
Threats to validity
Unmeasured confounding
Correlation does not equal causation
May 2013 Uses and Validity of Primary Care Database studies
38. Text mining
Text mining the PCD literature
PCD validity
Threats to validity
Unmeasured confounding
Correlation does not equal causation
GP recording
May 2013 Uses and Validity of Primary Care Database studies
39. Text mining
Text mining the PCD literature
PCD validity
Threats to validity
Unmeasured confounding
Correlation does not equal causation
GP recording
Clinical coding
May 2013 Uses and Validity of Primary Care Database studies
40. Text mining
Text mining the PCD literature
PCD validity
Clinical Coding in PCD’s
All clinical events are entered by GP’s as clinical codes:
May 2013 Uses and Validity of Primary Care Database studies
41. Text mining
Text mining the PCD literature
PCD validity
Clinical Coding in PCD’s
All clinical events are entered by GP’s as clinical codes:
Symptoms, signs & diagnoses (READ codes)
Referrals to external care centres
Immunisation records
Prescription information
Diagnostic test records and results
May 2013 Uses and Validity of Primary Care Database studies
42. Text mining
Text mining the PCD literature
PCD validity
Clinical Coding in PCD’s
All clinical events are entered by GP’s as clinical codes:
Symptoms, signs & diagnoses (READ codes)
Referrals to external care centres
Immunisation records
Prescription information
Diagnostic test records and results
Everything recorded by a GP can be identified (if you know
which codes to look for and where to look for them!)
May 2013 Uses and Validity of Primary Care Database studies
43. Text mining
Text mining the PCD literature
PCD validity
Clinical Coding in PCD’s
All clinical events are entered by GP’s as clinical codes:
Symptoms, signs & diagnoses (READ codes)
Referrals to external care centres
Immunisation records
Prescription information
Diagnostic test records and results
Everything recorded by a GP can be identified (if you know
which codes to look for and where to look for them!)
e.g.
H331.00 - Asthma diagnosis
H33z011 - Severe asthma attack
33G1 - Spirometry testing
May 2013 Uses and Validity of Primary Care Database studies
44. Text mining
Text mining the PCD literature
PCD validity
Clinical codes in PCD studies
Diagnoses are made by reference to a set of clinical codes
Workflow
1 Researchers decide on a rough set of codes for a condition
By searching lookup tables for matching terms
By reference to an external source (e.g. QOF)
2 Clinicians go through this draft list by hand and select the
relevant codes
3 The database is searched for events matching the finalised
code list
4 The correct combination of events in the timeframe of interest
gives a diagnosis
e.g. For Asthma: Need at least 1+ clinical event 1+ drug
event in the last year to qualify
May 2013 Uses and Validity of Primary Care Database studies
45. Text mining
Text mining the PCD literature
PCD validity
Code list? What code list?
Currently no obligation to publish code lists
No centralised repository for clinical codes
The vast majority of PCD studies do not publish their codes
No way of knowing if a condition diagnosis is valid
No way to replicate the research
For example. . .
In 45 UK case-control PCD studies (diabetes):
May 2013 Uses and Validity of Primary Care Database studies
46. Text mining
Text mining the PCD literature
PCD validity
Code list? What code list?
Currently no obligation to publish code lists
No centralised repository for clinical codes
The vast majority of PCD studies do not publish their codes
No way of knowing if a condition diagnosis is valid
No way to replicate the research
For example. . .
In 45 UK case-control PCD studies (diabetes):
Only 5 reported ANY clinical codes. . .
May 2013 Uses and Validity of Primary Care Database studies
47. Text mining
Text mining the PCD literature
PCD validity
Code list? What code list?
Currently no obligation to publish code lists
No centralised repository for clinical codes
The vast majority of PCD studies do not publish their codes
No way of knowing if a condition diagnosis is valid
No way to replicate the research
For example. . .
In 45 UK case-control PCD studies (diabetes):
Only 5 reported ANY clinical codes. . .
Only 2 of these published codes in appendix
May 2013 Uses and Validity of Primary Care Database studies
48. Text mining
Text mining the PCD literature
PCD validity
Code list? What code list?
Currently no obligation to publish code lists
No centralised repository for clinical codes
The vast majority of PCD studies do not publish their codes
No way of knowing if a condition diagnosis is valid
No way to replicate the research
For example. . .
In 45 UK case-control PCD studies (diabetes):
Only 5 reported ANY clinical codes. . .
Only 2 of these published codes in appendix
Only 1 provided full set of code lists
May 2013 Uses and Validity of Primary Care Database studies
49. Text mining
Text mining the PCD literature
PCD validity
Validity of Clinical coding
Clinical codes should be held to scrutiny and peer-review (either
pre- or post-publication)
This would allow for:
replication of studies
May 2013 Uses and Validity of Primary Care Database studies
50. Text mining
Text mining the PCD literature
PCD validity
Validity of Clinical coding
Clinical codes should be held to scrutiny and peer-review (either
pre- or post-publication)
This would allow for:
replication of studies
validation of diagnoses
May 2013 Uses and Validity of Primary Care Database studies
51. Text mining
Text mining the PCD literature
PCD validity
Validity of Clinical coding
Clinical codes should be held to scrutiny and peer-review (either
pre- or post-publication)
This would allow for:
replication of studies
validation of diagnoses
incremental improvements to clinical definitions
May 2013 Uses and Validity of Primary Care Database studies
52. Text mining
Text mining the PCD literature
PCD validity
ClinicalCodes.org
. . . Is an online repository for PCD researchers to upload their
codes upon publication.
Deposit code-lists for
published studies
Download historical
code-lists
Archive for all Quality and
Outcomes Framework
business rules (2004 -
current)
Database-specific
information (e.g.
consultation types)
May 2013 Uses and Validity of Primary Care Database studies
53. Text mining
Text mining the PCD literature
PCD validity
ClinicalCodes.org
Allows for validation /
replication of PCD studies
Tracking of disease
definitions through time
Comparitive studies of
clinical codes
Don’t reinvent the wheel!
Currently in development on campus:
medcodes.ls.manchester.ac.uk:8080/codesdb
May 2013 Uses and Validity of Primary Care Database studies
54. Text mining
Text mining the PCD literature
PCD validity
Summary
Publish open access!
May 2013 Uses and Validity of Primary Care Database studies
55. Text mining
Text mining the PCD literature
PCD validity
Summary
Publish open access!
Upload your codes!
May 2013 Uses and Validity of Primary Care Database studies
56. Text mining
Text mining the PCD literature
PCD validity
Summary
Publish open access!
Upload your codes!
Thank you
May 2013 Uses and Validity of Primary Care Database studies