1. Working With Large-Scale
Clinical Datasets
Craig Smail, MA, MSc ( @craigsmail)
KU Medical Center
9th October 2014
Background: http://jsgamingtv.com/wp-content/uploads/2014/07/server-room-hd-free-23325111.jpg
3. Overview
• Targeted audience: anyone involved (directly
or indirectly) in clinical data extraction,
validation, and standardization
• Sections:
1. Data extraction: planning
2. Data extraction
3. Data standardization
4. Data transfer
4. Data Extraction: Planning
• Dataset type
– Most common: limited and de-identified
– Difference: limited can contain some personal
information (DOB, DOD, city, state, age)
• Legal agreements
– Data Use Agreement (DUA)
– Business Associates Agreement (BAA)
– Institutional Review Board (IRB)
• Usually only if IRB considers activity Human Subjects
Research
5. Data Extraction: Planning
• Important to finalize list of data elements
before pull
– Time-consuming to repull
– Reallocation of resources (e.g. programmer time)
• Summary statistics are helpful in planning
stage
• e.g. death status requested a lot, but is very rarely
available in the EHR
6. Data Extraction: Planning
• Use of data proxy correlated with data
element of interest
– sometimes need to develop proxies for data
points of interest (e.g. severity of pain;
hypoglycemic events)
– Example use case: aspirin as a proxy for
antiphospholipid antibodies lab1
• Proxy data elements should be supported by
data
1 Frankovich, J., Longhurst, C., Sutherland, S. Evidence-Based Medicine in the EMR Era, N Engl J Med
2011; 365:1758-1759N
7. Example: Proxy for Death Status
• Data extracted from large multi-specialty
clinic on the east coast
• 300,000 patients in EHR
• ~10,000 with date-of-death (we’ll take this as
gold-standard)
• Is days since last encounter a good proxy?
8. Example: Proxy for Death Status
library(glm2)
# import data
setwd([dir here])
Encs = read.csv("lastenc.csv", header= FALSE)
# find days since last encounter
for (i in 1:nrow(Encs)) {
Encs[i,3] = as.Date("2014-09-02") - as.Date(Encs[i, 1], "%m/%d/%Y")
}
# binarize (no encounter in last 1000 days = 1, <= 1000 = 0 – also tried 180, 265, 750)
for (i in 1:nrow(Encs)) {
Encs[i, 4] = ifelse(Encs[i, 3] > 1000, 1, 0)
}
# clean up table
Encs = Encs[ , c(2, 4)]
# fit model (logistic regression – but could use something else)
fit = glm(Encs[, 1] ~ Encs[, 2], data = Encs, family = "binomial")
confusionMatrix = table(round(fit$fitted.values), Encs[,1])
misclassRate = (confusionMatrix[1,2] + confusionMatrix[2,1]) / sum(confusionMatrix) #
0.34
9. Example: Proxy for Death Status
• Is days since last encounter a good proxy?
No (error rate = 34%)
• Consequences:
10. Data Extraction: Planning
• Cohort definition
– Spell out cohort definitions explicitly, including all
assumptions
– Real-world example:
• ‘Two consecutive eGFRs >= 15 and < 60 occurring at least 90
days apart’
• Further restriction specified ‘if any value > 60 in between 90
days, then throw out’
• Word ‘consecutive’ means no values in between 90 days will
be considered at all
– If any another eGFR value occurs between 90 days, then the
patient does not meet the first restriction
11. Data Extraction: Planning
• Final thought on planning:
“Not everything that counts can be counted and
not everything that can be counted counts.”
—Albert Einstein (or William Bruce Cameron,
depends who you believe)
• some data elements are well populated, but
reflect things like coding bias (e.g. ‘up-coding’
to a code with larger reimbursement)
12. Data Extraction
• What are data extractions being used for in the
NRN?
– Pharmaceutical companies: data on 143,057
patients from 8 health-care organizations/health
care systems
– Federally-funded research (NIH, AHRQ): data on
~100,000 patients
– Health IT vendors: work with Cerner to produce
performance reports for use by participating
providers
• Clinicians like performance feedback, if your EHR cannot
provide it they will go elsewhere (i.e. switch to another
vendor)
13. Data Extraction
• Longitudinal data important
– look at temporal trends over time in same
patient
– during EHR transitions, some EHR vendors will
import all data, but restrict full access to only
last 18/24/26 months – clinicians don’t like this,
they want to be able to access all data
14. Data Validation
• Date parameters (e.g. look at min and max dates of encounter
in dataset, when 1000s of patients of dataset, would expect to
see dates match with range)
– Percentage of distinct patients in extraction vs. overall practice count:
cohort percentages are quite stable across practices
» e.g. ‘all patients over age 18 with a diagnosis of type-2 diabetes
defined by ICD-9 code xx.xxx
– Caveat: doesn’t work well with small practices (< 2,000 distinct
patients)
15. Data Standardization
• Open-source models (Observational Medical Outcomes
Partnership)
• Script data out of database (e.g. SQL view)
• Map labs/procedures to standardized concept list
– Why? different string labels referring to creatinine blood test from
three data feeds, with frequency of occurrence…
17. Data Transfer
• HIPPA requirements
• Usually FTP to secure site (e.g Egnyte)
Ref: http://www.hhs.gov/ocr/privacy/hipaa/enforcement/examples/
18. Concluding Thoughts
• Extracted data is treated as gold-standard, since it is pulled
directly from data source (i.e. EHR), but data often comes
from intermediate product (such as a registry product, like the
product DARTNet provides); but usually don’t have control
over data mapping from EHR to registry
• The EHR of the future (?):
– Genetic data (WGS or WES)
» WGS = ~100 GB
» WES = ~8 GB
– Integration with consumer wearable devices (e.g. FitBit; iPhone ECG)
– Further down the road: human microbiome; home microbiome
19. Always question
the data
Pic ref: http://www.yoyowall.com/wp-content/uploads/2013/07/Gandalf-The-Grey-The-Lord-Of-The-Rings.jpg
20. Questions?
• Slides available from slideshare
(URL @craigsmail)
• Email: csmail@aafp.org
Notas do Editor
Repulls wastes everyone’s time
used aspirin as a proxy for antiphospholipid antibodies lab (due to practice of prescribing aspirin in these patients at site) in treating a 13 year-old girl with systemic lupus erythematosus (SLE)
Audience participation: ask what other factors might explain a gap in encounters (e.g. moved out-of-town, changed provider)
Only 13 lines of code
Binarized time since last encounter (tried 180, 365, 750)
CKD study: ~100,000 in dataset, say same ratio holds (3% of individuals in EHR are dead), gives 3,000 names for NDI
Cost: $350 + ($0.15 * 3,000 * 10) = $4,500
So you want to make sure the cohort you send to NCI is right!
What are data extractions being used for in the NRN?
Pharmaceutical companies: type-2 diabetes study looking at drug prescribing habits of primary-care physicians for patients with type-2 diabetes (data on 143,057 patients from 8 health-care organizations/health care systems)
Federally-funded research (NIH, AHRQ ): decision support for chronic kidney disease, working with National Kidney Foundation (data on ~100,000 patients)
Health IT vendors: we work with Cerner to product performance reports for use by participating providers, used to compare performance on several metrics (e.g. blood pressure targets; accuracy of ICD-9 coding)
Clinicians like performance feedback, if your EHR cannot provide it they will go elsewhere (i.e. switch)