NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for predictive tools
1. Data Management for Predictive Tools Paul Fearn, MBA NLM Informatics Research Fellow Biomedical and Health Informatics University of Washington | Fred Hutchinson Cancer Research Center Seattle, Washington PROSTATE CANCER: PREDICTIVE MODELS FOR DECISION MAKING April 7th – 9th, 2011 - MSKCC - New York, NY
2. Data Management Requirements Need to assemble large datasets for predictive modeling Pooling data across sites, systems and countries Linking data across clinical, specimen and lab repositories Quality assurance (for reproducibility of results) Tradeoffs between accuracy and reproducibility of data points Transparency of data processing Complete and up-to-date datasets Ease to access, sort, filter and export data Statistical analysis in Stata, R, SPSS, SAS, Excel SQL queries and reports Sustainability Secondary (N-ary) use of clinical and research data Cumulative cost of data entry Cumulative cost of staff training and turnover Cumulative risks and opportunity costs of staff entrenchment
3. The Growth Problem Lu Z. PubMed and Beyond. Database 2011;2011:baq036 21245076[pmid]
8. The Curation Problem Increasing volume of data More data points for annotation Clinical / patient Genomic / biological Public health / environment Parallel curation issues in modern clinical and biological research databases (Krallinger 2008*) Development of NLP system to support clinical research operations (Savova 2010**) *18834499[pmid], **20819853[pmid]
9. On the Other Hand… Long tail of research efforts Small heterogeneous labs and projects Subsets of data Specialized requirements Innovative approaches
10. Spectrum of Approaches One dataset per project (i.e. study based systems) Registry databases (i.e. one treatment or disease) Data warehouse or data repository Common schema (data model) “Amalgamation” of heterogeneous datasets Common security and access Common syntax (data format) Defined links between records Indexed for searching and retrieval Federation / grid of semantically integrated data Common vocabulary / terminology Formal models (caBIG)
25. Appendix: Clinical Systems Surgical Reports Radiation Therapy Reports Pathology Reports Laboratory Reports Radiology Reports Review of Systems and Patient Reported Outcomes Electronic Medical / Health Records Registration / demographics Clinical trials eligibility and recruitment Scheduling and operations
26. Appendix: Engaging Patients in Data Management Pre-first visit questionnaires Web-based survey systems (e.g. REDCap) Patient reported outcomes Longitudinal follow-up process Tablets, iPads and mobile applications
Notas do Editor
I hope you will give a broad overview of the key features of the database that would allow the development of optimal predictive models, demonstrate how Caisis works to collect clinical and research data, and has proved to be so valuable to the development of predictive models.
Constraints on data entry increase reproducibility, but may decrease accuracyConducive to quantitative research and hypothesis testingOpen fields / coding may increase accuracy, but decrease reproducibilityConducive to qualitative research and discovery
Krallinger et al. Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol (2008) vol. 9 Suppl 2 pp. S8Savova et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc (2010) vol. 17 (5) pp. 507-13
Caisis is a data repository. One data model to rule them all
How much time and effort does it take to pool databases and spreadsheets for predictive modeling?Stein. Creating a bioinformatics nation. Nature (2002) vol. 417 (6885) pp. 119-2012000935[pmid]If there is a need for large aggregated datasets from heterogeneous sources to support predictive modeling, we need to plan for this model.Building for one site and rolling out to other sites successfully is rare.
Most people proclaimed that they did not want to “reinvent the wheel”, but proceeded to do so. Disconnect between beliefs and actions.Harris et al. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform (2009) vol. 42 (2) pp. 377-81