O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×
Carregando em…3

Confira estes a seguir

1 de 66 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Data science 101 (20)


Mais recentes (20)

Data science 101

  1. 1. Data Science 101 Robert Hoyt MD FACP January 12, 2017
  2. 2. Disclaimer • I have no conflicts of interest to report • The opinions presented are those of the author and do not necessarily reflect those of the University of West Florida
  3. 3. Learning Objectives Upon completion of the presentation participants should be able to: • Summarize the characteristics of data science • Summarize the skill sets for data scientists • Compare and contrast predictive analytics using statistics vs. machine learning • Enumerate features of IBM Watson Analytics (IBMWA) • Enumerate features of WEKA machine learning • List the challenges facing data science
  4. 4. Look Familiar?
  5. 5. AHIMA Supports Data Analytics
  6. 6. Definitions • Data science is “the scientific study of the creation, validation and transformation of data to create meaning.” 1 Because data science is relatively new, definitions are still evolving. Data science is a good “umbrella” term. • Analytics is “the discovery and communication of meaningful patterns in data.” While some would argue for separating data analytics from data mining and knowledge discovery from data (KDD), we will use the terms interchangeably. 2
  7. 7. Venn diagram of Data Science Data Science
  8. 8. Critical need for data scientists with: • Domain expertise (example: healthcare) • In depth statistical knowledge • Computer science expertise • Machine learning expertise • Programming expertise: R, SQL and Python languages • Relational database system (RDBS) knowledge • Comfort level with “Big Data”
  9. 9. Historical Background • While all industries (including sports) are incorporating analytics and data science, the business world was first. • Businesses benefitted from knowing which customers were likely to unsubscribe (churn) and if you purchased item A, would you purchase item B (market basket analysis). • As far back as the 1960s a small group of statisticians suggested their field should be broadened to handle more volume and variety of data. • In the 1990s computer scientists developed and promoted machine learning software.
  10. 10. Historical background • There is evidence that many healthcare workers lack training in statistics and machine learning. 3 • There is also evidence that statistics is not easy to teach to non-statisticians and difficult to retain. 4 • Statisticians recommend knowledge of calculus and linear algebra; not routinely studied by healthcare workers. They often prefer that statistical formulas should be calculated longhand.
  11. 11. Stats: Logistic Regression B. Would you prefer approach A or B? A.
  12. 12. Historical Background • As a result, many workers are not comfortable with statistics and data analytics • This observation flies in the face of a health data explosion and a shortage of data scientists • The data explosion is fueled by genomic, EHR- related, wearable technology and social media data. • About 75% of healthcare data is unstructured (free text), so difficult to analyze • Enter the Big Data era to further confuse matters
  13. 13. Big Data Definition • #1 Too much data to analyze on one computer • #2 The Five V’s • Volume: massive amounts of data are being generated • Velocity: data is being generated so rapidly that it needs to be analyzed without placing it in a database • Variety: roughly 80% of data in existence is unstructured so it won’t fit into a database or spreadsheet. • Veracity: current data can be “messy” with missing data and other challenges. • Value: data scientists now have the capability to turn large volumes of unstructured data into something meaningful.5
  14. 14. Data Science is part of the federal vision of a healthcare system • Learning health system: “an ecosystem where all stakeholders can securely, effectively and efficiently contribute, share and analyze data” 6 (the PDCA cycle) • Precision medicine: “identifying which approaches will be effective for which patients based on genetic, environmental, and lifestyle factors.” Clearly this initiative requires a big data approach to integrate these data.7 • Population health requires data analytics • Value based care requires data
  15. 15. Types of analytics (Gartner) Predictive analytics describes four attributes: 1. An emphasis on prediction 2. Rapid analysis measured in hours or days 3. An emphasis on the business relevance of the resulting insights (no ivory tower analyses) 4. An emphasis on ease of use, thus making the tools accessible to business users.8
  16. 16. Predictive Analytics • It could be argued that predictive analytics is the most important aspect of data science, where an outcome of importance is predicted based on multiple factors influencing the outcome. This is the area I will focus on • Use cases will be discussed in the next slide • I will not cover: • Text mining with natural language processing (NLP) is very important for mining unstructured data • Data visualization software, such as Tableau and QlikSense, used for descriptive analytics • Deep Learning based on artificial intelligence (AI)
  17. 17. Predictive Analytics Use Cases • Predict poor patient outcomes (morbidity) • Sepsis prediction 9 • Impending renal failure 10 • Predict death (mortality) • Predict readmissions: in fiscal 2016 only 24% (799/3400) of reporting hospitals will not receive a penalty (0.1%-3% range) for too many readmissions. 11-12 • Predict high cost patients for population health care management: 5% of Medicare/Medicaid patients use 50% of resources. 13-14
  18. 18. Predictive Modeling Approaches 1. Modeling with statistics 2. Modeling with Machine Learning 3. Modeling using the R or Python programming languages (not covered)
  19. 19. Predictive analytics Design the model Statistical Modeling Statistics Machine learning Association Regression Classification Clustering Programming Languages
  20. 20. Data Science/Analytics Process
  21. 21. Predictive Analytics • The most common approach is to use classification where you predict an outcome (dependent variable) that is categorical data (e.g. lived, died) with multiple predictors (independent variables). For example, you have a data set of pregnant women with Zika virus. Some have children with micro-encephalopathy and others don’t. You run a classification model to see if factors such as age, trimester of infection, fever, symptoms, etc. predict micro-encephalopathy • If the outcome is numerical then you would use linear regression
  22. 22. Need for better data analytical tools • We would benefit from more user friendly tools and some degree of automation • MS Excel with the Analysis ToolPak add-in is a possibility but implies you know which stats tests to use • There are also multiple statistical packages, such as SPSS and SAS, also associated with a steep learning curve
  23. 23. Need for better data analytical tools • Tool #1: IBM Watson Analytics: automatics predictive, descriptive and visualization analytics • Tool #2 WEKA: open source machine learning platform
  24. 24. IBM Watson Analytics • New program offered in 2015 that is not related to Watson Health (cognitive computing). Business oriented • Program is based on SPSS-based statistical tests. Covers regression, classification, decision trees, chi-square, t- tests, etc. • Program can automatically convert nominal data to numerical and vice versa • Versions • Free • Professional (Academic)
  25. 25. IBM Watson Analytics Academic program • Free for universities to use for teaching (non-commercial) purposes. Includes 100 students/professor/year • University of West Florida has used the program for about 12 months in a Health Informatics graduate course and a Data Mining (computer science) course • IBM did an on-site visit for training • Multiple videos on YouTube • PDF user guide available
  26. 26. IBM Watson analytics features • IBMWA is completely online • Accepts Excel and CSV input, as well as feeds from most relational database systems (RDBSs) • 100 GB storage • Limits: 500 columns and 10 million rows • Twitter Feed analysis
  27. 27. Our 2016 Review Article https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5080525/
  28. 28. IBMWA versions • About the time we submitted our detailed analysis of Watson Analytics, IBM had created Watson-2 that combined “Explore” and “Predict” into “Discover.” Watson-1 will be retired shortly. • Watson-2 includes statistical details about prediction when the target/outcome is numerical. They are working on adding the statistical details for categorical targets/outcomes. • Watson-2 includes a “data quality” score but doesn’t point out missing or skewed data and outliers.
  29. 29. Breast-Cancer-spreadsheet • 286 patients • One outcome and 9 Attributes or predictors
  30. 30. Step 1: upload data
  31. 31. Step 2 Create a new Discovery
  32. 32. “Ask a question” function
  33. 33. Select a Visualization
  34. 34. Predictive Analytics: What Drives Breast Cancer Recurrence + predictive model
  35. 35. Predict breast cancer recurrence
  36. 36. Degree of malignancy and prediction of recurrence and no recurrence
  37. 37. Decision Tree Results
  38. 38. Confusion matrix is created but not explained (degree of malignancy and no recurrence) Predicted No recurrence Recurrence Actual No recurrence TN = 161 FP = 40 201 Recurrence FN = 40 TP = 45 85 201 85 286 Accuracy = TP + TN/Total = 72% Sensitivity (recall) TP/FN + TP = 53% Specificity = TN/TN + FP = 80% Precision = TP/TP + FP = 53%
  39. 39. Create your Display • Display can be shared by email, hyperlink, Tweet or downloaded • Display is interactive
  40. 40. IBMWA limitations • Business oriented, so not aligned perfectly with healthcare data analytics. Predictive strength is good, but we are used to sensitivity, specificity, PPV, NPV, ROC curves, etc. • No choice of statistical tests • IBMWA does not perform unsupervised learning • This approach (results first, stats second) may not appeal to purists • Sample dataset I used was of excellent quality, therefore not typical of many datasets
  41. 41. questions • Process to apply for the academic program is easy. Apply at: https://www.ibm.com/blogs/watson-analytics/calling-all- academics-have-we-got-a-watson-analytics-for-you/ • IBM Contact information: Randy Messina at randymessina@us.ibm.com • My contact information: rhoyt@uwf.edu IBMWA Application Process
  42. 42. Machine Learning (ML) • Machine learning was developed by computer scientists and is largely based on mathematics, like statistics • While some ML algorithms are difficult to understand (e.g. neural networks), others are easier, such as decision trees and regression • Modeling is like baking: you decide what you want to bake and the select the best recipe (algorithm) to accomplish it. Optimally, you select multiple recipes and compare the results! Example: you want a model to decide what is spam email. You test many algorithms for best results and determine the best combination of predictors
  43. 43. Algorithm Types • Supervised learning • Classification for categorical data (spam v no spam) • Regression for numerical data ($, mortality rate) • Unsupervised learning • Association rules: an example would be market basket analysis • Clustering: when you don’t know the data categories and you are looking for patterns in large data sets. Used extensively with genetic data sets • What ML has in common with statistical approach • Both will perform linear regression, logistic regression and decision trees
  44. 44. Open Source Free ML Programs • WEKA 15 • Pentaho Community 16 • RapidMiner Community 17 • KNIME 18 • Orange 19
  45. 45. Machine Learning with Orange University of Ljubljana,Slovenia
  46. 46. WEKA • Named after a bird in New Zealand and stands for Waikato Environment for Knowledge Assessment • Free software is associated with a free ML course and a low cost textbook • Software works on all operating systems • WEKA is the only ML program mentioned that does not require moving around widgets or operators
  47. 47. Predicting type 2 diabetes (WEKA)
  48. 48. Results using logistic regression
  49. 49. Outcome Measurements Accuracy is hitting a the bulls-eye every time. Precision is hitting the same place each time, even if it is not the place you aimed for.
  50. 50. Receiver operator characteristic (ROC) curve (c-statistic = AUC) TP FP
  51. 51. Decision Tree for Contact Lenses
  52. 52. Clustering algorithm (3 groups identified)
  53. 53. Predictive Analytics Report Card • Many risk prediction models yield mediocre results at this point (C-statistic .56 -.80), but we are early in the game. • Models need to work in real-time ideally • Some risk models are used in healthcare organizations that might not fit your patient demographic, such as safety-net hospitals, etc. • It is helpful to identify patients at risk for morbidity and mortality but you still have to have an intervention team, ready to apply additional resources to high risk patients
  54. 54. Data Science Education Stats • Certificate (82); Bachelor (24); Masters (259); Doctorate (14) • 37% of courses are offered online • 101 programs were related to business schools, 40 related to mathematics and statistics departments, 39 related to computer science departments and 9 related to new data science departments. The remainder were from a wide variety of college and university departments.
  55. 55. Data Science Centers • Multiple universities and medical centers have created “data centers” to create the right environment for data analysis and research • They tend to be multi-disciplinary and not just relegated to the computer science department • Every industry seems to have interest in data science and analytics, hence the need to create a central hub
  56. 56. Data Science Resources • My web site www.informaticseducation.org has a resource center with Chapter 23: Data Science resources: • Data sets: health and non-health related • Free data science courses • Free statistics resources • Free visualization software • Free programming tutorials • Other helpful stuff • Chapter 23 is available thru Lulu.com for $2.99
  57. 57. ONC Sponsored Free Courses • Healthcare Data Analytics • Bellevue College (limited to Veterans Administration Staff Only) • Columbia University • Normandale College • Oregon Health & Science University • University of Alabama at Birmingham • University of Texas Health Science Center at Houston
  58. 58. Machine Learning Resources • I would recommend beginning with Jason Brownlee’s eBooks: • Machine Learning Algorithms $27 (163 pages) • Machine Learning Mastery with WEKA $27 (248 pages) • www.machinelearningmastery.com
  59. 59. Machine Learning Data Sets
  60. 60. Data Science Challenges • Not enough data scientists; it is estimated that we will need 140,000 by 2018 20 • Not enough data science training programs • Expensive to build big data and data science centers • Privacy and security concerns • Hype. Adverse unintended consequences (AUCs) • Medical data is heterogeneous and complex, compared to other industries 21 • Correlation does not equal causation • 80% of the time spent with data analysis is spent preparing the data for analysis 22
  61. 61. Data Science Challenges • Difficult to find patient-level data • It has been stated that clinical medicine accounts for only 20% of population health; 80% is due to psycho-social-environmental-behavioral- economic factors that are beyond the control of the healthcare system. Therefore, interventions based on good data can result in no impact 23 • Just because you have technology and voluminous data doesn’t mean it changes patient outcomes. Example: fitness devices affecting behavior 24
  62. 62. Make data science part of patient care Not everyone will be able to afford a robust analytics platform overlaid on a clinical data warehouse and the ability to handle Big Data. But we can start the educational process to learn more about data science
  63. 63. Anticipate the onslaught of data analytical vendors
  64. 64. Conclusions • Data science is a new information science that serves as an umbrella for data creation, manipulation, analysis and research • Data scientists are in high demand and it will take years before we can educate enough scientists to meet the demand • Data science is a team sport; it will require teams with individual skill sets to accomplish robust data science • New tools such as Watson and WEKA likely represent the beginning of analysis automation
  65. 65. Conclusions • I encourage everyone to increase their knowledge in data science areas, such as predictive analytics • There are a myriad of free and affordable courses now available online (mentioned in my blog and on the resource page) • I encourage academic centers and HIT vendors to expand their data science offerings at multiple levels
  66. 66. Questions? Slides available as Data Science 101 on www.slideshare.net

Notas do Editor

  • 1 Data Science Association. www.datascienceassn.org Accessed September 12, 2106
    2 Analytics. Wikipedia. www.wikipedia.org Accessed January 16, 2016
  • 3 Wegwarth O, Schwartz LM, Woloshin S et al. Do physicians understand cancer screening statistics? A national survey of primary care physicians in the United States. Ann Intern Med 2012;156(5):340-9

    4 Manrai AK, Bhatia G, Strymish J et al. Medicine’s uncomfortable relationship with math. Research Letter. June 2014. JAMA Intern Med 2014;174(6):991-993
  • IBM Big Data and Analytics Hub http://www.ibmbigdatahub.com/blog/why-only-one-5-vs-big-data-really-matters
  • 6 ONC definition of learning health system. Connecting Health and Care for the Nation. A Shared Nationwide Interoperability Roadmap. October 2015
    7 National Research Council. Towards Precision Medicine: Building a Network for Biomedical Research and a new Taxonomy of Disease. National Academies Press. 2011
  • Gartner IT Glossary: http://www.gartner.com/it-glossary/predictive-analytics/
  • Desautels T, Calvert J, Hoffman J et al. Prediction of sepsis in the ICU with minimal EHR data: a machine learning approach. JMIR Medical Informatics 2016;4(3):e28

    Echouffo-Tcheugui JB, Kengne AP. Risk models to predict chronic kidney disease and its progression: a systematic review. PLOS Medicine. November 20, 2012. Journals.plos.org

    Most hospitals face 30 day readmissions penalty in fiscal 2016. August 3, 2015 www.modernhealthcare.com

    Amarasingham R, Patel P, Tolo K et al. Allocating scarce resources in real-time to reduce heart failure readmissions: a prospective controlled trial. BMJ Quality Safety. July 31 2013

    Stanton M. The high concentration of US Health Care expenditures. Research in Action. Issue 19. 2002. AHRQ Archive. https://archive.ahrq.gov

    Chechulin Y, Nazerian A, Rais S. Predicting patients with high risk of becoming high cost healthcare users in Ontario. Healthcare Policy. 2014;9(3):68-79
  • From Doing Data Science by O’Neill and Schutt. O’Reilly Media. 2014
  • WEKA: http://www.cs.waikato.ac.nz/ml/weka/

    Pentaho Community www.community.pentaho.com

    RapidMiner Community www.community.rapidminer.com

    KNIME www.knime.org

    Orange data mining www.orange.biolab.si

  • C-statistic (used to compare logistic regression models): The probability that predicting the outcome is better than chance. Used to compare the goodness of fit of logistic regression models, values for this measure range from 0.5 to 1.0. A value of 0.5 indicates that the model is no better than chance at making a prediction of membership in a group and a value of 1.0 indicates that the model perfectly identifies those within a group and those not. Models are typically considered reasonable when the C-statistic is higher than 0.7 and strong when C exceeds 0.8 (Hosmer & Lemeshow, 2000; Hosmer & Lemeshow, 1989). http://mchp-appserv.cpe.umanitoba.ca/viewDefinition.php?definitionID=104234

    Area under the curve: based on prediction rules with true positives plotted against false positives (1-specificity). The closer to 1, the better. 0.5 is essentially worthless. http://gim.unmc.edu/dxtests/roc3.htm
  • DataScience Community. http://datascience.community/colleges
  • Manyika J, Chui M, Brown B. Et al. Big Data: The Next Frontier for Innovation, Competition and Productivity. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation

    Krzysztof JC, Moore GW. Uniqueness of medical data mining. Art Int Med 2002;26:1-24

    Press G. Cleaning big data: most time consuming, least enjoyable data science task, survey says. Forbes. March 23 2016 www.forbes.com
  • Jacobsen RM, Isham GJ, Rutten LJF. Population Health as a means for health care organizations to deliver value. Mayo Clinic Proceedings November 2015;90(11):1465-1470

    Jakicic JM, David KK, Rogers RJ et al. Effect of wearable technology combined with lifestyle intervention on increases long term weight loss. The IDEA RCT.JAMA 2016:316(11):1161-1171
  • Parikh RB, Obermeyer Z, Bates DW. Making predictive analytics a routine part of patient care. Harvard Business Review. April 21, 2016. https://hbr.org
  • CloudMEDX www.cloudmedxhealth.com