Seu SlideShare está sendo baixado.
×

- 1. Data Science 101 Robert Hoyt MD FACP January 12, 2017
- 2. Disclaimer • I have no conflicts of interest to report • The opinions presented are those of the author and do not necessarily reflect those of the University of West Florida
- 3. Learning Objectives Upon completion of the presentation participants should be able to: • Summarize the characteristics of data science • Summarize the skill sets for data scientists • Compare and contrast predictive analytics using statistics vs. machine learning • Enumerate features of IBM Watson Analytics (IBMWA) • Enumerate features of WEKA machine learning • List the challenges facing data science
- 4. Look Familiar?
- 5. AHIMA Supports Data Analytics
- 6. Definitions • Data science is “the scientific study of the creation, validation and transformation of data to create meaning.” 1 Because data science is relatively new, definitions are still evolving. Data science is a good “umbrella” term. • Analytics is “the discovery and communication of meaningful patterns in data.” While some would argue for separating data analytics from data mining and knowledge discovery from data (KDD), we will use the terms interchangeably. 2
- 7. Venn diagram of Data Science Data Science
- 8. Critical need for data scientists with: • Domain expertise (example: healthcare) • In depth statistical knowledge • Computer science expertise • Machine learning expertise • Programming expertise: R, SQL and Python languages • Relational database system (RDBS) knowledge • Comfort level with “Big Data”
- 9. Historical Background • While all industries (including sports) are incorporating analytics and data science, the business world was first. • Businesses benefitted from knowing which customers were likely to unsubscribe (churn) and if you purchased item A, would you purchase item B (market basket analysis). • As far back as the 1960s a small group of statisticians suggested their field should be broadened to handle more volume and variety of data. • In the 1990s computer scientists developed and promoted machine learning software.
- 10. Historical background • There is evidence that many healthcare workers lack training in statistics and machine learning. 3 • There is also evidence that statistics is not easy to teach to non-statisticians and difficult to retain. 4 • Statisticians recommend knowledge of calculus and linear algebra; not routinely studied by healthcare workers. They often prefer that statistical formulas should be calculated longhand.
- 11. Stats: Logistic Regression B. Would you prefer approach A or B? A.
- 12. Historical Background • As a result, many workers are not comfortable with statistics and data analytics • This observation flies in the face of a health data explosion and a shortage of data scientists • The data explosion is fueled by genomic, EHR- related, wearable technology and social media data. • About 75% of healthcare data is unstructured (free text), so difficult to analyze • Enter the Big Data era to further confuse matters
- 13. Big Data Definition • #1 Too much data to analyze on one computer • #2 The Five V’s • Volume: massive amounts of data are being generated • Velocity: data is being generated so rapidly that it needs to be analyzed without placing it in a database • Variety: roughly 80% of data in existence is unstructured so it won’t fit into a database or spreadsheet. • Veracity: current data can be “messy” with missing data and other challenges. • Value: data scientists now have the capability to turn large volumes of unstructured data into something meaningful.5
- 14. Data Science is part of the federal vision of a healthcare system • Learning health system: “an ecosystem where all stakeholders can securely, effectively and efficiently contribute, share and analyze data” 6 (the PDCA cycle) • Precision medicine: “identifying which approaches will be effective for which patients based on genetic, environmental, and lifestyle factors.” Clearly this initiative requires a big data approach to integrate these data.7 • Population health requires data analytics • Value based care requires data
- 15. Types of analytics (Gartner) Predictive analytics describes four attributes: 1. An emphasis on prediction 2. Rapid analysis measured in hours or days 3. An emphasis on the business relevance of the resulting insights (no ivory tower analyses) 4. An emphasis on ease of use, thus making the tools accessible to business users.8
- 16. Predictive Analytics • It could be argued that predictive analytics is the most important aspect of data science, where an outcome of importance is predicted based on multiple factors influencing the outcome. This is the area I will focus on • Use cases will be discussed in the next slide • I will not cover: • Text mining with natural language processing (NLP) is very important for mining unstructured data • Data visualization software, such as Tableau and QlikSense, used for descriptive analytics • Deep Learning based on artificial intelligence (AI)
- 17. Predictive Analytics Use Cases • Predict poor patient outcomes (morbidity) • Sepsis prediction 9 • Impending renal failure 10 • Predict death (mortality) • Predict readmissions: in fiscal 2016 only 24% (799/3400) of reporting hospitals will not receive a penalty (0.1%-3% range) for too many readmissions. 11-12 • Predict high cost patients for population health care management: 5% of Medicare/Medicaid patients use 50% of resources. 13-14
- 18. Predictive Modeling Approaches 1. Modeling with statistics 2. Modeling with Machine Learning 3. Modeling using the R or Python programming languages (not covered)
- 19. Predictive analytics Design the model Statistical Modeling Statistics Machine learning Association Regression Classification Clustering Programming Languages
- 20. Data Science/Analytics Process
- 21. Predictive Analytics • The most common approach is to use classification where you predict an outcome (dependent variable) that is categorical data (e.g. lived, died) with multiple predictors (independent variables). For example, you have a data set of pregnant women with Zika virus. Some have children with micro-encephalopathy and others don’t. You run a classification model to see if factors such as age, trimester of infection, fever, symptoms, etc. predict micro-encephalopathy • If the outcome is numerical then you would use linear regression
- 22. Need for better data analytical tools • We would benefit from more user friendly tools and some degree of automation • MS Excel with the Analysis ToolPak add-in is a possibility but implies you know which stats tests to use • There are also multiple statistical packages, such as SPSS and SAS, also associated with a steep learning curve
- 23. Need for better data analytical tools • Tool #1: IBM Watson Analytics: automatics predictive, descriptive and visualization analytics • Tool #2 WEKA: open source machine learning platform
- 24. IBM Watson Analytics • New program offered in 2015 that is not related to Watson Health (cognitive computing). Business oriented • Program is based on SPSS-based statistical tests. Covers regression, classification, decision trees, chi-square, t- tests, etc. • Program can automatically convert nominal data to numerical and vice versa • Versions • Free • Professional (Academic)
- 25. IBM Watson Analytics Academic program • Free for universities to use for teaching (non-commercial) purposes. Includes 100 students/professor/year • University of West Florida has used the program for about 12 months in a Health Informatics graduate course and a Data Mining (computer science) course • IBM did an on-site visit for training • Multiple videos on YouTube • PDF user guide available
- 26. IBM Watson analytics features • IBMWA is completely online • Accepts Excel and CSV input, as well as feeds from most relational database systems (RDBSs) • 100 GB storage • Limits: 500 columns and 10 million rows • Twitter Feed analysis
- 27. Our 2016 Review Article https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5080525/
- 28. IBMWA versions • About the time we submitted our detailed analysis of Watson Analytics, IBM had created Watson-2 that combined “Explore” and “Predict” into “Discover.” Watson-1 will be retired shortly. • Watson-2 includes statistical details about prediction when the target/outcome is numerical. They are working on adding the statistical details for categorical targets/outcomes. • Watson-2 includes a “data quality” score but doesn’t point out missing or skewed data and outliers.
- 29. Breast-Cancer-spreadsheet • 286 patients • One outcome and 9 Attributes or predictors
- 30. Step 1: upload data
- 31. Step 2 Create a new Discovery
- 32. “Ask a question” function
- 33. Select a Visualization
- 34. Predictive Analytics: What Drives Breast Cancer Recurrence + predictive model
- 35. Predict breast cancer recurrence
- 36. Degree of malignancy and prediction of recurrence and no recurrence
- 37. Decision Tree Results
- 38. Confusion matrix is created but not explained (degree of malignancy and no recurrence) Predicted No recurrence Recurrence Actual No recurrence TN = 161 FP = 40 201 Recurrence FN = 40 TP = 45 85 201 85 286 Accuracy = TP + TN/Total = 72% Sensitivity (recall) TP/FN + TP = 53% Specificity = TN/TN + FP = 80% Precision = TP/TP + FP = 53%
- 39. Create your Display • Display can be shared by email, hyperlink, Tweet or downloaded • Display is interactive
- 40. IBMWA limitations • Business oriented, so not aligned perfectly with healthcare data analytics. Predictive strength is good, but we are used to sensitivity, specificity, PPV, NPV, ROC curves, etc. • No choice of statistical tests • IBMWA does not perform unsupervised learning • This approach (results first, stats second) may not appeal to purists • Sample dataset I used was of excellent quality, therefore not typical of many datasets
- 41. questions • Process to apply for the academic program is easy. Apply at: https://www.ibm.com/blogs/watson-analytics/calling-all- academics-have-we-got-a-watson-analytics-for-you/ • IBM Contact information: Randy Messina at randymessina@us.ibm.com • My contact information: rhoyt@uwf.edu IBMWA Application Process
- 42. Machine Learning (ML) • Machine learning was developed by computer scientists and is largely based on mathematics, like statistics • While some ML algorithms are difficult to understand (e.g. neural networks), others are easier, such as decision trees and regression • Modeling is like baking: you decide what you want to bake and the select the best recipe (algorithm) to accomplish it. Optimally, you select multiple recipes and compare the results! Example: you want a model to decide what is spam email. You test many algorithms for best results and determine the best combination of predictors
- 43. Algorithm Types • Supervised learning • Classification for categorical data (spam v no spam) • Regression for numerical data ($, mortality rate) • Unsupervised learning • Association rules: an example would be market basket analysis • Clustering: when you don’t know the data categories and you are looking for patterns in large data sets. Used extensively with genetic data sets • What ML has in common with statistical approach • Both will perform linear regression, logistic regression and decision trees
- 44. Open Source Free ML Programs • WEKA 15 • Pentaho Community 16 • RapidMiner Community 17 • KNIME 18 • Orange 19
- 45. Machine Learning with Orange University of Ljubljana,Slovenia
- 46. WEKA • Named after a bird in New Zealand and stands for Waikato Environment for Knowledge Assessment • Free software is associated with a free ML course and a low cost textbook • Software works on all operating systems • WEKA is the only ML program mentioned that does not require moving around widgets or operators
- 47. Predicting type 2 diabetes (WEKA)
- 48. Results using logistic regression
- 49. Outcome Measurements Accuracy is hitting a the bulls-eye every time. Precision is hitting the same place each time, even if it is not the place you aimed for.
- 50. Receiver operator characteristic (ROC) curve (c-statistic = AUC) TP FP
- 51. Decision Tree for Contact Lenses
- 52. Clustering algorithm (3 groups identified)
- 53. Predictive Analytics Report Card • Many risk prediction models yield mediocre results at this point (C-statistic .56 -.80), but we are early in the game. • Models need to work in real-time ideally • Some risk models are used in healthcare organizations that might not fit your patient demographic, such as safety-net hospitals, etc. • It is helpful to identify patients at risk for morbidity and mortality but you still have to have an intervention team, ready to apply additional resources to high risk patients
- 54. Data Science Education Stats • Certificate (82); Bachelor (24); Masters (259); Doctorate (14) • 37% of courses are offered online • 101 programs were related to business schools, 40 related to mathematics and statistics departments, 39 related to computer science departments and 9 related to new data science departments. The remainder were from a wide variety of college and university departments.
- 55. Data Science Centers • Multiple universities and medical centers have created “data centers” to create the right environment for data analysis and research • They tend to be multi-disciplinary and not just relegated to the computer science department • Every industry seems to have interest in data science and analytics, hence the need to create a central hub
- 56. Data Science Resources • My web site www.informaticseducation.org has a resource center with Chapter 23: Data Science resources: • Data sets: health and non-health related • Free data science courses • Free statistics resources • Free visualization software • Free programming tutorials • Other helpful stuff • Chapter 23 is available thru Lulu.com for $2.99
- 57. ONC Sponsored Free Courses • Healthcare Data Analytics • Bellevue College (limited to Veterans Administration Staff Only) • Columbia University • Normandale College • Oregon Health & Science University • University of Alabama at Birmingham • University of Texas Health Science Center at Houston
- 58. Machine Learning Resources • I would recommend beginning with Jason Brownlee’s eBooks: • Machine Learning Algorithms $27 (163 pages) • Machine Learning Mastery with WEKA $27 (248 pages) • www.machinelearningmastery.com
- 59. Machine Learning Data Sets
- 60. Data Science Challenges • Not enough data scientists; it is estimated that we will need 140,000 by 2018 20 • Not enough data science training programs • Expensive to build big data and data science centers • Privacy and security concerns • Hype. Adverse unintended consequences (AUCs) • Medical data is heterogeneous and complex, compared to other industries 21 • Correlation does not equal causation • 80% of the time spent with data analysis is spent preparing the data for analysis 22
- 61. Data Science Challenges • Difficult to find patient-level data • It has been stated that clinical medicine accounts for only 20% of population health; 80% is due to psycho-social-environmental-behavioral- economic factors that are beyond the control of the healthcare system. Therefore, interventions based on good data can result in no impact 23 • Just because you have technology and voluminous data doesn’t mean it changes patient outcomes. Example: fitness devices affecting behavior 24
- 62. Make data science part of patient care Not everyone will be able to afford a robust analytics platform overlaid on a clinical data warehouse and the ability to handle Big Data. But we can start the educational process to learn more about data science
- 63. Anticipate the onslaught of data analytical vendors
- 64. Conclusions • Data science is a new information science that serves as an umbrella for data creation, manipulation, analysis and research • Data scientists are in high demand and it will take years before we can educate enough scientists to meet the demand • Data science is a team sport; it will require teams with individual skill sets to accomplish robust data science • New tools such as Watson and WEKA likely represent the beginning of analysis automation
- 65. Conclusions • I encourage everyone to increase their knowledge in data science areas, such as predictive analytics • There are a myriad of free and affordable courses now available online (mentioned in my blog and on the resource page) • I encourage academic centers and HIT vendors to expand their data science offerings at multiple levels
- 66. Questions? Slides available as Data Science 101 on www.slideshare.net