O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Analyzing Breast Cancer Dataset with Azure Machine Learning Studio

595 visualizações

Publicada em

This presentation was given by https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ Member Frank Mendoza of Catalytics on January 23, 2018

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Analyzing Breast Cancer Dataset with Azure Machine Learning Studio

  1. 1. 2018 Catalytics, LLC - Proprietary and Confidential Analyzing Breast Cancer Dataset with Azure Machine Learning (ML) Studio Frank Mendoza CEO, Catalytics Chicago Technology for Value-Based Healthcare Meetup January 23, 2018
  2. 2. 2018 Catalytics, LLC - Proprietary and Confidential • Total of 569 records in dataset – donated in 1995 • 30 distinct numerical attributes (or features) associated with each record • No categorical features available within the dataset Breast Cancer Wisconsin (Diagnostic) Dataset Description Location: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
  3. 3. 2018 Catalytics, LLC - Proprietary and Confidential Breast Cancer Wisconsin (Diagnostic) Dataset Description, cont. • Column identified as “Diagnosis” is the dataset label • M = malignant • B = benign 300+ 200+ Example of Measurements
  4. 4. 2018 Catalytics, LLC - Proprietary and Confidential Core Steps to build Predictive Models using Machine Learning 5(a) Test API
  5. 5. 2018 Catalytics, LLC - Proprietary and Confidential Acquire Data & Prepare • Dataset did not have any missing values • Manipulation was still required to ensure training process would be successful – normalization, etc. • Split data into two sets to Train & Test model • Training = 311 records (~54%) • Testing set 1 = 208 records (~36%) • Additional Testing set was to test model after API created – step 5(a) • Testing set 2 = 50 records (~10%) • Training & Testing set 1 was uploaded to Azure Machine Learning (ML) Studio
  6. 6. 2018 Catalytics, LLC - Proprietary and Confidential Training Predictive Model Choosing algorithms • Since label is 2 class – Benign vs. Malignant; it was clear that a Classification model would be necessary • Multiple models were developed to identify the best algorithm to use • Two class Logistic Regression • Two class Support Vector Machine • Two class Boosted Decision Tree • Two class Neural Network - WINNER
  7. 7. 2018 Catalytics, LLC - Proprietary and Confidential Optimizing Neural Network Model • Feature Selection – identify which attributes matter Important Less Important
  8. 8. 2018 Catalytics, LLC - Proprietary and Confidential Feature Selection, continued • Azure ML contains a module called “Permutation Feature Importance” that will test features to identify importance
  9. 9. 2018 Catalytics, LLC - Proprietary and Confidential Cross Validation • Azure ML contains a module called “Cross Validation Model” that will evaluate model by partitioning the data – used to ensure that model will perform against unseen/ new data 10 folds
  10. 10. 2018 Catalytics, LLC - Proprietary and Confidential Neural Network Classification Model Optimized • Feature selection allowed us to remove 14 attributes that did not contribute to improving model • Accuracy improved from 0.976 to 0.981
  11. 11. AZURE ML DEMONSTRATION
  12. 12. AZURE ML API/ EXCEL DEMONSTRATION
  13. 13. 2018 Catalytics, LLC - Proprietary and Confidential Frank Mendoza, CEO & Chief Catalyst 900 E. Pecan St, Suite 300-286 Pflugerville, TX 78660-8048 Phone: +1 (512) 767-8604 Fax: +1 (737) 703-5478 Email: Frank@CatalyticsConsulting.com linkedin.com/in/fxmendoza Twitter: @DataDrivenMind
  14. 14. Appendix
  15. 15. 2018 Catalytics, LLC - Proprietary and Confidential Attribute Information 1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1) Location: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.707&rep=rep1&type=pdf

×