This presentation was given by https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ Member Frank Mendoza of Catalytics on January 23, 2018
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
Analyzing Breast Cancer Dataset with Azure Machine Learning Studio
1. 2018 Catalytics, LLC - Proprietary and Confidential
Analyzing Breast
Cancer Dataset with
Azure
Machine Learning (ML)
Studio
Frank Mendoza
CEO, Catalytics
Chicago Technology for Value-Based Healthcare
Meetup
January 23, 2018
2. 2018 Catalytics, LLC - Proprietary and Confidential
• Total of 569 records in dataset – donated in 1995
• 30 distinct numerical attributes (or features) associated with
each record
• No categorical features available within the dataset
Breast Cancer Wisconsin (Diagnostic) Dataset
Description
Location: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
3. 2018 Catalytics, LLC - Proprietary and Confidential
Breast Cancer Wisconsin (Diagnostic) Dataset
Description, cont.
• Column identified as “Diagnosis” is the dataset label
• M = malignant
• B = benign 300+
200+
Example of Measurements
4. 2018 Catalytics, LLC - Proprietary and Confidential
Core Steps to build Predictive Models using Machine Learning
5(a)
Test API
5. 2018 Catalytics, LLC - Proprietary and Confidential
Acquire Data & Prepare
• Dataset did not have any missing values
• Manipulation was still required to ensure training process would be successful –
normalization, etc.
• Split data into two sets to Train & Test model
• Training = 311 records (~54%)
• Testing set 1 = 208 records (~36%)
• Additional Testing set was to test model after API created – step 5(a)
• Testing set 2 = 50 records (~10%)
• Training & Testing set 1 was uploaded to Azure Machine Learning (ML) Studio
6. 2018 Catalytics, LLC - Proprietary and Confidential
Training Predictive Model
Choosing algorithms
• Since label is 2 class – Benign vs. Malignant; it was clear that a
Classification model would be necessary
• Multiple models were developed to identify the best algorithm to use
• Two class Logistic Regression
• Two class Support Vector Machine
• Two class Boosted Decision Tree
• Two class Neural Network - WINNER
7. 2018 Catalytics, LLC - Proprietary and Confidential
Optimizing Neural Network Model
• Feature Selection – identify which attributes matter
Important Less Important
8. 2018 Catalytics, LLC - Proprietary and Confidential
Feature Selection, continued
• Azure ML contains a module called “Permutation Feature Importance” that will
test features to identify importance
9. 2018 Catalytics, LLC - Proprietary and Confidential
Cross Validation
• Azure ML contains a module called “Cross Validation Model” that will evaluate
model by partitioning the data – used to ensure that model will perform
against unseen/ new data
10 folds
10. 2018 Catalytics, LLC - Proprietary and Confidential
Neural Network Classification Model
Optimized
• Feature selection allowed us to remove 14 attributes that did not
contribute to improving model
• Accuracy improved from 0.976 to 0.981
15. 2018 Catalytics, LLC - Proprietary and Confidential
Attribute Information
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32) Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
Location: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.707&rep=rep1&type=pdf