Determine Presence of Breast Cancer

This machine learning program detects the presence (or absence) of breast cancer from pertinent data regarding physical characteristics.

Project Set Up and Installation

This project comprises training two models - one using AutoML and the second using HyperDrive. The best model in each case is registered. From amongst the registered models, the model with the greater accuracy is then deployed as an endpoint service. Finally, this service is invoked to make predictions. The diagram below captures the general flow and the main aspects of building the models

Dataset

Overview

The dataset is at Breast Cancer Prediction Dataset. An understanding of the data can be had at https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset/discussion/66975#509394

Task

The task is to predict the presence of breast cancer given certain physical characteristics. There are 5 features or characteristics of the cell with a 'diagnosis' label indicating the cell is cancerous or not.

Access

The data is downloaded as a csv file from https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset . It is then made avalabe at a publicly available github such as https://github.com/dntrply/nd00333-capstone/raw/master/dataset/Breast_cancer_data.csv The data is read into the AzureML project using the Tabular Dataset Factory function to read from a file/URL. Once ingested, the Tabular Dataset is used from there on.

Automated ML

The objective is a binary classification (cancerous or not) and so the primary_metric chosen is 'AUC_weighted'. Early stopping is enabled and an experiment timeout is set so as to limit the total time. Early stopping is enabled to prevent overfitting. The appropriate data and label name are specified.

Results

From the output, it would appear that the best model is an ensemble of models. The accuracy from the best model is 0.930

Screenshots of the AutoML RunDetails widget

Screenshots of the best AutoML model with parameters

Hyperparameter Tuning

TODO: What kind of model did you choose for this experiment and why? Give an overview of the types of parameters and their ranges used for the hyperparameter search Given the nature of the data and the desired outcome, a LogisticsRegression model is chosen. We go with the scikit-learn implementation. Two parameters, inverse regularization (--C) and maximum number of iterations (---max_iter) were chosen to be optimized. The sampling chosen was random parameter sampling, with a set of discrete values provided for each parameter. The Bandit Policy with a slack factor of 0.1 was chosen as the early termination policy. The primary metric was Accuracy with a goal set to maximize this primary metric. Training code 'train.py' was provided. It exercised the regression code and saved the ensuing accuracy and model (later used by HyperDrive to evaluate the best model) An environment was specified - in this case the primary consideration being the conda package scikit-learn

Results

The accuracy of the best model was 0.938. The corresponding parameter values were --C of 10 and --max_iter of 400.

Screenshots of the HyperParameter RunDetails widget

Screenshots of the best HyperParameter model with parameters

Model Deployment

The HyperDrive model had a slightly better accuracy and was chosen to be deployed. Deployment consists of specifying an Inference configuration and a deployment configuration. Inference configuration consists of specifying the environment, the scoring code (with init and run functions). Once the deployment is successful, the scoring_URI can be retrieved from the deployment service. The scoring URI can then be used to make a HTTP request with the input data. The input data in this example is a batch array of parameter values. The endpoint is capable of taking the batch array, invoking the model prediction, and returning the predicted results.

Any resources such as the service and the compute cluster are then deleted.

An example of one instance of input data is: ``` [[9.504, 12.44, 60.34, 273.9, 0.1024], [15.37, 22.76, 100.2, 728.2, 0.092], [21.09, 26.57, 142.7, 1311.0, 0.1141], [11.04, 14.93, 70.67, 372.7, 0.07987]] ```

with the results

    [1, 0, 0, 1]

The results indicate that the first and last samples are likely cancerous.

Screenshot showing model endpoint as Healthy

Suggestions for improvements

Experiment with normalizing the data to determine if model accuracy can be improved.
Determine if the input features are highly correlated. If so, remove highly correlated features prior to training
Provide enhanced instrumentation/logging especially during inference
Convert the model to ONNX format for greater interoperability

Screen Recording

A video screencast demonstrating the project may be found here

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
dataset		dataset
env		env
outputs		outputs
starter_file		starter_file
CODEOWNERS		CODEOWNERS
README.md		README.md
automl.ipynb		automl.ipynb
automl.log		automl.log
azureml_automl.log		azureml_automl.log
capstone_automl.log		capstone_automl.log
hyper_scoring.py		hyper_scoring.py
hyperparameter_tuning.ipynb		hyperparameter_tuning.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Determine Presence of Breast Cancer

Project Set Up and Installation

Dataset

Overview

Task

Access

Automated ML

Results

Hyperparameter Tuning

Results

Model Deployment

Suggestions for improvements

Screen Recording

About

Releases

Packages

Languages

dntrply/nd00333-capstone

Folders and files

Latest commit

History

Repository files navigation

Determine Presence of Breast Cancer

Project Set Up and Installation

Dataset

Overview

Task

Access

Automated ML

Results

Hyperparameter Tuning

Results

Model Deployment

Suggestions for improvements

Screen Recording

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages