Design and implementation of a machine learning Spark application for breast cancer detection
I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!
-
The project consists in the design and the implementation of a Spark application that analyzes the dataset and applies various Machine Learning techniques to create a classification model for the detection and prediction of breast cancer.
-
To build the ML models the Pipeline approach of the MLib Spark API has been used.
-
Five different classification models have been used (Logistic Regression, Decision Three, Random Forest, Linear Support Vector, Naïve Bayes)
-
All these models have been evaluated by using different accuracy metrics to choose the best model.
The columns of the used dataset represent all the features computed from a digitalized image of a fine-needle aspiration (FNA) of a breast mass. After all these exams, digital images are computed and all the features of the cell nuclei have been collected into the datased that will be used to predict the diagnosis (M=malignant, B=benign).
First phase of every machine learning project regards dataset analysis, attributes understanding, missing values search and their fixing and the verification of dataset's classes balance.
Heatmaps, pairplots and histograms have been used in this phase.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.
Various classification methods have been used in this work, but the first three steps of the pipeline are in common for all classification methods:
- StringIndexer: used to transform the string values (M or B) of the “diagnosis” column in numeric binary values (0=B and 1=M)
- VectorAssembler: a feature-transformer that merges multiple columns (30, in this case --> the initial 32 columns minus the first two: the ID column (it's irrelevant) and the diagnosis column (it's the target attribute to evaluate) into a vector column.
- StandardScaler: used to standardize features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
Five different machine learning models have been used in this work:
- Logistic Regression
- Decision Three
- Random Forest
- Linear Support Vector
- Naïve Bayes
Different evaluation metrics have been used in this project to evaluate the accuracy of the models:
- Precision = how many are correctly classified among that class
- Recall = how many of a certain class you find over the whole number of element of this class
- F1-score = the harmonic mean between precision and recall
- Support = number of occurrences of a given class in your dataset (if you have 37.5K instances of class 0 and 37.5K of class 1, it is a really well balanced dataset)
- Accuracy = rate of positive predicted values
- Test Error = 1 - Accuracy
- ROC Curve and AUC value
For any support, error corrections, etc. please email me at [email protected]