In this repository, we present artifacts corresponding an intelligent framework that provides automated Data transformations and optimal model deployment, to accelerate accurate and timely inspection of data and model quality checks, and facilitate the productivity of distinguished Data and ML teams across the organization.
In order to demonstrate the orchestrated workflow, we use the example of . This data comprises of historical representation of patient and hospital outcomes, wherein the goal involves building a machine learning (ML) to predict hospital readmission. The model has to predict whether the high-risk diabetic-patients is likely to get readmitted to hospital after previous encounter within thirty days or after thirty days or not. Since this use case deal with multiple outcomes, this ML problem is called "Multi-class Classification".
- An AWS account
- An Amazon SageMaker Studio domain with managed policy attached to the IAM execution role as shown in the blog
- An Amazon S3 Bucket
For a full walkthrough of Automating Exploratory Data Analysis and Model Operationalization with Amazon Sagemaker, see this blog post.
Our solution demonstrates an automated end-to-end approach to perform Exploratory Data Analysis (EDA) with human in the loop to determine the model quality thresholds and approve the optimal/qualified data to be pushed into Sagemaker pipeline in order to push the final data into feature store, thereby speeding the executional framework.
Further, the approach shows deploying the best candidate model and creating the model endpoint on the transformed dataset that was automatically processed as new data arrives into the framework.
Below is the initial setup for data preprocessing step prior to automating the workflow:
This step comprises of data flow initiation to process the raw data stored in S3 bucket. A sequence of steps in the data wrangler UI are created to perform feature engineering on the data. Then, Sagemaker processing job is executed to save the flow to S3 and storing the transformed features into Sagemaker feature Store for reusable purposes.
Once the flow has been created which includes the recipe of instructions to be executed on the data pertaining to the use case, the goal is to automate the process of creating the flow on any new incoming data, and initiate the process of extracting model quality insights, parsing the information to an authorized user to inspect the data quality and waiting for approval to execute the model building and deployment step automatically.
The architecture below showcases the end-to-end automation of data transformation followed by human in the loop approval to facilitate the steps of model training and deployment.
Prior to automating using Step functions workflow, we need to perform a sequence of data transformations to create a flow.
To start using Data Wrangler, complete the following steps:
- In a Sagemaker Studio domain, on the Launcher tab, choose “New data flow”.
- Import the Patient Readmission Dataset
flowHealthcareDiabeticReadmission.csv
from its location in Amazon S3. - Choose Import dataset.
Now, we can start with data transformation in data wrangler UI. We are going to perform a sequence of 8 steps to process, clean and transform the data along with some analysis dashboards to do an initial data and model quality checks.
- Since we intend to view the data quality and model checks initially, start with clicking on “Analysis” tab next to “data” tab and click on
Create new analysis
. - Under create analysis section, choose analysis type as
Data Quality and Insights Report
, target column asreadmitted
and problem type asClassification
as shown below. Then, click on create.
A summary report will be created providing dataset statistics, information regarding target columns, a quick model summary, confusion matrix and feature details. This gives an insight into the data and model quality based on current imported data.
Similarly, other analysis reports can be created to determine additional insights that can be easily gathered from the data using Data Wrangler.
- Choose Analysis type as
Bias Report
. You can use this bias report in Data Wrangler to uncover potential biases in your data. Select the target column to be predicted, i.e.,readmitted
for our use case. Bias Report analysis uses Amazon SageMaker Clarify to perform bias analysis. - You can choose the predicted column value as
no
indicating the patient will not be readmitted. - Select gender as
male
orfemale
as shown below and click oncheck for bias
andsave
to save the bias report.
- After inspecting the bias, you can also create feature correlation chart to determine the correlation between various features which can help determine the features which can be dropped or used for predicting the target column.
- Other analysis charts can also be created. For brevity, I have created the following analysis charts as shown below.
- Now that we have inspected the data, we can move to the next step of performing some data transformations using Data Wrangler UI. We will execute the following steps in data wrangler UI.
- Go back the data tab and start with
Drop column
step. Dropmax_glu_cerum
anda1c_result
since they have low impact on the target column as inspected from the analysis reports. Click onpreview
and thenupdate
.
- Next, choose
Impute
as Transform step, selectNumeric
as column type and selecttime_in_hospital
,num_procedures
,num_medications
,number_diagnosis
,change
anddiabetes_med
with Imputing Strategy asApproximate Median
. This will assist in replacing the missing values in the respective columns with median values for that column. Click onpreview
and thenUpdate
to reflect the changes in the dataset.
- Based on the analysis, we can also remove additional unwanted columns which can tend to bring bias in the data. We will drop
gender
,num_procedures
andnumber_outpatient
in the next step as shown below. You can execute the same step in the primary drop columns section as well, but if you looked at detailed insights from analysis tab, you can later determine to drop additional columns that won’t help in predicting the target column.
- Since our source dataset has features with special characters, we need to clean them before training. Let's use
Search and Edit
Transform. - In order to further clean the data, Pick
Search and edit
from the list of transforms on the right panel. SelectFind and replace substring
- Select the target column
race
for Input column and use\?
regex for Pattern. For theReplacement String
useOther
. Let’s leaveOutput Column
blank for in-place replacements. - Once reviewed, click
Add
to add the transform to your data-flow.
- Since we will be storing the transformed features in
Sagemaker Feature Store
for reusability purpose, we need to add the Event Time and a unique record ID as additional columns to timestamp the features. These 2 fields are required by Sagemaker Feature Store to store them to create Feature groups and store the transformed features in the feature store. We use Data Wrangler’s custom transform option with Pandas and add the following code as shown below:
Code snippet:
#Table is available as variable `df`
import time
from uuid import uuid4
import datetime as dt
import pandas as pd
rec_id=[]
for i in range(len(df)):
unique_id = str(uuid4())
rec_id.append(str(unique_id))
#print('{} : {}'.format(age, unique_id))
df['Record_id']=rec_id
df['EventTime'] = time.time()
df = df.dropna()
- Next, since the
readmitted
column comprises of<30, >30 and no
as the respective column values, with imbalance for <30 and >30 values as seen in the bias report, we can convert his into binary class classification by merging <30 and >30 as a numerical output of 1 as shown. SelectFind and Replace substring
as transform step, with Input columns asreadmitted
and<30|>30
regex pattern and Replacement string as1
.
- Similarly, chose
Find and Replace Substring
with same Input Column, regex Pattern asno
and Replacement String as0
. Click on preview and update to reflect in the transformed dataset. readmit
- Upon further inspection, since
race
column might not benefit the target column prediction, we can droprace
as well as shown. You should be able to see the sequence of steps performed as per the depicted picture below.
Note that you can perform a lot of alternative transformations for your specific dataset and use case when performing the initial data transformation steps.
- Now, after executing the final step, navigate to
Analysis
tab and createData Quality and Insights
Report andQuick Model
report to inspect the transformed data statistics.
- As per your inspection of the
Data Quality and Insights
Report, you should have the 100% Valid Data with no missing values and 0% Duplicate rows in the transformed dataset.
Now, after defining all of the transformations to perform over the dataset, you can export the resulting ML features to Feature Store as shown in the blog.
To avoid any recurring charges, stop any running Data Wrangler and Jupyter Notebook instances within Studio when not in use. Make sure to delete the Sagemaker endpoint. Also, delete the output files in Amazon S3 you created while running the orchestration workflow via step function. You have to delete the data in the S3 buckets before you can delete the buckets.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.