Machine Learning Using BigQueryML
Table of Contents
The data science world as we know it is obsessed with building the most innovative and effective ML models. Machine learning on large datasets requires extensive programming and knowledge of ML frameworks. These requirements restrict solution development to a very small set of people.Many companies couldn’t afford data scientists with such a level of expertise but have a need to build models to drive their business forward.
BigQuery ML democratizes machine learning by letting SQL practitioners build models using existing SQL tools and skills.Companies can use their data analysts who understand the data but have limited knowledge and programming expertise.
Our project will walk you through the project on how to use this amazing tool using a sample NCAA Basketball dataset. Using this dataset we will attempt to produce analyses on how some of the most fundamental tracked stats impact a teams chances of winning a basketball game.
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
A working knowledge of database querying skills.
To get started with BigQueryML
-
Create a free trial account with Google Cloud:- Create Account
-
In the google console page:- Create a Project
Navigate to the BigQuery Console. Check the access of your account to the console here
Start with creating a project in the GCP Sandbox.
- In the navigation panel, there is an option to create a project right next to "Google Cloud Platform".
- Start with a new project. All the steps going forward- the data, models and results will be stored at one place in this project.
The second step is to create a BigQuery dataset to store your model. To create your dataset, follow these steps:
- In the Cloud Console, go to the BigQuery page. Go to the BigQuery page
- In the navigation panel, in the Resources section, click your project name.
- On the right side, in the details panel, click
- On the Create dataset page, for Dataset ID, enter census.
- Leave all of the other default settings in place and click Create dataset.
Creating Tables. Add the data that you want to work on to the database created in step2.
CREATE TABLE IF NOT EXISTS `big-data-bas.bigdata.main_data`
AS
SELECT
Game_id, field_goals_made,
turnovers, offensive_rebounds,
defensive_rebounds, free_throws_made,
assists, blocks, steals ,win,
CASE
WHEN row_num <= 20862 THEN 'training'
WHEN row_num <= 26822 THEN 'evaluation'
ELSE 'prediction'
END AS dataframe
FROM
(
SELECT
Game_id, field_goals_made,
turnovers, offensive_rebounds,
defensive_rebounds, free_throws_made,
assists, blocks, steals ,win, ROW_NUMBER() OVER (partition by win) as row_num
FROM `bigquery-public-data.ncaa_basketball.mbb_teams_games_sr`
WHERE win is NOT NULL AND
turnovers IS NOT NULL AND
offensive_rebounds IS NOT NULL AND
defensive_rebounds IS NOT NULL AND
free_throws_made IS NOT NULL AND
assists IS NOT NULL AND
blocks IS NOT NULL AND
steals IS NOT NULL AND
field_goals_made IS NOT NULL
);
Create a Model.
# Model
CREATE MODEL IF NOT EXISTS 'big-data-bas.bigdata.win_model1`
OPTIONS
(MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER',
learn_rate = 0.1,
early_stop = TRUE,
input_label_cols=['win']) AS
SELECT
field_goals_made,
turnovers, offensive_rebounds,
defensive_rebounds, free_throws_made,
assists, blocks, steals ,win
FROM `big-data-bas.bigdata.train_V1`
WHERE win is NOT NULL;
Model Fine Tuning You can check the performance of your classifier at different score thresholds and choose an appropriate thereshold as per your business problem.
Evaluation of the Model's predictions:
# Evaluate the model
SELECT
*
FROM
ML.EVALUATE (MODEL `big-data-bas.bigdata.win_model1`,
(
SELECT
*
FROM
`big-data-bas.bigdata.main_data`
WHERE
dataframe = 'evaluation'
)
)
;
This project repository is created in partial fulfillment of the requirements for the Big Data Analytics course offered by the Master of Science in Business Analytics program at the Carlson School of Management, University of Minnesota.
These are some useful resources to get started with Bigquery ML.
- Google Cloud end-to-end user journey for each model
- NCAA Basketball dataset
- BigQuery Cheat Sheet
- Velotio's BigQuery 101: All the Basics You Need to Know
- Bailey Hill - @LinkedIn - [email protected]
- Kiran - @LinkedIn - [email protected]
- Priyanka Bhosale - @LinkedIn - [email protected]
- Qujiaheng (Johnny) Zhang - @LinkedIn - [email protected]
- Rahul Agarwal - @LinkedIn - [email protected]
Project Link: https://github.com/bvfst/BigQueryML