Skip to content

baileythill/BigQueryML

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Machine Learning Using BigQueryML

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Resources
  8. Contributers

About The Project

Big Query ML

The data science world as we know it is obsessed with building the most innovative and effective ML models. Machine learning on large datasets requires extensive programming and knowledge of ML frameworks. These requirements restrict solution development to a very small set of people.Many companies couldn’t afford data scientists with such a level of expertise but have a need to build models to drive their business forward.

BigQuery ML democratizes machine learning by letting SQL practitioners build models using existing SQL tools and skills.Companies can use their data analysts who understand the data but have limited knowledge and programming expertise.

Our project will walk you through the project on how to use this amazing tool using a sample NCAA Basketball dataset. Using this dataset we will attempt to produce analyses on how some of the most fundamental tracked stats impact a teams chances of winning a basketball game.

(back to top)

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

A working knowledge of database querying skills.

Installation

To get started with BigQueryML

  1. Create a free trial account with Google Cloud:- Create Account

  2. In the google console page:- Create a Project

  3. Make sure billing is enabled.

  4. Ensure that the API is enabled.

(back to top)

Usage

Navigate to the BigQuery Console. Check the access of your account to the console here

Step one: Create a Project

Start with creating a project in the GCP Sandbox.

  1. In the navigation panel, there is an option to create a project right next to "Google Cloud Platform".
  2. Start with a new project. Create Project All the steps going forward- the data, models and results will be stored at one place in this project.

Step two: Create your dataset

The second step is to create a BigQuery dataset to store your model. To create your dataset, follow these steps:

  1. In the Cloud Console, go to the BigQuery page. Go to the BigQuery page
  2. In the navigation panel, in the Resources section, click your project name.
  3. On the right side, in the details panel, click Create Dataset
  4. On the Create dataset page, for Dataset ID, enter census.
  5. Leave all of the other default settings in place and click Create dataset.

Step Three:

Creating Tables. Add the data that you want to work on to the database created in step2.

CREATE TABLE IF NOT EXISTS  `big-data-bas.bigdata.main_data`
AS 
SELECT
Game_id, field_goals_made, 
turnovers, offensive_rebounds, 
defensive_rebounds, free_throws_made, 
assists, blocks, steals ,win,
CASE
 WHEN row_num <= 20862  THEN 'training'
 WHEN row_num <= 26822  THEN 'evaluation'
 ELSE 'prediction'
END AS dataframe

FROM
(
SELECT 
 Game_id, field_goals_made, 
 turnovers, offensive_rebounds, 
 defensive_rebounds, free_throws_made, 
 assists, blocks, steals ,win, ROW_NUMBER() OVER (partition by win) as row_num

FROM `bigquery-public-data.ncaa_basketball.mbb_teams_games_sr`
WHERE win is NOT NULL AND
 turnovers  IS NOT NULL AND
 offensive_rebounds IS NOT NULL AND 
 defensive_rebounds  IS NOT NULL AND
 free_throws_made IS NOT NULL AND
 assists  IS NOT NULL AND
 blocks  IS NOT NULL AND
 steals  IS NOT NULL AND
 field_goals_made IS NOT NULL 
);

Step Four:

Create a Model.

# Model
CREATE MODEL IF NOT EXISTS 'big-data-bas.bigdata.win_model1`
OPTIONS 
(MODEL_TYPE = 'BOOSTED_TREE_CLASSIFIER', 
learn_rate = 0.1,
early_stop = TRUE,
input_label_cols=['win']) AS
SELECT 
field_goals_made, 
turnovers, offensive_rebounds, 
defensive_rebounds, free_throws_made, 
assists, blocks, steals ,win
FROM `big-data-bas.bigdata.train_V1`
WHERE win is NOT NULL;

Model Fine Tuning Model Fine Tuning You can check the performance of your classifier at different score thresholds and choose an appropriate thereshold as per your business problem.

Step Five:

Evaluation of the Model's predictions:

# Evaluate the model
SELECT
  *
FROM
  ML.EVALUATE (MODEL `big-data-bas.bigdata.win_model1`,
    (
    SELECT
      *
    FROM
      `big-data-bas.bigdata.main_data`
    WHERE
      dataframe = 'evaluation'
    )
  )
  ;

Evaluation Results Evaluation results

(back to top)

Here's out of sample preformance of our model with probability threshold of 0.5.

License

This project repository is created in partial fulfillment of the requirements for the Big Data Analytics course offered by the Master of Science in Business Analytics program at the Carlson School of Management, University of Minnesota.

(back to top)

Resources used

These are some useful resources to get started with Bigquery ML.

(back to top)

Contributers

Project Link: https://github.com/bvfst/BigQueryML

(back to top)

About

Trends Marketplace

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published