Submission:
- Please push your assignment to your fork (your GitHub repository of the course) and submit a link to it via the form shared in Slack.
In this project, you will implement the exploratory data analysis plan developed in Project 1. This will lay the groundwork for our modeling exercise in Project 3.
Before completing an analysis, it is critical to understand your data. You will need to identify all the biases of the variables in your model in order to accurately assess the strengths and limitations of your analysis and predictions.
Following these steps will help you better understand your dataset.
Objective: A Jupyter notebook writeup that provides a dataset overview with visualizations and statistical analysis.
- Requirements:
- Read in your dataset, determine how many samples are present, and identify any missing data.
- Create a table of descriptive statistics for each of the variables (count, mean, standard deviation, ...).
- Describe the distributions of your data.
- Plot boxplots for each variable.
- Create a covariance matrix.
- Determine any issues or limitations based on your exploratory analysis.
- Outline exploratory analysis methods.
The dataset is available here.
For this project we will be using an Jupyter notebook. This notebook will use matplotlib
for plotting and visualizing our data. This type of visualization is handy for prototyping and quick data analysis. We will discuss more advanced data visualizations for disseminating your work.
- Open the starter code notebook in Anaconda.
- Read in your dataset.
- Try out a few
pandas
commands for describing your data:df.describe()
,df['columnName'].sum()
,df['columnName'].mean()
,df['columnName'].count()
,df.corr()
- Read the documentation for
pandas
. Most of the time, there is a tutorial that you can follow; learning to read documentation is crucial to your success as a data scientist.
Look at some sample notebooks for an example of the types of visualizations you can use in your notebook.
The rubric is available here.