Skip to content

Latest commit

 

History

History
352 lines (292 loc) · 33 KB

README.md

File metadata and controls

352 lines (292 loc) · 33 KB

GeneralAssemb.ly

DAT11 Course Repository

Welcome to Data Science at General Assembly! This is where we will be hosting all class slides, assignments, resources, and more. Course materials for General Assembly's Data Science course in Washington, DC (1/12/16 - 3/31/16).

Course Outline (tentative, subject to change)

Tuesday Thursday
1/12: Introduction to Data Science 1/14: Python Data Model; Data Reading and Cleaning
1/19: Command Line and Version Control 1/21: Exploratory Data Analysis
1/26: Data Visualization 1/28: Machine Learning Introduction
2/2: K-Nearest Neighbors 2/4: Linear Regression
2/9: Web Scraping and Data Cleansing 2/11: Basic Model Evaluation
2/16: Logistic Regression 2/18: Advanced Model Evaluation
2/23: First Project Presentation 2/25: Naive Bayes and Text Data
3/1: Natural Language Processing 3/3: Kaggle Competition
3/8: Decision Trees 3/10: Ensembling (Random Forest)
3/15: Advanced scikit-learn/Clustering 3/17: Final Project Presentation
3/22: Final Project Presentation 3/24: Selected Topics, Wrap-up

Class Assignment Outline (tentative, subject to change)

Date Assignment
1/21 HW#1: Chipotle Python
1/26 HW#2: Command Line
1/28 HW#3: IMDB with Pandas
2/4 Project Brainstorming Deadline
2/9 HW#4 Yelp Votes Linear Regression
2/11 Project Question and Dataset Due
2/16 HW#5: Web Scraping - IMDB (Optional)
2/23 First Project Presentation
3/8 HW#6: Naive Bayes with Yelp Review Text (Optional) & Draft Project Paper Due
3/15 Peer Review Due
3/22 Final Project Presentations

Submission Forms


Class 1: Introduction to Data Science

Python Resources

Resources:


Class 2: Python Fundamentals, Data Reading, and Cleaning

  • Python:
    • Spyder interface / Ipython Notebooks
    • Looping exercise
    • Lesson on file reading with airline safety data (code, data, article)
    • Data cleaning exercise
    • Walkthrough of Python homework with Chipotle data (code, data, article)
  • Discuss Course Project
  • Wrap up: Course schedule, office hours

Homework:

  • Complete the Python homework assignment with the Chipotle data, add a commented Python script to your GitHub repo, and submit a link using the homework submission form. (Note: Pandas, which is covered in class 4, should not be used for this assignment.)

  • Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time this weekend practicing Python:

    • Introduction to Python does a great job explaining Python essentials and includes tons of example code.
    • If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
    • If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
    • If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
    • If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message) and send me your code in Slack.
  • To give you a framework for thinking about your project, watch What is machine learning, and how does it work? (10 minutes). (This is the IPython notebook shown in the video.) Alternatively, read A Visual Introduction to Machine Learning, which focuses on a specific machine learning model called decision trees.

  • Optional: Browse through some more example student projects, which may help to inspire your own project!

  • Install the Anaconda distribution of Python 2.7x.

    • If you choose not to use Anaconda, here is a list of the Python packages you will need to install during the course.

Resources:


Class 3: Command Line and Version Control

  • Git and GitHub (slides)
  • Command line exercise (code)
  • Intermediate command line

Homework:

Create a Markdown document that includes your answers to questions 1-3 below and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form:

  1. Using chipotle.tsv in the data subdirectory:
    • Look at the head and the tail, and think for a minute about how the data is structured. What do you think each column means? What do you think each row means? Tell me! (If you're unsure, look at more of the file contents.)
    • How many orders do there appear to be?
    • How many lines are in the file?
    • Which burrito is more popular, steak or chicken?
    • Do chicken burritos more often have black beans or pinto beans?
  2. Count the number of occurrences of the word 'dictionary' (regardless of case) across all files in the DAT9 repo.
  3. Optional: Use the the command line to discover something "interesting" about the Chipotle data. The advanced commands may be helpful to you!

Git and Markdown Resources:

Command Line Resources:

  • The Linux command line
  • If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
  • If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.

Git Repo setup

Step 1

Fork the Class repo: Step 1

Step 2

Copy the link from your new Forked Repo Step 2

Step 3

Clone your new forked repo to your computer. git clone [email protected]:YOUR_USERNAME/DAT-DC-11.git

Step 4

cd (change directory) into the cloned repo.

Step 5

git remote add upstream https://github.com/ga-students/DAT-DC-11

Step 6

Repeat this step often to keep your Repo up to date with the Class Repo:

git fetch upstream
git merge upstream/master

Resources:


Class 4: Exploratory Data Analysis

Homework:

Resources:


Class 5: Visualization

  • Part 2 of Exploratory Data Analysis with Pandas (code)
  • Visualization with Pandas and Matplotlib (notebooks)
  • Python homework with the Chipotle data ([Solution](homework solutions/03_python_homework_chipotle_explained.ipynb))

Homework:

Pandas Resources:

Visualization Resources:


Class 6: Machine Learning

Homework:

Machine Learning Resources:

IPython Notebook Resources:

  • If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
  • This Reddit discussion compares the relative strengths of the IPython Notebook and Spyder.

Class 7: K-Nearest Neighbors

Homework:

KNN Resources:

Seaborn Resources:


Class 8: Linear Regression

Homework:

Linear Regression Resources:

Other Resources:


Meet the Team

Aleks Ontman (Instructor)

Dr. Ontman joined Deloitte in 2012, currently a Sr. Data Scientist in Deloitte's Advanced Analytics Visualization team (VizStudio) specializing in: machine learning, design thinking for prototyping new solutions, ideation workshops, and guided interactions with Big Data. Our projects involved big data solutions for 5+ Fortune 100 companies, 10+ Fortune 500 companies, and several Federal Agencies.

Alex Sherman (Instructor)

Alex is a passionate business analytics advocate. He currently works as a Technology Consultant at Deloitte Consulting, in which he leads the design and implementation for informatics and analytics software development projects, repurposing semantic open source software to enhance data access for federal health care clients. In his free time, Alex is an avid jazz percussionist, self-proclaimed as the best drum stick spinner in the DC metro area.

Al is interested in creatively applying software engineering and data science to solve real world problems. He recently graduated with a degree in computer science from the McCormick School of Engineering at Northwestern University and currently works at the Washington Post as a Data Scientist part of the Big Data & Data Warehouse Solutions department.

Tim Foley (Course Producer)

As your Course Producer, it's Tim's job to make sure that you (and your instructors) have everything you need for a successful experience in DAT9. If you've got a question, and you're not sure who to ask, start with Tim!

Before GA, Tim lived and worked in China as a facilitator and program-designer for youth leadership programs at international schools all over Asia (e.g. student-council retreats, backpacking trips, etc.). After a year abroad, he was ready to move back to the good ol' USA. Tim started out at the front desk as a member of GA's Front Lines team, moved up to "Campus Commander" (yes, a real title), and then in January started as a full-time Course Producer. In addition to Data Science, Tim also produces Front- and Back-End Web Development, Data Analytics, Mobile Development, and Digital Marketing. Tim has been trying to learn Esperanto since high school.

Contact Info