-
Notifications
You must be signed in to change notification settings - Fork 5
Week 1
Download the Titanic dataset from Kaggle, and complete the following exercises. You will need to create a Kaggle account unless you already have one, but it is very straightforward. The dataset consists of personal information of all the people on board the Titanic, along with the information whether they survived the ice berg or not. For now we'll stick to preprocessing and general handling of the data.
-
Read the data in your favorite language. Hint: with Python use can use
pandas
-
Have a look at the data. We will build new representations of the dataset that are better suited for a particular purpose. Some of the columns, e.g
Name
, simply identify a person and cannot be useful for prediction tasks - remove them. -
The column
Cabin
contains a letter and a number. A clever data scientist might conclude that the letter stands for a deck on the ship (which is indeed true) and that having just the deck information might improve the results of a classifier predicting an output variable. Add a new column to the dataset, which is simply the deck letter. -
You'll notice that some of the columns, such as the previously added deck number are categorical. Their representation as a string is not efficient for further computation. Transform them into numeric values so that a unique integer id corresponds to each distinct category. Hint.
pandas
can do this for you. -
Some of the rows in the data have missing values, e.g when the cabin number of a person is not known. Most machine learning algorithms have trouble with missing values, and they need to be handled in preprocessing:
a) For continous values, replace the missing values with the average of the non-missing values of that column.
b) For discrete and categorical values, replace the missing values with the mode of the column.
This is known as imputation.
-
At this point, all data are numeric. Write the data, with the modifications, to a
.csv
file. Then, write another file, this time in thejson
format, with the following structure:
[
{
"Deck": 0,
"Age": 20,
"Survived", 0
...
},
{
...
}
]
You can study the records to see if there is an evident pattern in the chances of survival. We will be doing more systematic analysis next week, so your work here will be rewarded.
Next we'll look at some text data. We'll be looking into Amazon reviews, and the steps needed to transform a raw dataset into one more suitable for prediction tasks.
-
Download the automotive 5-core dataset from here. Extract it to find the data in
json
format. You can also download one of the bigger ones, if you are feeling ambitious. -
The
reviewText
field contains the unstructured review text written by the user. When dealing with natural language, it is important to notice that while, for example, the words "Copper" and "copper." are represented by two different strings, they have the same meaning. When applying statistical methods on this data, it is useful to ensure that words with the same meaning are represented by the same string.To do this, we usually normalize the data, by for example removing punctuation and capitalization differences. A related issue is that, for example, while again the words "swims" and "swim" are distinct string, they both refer to swimming. Stemming refers to the process of mapping words in inflected form to their base form: swims -> swim, etc.
Finally, another popular approach is to remove so called stop-words, words that are very common and have little to do with the actual content matter. There's plenty of openly available lists of stop-words for almost any (natural) language.
-
Do the following:
a) Open the json file in your favorite environment, e.g python
b) Access the
reviewText
field, and downcase the contentsc) Remove all punctuation, as well as the stop-words. You can find a stop-word list for English e.g here
d) Apply a stemmer on the paragraphs, so that inflected forms are mapped to the base form. For example, for python the popular natural language toolkit nltk has an easy-to-use stemmer.
e) Filter the data by selecting reviews where the field
overall
is 4 or 5, and store the review texts in filepos.txt
. Similarly, select reviews with rating 1 or 2 and store the reviews in fileneg.txt
. (Ignore the reviews with overall rating 3.) Each line in the two files should contain exactly one preprocessed review text without the rating.
Having created two collections of positive and negative reviews, respectively, you may wish to take a quick look to see how the review texts differ between them. Here too, we will be using this data later to experiment with machine learning methods.
This exercise involves an SQL database with historical data on baseball players.
-
Download the sqlite version of The History of Baseball dataset from Kaggle. Note that you will need to unzip the file before it can be accessed.
-
You can choose whether to install the sqlite database management software and use it to complete this exercise using SQL, or to use an sqlite library in your favorite programming environment. See the examples in the slides from Lecture 2.
-
Join together the tables
player
andhall_of_fame
by the keyplayer_id
which appears in both. Select the rows where the field inducted takes value 'Y' (i.e., those entries where the player in question was voted to the Hall of Fame). You should get 312 famous players, including Babe Ruth and Yogi Berra. -
Now join in a third table,
player_college
, again by the keyplayer_id
. The player_college table indicates for which college (university) the player has played, if any, in each year. Note that the number of rows in the tableplayer_college
per player varies between zero and nine (somebody probably enjoyed studying as much as you do!).Keep the restriction that the player should have been selected to the Hall of Fame (
inducted
= 'Y'), and group the output by college (college_id
). Count how many Hall of Fame players are alumni of each college. There is one college that has as many as three. Which one? (To see the full name of the college, you can look up the college_id in tablecollege
.)
Hint: Remember that you can process and analyse the data in, e.g., python, so you don't necessarily have do the counting in item (4) by writing SQL code. In other words, unless you happen to be an SQL lover, you may wish to do the table joins in SQL but all the rest in another language.
Hint: If you decide to do the counting in item (4) in SQL, the DISTINCT
clause may be useful.