-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpsy-data-inspection.Rmd
285 lines (213 loc) · 9.18 KB
/
psy-data-inspection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
title: "psy-data-inspection"
author: "Chaehan So"
# date: "2/5/2019"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# knitr::opts_knit$set(global.par = TRUE)
# clear the workspace
rm(list=ls())
# mode <- "new"
mode <- "old"
# select target and features
# target.label <- "PERF07"
target.label <- "PERF.all"
dataset.label <- paste0(c("data/dataset.rds"), collapse = ".")
dataset.NA.label <- paste0(c("data/dataset.NA.rds"), collapse = ".")
# features.set <- "big5composite"
features.set <- "big5items"
# devtools::install_github("agilebean/machinelearningtools", force = TRUE)
# load libraries
libraries <- c("dplyr", "magrittr", "tidyverse", "purrr"
, "sjlabelled" # read SPSS
, "caret", "doParallel"
, "stargazer", "DataExplorer", "skimr"
, "machinelearningtools"
, "knitr", "pander"
)
sapply(libraries, require, character.only = TRUE)
# nominal <- FALSE # with ordinal as ORDERED factors
nominal <- TRUE # with ordinal as NOMINAL factor
seed <- 17
```
```{r, echo=FALSE}
# par(mar=c(0,0,0,0)) #it's important to have that in a separate chunk
```
# 1. Data Acquistion
The file "Personality-Performance-Turnover-Chaehan So.sav" is in SPSS format. Columns contain hidden "variable labels" below column names as shown in the following.
```{r data acquisition, echo=FALSE, message=FALSE, warning=FALSE}
#######################################################################
# 1. Data Acquistion
#######################################################################
filename <- "data/Personality-Performance-Turnover-Chaehan So.sav"
file.raw <- sjlabelled::read_spss(filename,
atomic.to.fac = TRUE,
verbose = FALSE)
data.labels <- foreign::read.spss(filename) %>%
attributes %>%
.$variable.labels %T>% print
file.raw %<>%
dplyr::select(-id, -prinum, -TESTDATE) %>%
tbl_df
file.raw$job %>% table
file.raw %>%
filter(job != 1 & job != 10) %>%
count()
```
# 2. Data Preparation
```{r data preparation code, include=FALSE}
################################################################################
# 2. Data Preparation
################################################################################
if (nominal) {
# data.raw <- sjlabelled::copy_labels(data.raw, df_origin = file.raw)
# file.raw %>% str
## tricky: either mutate+one_of(column_labels) OR mutate_at+column_labels
data.raw <- file.raw %>%
# convert categorical variables to factors
mutate_at((c("COMNAME", "team_id", "class", "job", "gender", "educa")),
as.factor) %>%
# convert numerical variables to numeric datatype
mutate_at(vars(starts_with("TO")), as.numeric) %>%
# fix import error for PERF07
mutate_at("PERF07", as.numeric) %T>% print
# leave Big5 items' datatype to numeric bec. they are already converted
# dataset %>% str
# dataset %>% glimpse
} else { # factors treated as ordinal
}
```
## 2.1 Data Inspection
### 2.1.1 Data Structure
A glimpse at the data structure of the raw data shows that 6 of the 57 variables are categorical: COMNAME, team_id, class, job, gender, educa. Thereof, only educa is ordinal.
All other variables (Big5 items, LIFE_S_R, JOB_S_R, turnover, performance) had been converted to numerical values although they are - strictly speaking - ordinal variables. The only variable reflecting an intrinsically numeric data type is howlong, defined as tenure in months.
```{r data preparation, echo=TRUE}
data.raw %>% glimpse
```
\newpage
### 2.1.2 Histograms
All big5 items, social desirability (sd) and infrequency (inf) are more or less normally distributed.
Tenure (howlong) shows a logarithmic decline.
In contrast, performance variables all are highley skewed except PERF09 and PERF11. The turnover variables all show little or zero variance.
```{r histograms, fig.height=8.7, fig.width=7, cache=FALSE, echo=FALSE}
# par(mar=c(0,0,0,0))
data.raw %>% plot_histogram(nrow = 9, ncol = 6)
```
\newpage
### 2.1.3 Missing Values
The following table and plot shows all variables containing NAs sorted by decreasing count:
```{r missing values, fig.height=9, fig.width=7, cache=TRUE, echo=FALSE}
################################################################################
### 2.1.3 Missing Values
################################################################################
data.raw %>% DataExplorer::plot_missing()
```
\newpage
The top NA variables are:
```{r missing values code, echo=FALSE}
# find columms with most NAs
## count NAs per columns
na.columns <- data.raw %>%
# find any columns with NAs
select_if(function(x) any(is.na(x))) %>%
# sum NAs for each column
summarise_all(funs(sum(is.na(.))))
## rank top NA columns
na.sorted <- na.columns %>%
t %>% as_tibble(rownames = "variable") %>%
rename(NA.count = V1) %>%
arrange(desc(NA.count)) %T>% print
```
## 2.2 Data Cleaning
### Goal: Define a datasets with least # of NAs
### 2.2.1 Prune dataset
The first step of feature engineering is to remove those variables that contain too many missing variables (NAs).
To obtain data.pruned, we remove the variables
- JOB_S_R (1027 NAs)
- howlong (910 NAs)
- job (342 NAs)
- class (272 NAs)
- team_id (238 NAs)
As the remaining dataset would only have 858 observations without NAs, we remove PERF10 and PERF11 which increases sample size to n=958. We also remove TO07 due to zero variance.
```{r define data.pruned code, echo=TRUE}
data.pruned <-
data.raw %>%
select(-JOB_S_R, # 1027 NAs
-howlong, # 910 NAs
# -PERF11, # 502 NAs
# -PERF10, # 431 NAs
# -job, # 342 NAs
# -PERF09, # 331 NAs
-PERF08,
-PERF07, # 309 NAs
# -TO07, # 288 NAs
-class, # 272 NAs +(904-713)
-team_id, # 238 NAs +(1053-904)
# -PERF08, # 198 NAs +(1202-1053)
# -LIFE_S_R # 107 NAs +(1288-1202)
-starts_with("TO"),
-COMNAME,
-gender, -educa, -inf, -sd
) %>%
# create mean performance & turnover score
# mutate(PERF.all = rowMeans(select(., starts_with("PERF")) ) ) %>%
# mutate(TO.all = rowMeans(select(., starts_with("TO")) ) ) %>%
# move life satisfaction to dataframe end to indicate target variable
select(-LIFE_S_R, everything()) %>% print
```
### 2.2.2 Clean dataset
Removing the missing values shrinks the dataset from n=1621 to n=1023.
```{r define dataset containing NA, echo=TRUE}
data.pruned %>% nrow
data.pruned %>% saveRDS(dataset.NA.label)
## final removal of NA rows
# dataset <- data.pruned %>% drop_na %T>% { print(nrow(.)) }
dataset <- data.pruned %T>% { print(nrow(.)) }
```
The final dataset has the following structure:
```{r define dataset.items, echo=TRUE}
dataset %>% glimpse
```
### 2.2.3 Correlation Plot "items": All Big5 items
To visualize the correlations, the following dataset uses the Big5 items instead of the Big5 composite scores.
```{r define dataset.items code, echo=TRUE}
dataset.items <- dataset %>%
# remove composite scores - equivalent to (-nn, -ee, -oo, -aa, -cc)
select(-matches("(oo|cc|ee|aa|nn)$")) %T>% print
```
In the correlation plot for dataset.items, we can see that the Big5 composite scores (e.g. nn) are highly correlated with the corresponding items (e.g. nn1, nn2, ..., nn6). Therefore, the Big5 items can be removed from the dataset without losing too much predictive information.
Importantly, some of the turnover variables (TO08, TO09) yield no correlations - this can be explained by their nearly zero variance pattern, i.e. their standard deviation is nearly zero.
```{r corrplot1, fig.height=6, fig.width=6, cache=FALSE, echo=FALSE, warning=FALSE}
dataset.items %>%
select_if(is.numeric) %>%
# select(-TO08, -TO09) %>% # suppress warning: "the standard deviation is zero"
cor(use = "pairwise.complete.obs") %>%
corrplot::corrplot()
```
### 2.2.4 Correlation Plot "composites": Big5 composite scores
This dataset uses the Big5 composite scores instead of the Big5 items.
```{r define dataset.composites code, echo=TRUE}
dataset.composites <- dataset %>%
# remove Big5 items
select(-matches(".*(1|2|3|4|5|6)")) %>%
# remove turnover items except "TO.all"
select(-matches("^TO[0-9]{2}$")) %T>% print
```
The correlation plot for dataset.composites shows an interesting pattern:
The performance variables PERF07, PERF08, PERF09 show correlations in increasing order with every Big5 composite score.
```{r corrplot2, fig.height=6, fig.width=6, cache=FALSE, echo=FALSE}
dataset.composites %>%
select_if(is.numeric) %>%
cor(use = "pairwise.complete.obs") %>%
corrplot::corrplot()
```
Performance (PERF.all) correlates moderately positively with conscientiousness and moderately negatively with neuroticism, as expected.
Social desirability (sd) correlates strongly negatively with neuroticism, interestingly, and moderately positively with conscientiousness. Although latter positivly relates to performance, social desirability and performance only show a weak positive relationship.
```{r save datasets, include=FALSE}
dataset %>%
# remove turnover variables by regexp for select(-one_of("TO08", ...))
select(-matches("^(TO)[0-9]{2}$"))
dataset %>% saveRDS(dataset.label)
```