psy-data-inspection.Rmd

---
title: "psy-data-inspection"
author: "Chaehan So"
# date: "2/5/2019"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# knitr::opts_knit$set(global.par = TRUE)

# clear the workspace
rm(list=ls())

# mode <- "new"
mode <- "old"

# select target and features
# target.label <- "PERF07"
target.label <- "PERF.all"

dataset.label <- paste0(c("data/dataset.rds"), collapse = ".")
dataset.NA.label <- paste0(c("data/dataset.NA.rds"), collapse = ".")

# features.set <- "big5composite"
features.set <- "big5items"

# devtools::install_github("agilebean/machinelearningtools", force = TRUE)
# load libraries
libraries <- c("dplyr", "magrittr", "tidyverse", "purrr"
               , "sjlabelled" # read SPSS
               , "caret", "doParallel"
               , "stargazer", "DataExplorer", "skimr"
               , "machinelearningtools"
               , "knitr", "pander"
)
sapply(libraries, require, character.only = TRUE)

# nominal <- FALSE # with ordinal as ORDERED factors
nominal <- TRUE # with ordinal as NOMINAL factor

seed <- 17
```

```{r, echo=FALSE}
# par(mar=c(0,0,0,0)) #it's important to have that in a separate chunk
``` 

# 1. Data Acquistion
The file "Personality-Performance-Turnover-Chaehan So.sav" is in SPSS format. Columns contain hidden "variable labels" below column names as shown in the following.

```{r data acquisition, echo=FALSE, message=FALSE, warning=FALSE}

#######################################################################
# 1. Data Acquistion
#######################################################################

filename <- "data/Personality-Performance-Turnover-Chaehan So.sav"

file.raw <- sjlabelled::read_spss(filename, 
                                  atomic.to.fac = TRUE,
                                  verbose = FALSE) 

data.labels <- foreign::read.spss(filename) %>% 
  attributes %>% 
  .$variable.labels %T>% print

file.raw %<>%
  dplyr::select(-id, -prinum, -TESTDATE) %>% 
  tbl_df 

file.raw$job %>% table

file.raw %>% 
  filter(job != 1 & job != 10) %>% 
  count()

```

# 2. Data Preparation

```{r data preparation code, include=FALSE}
################################################################################
# 2. Data Preparation
################################################################################

if (nominal) {
  
  # data.raw <- sjlabelled::copy_labels(data.raw, df_origin = file.raw)
  # file.raw %>% str
  
  ## tricky: either mutate+one_of(column_labels) OR mutate_at+column_labels
  data.raw <- file.raw %>%
    # convert categorical variables to factors
    mutate_at((c("COMNAME", "team_id", "class", "job", "gender", "educa")),
              as.factor) %>%
    # convert numerical variables to numeric datatype
    mutate_at(vars(starts_with("TO")), as.numeric) %>% 
    # fix import error for PERF07
    mutate_at("PERF07", as.numeric) %T>% print
  # leave Big5 items' datatype to numeric bec. they are already converted
  
  # dataset %>% str
  # dataset %>% glimpse
  
} else { # factors treated as ordinal
  
  
}
```

## 2.1 Data Inspection

### 2.1.1 Data Structure
A glimpse at the data structure of the raw data shows that 6 of the 57 variables are categorical: COMNAME, team_id, class, job, gender, educa. Thereof, only educa is ordinal. 
All other variables (Big5 items, LIFE_S_R, JOB_S_R, turnover, performance) had been converted to numerical values although they are - strictly speaking - ordinal variables. The only variable reflecting an intrinsically numeric data type is howlong, defined as tenure in months.

```{r data preparation, echo=TRUE}
data.raw %>% glimpse
```

\newpage
### 2.1.2 Histograms
All big5 items, social desirability (sd) and infrequency (inf) are more or less normally distributed.
Tenure (howlong) shows a logarithmic decline.
In contrast, performance variables all are highley skewed except PERF09 and PERF11. The turnover variables all show little or zero variance.


```{r histograms, fig.height=8.7, fig.width=7, cache=FALSE, echo=FALSE}
# par(mar=c(0,0,0,0))
data.raw %>% plot_histogram(nrow = 9, ncol = 6)
```

\newpage
### 2.1.3 Missing Values
The following table and plot shows all variables containing NAs sorted by decreasing count:

```{r missing values, fig.height=9, fig.width=7, cache=TRUE, echo=FALSE}
################################################################################
### 2.1.3 Missing Values
################################################################################
data.raw %>% DataExplorer::plot_missing()
```

\newpage
The top NA variables are:

```{r missing values code, echo=FALSE}

# find columms with most NAs
## count NAs per columns
na.columns <- data.raw %>% 
  # find any columns with NAs
  select_if(function(x) any(is.na(x))) %>% 
  # sum NAs for each column
  summarise_all(funs(sum(is.na(.)))) 

## rank top NA columns
na.sorted <- na.columns %>% 
  t %>% as_tibble(rownames = "variable") %>% 
  rename(NA.count = V1) %>% 
  arrange(desc(NA.count)) %T>% print
```


## 2.2 Data Cleaning

### Goal: Define a datasets with least # of NAs

### 2.2.1 Prune dataset

The first step of feature engineering is to remove those variables that contain too many missing variables (NAs). 

To obtain data.pruned, we remove the variables
- JOB_S_R (1027 NAs)
- howlong (910 NAs)
- job (342 NAs)
- class (272 NAs)
- team_id (238 NAs)

As the remaining dataset would only have 858 observations without NAs, we remove PERF10 and PERF11 which increases sample size to n=958. We also remove TO07 due to zero variance.

```{r define data.pruned code, echo=TRUE}
data.pruned <-
  data.raw %>% 
  select(-JOB_S_R, # 1027 NAs
         -howlong, #  910 NAs
         # -PERF11,  #  502 NAs
         # -PERF10,  #  431 NAs
         # -job,     #  342 NAs
         # -PERF09,  #  331 NAs
         -PERF08,
         -PERF07, # 309 NAs
         # -TO07,    #  288 NAs
         -class,   #  272 NAs +(904-713)
         -team_id, #  238 NAs +(1053-904)
         # -PERF08,  #  198 NAs +(1202-1053)
         # -LIFE_S_R #  107 NAs +(1288-1202)
         -starts_with("TO"),
         -COMNAME,
         -gender, -educa, -inf, -sd
  ) %>% 
  # create mean performance & turnover score
  # mutate(PERF.all = rowMeans(select(., starts_with("PERF")) ) ) %>%
  # mutate(TO.all = rowMeans(select(., starts_with("TO")) ) ) %>% 
  # move life satisfaction to dataframe end to indicate target variable
  select(-LIFE_S_R, everything()) %>% print

```


### 2.2.2 Clean dataset
Removing the missing values shrinks the dataset from n=1621 to n=1023.

```{r define dataset containing NA, echo=TRUE}
data.pruned %>% nrow
data.pruned %>% saveRDS(dataset.NA.label)
## final removal of NA rows
# dataset <- data.pruned %>% drop_na %T>% { print(nrow(.)) }
dataset <- data.pruned %T>% { print(nrow(.)) }
```

The final dataset has the following structure:

```{r define dataset.items, echo=TRUE}
dataset %>% glimpse
```


### 2.2.3 Correlation Plot "items": All Big5 items

To visualize the correlations, the following dataset uses the Big5 items instead of the Big5 composite scores.

```{r define dataset.items code, echo=TRUE}
dataset.items <- dataset %>% 
  # remove composite scores - equivalent to (-nn, -ee, -oo, -aa, -cc)
  select(-matches("(oo|cc|ee|aa|nn)$")) %T>% print
```

In the correlation plot for dataset.items, we can see that the Big5 composite scores (e.g. nn) are highly correlated with the corresponding items (e.g. nn1, nn2, ..., nn6). Therefore, the Big5 items can be removed from the dataset without losing too much predictive information.

Importantly, some of the turnover variables (TO08, TO09) yield no correlations - this can be explained by their nearly zero variance pattern, i.e. their standard deviation is nearly zero.


```{r corrplot1, fig.height=6, fig.width=6, cache=FALSE, echo=FALSE, warning=FALSE}
dataset.items %>% 
  select_if(is.numeric) %>% 
  # select(-TO08, -TO09) %>% # suppress warning: "the standard deviation is zero"
  cor(use = "pairwise.complete.obs") %>% 
  corrplot::corrplot()
```


### 2.2.4  Correlation Plot "composites": Big5 composite scores
This dataset uses the Big5 composite scores instead of the Big5 items.

```{r define dataset.composites code, echo=TRUE}
dataset.composites <- dataset %>% 
  # remove Big5 items
  select(-matches(".*(1|2|3|4|5|6)")) %>% 
  # remove turnover items except "TO.all"
  select(-matches("^TO[0-9]{2}$")) %T>% print
```

The correlation plot for dataset.composites shows an interesting pattern:
The performance variables PERF07, PERF08, PERF09 show correlations in increasing order with every Big5 composite score. 
  
```{r corrplot2, fig.height=6, fig.width=6, cache=FALSE, echo=FALSE}
dataset.composites %>% 
  select_if(is.numeric) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  corrplot::corrplot()
```

Performance (PERF.all) correlates moderately positively with conscientiousness and moderately negatively with neuroticism, as expected.
Social desirability (sd) correlates strongly negatively with neuroticism, interestingly, and moderately positively with conscientiousness. Although latter positivly relates to performance, social desirability and performance only show a weak positive relationship.


```{r save datasets, include=FALSE}
dataset %>%
  # remove turnover variables by regexp for select(-one_of("TO08", ...))
  select(-matches("^(TO)[0-9]{2}$"))
dataset %>% saveRDS(dataset.label)
```