-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIntroR_252C_Draft
170 lines (131 loc) · 8.27 KB
/
IntroR_252C_Draft
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: "Introduction_to_R_252C"
author: "Zachary Butzin-Dozier"
date: "7/30/2019"
output: pdf_document
---
Adapted from materials originally created by Dr. Jade Benjamin-Chung
```{r}
# Load the dplyr package
library(dplyr)
```
#Downloading, merging, and subsetting the WASH Benefits data
Note: do not be alarmed if the code below is overwhelming. We have merged and subsetted the data for you, but we want to make the code available for you to reference in the future.
Data analysis exercises in this course use data from the WASH Benefits Bangladesh trial. The trial assessed whether water, sanitation, handwashing, and nutrition interventions delivered separately or together could reduce child diarrhea and/or improve child growth. The trials used a cluster- and block-randomized design. Within each geographic block, 8 village clusters were randomized to a treatment or conrtol arm. See Luby et al. 2018 for full details (doi: http://dx.doi.org/10.1016/)
In this first exercise we will calculate the prevalence of diarrhea in different treatment arms after the interventions were delivered.
The data and codebooks are publicly available here: https://osf.io/pqzj5/
```{r}
# Load the diarrhea dataset:
d = read.csv("washb-bangladesh-diar-public.csv")
# Load the dataset with treatment data
#remove this line of code
tr = read.csv("washb-bangladesh-tr-public.csv")
```
Next let's merge the two datasets together. This will allow us to calculate the prevalence of diarrhea in different treatment arms (e.g., water, sanitation, handwashing, etc.)
```{r}
d_tr = left_join(d, tr, by=c("block","clusterid"))
```
Now let's filter to only keep the rows for diarrhea measurements after the interventions were delivered.The svy variable includes values 0, 1, 2. We are going to drop the 0 values, which indicate the time period before interventions were delivered.
```{r}
#Note: in R, "!" means "not". In English, the code below means "Filter d_tr to include values where svy is not equal to 0"
#The %>% is the pipe command, which we use to perform multiple operations on a datset
d_tr = d_tr %>% filter(svy!=0)
```
Now we are going to drop children with missing values in the diarrhea variable from the dataset. This assumes that they were missing completely at random - i.e., that there are no characteristics associated with whether a child was missing diarrhea measurement.
```{r}
d_tr = d_tr %>% filter(!is.na(diar7d))
# Take a look at the first 6 rows of the merged dataset with the command "head":
head(d_tr)
```
#Questions
To count the number of people with the disease, the code below uses the pipe operator %>% to perform an operation on the dataframe df. The pipe operator allows us to run a series of actions on our data.frame df. The pipe separates each action, and these actions are established using functions (filter and summarise below) that are like verbs in a sentence.
First, we filter the dataframe to only include rows where diar7d==1. Then the next pipe indicates that the summarise command is performed on the filtered dataset. In this filtered dataset, we count the number of rows using the summarise(n_with_diarrhea=n()) command. summarise can be used for a variety of calculations, such as the mean. n() is used with summarise to count rows. Below we label the result n_with_disease. We then include a line with people_with_disease on its own, which tells R to display the contents of this object.
In general, it is a good practice to keep your code tidy and organized. This makes it easier to read and understand and also easier to catch mistakes. One way to keep your code tidy is to use a line break after each pipe. Click the check button to run the code.
```{r}
#Calculate the number of children with diarrhea across all children in the dataset (ie., not stratifing by the treatment variable). Use the variable diar7d for diarrhea. Save your result in an object called p1. Label the result inside p1 as n_with_diarrhea
ndiar = d_tr %>%
filter(diar7d==1) %>%
summarise(n_with_diarrhea=n())
ndiar
```
Problem 1: Calculate the number of children without diarrhea across all children in the dataset (ie., not stratifing by the treatment variable). Use the variable diar7d for diarrhea. Save your result in an object called p1. Label the result inside p1 as n_without_diarrhea
```{r}
p1 = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
```
```{r}
#We can use R to calculate prevalence in a dataframe. Below, we will calculate the prevalence of bruising in our dataset
#Note: we need to filter to remove NA values for bruising
pbruise = d_tr %>%
filter(!is.na(bruise7d)) %>%
summarise(mean(bruise7d))
pbruise
```
```{r}
# Problem 2: Calculate diarrhea prevalence in the whole dataset (ignoring treatment arm) and save it in an object called prevalence.
prevalence = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
```
We can use R to create tables based on counts in our data. In epidemiology, we often create 2X2 tables based on binary exposure and binary outcome. Below, we have written code with 3 columns. The first column describes treatment, the second column The second column is called "bruise7d" and includes 0 for children without bruising and 1 for children with bruising. The third column is called "n" and includes the number of children with or without bruising in that arm.
```{r}
bruise_tr_table = d_tr %>%
group_by(tr, bruise7d) %>%
summarise(n=n())
bruise_tr_table
```
Problem 3: Now let's get counts of whether children did or did not have diarrhea in each treatment arm. Since the WASH Benefits trial, there were 7 different arms (6 intervention + control) create a data frame with 14 rows (two for each arm). The first column is called "tr" for treatment. The second column is called "diar7d" and includes 0 for children without diarrhea and 1 for children with diarrhea. The third column is called "n" and includes the number of children with or without diarrhea in that arm.
```{r}
# Hint: the row for Control with no diarrhea should be this:
# tr diar7d n
# Control 0 3782
#-----------------------------------------------
diar_tr_table = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
diar_tr_table
```
Problem 4: Calculate the diarrhea prevalence in each treatment arm. You will need to combine different commands used in this problem set to calculate this. Save the results in an object called prevalence_tr. It should have 7 rows (one for each treatment) and two columns. The first column should be for the treatment name and the second should be for the prevalence.
```{r}
prevalence_tr = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
prevalence_tr
```
Problem 5: Examine the results in prevalence_tr. Which arm had the lowest diarrhea prevalence? Save the name of the treatment arm using the same spelling as in the treatment label in prevalence_tr to indicate your answer in an object called p6. (e.g., p6 = "Control")
```{r}
p5 = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
```
Problem 6: Which treatment arm had prevalence closest to the prevalence in the control arm? Save the name of the treatment arm using the same spelling as in the treatment label in prevalence_tr to indicate your answer in an object called p6. (e.g., p7 = "Control")
```{r}
p6 = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
```
#cbind and rbind
We can use the commands cbind and rbind to merge columns with the same number of rows or columns, respectively.
```{r}
#Lets explore how we can merge dataframes using cbind and rbind
x = c(1,2,3)
y = c(4,5,6)
z = c(7,8,9)
```
```{r}
#first, let's try cbind
cxyz = cbind(x,y,z)
cxyz
```
```{r}
#now, let's try again with rbind
rxyz = rbind(x,y,z)
rxyz
```
Below are the scores for 3 highschool cross-country runners in their last 4 mile runs.
```{r}
ryan = c(6,7,7,8)
sarah = c(6,6,7,8)
phil = c(7,8,9,9)
runtimes = rbind(ryan, sarah, phil)
runtimes
```
Problem 7: A transfer student, Tyler, just joined the team. Tyler's last 4 run times are listed below. How would you merge Tyler's scores with the rest of the team?
```{r}
tyler = c(6,6,6,8)
runtimes2 = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
```
Problem 8: The students all run their next race. Their race times are listed below (ryan first, sarah second, phil third, and tyler fourth). How would you merge these scores with the dataframe runtimes2?
```{r}
race5 = c(6,7,5,5)
runtimes3 = "<<<<<<<<<<<<< YOUR CODE HERE >>>>>>>>>>>>>>>"
```