-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial_dataset_3.qmd
335 lines (218 loc) · 16 KB
/
tutorial_dataset_3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# Tutorial 3: Adding contexts and complex units
## Overview
This is the third of five tutorials on adding datasets to your `traits.build` database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in `traits.build-template`. Instructions are available at [Tutorial: Example compilation](tutorial_compilation.html).\
### Goals
- Learn how to [add contexts](#add_contexts).
- Learn some complexities with respect to [units](#complex_units).
- Learn additional [custom_R_code tricks](#custom_R_code_tricks).
### New functions introduced
- metadata_add_contexts
------------------------------------------------------------------------
## Adding tutorial_dataset_3
### Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled `tutorial_dataset_3` within the data folder.
- Ensure that this folder exists on your computer.
- The file `data.csv` exists within the `tutorial_dataset_3` folder.
- There is a folder `raw` nested within the `tutorial_dataset_3` folder, that contains one file, `notes.txt`.
### source necessary functions
- If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the `traits.build` package and the custom functions file:
```{r, eval=FALSE}
library(traits.build)
source("R/custom_R_code.R")
```
### Use functions to create a metadata.yml file
#### **Create a metadata template**
To create the metadata template, run:
```{r, eval=FALSE}
metadata_create_template("tutorial_dataset_3")
```
As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:
[data format:]{style="color:blue;"} [**wide**]{style="color:red;"}\
[taxon_name column:]{style="color:blue;"} [**1: Species**]{style="color:red;"}\
[location_name column:]{style="color:blue;"} [**5: site**]{style="color:red;"}\
[individual_id column:]{style="color:blue;"} [**1: NA**]{style="color:red;"}\
[collection_date column:]{style="color:blue;"} [**1: NA**]{style="color:red;"}\
[Enter collection_date range in format '2007/2009':]{style="color:blue;"} [**2011-02/2011-03**]{style="color:red;"}\
[Do all traits need `repeat_measurements_id`'s?]{style="color:blue;"} [**2: No**]{style="color:red;"}\
In this dataset, unlike the first two, the data being input is at the individual-level. Since there is only a single data row for each individual, it is not required to map in an individual_id. A column with an `individual_id` is required if you want to keep track of multiple rows of data for the same individual.
*Navigate to the dataset's folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.*
------------------------------------------------------------------------
#### **Propagate source information into the metadata.yml file**
This dataset is from a published source and therefore the source information can be added with the function `metadata_add_source_doi`:
```{r, eval=FALSE}
metadata_add_source_doi(dataset_id = "tutorial_dataset_3",
doi = "10.1007/s11104-013-1725-x")
```
confirm:
1. the authors' names are formatted as `first name last name` or `first initial last name`\
2. the article title is in sentence case\
3. the page numbers are filled in as a range, separated by a double dash\
You have just added 3 doi's that all yield perfect reference information - and indeed most references are added correctly, but some journals and doi's for many older references are in ALL CAPS or missing page numbers, so it is worth checking.
------------------------------------------------------------------------
#### **Add location details**
All data for this dataset was collected at a single location, specified in the `data.csv` file as `The University of Melbourne Burnley campus`. No additional details are provided. For such studies, it is best to look up the campus location and input approximate latitude/longitude coordinates.
As well as adding locations and location properties from a table, the function `metadata_add_locations` lets you add a basic location data scaffold in metadata.yml.
For instance, for this study:\
1. you add the location names from the data.csv file\
2. the function automatically adds blank fields for latitude, longitude, and description\
3. values for these fields must then be filled in manually\
```{r, eval=FALSE}
data <- read_csv("data/tutorial_dataset_3/data.csv")
metadata_add_locations("tutorial_dataset_3", data)
```
You select the location name, but not any location properties, as none are provided in the data.csv file or another tabular format.
[location_name:]{style="color:blue;"} [**4: site**]{style="color:red;"}\
[location_property columns:]{style="color:blue;"} [*just press enter*]{style="color:red;"}\
This creates the following scaffold in `methdata.yml`:
```{r, eval=FALSE}
The University of Melbourne Burnley campus:
latitude (deg): na_character
longitude (deg): na_character
description: na_character
```
- `metadata_add_locations` automatically selects the unique values in the location name column.
- if no columns with location properties are specified, the function just adds the three core location properties.
- the values for these location properties are available in the notes file.
------------------------------------------------------------------------
#### **Add traits**
To select columns in the `data.csv` file that include trait data, run:
```{r, eval=FALSE}
metadata_add_traits(dataset_id = "tutorial_dataset_3")
```
Select columns [**5 6 7 8 9 10**]{style="color:red;"}, as these contain trait data.
------------------------------------------------------------------------
#### **Add contexts** {#add_contexts}
A context is any piece of ancillary information that helps explain why a certain trait value was measured.
In traits.build, some contexts are mapped in as part of the default metadata structure, including the location (& location properties), a general sense of organism age (`life_stage`), `basis_of_record`, and the general methods for each trait.\
However most contexts are pieces of information that are essential to record for a specific dataset, but not recorded for most other datasets. The context field therefore allows any context property to be added manually.\
Context properties are divided into 5 categories:\
1. **method contexts**: Context properties that capture *differences* in method between measurements of the same trait. For plants, canopy position and leaf age are two common method contexts.\
2. **temporal contexts**: Context properties that capture explicit time-related differences between groups of measurements. This is separate from `collection_date`, as an explicit meaning should accompany each temporal context property and the distinct values may span a range of collection dates. For plants `sampling season` (dry versus wet) is a commonly mapped in temporal context.\
3. **entity contexts**: This context property category pertains to individual-level measurements, and documents features of the individual that explicitly distinguish it from other individuals that are measured. In addition to features like the sex of an individual, it is the location to document individual-level co-variates that are not themselves traits, but are information required to interpret other trait values.\
4. **treatment contexts**: Any experimental treatment that has been applied to groups of individuals.\
5. **plot contexts**: Any variation within a documented location, where different individuals experience know differences in growing/living conditions or growing/living history. For plants, this context category is frequently used to map in slope position or fire history.\
Context properties are most frequently included in the `data.csv` file as columns of values. Occasionally, separate columns of trait values might represent measurements with different context property values, a topic for a later tutorial.
Context properties that are columns in the data file, can be added with the function `metadata_add_contexts`:
```{r, eval=FALSE}
metadata_add_contexts("tutorial_dataset_3")
```
This leads to a user-prompt to select the relevent columns:
[Indicate all columns that contain additional contextual data for tutorial_dataset_3 (by number separated by space; e.g. '1 2 4'):]{style="color:blue;"}\
[1: Species]{style="color:blue;"}\
[2: Treatment]{style="color:blue;"}\
[3: Replicate]{style="color:blue;"}\
[4: site]{style="color:blue;"}\
[5: life_form]{style="color:blue;"}\
[6: WP leaf (Mpa) predawn]{style="color:blue;"}\
[7: WP leaf (Mpa) midday]{style="color:blue;"}\
[8: LMA kg/m2]{style="color:blue;"}\
[9: Stomatal density Upper surface]{style="color:blue;"}\
[10: Stomatal density Lower surface]{style="color:blue;"}\
Select column 2 which is the only column with a context property:
[Selection:]{style="color:blue;"} [**2**]{style="color:red;"}\
Additional user prompts ask for details about the context property category and values:
[What category does context Treatment fit in? (by number separated by space; e.g. '1 2 4'):]{style="color:blue;"}\
[1: treatment_context]{style="color:blue;"}\
[2: plot_context]{style="color:blue;"}\
[3: temporal_context]{style="color:blue;"}\
[4: method_context]{style="color:blue;"}\
[5: entity_context]{style="color:blue;"}\
This is a treatment context, so select 1:
[Selection:]{style="color:blue;"} [**1**]{style="color:red;"}\
[The following values exist for this context: Drought, Watered.]{style="color:blue;"}\
[Are replacement values required? (y/n)]{style="color:blue;"} [**y**]{style="color:red;"}\
Although the trait values `Drought` and `Watered` are probably sufficiently descriptive, for other drought-treatment studies we've used `drought` and `well-watered`, so prefer to align the context property values with these.
[Are descriptions required? (y/n)]{style="color:blue;"} [**y**]{style="color:red;"}\
The free-form description field let's you add details about the exact meaning of `drought` vs `well-watered` for this study.
In the `metadata.yml` file, there will now be a scaffold for the contexts:
```{r, eval= FALSE}
contexts:
- context_property: unknown
category: treatment
var_in: Treatment
values:
- find: Drought
value: unknown
description: unknown
- find: Watered
value: unknown
description: unknown
```
In addition to filling in the preferred context property values and descriptions, you must also assign a name to the `context_property`. This is a free-form field, but as with `location_property` it is best to ensure you align `context_propery` names throughout the database. In the AusTraits plant trait database, this `context_property` is always called `drought treatment`.
The finished context section will be:
```{r, eval= FALSE}
contexts:
- context_property: drought treatment
category: treatment
var_in: Treatment
values:
- find: Drought
value: drought
description: The plants were watered with 20% of the water used by well-watered
plants (determined gravimetrically) in the 3-4 days preceding each watering
event).
- find: Watered
value: well-watered
description: The plants were watered to pot capacity at (2 L per pot).
```
------------------------------------------------------------------------
### Manual filling in of metadata
The components of this dataset that can be propagated with functions are not complete, and the remaining `unknown` fields must now be filled in manually.\
- the `contributors` section\
- `description`, `basis_of_record`, `life_stage`, `sampling_strategy`, `original_file`, and `notes` under the `dataset` section\
- details for each trait, including `unit_in`, `trait_name`, `entity_type`, `value_type`, `basis_of_record`, `replicates` and `methods`\
#### **Adding contributors**
The file `data/tutorial_dataset_3/raw/tutorial_dataset_3_notes.txt` indicates the main data_contributor for this study.\
#### **Dataset fields**
The file `data/tutorial_dataset_3/raw/tutorial_dataset_3_notes.txt` indicates how to fill in the `unknown` dataset fields for this study.
#### **Trait details** {#complex_units}
The file `data/tutorial_dataset_3/raw/tutorial_dataset_3_notes.txt` indicates how to fill in the `unknown` trait fields for this study, but see below as well.
Remember, the `trait_name` must match a trait concept within the [traits dictionary](https://github.com/traitecoevo/traits.build-template/blob/master/config/traits.yml). For this example:
| column in dataset | trait concept | units_in | entity_type | value_type | basis_of_ value | replicates |
|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| life_form | life_form | .na | species | mode | expert_score | .na |
| WP leaf (Mpa) predawn | water_potential_predawn | neg_MPa | individual | raw | measurement | 1 |
| WP leaf (Mpa) midday | water_potential_midday | neg_MPa | individual | raw | measurement | 1 |
| LMA kg/m2 | leaf_mass_per_area | kg/m2 | individual | raw | measurement | 1 |
| Stomatal density Upper surface | leaf_stomatal_density_adaxial | '{count}/mm2' | individual | raw | measurement | 1 |
| Stomatal density Lower surface | leaf_stomatal_density_abaxial | '{count}/mm2' | individual | raw | measurement | 1 |
With the units, note:
- In the data.csv file, all water potential values are positive, indicating the data contributor mapped in the "negative" of the true water potential values (which are always below zero). A negative sign at the beginning of the units field is not recognised and therefore the convention is to use the prefix `neg_` to indicate the values input are the negative of the true values.
- stomatal density is a "count density", a number of stomata per unit area. The actual UCUM standard for this is simply `1/mm2` , but for clarity we use `{count}/mm2` . The word `count` is in curly brackets, since it is a "note" rather than a true unit.
- If the unit begin with a curly bracket, the unit needs to be placed in single quotes
### Testing, error fixes, and report building {#custom_R_code_tricks}
At this point, run the dataset tests and rebuild the dataset:
```{r, eval=FALSE}
dataset_test("tutorial_dataset_3")
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
```
The dataset test should yield an error that one `water_potential_predawn` value does not convert to numeric, indicating a placeholder-character is being used in place of an NA [{note: this error wasn't triggering as this vignette was being written)]{style="color:red;"}
Looking at the excluded_data table indicates there is a "\*" in one column, so one adds:
```{r, eval=FALSE}
custom_R_code: '
data %>%
mutate(
across(c("WP leaf (Mpa) predawn"), ~na_if(.x,"*"))
)
'
```
However, now you'll get the error: [Caused by error in na_if(): ! Can't convert y <character> to match type of x <double>.]{style="color:blue;"}
This indicates a mismatch between column types, necessitating that you change the column to character:
```{r, eval=FALSE}
custom_R_code: '
data %>%
mutate(
across(c("WP leaf (Mpa) predawn"), ~as.character(.x)),
across(c("WP leaf (Mpa) predawn"), ~na_if(.x,"*"))
)
'
```
At this point, rerunning the tests and rebuilding the database should not generate any errors or excluded values, so you can build and review the report.
As a final step, build a report for the study
```{r, eval=FALSE}
traits.build_database$build_info$version <- "5.0.0"
# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_3", traits.build_database, overwrite = TRUE)
```