Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combined table output #144

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7199d18
Create uni_table.R
ehwenk Nov 10, 2023
bb95cbe
new function: database_create_combined_table
dfalster Nov 10, 2023
bf2d654
Merge branch 'develop' into combined-table-output
dfalster Nov 10, 2023
0747302
rename context columns
ehwenk Nov 13, 2023
487ede8
Merge branch 'develop' into combined-table-output
ehwenk Nov 16, 2023
3eddab2
Update output_combined_table.R
ehwenk Nov 16, 2023
a67a1b0
Update output_combined_table.R
ehwenk Nov 16, 2023
45c03a7
Remove trailing whitespaces
yangsophieee Nov 16, 2023
b2e31f0
Merge branch 'develop' into combined-table-output
yangsophieee Nov 16, 2023
5a15b22
Add "original_name" to `left_join`
yangsophieee Nov 16, 2023
10380f1
Add tests for database_create_combined_table to `dataset_test`
yangsophieee Nov 16, 2023
c227716
Update metadata of Test_2023_1 for testing
yangsophieee Nov 16, 2023
91a56a0
Add function skeleton for unpacking combined table
yangsophieee Nov 16, 2023
fce66ad
Merge branch 'develop' into combined-table-output
ehwenk Nov 16, 2023
26465d9
Merge branch 'develop' into combined-table-output
ehwenk Nov 16, 2023
c101fb4
Fix formatting
yangsophieee Nov 16, 2023
4ad7f54
Fix formatting
yangsophieee Nov 16, 2023
667fe32
Merge branch 'combined-table-output' of https://github.com/traitecoev…
ehwenk Nov 17, 2023
95d016c
Update formatting
yangsophieee Nov 18, 2023
6d0b24a
Merge branch 'combined-table-output' of https://github.com/traitecoev…
yangsophieee Nov 18, 2023
713a60f
Merge branch 'combined-table-output' of https://github.com/traitecoev…
ehwenk Nov 19, 2023
f0767b8
Merge branch 'develop' into combined-table-output
ehwenk Nov 19, 2023
8b6f7b7
Update output_combined_table.R
ehwenk Nov 22, 2023
8d6474e
Merge branch 'develop' into combined-table-output
ehwenk Dec 3, 2023
dc622c9
Update output_combined_table.R
ehwenk Dec 3, 2023
f495c7b
Update output_combined_table.R
ehwenk Dec 3, 2023
68fa9d4
Update formatting
yangsophieee Dec 6, 2023
12cbe01
Fix `str_replace_all` arguments for context descriptions
yangsophieee Dec 6, 2023
b53d933
update function documentation
ehwenk Jan 24, 2024
e78833b
Update output_combined_table.R
ehwenk Jan 31, 2024
9717d45
Merge branch 'develop' into combined-table-output
ehwenk Aug 30, 2024
63b812a
Update metadata.yml
ehwenk Aug 30, 2024
e9f7145
Update output_combined_table.R
ehwenk Aug 30, 2024
6ad632f
updated documentation
ehwenk Aug 30, 2024
97d9a07
fix error with RDS file not existing
ehwenk Aug 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions R/output_combined_table.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#' Create combined traits.build table
#'
#' Create a single database output that merges together the information
#' in all relational tables within a traits.build database.
#' Trait measurements are still output in long format (1 row per trait value),
#' but all measurement-related metadata (methods, location properties, context properties, contributors)
#' are now included as additional columns in a single table.
#'
#' @param database A traits.build database
#'
#' @return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs documentation

#' @export
#'
#' @examples
database_create_combined_table <- function(database) {

location_latlon <-
database$locations %>%
dplyr::filter(location_property %in% c("latitude (deg)", "longitude (deg)")) %>%
tidyr::pivot_wider(names_from = location_property, values_from = value)

location_properties <-
database$locations %>%
dplyr::filter(!location_property %in% c("latitude (deg)", "longitude (deg)")) %>%
dplyr::mutate(
location_property = stringr::str_replace_all(location_property, "=", "-"),
value = stringr::str_replace_all(value, "=", "-"),
location_property = stringr::str_replace_all(location_property, ";", ","),
value = stringr::str_replace_all(value, ";", ",")
) %>%
dplyr::mutate(location_properties = paste0(location_property, "=", value)) %>%
dplyr::select(dplyr::all_of(c("dataset_id", "location_id", "location_name", "location_properties"))) %>%
dplyr::group_by(dataset_id, location_id, location_name) %>%
dplyr::mutate(location_properties = paste0(location_properties, collapse = "; ")) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would dplyr::summarise here work the same and replace the need for distinct() later? Same with later tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly but, I think distinct() is actually clearer in this case - otherwise you'd just have to list all the column individually and decide how to treat each one (lots of "take first value")

dplyr::ungroup() %>%
dplyr::distinct()

contributors <-
database$contributors %>%
dplyr::mutate(
affiliation = stringr::str_replace_all(affiliation, ":", "-"),
affiliation = stringr::str_replace_all(affiliation, ";", ","),
affiliation = stringr::str_replace_all(affiliation, "<", "("),
affiliation = stringr::str_replace_all(affiliation, ">", ")"),
additional_role = stringr::str_replace_all(additional_role, "<", "("),
additional_role = stringr::str_replace_all(additional_role, ">", ")"),
data_collectors = paste0(given_name, " ", last_name),
data_collectors = ifelse(
!is.na(ORCID),
paste0(data_collectors, " <ORCID:", ORCID),
data_collectors),
data_collectors = ifelse(
is.na(ORCID),
paste0(data_collectors, " <affiliation:", affiliation),
paste0(data_collectors, ";affiliation:", affiliation)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add spaces after the semicolons and use equal (=) signs instead of colons, consistent with location_properties? It might be more readable.

data_collectors = ifelse(
!is.na(additional_role),
paste0(data_collectors, ";additional_role:", additional_role, ">"),
paste0(data_collectors, ">"))
) %>%
dplyr::select(-dplyr::all_of(c("last_name", "given_name", "ORCID", "affiliation", "additional_role"))) %>%
dplyr::group_by(dataset_id) %>%
dplyr::mutate(data_collectors = paste0(data_collectors, collapse = "; ")) %>%
dplyr::ungroup() %>%
dplyr::distinct()

contexts_tmp <-
database$contexts %>%
dplyr::mutate(
context_property = stringr::str_replace_all(context_property, "=", "-"),
value = stringr::str_replace_all(value, "=", "-"),
description = stringr::str_replace_all(description, "=", "-"),
context_property = stringr::str_replace_all(context_property, ";", ","),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the characters "<" and ">" also need to be replaced in case they ever make it into the context_property, value and description fields?

value = stringr::str_replace_all(value, ";", ","),
description = stringr::str_replace_all(description, "=", "-"),
yangsophieee marked this conversation as resolved.
Show resolved Hide resolved
value = ifelse(
is.na(description),
paste0(context_property, ":", value),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here I wonder if "=" would be more readable than ":"

paste0(context_property, ":", value, " <", description, ">"))
) %>%
dplyr::select(-dplyr::all_of(c("description", "context_property", "category"))) %>%
tidyr::separate_longer_delim(link_vals, ", ") %>%
distinct()

reformat_contexts <- function(contexts_table, context_id) {
context_category <- gsub("_id", "_properties", context_id, fixed = TRUE)
out <- contexts_table %>%
dplyr::filter(link_id == context_id) %>%
dplyr::select(-link_id) %>%
dplyr::distinct(dataset_id, link_vals, .keep_all = TRUE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line necessary? Shouldn't it be distinct anyway, otherwise there's something wrong with the contexts table that we want to be picked up by dataset_test?


names(out)[which(names(out) == "value")] <- context_category
names(out)[which(names(out) == "link_vals")] <- context_id
out
}

join_contexts <- function(data, contexts_tmp) {
data %>%
dplyr::left_join(
by = c("dataset_id", "treatment_context_id"),
reformat_contexts(contexts_tmp, "treatment_context_id")
) %>%
dplyr::left_join(
by = c("dataset_id", "plot_context_id"),
reformat_contexts(contexts_tmp, "plot_context_id")
) %>%
dplyr::left_join(
by = c("dataset_id", "entity_context_id"),
reformat_contexts(contexts_tmp, "entity_context_id")
) %>%
dplyr::left_join(
by = c("dataset_id", "temporal_context_id"),
reformat_contexts(contexts_tmp, "temporal_context_id")
) %>%
dplyr::left_join(
by = c("dataset_id", "method_context_id"),
reformat_contexts(contexts_tmp, "method_context_id")
)
}

combined_table <-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the ID columns that help join to other relational tables could be removed from the combined table?

database$traits %>%
dplyr::left_join(location_latlon, by = c("dataset_id", "location_id")) %>%
dplyr::left_join(location_properties, by = c("dataset_id", "location_id", "location_name")) %>%
austraits::join_contexts(contexts_tmp) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you define a function called join_contexts above? Why do we use the austraits version here?

dplyr::left_join(
database$methods %>% dplyr::select(-dplyr::all_of(c("data_collectors"))),
by = c("dataset_id", "trait_name", "method_id")
) %>%
dplyr::left_join(contributors, by = c("dataset_id")) %>%
dplyr::left_join(database$taxa, by = c("taxon_name")) %>%
dplyr::left_join(database$taxonomic_updates, by = c("taxon_name", "dataset_id", "original_name"))

combined_table
}
14 changes: 14 additions & 0 deletions R/testdata.R
Original file line number Diff line number Diff line change
Expand Up @@ -1176,6 +1176,20 @@ dataset_test_worker <-
info = paste0(red(dataset_id), "\t`db_traits_pivot_longer` threw a warning")
)
}

expect_no_error(
combined_table <- database_create_combined_table(dataset),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests creating a combined table for a single dataset, which may very well be intended.

Will we also add database_create_combined_table to build.R or to the GitHub actions list for AusTraits?

info = paste0(red(dataset_id), "\t`database_create_combined_table`")
)

expect_equal(
nrow(combined_table), nrow(dataset$traits),
info = sprintf(
"%s\tnumber of rows of combined table not equal to rows of original traits table",
red(dataset_id)
)
)

}
})
}
Expand Down
3 changes: 2 additions & 1 deletion tests/testthat/examples/Test_2023_1/metadata.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,12 @@ dataset:
locations:
Atherton:
description: Tropical rain forest vegetation.
latitude (deg): .na
longitude (deg): .na
elevation (m): 800
rainfall (mm): 2000
Cape Tribulation:
description: Complex mesophyll vine forest in tropical rain forest.
elevation (m): 25
latitude (deg): .na
longitude (deg): .na
rainfall (mm): 3500
Expand Down
Loading