Remove duplicate flora data & add authorship to flora data (#791)

Two pieces of work on this branch that cause most data.csv files for datasets with flora data to be completely overwritten: * add species-by-species authorship to all flora data. This adds a new line to all data.csv files for floras were it was possible to add attribution for who wrote each profile. This was requested by the taxonomy community and has been mapped in as either "source_id" or as "measurement_remarks". These are not curated, but are simply the "sources" or "authors" that could be automatically downloaded. * filter the original flora data in AusTraits (datsets: ABRS_1981, NHNSW_2014, 2014_2, 2016, WAH_1998, SAH_2014, NTH_2014) using the following rules: - remove all woodiness, growth form, life history from the "original" flora scrapings, since we have complete trait value datasets for these traits (most common error here are "vines that climb to tree tops" being designated as trees, but there are others) - remove all taxon_name x trait_name x dataset_id that are in "original" and "new" scraped datasets; there are indeed updated values for a number of numeric traits and in the ~100 profiles I've looked up where there is a difference between old and new, only 1 mistake in the newer versions. That said, the "differences" are the absolute minority - for trait x taxon x dataset values in both old and new flora extractions 98+ % are identical. - retain all categorical data that is only in the "original" scrapings (except the three complete traits). I've spot checked lots of values and haven't found any errors - and other than growth form, woodiness, life history there isn't much overlap in the categorical traits scraped in the "original" and "new" flora datasets - For numeric traits, for trait x taxon x dataset combinations that are only in the "original" scrapings, I manually checked every data point (~8000 values across all floras) and manually correct or dismissed incorrect values. Overall, this has removed ~100,000 data points. These are almost entirely true duplicates: nrow(austraits_develop$traits) [1] 1813898 nrow(austraits_removed$traits) [1] 1706226
traitecoevo · May 14, 2024 · 36f3855 · 36f3855
1 parent 2b11d9c
commit 36f3855
Show file tree

Hide file tree

Showing 73 changed files with 627,418 additions and 697,429 deletions.
diff --git a/build.R b/build.R
@@ -1317,10 +1317,6 @@ WAH_1998_config <- dataset_configure("data/WAH_1998/metadata.yml", definitions)
 WAH_1998_raw <- dataset_process("data/WAH_1998/data.csv", WAH_1998_config, schema, resource_metadata, unit_conversions)
 WAH_1998 <- dataset_update_taxonomy(WAH_1998_raw, taxon_list)
 
-WAH_2016_config <- dataset_configure("data/WAH_2016/metadata.yml", definitions)
-WAH_2016_raw <- dataset_process("data/WAH_2016/data.csv", WAH_2016_config, schema, resource_metadata, unit_conversions)
-WAH_2016 <- dataset_update_taxonomy(WAH_2016_raw, taxon_list)
-
 WAH_2022_1_config <- dataset_configure("data/WAH_2022_1/metadata.yml", definitions)
 WAH_2022_1_raw <- dataset_process("data/WAH_2022_1/data.csv", WAH_2022_1_config, schema, resource_metadata, unit_conversions)
 WAH_2022_1 <- dataset_update_taxonomy(WAH_2022_1_raw, taxon_list)
@@ -1801,7 +1797,6 @@ austraits_raw <- build_combine(
   Vesk_2019,
   Vlasveld_2018,
   WAH_1998,
-  WAH_2016,
   WAH_2022_1,
   WAH_2022_2,
   WAH_2023_1,