Skip to content

Commit

Permalink
Remove duplicate flora data & add authorship to flora data (#791)
Browse files Browse the repository at this point in the history
Two pieces of work on this branch that cause most data.csv files for datasets with flora data to be completely overwritten:

*  add species-by-species authorship to all flora data. This adds a new line to all data.csv files for floras were it was possible to add attribution for who wrote each profile. This was requested by the taxonomy community and has been mapped in as either "source_id" or as "measurement_remarks". These are not curated, but are simply the "sources" or "authors" that could be automatically downloaded.
 *  filter the original flora data in AusTraits (datsets: ABRS_1981, NHNSW_2014, 2014_2, 2016, WAH_1998, SAH_2014, NTH_2014) using the following rules:

-  remove all woodiness, growth form, life history from the "original" flora scrapings, since we have complete trait value datasets for these traits (most common error here are "vines that climb to tree tops" being designated as trees, but there are others)
-  remove all taxon_name x trait_name x dataset_id that are in "original" and "new" scraped datasets; there are indeed updated values for a number of numeric traits and in the ~100 profiles I've looked up where there is a difference between old and new, only 1 mistake in the newer versions. That said, the "differences" are the absolute minority - for trait x taxon x dataset values in both old and new flora extractions 98+ % are identical.
-  retain all categorical data that is only in the "original" scrapings (except the three complete traits). I've spot checked lots of values and haven't found any errors - and other than growth form, woodiness, life history there isn't much overlap in the categorical traits scraped in the "original" and "new" flora datasets
-  For numeric traits, for trait x taxon x dataset combinations that are only in the "original" scrapings, I manually checked every data point (~8000 values across all floras) and manually correct or dismissed incorrect values.

Overall, this has removed ~100,000 data points. These are almost entirely true duplicates:

    nrow(austraits_develop$traits)
    [1] 1813898
    nrow(austraits_removed$traits)
    [1] 1706226
  • Loading branch information
ehwenk authored May 14, 2024
1 parent 2b11d9c commit 36f3855
Show file tree
Hide file tree
Showing 73 changed files with 627,418 additions and 697,429 deletions.
5 changes: 0 additions & 5 deletions build.R
Original file line number Diff line number Diff line change
Expand Up @@ -1317,10 +1317,6 @@ WAH_1998_config <- dataset_configure("data/WAH_1998/metadata.yml", definitions)
WAH_1998_raw <- dataset_process("data/WAH_1998/data.csv", WAH_1998_config, schema, resource_metadata, unit_conversions)
WAH_1998 <- dataset_update_taxonomy(WAH_1998_raw, taxon_list)

WAH_2016_config <- dataset_configure("data/WAH_2016/metadata.yml", definitions)
WAH_2016_raw <- dataset_process("data/WAH_2016/data.csv", WAH_2016_config, schema, resource_metadata, unit_conversions)
WAH_2016 <- dataset_update_taxonomy(WAH_2016_raw, taxon_list)

WAH_2022_1_config <- dataset_configure("data/WAH_2022_1/metadata.yml", definitions)
WAH_2022_1_raw <- dataset_process("data/WAH_2022_1/data.csv", WAH_2022_1_config, schema, resource_metadata, unit_conversions)
WAH_2022_1 <- dataset_update_taxonomy(WAH_2022_1_raw, taxon_list)
Expand Down Expand Up @@ -1801,7 +1797,6 @@ austraits_raw <- build_combine(
Vesk_2019,
Vlasveld_2018,
WAH_1998,
WAH_2016,
WAH_2022_1,
WAH_2022_2,
WAH_2023_1,
Expand Down
Loading

0 comments on commit 36f3855

Please sign in to comment.