Skip to content

Latest commit

 

History

History
116 lines (99 loc) · 3.1 KB

02_clean_iucn.md

File metadata and controls

116 lines (99 loc) · 3.1 KB

Step 3 - IUCN cleanup

using GBIF
using CSV, DataFrames
using ProgressMeter
using Base.Threads

We downloaded a checklist of mammals reported to be in Canada from the IUCN database. Before deciding on this solution, we examined a few alternatives, notably the use of GBIF occurrences. GBIF occurrences had a few issues, including spurious records, museum specimens incorrectly tagged, captive exotic species being reported as occurrences, etc.

checklist = DataFrame(CSV.File(joinpath("data", "taxonomy.csv")))

Taxonomy filtering

The European metaweb is limited to "terrestrial" mammals. For this reason, we identified a number of taxonomic groups (mostly families) that are present in Canada but were excluded from the source dataset, and remove them.

valid_rows = map(
    fam ->
        !(
            fam  [
                "BALAENIDAE",
                "PHYSETERIDAE",
                "DELPHINIDAE",
                "BALAENOPTERIDAE",
                "OTARIIDAE",
                "PHOCIDAE",
                "ODOBENIDAE",
                "ZIPHIIDAE",
                "MONODONTIDAE",
                "ESCHRICHTIIDAE",
                "KOGIIDAE",
                "PHOCOENIDAE"
            ]
        ),
    checklist.familyName,
)
checklist = checklist[findall(valid_rows), :]

Species-specific removal

Neovison macrodon (considered to be extinct) and Enhydra lutris (considered a marine mammal) are removed as well.

extinct_sp = map(
    sp ->
        !(
            sp  [
                "Neovison macrodon",
                "Enhydra lutris"
            ]
        ),
    checklist.scientificName,
)
checklist = checklist[findall(extinct_sp), :]

Reconciliation on the GBIF names

By this point, the approach should be familiar: we will create a thread-safe structure for the name cleaning, and use the GBIF API to find the correct matches.

checklist_cleanup_components = [
    DataFrame(; code=String[], gbifname=String[], gbifid=Int64[], equal=Bool[]) for
    i in 1:nthreads()
]

Again, we get rid of _ before doing the matching. This is actually not something we want built into the name cleaning function itself, because some taxa have underscores as valid identifiers. None of the taxa from this specific dataset do, but it is better to keep the low-level tools general, and make the specific changes in user-code.

p = Progress(length(checklist.scientificName))
@threads for i in 1:length(checklist.scientificName)
    cname = replace(checklist.scientificName[i], '_' => ' ')
    try
        tax = GBIF.taxon(cname; strict=false, class="Mammalia")
        push!(
            checklist_cleanup_components[threadid()],
            (
                checklist.scientificName[i],
                tax.species[1],
                tax.species[2],
                cname == tax.species[1],
            ),
        )
    catch
        continue
    end
    next!(p)
end

We finally write the artifact:

checklist_cleanup = vcat(checklist_cleanup_components...)
CSV.write(joinpath("artifacts", "iucn_gbif_names.csv"), checklist_cleanup)