Skip to content

Commit

Permalink
misc/watersurfaces: update for version 2024
Browse files Browse the repository at this point in the history
  • Loading branch information
cecileherr committed Dec 11, 2024
1 parent 4613f9b commit 265fbe0
Showing 1 changed file with 122 additions and 67 deletions.
189 changes: 122 additions & 67 deletions src/miscellaneous/watersurfaces.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,12 @@ opts_chunk$set(

# Exploring the watersurfaces data source

The previous versions were 1.0 (10.5281/zenodo.3386859) and 1.1 (10.5281/zenodo.4117543).
We want to explore the changes in the most recent version: 1.2 (10.5281/zenodo.7440931).
The previous versions were 1.0 (10.5281/zenodo.3386859), 1.1 (10.5281/zenodo.4117543)
and 1.2 (10.5281/zenodo.7440931)

```{r v1.2}
We want to explore the changes in the most recent version: 2024 (10.5281/zenodo.14203168)

```{r v2024}
filepath <- file.path(fileman_up("n2khab_data"), "10_raw/watersurfaces/watersurfaces.gpkg")
```

Expand All @@ -56,7 +58,7 @@ Import the spatial layer:
st_layers(filepath)
```

```{r spatial-layer-v12}
```{r spatial-layer-v2024}
(ws <- read_sf(filepath, stringsAsFactors = TRUE, layer = "Watervlakken"))
```

Expand All @@ -72,38 +74,43 @@ This has been explored for version 1.1, and handled where needed by `n2khab::rea
Given the factor levels explored below, it does not appear that more changes are needed.
Hence this exploratory code has been dropped (2023-11-09).

### A summary of the spatial layer for version 1.2
### A summary of the spatial layer for version 2024

```{r}
ws %>%
st_drop_geometry %>%
summary
```

`PEILBEHEER` is a new column in version 1.2!
`PEILBEHEER` was a new column in version 1.2 and is still present in version 2024.
The column HYLAC was dropped in version 2024 (since the Hyla-codes are not used anymore by Natuurpunt)

Let us look for a few typical errors more systematically in version 1.2.
Let us look for a few typical errors more systematically in version 2024

### Are there `NA` values?

There are plenty of `NA`'s but only in fields where we expect them.
There are no `NA` values at all in version 2024 but ... see below!


```{r}
sapply(ws, function(x) sum(is.na(x)))
```

We see a missing value in `KRWTYPES` for `KRWTYPE == "Ad"`, while all other `KRWTYPE` values are accompanied by a value for `KRWTYPES`.
### Are there `""` values?

Yes, plenty of them.
The empty fields are actually coded as an empty string ("")
(or more accurately, since all those fields are factors: as a factor level "")

The "" are in fields where `NA` can be expected.

```{r}
ws |>
st_drop_geometry() |>
count(KRWTYPE, KRWTYPES)
sapply(ws |> st_drop_geometry(), function(x) sum(as.character(x) == '', na.rm = TRUE))
```


### Are there `<Null>` values?

One `<Null>` string is present in `GEBIED` column:
None:

```{r}
sapply(ws |> st_drop_geometry(), function(x) sum(as.character(x) == '<Null>', na.rm = TRUE))
Expand All @@ -117,20 +124,21 @@ None:
sapply(ws |> st_drop_geometry(), function(x) sum(as.character(x) == '0', na.rm = TRUE))
```

Previously, `HYLAC` was zero if missing!
### Are there any "/" values?

Number of unique values of numeric variable `HYLAC`:
None

```{r}
n_distinct(ws$HYLAC)
sapply(ws |> st_drop_geometry(), function(x) sum(as.character(x) == "/", na.rm = TRUE))
```

How many `NA` values for numeric variable `HYLAC`?

Many! So the zeroes were replaced by `NA`.
### Are there any "-" values?

Yes there are "-" in `KRWTYPEA`, but it is expected (described in the metadata).

```{r}
ws$HYLAC %>% is.na %>% sum
sapply(ws |> st_drop_geometry(), function(x) sum(as.character(x) == "-", na.rm = TRUE))
```

### Are WVLC codes unique?
Expand All @@ -145,23 +153,76 @@ ws$WVLC %>% unique %>% length == nrow(ws)
### Levels for each factor

We can compare the levels for each factor variable with the information given in
[Scheers et al. (2022)](https://pureportal.inbo.be/nl/publications/watervlakken-versie-12-polygonenkaart-van-stilstaand-water-in-vla).
[Leyssen et al. (2024)](https://pureportal.inbo.be/nl/publications/watervlakken-2024-polygonenkaart-van-stilstaand-water-in-vlaander).

We do not check `NAAM` and `GEBIED` since there are many possible options.

**WVLC**

Similar problem for the `WVLC` but we can check whether the structure of the codes is correct:

Does `WVLC` start with a province code?

```{r}
substr(levels(ws$WVLC),1 ,3) %>% unique()
```

How many records with an unexpected length? (not 10 characters)?

```{r}
ws %>% filter(str_length(WVLC) != 10) %>% nrow()
```

There is one problem:

```{r}
ws |> filter(str_detect(WVLC, pattern = "d")) |> as.matrix() |> t()
```

**KRWTYPE**: the codes are correct but there are more codes mentioned in the metadata report.

```{r}
levels(ws$KRWTYPE)
```
**KRWTYPEA**: the codes are correct but there are more codes mentioned in the metadata report.

```{r}
levels(ws$KRWTYPEA)
```


**KRWTYPES** (status): the codes in the spatial layer are the same as in the metadata report.

```{r}
levels(ws$KRWTYPES)
```

**DIEPKL**: there are extra codes in the dataset ("0 - 2 m", "2 - 4 m", "4 - 6 m" and "> 6 m") and the ordering of the levels could be more logical.
`KRWTYPE` / `KRWTYPEA` / `KRWTYPES` consistency:

```{r}
ws |>
st_drop_geometry() |>
count(KRWTYPE, KRWTYPES) |>
print()
```

```{r}
ws |>
st_drop_geometry() |>
filter(!is.na(KRWTYPEA), KRWTYPEA != "-") |>
count(KRWTYPE, KRWTYPEA) |>
print(n = Inf)
```

One record has `KRWTYPES` set to ‘voorlopig’:

```{r}
ws |> filter(KRWTYPES == "voorlopig") |> as.matrix() |> t()
```

After discussion with the authors, this seems to be correct.

**DIEPKL**: the codes in the spatial layer are the same as in the metadata report but the ordering of the levels could be more logical.

This comment has been minimized.

Copy link
@florisvdh

florisvdh Dec 13, 2024

Member

the ordering of the levels could be more logical.

Makes sense if you actually mean n2khab::read_watersurfaces() here, but the notebook object is just created with sf::read_sf(..., stringsAsFactors = TRUE), so such things are still within expectation here.

This comment has been minimized.

Copy link
@florisvdh

florisvdh Jan 8, 2025

Member

@cecileherr does this comment still make sense / does it have implications?

(tagging #73)

This comment has been minimized.

Copy link
@cecileherr

cecileherr Jan 8, 2025

Author Collaborator

well, we might want to reorder the levels (and this for the different versions of the layer) but it has not happened yet in read_watersurfaces. Ideally I would correct it but I'm a bit short on time at the moment. At least so it is visible for next version (ahem ahem)


```{r}
levels(ws$DIEPKL)
Expand All @@ -174,57 +235,49 @@ However this was just for exploratory purposes as it was not the purpose to alwa
- also in the future we will not rectify this (by default) in reading functions for raw data sources: problems in the data will be returned as-is and should be solved in a future version of the data source.
By default we just streamline column names and variable types, we make sure that values referring to `NA` are effectively returned as `NA` and we try to avoid some encoding problems.

Because on Windows the ≥/\u2265 character is not well displayed, we will recode it into '=>'
(otherwise "≥" in ws$DIEPKL are rendered as "=" in the html output):
Because on Windows the ≥/\u2265 character is not well displayed, we had to recode it into '=>'
(otherwise "≥" in ws$DIEPKL was rendered as "=" in the html output).
Note that this is not needed anymore for version 2024 that does not contain the character ≥ anymore
but we will keep this in the read fucntion for the previous versions

```{r}
levels(ws$DIEPKL) <- gsub(pattern = "\u2265", ">=", levels(ws$DIEPKL))
levels(ws$DIEPKL)
```

**CONNECT**: the codes differ from those in the report, which have the codes from version 1.1.
**CONNECT**: the codes in the spatial layer are the same as in the metadata report.

```{r}
levels(ws$CONNECT)
```

These are the categories in the metadata report for `CONNECT`:

- geïsoleerd: niet verbonden met een waterloop
- permanent: het watervlak staat permanent in verbinding met minstens één
waterloop
- periodiek: het watervlak staat tijdelijk (door peilbeheer of droogte) in verbinding
met minstens één waterloop

**FUNCTIE**: the code "veedrenk" is not mentioned in the metadata report and
there are many more codes in the report than in the
dataset
**FUNCTIE**: there are many more codes in the report than in the dataset

```{r}
levels(ws$FUNCTIE)
```

And here are the categories in the metadata report for `FUNCTIE`:
Here are the categories in the metadata report for `FUNCTIE`:

functie toewijzing
---------------------- ---------------------
natuur doelstelling natuurbehoud
hengelintensief intensief hengelen (met infrastructuur, bepoting of gebruikt voor wedstrijdhengelen)
hengelextensief extensief hengelen (geen infrastructuur, bepoting of wedstrijdhengelen)
jacht jagen
tuin/park esthetisch (verblijfsrecreatie, tuin- en parkvijvers)
tuin_park esthetisch (verblijfsrecreatie, tuin- en parkvijvers)
vogel waterpartij voor gedomesticeerde watervogels
viskweek opkweken van vis
zwemmen zwemmen
duiken duiken
zachterecreatie niet gemotoriseerde watersport
motorrecreatie gemotoriseerde watersport
berging waterberging ten behoeve van overstromings- of peilbeheer
duik duiken
zachteRecreatie niet gemotoriseerde waterrecreatie
motorrecreatie gemotoriseerde waterrecreatie
waterberging waterberging ten behoeve van overstromings- of peilbeheer
opslag reservoir voor water (industrie, landbouw, bluswater, waterkracht…)
drinkwater drinkwaterwinning
zuivering (kleinschalige) waterzuivering, infiltratie
bezinking bezinking van proceswater
drinkplaats watervoorziening voor vee
veedrenk watervoorziening voor vee
geen geen specifieke functie

**PEILBEHEER**: one code less than in the report.
Expand All @@ -233,25 +286,24 @@ geen geen specifieke functie
levels(ws$PEILBEHEER)
```


## Potential issues

- `KRWTYPE`, `FUNCTIE`, `PEILBEHEER`: more levels in the metadata report than in the dataset.
- there is a new field: `KRWTYPEA` and one field was dropped `HYLAC`
- `KRWTYPE`, `KRWTYPEA`, `FUNCTIE`, `PEILBEHEER`: more levels in the metadata report than in the dataset.
This is not necessarily a problem.
- `DIEPKL`, `FUNCTIE`: extra codes in the dataset that are not present in the report
- `CONNECT`: codes differ between dataset and report
- a missing value in `KRWTYPES` for `KRWTYPE == "Ad"`
- One `<Null>` string is present in `GEBIED` column

The extra codes will be discussed with the authors of the layers.
- All the missing values for the factors are coded as level empty string ""
- One wrong WVLC code ("d") in de Kleiputten van Heist (Palingpot)


## Validity of the geometries

Let's inspect features with invalid or corrupt geometry:

```{r}
st_is_valid(ws) %>% table
invalid_geoms <- ws[!st_is_valid(ws) | is.na(st_is_valid(ws)), ]
validities <- st_is_valid(ws)
validities %>% table
invalid_geoms <- ws[!validities | is.na(validities), ]
st_is_valid(invalid_geoms, reason = TRUE)
```

Expand All @@ -275,7 +327,7 @@ all(st_is_valid(valid_geoms))
```

So this works well; in derived data we will fix these geometries.
We might also consider an optional geometry reparation step in `read_watersurfaces()`.
We can use the optional geometry reparation step in `read_watersurfaces()`.

We also check that no empty geometries are present:

Expand All @@ -290,7 +342,7 @@ Refer to <https://github.com/inbo/n2khab-preprocessing/issues/60> and <https://r

```{r plot}
# plot watersurfaces v1.2
# plot watersurfaces v2024
p <- ggplot() +
geom_sf(data = ws, aes(), color = "blue")
Expand All @@ -304,24 +356,25 @@ print(p)
```

<!-- ### Are all the watersurfaces in Flanders? -->
<!-- ### Are all the watersurfaces in Flanders? -->

<!-- ```{r streams-in-fl} -->
<!-- in_vl <- st_contains(x = sf_vl, y = ws) -->

<!-- not_contained <- ws %>% -->
<!-- st_drop_geometry() %>% -->
<!-- mutate(orig_rowname = rownames(.)) %>% -->
<!-- not_contained <- ws %>% -->
<!-- st_drop_geometry() %>% -->
<!-- mutate(orig_rowname = rownames(.)) %>% -->
<!-- filter(!rownames(.) %in% in_vl[[1]]) -->

<!-- # 61 watersurfaces -->
<!-- # 62 watersurfaces -->
<!-- print(not_contained %>% nrow()) -->

<!-- not_contained %>% -->
<!-- distinct(WVLC, NAAM) %>% -->
<!-- not_contained %>% -->
<!-- distinct(WVLC, NAAM) %>% -->
<!-- kable() -->
<!-- ``` -->
<!-- 61 watersurfaces are not 'contained' within the polygon for Flanders (see list above) -->

<!-- 62 watersurfaces are not 'contained' within the polygon for Flanders (see list above) -->

# Tidyverse-styled, internationalized column names when using the data source in R

Expand All @@ -333,10 +386,10 @@ data source variable data frame variable
---------------------- ---------------------
`WVLC` `polygon_id`
`WTRLICHC` `wfd_code`
`HYLAC` `hyla_code`
`NAAM` `name`
`GEBIED` `area_name`
`KRWTYPE` `wfd_type `
`KRWTYPEA` `wfd_type_alternative`
`KRWTYPES` `wfd_type_certain`
`DIEPKL` `depth_class`
`CONNECT` `connectivity`
Expand All @@ -347,12 +400,14 @@ data source variable data frame variable

- not uptaking `OPPWVL`, `OMTWVL`, `SHAPE_Length`, `SHAPE_Area` (area & perimeter are easily calculated etc) -- OK
- sort by `polygon_id` -- OK
- add translations to long text for `wfd_type ` and `connectivity` -- OK but not by default
- add translations to long text for `wfd_type `, `wfd_type_alternative `, `connectivity` -- OK but not by default
- add translations to long text for `usage`? -- for a later version (as more
codes will be used)
- converting null / 0 values to `NA` -- OK
- support new variable `water_level_management` -- OK

- converting null / 0 / "" values to `NA` -- OK
- support variable water_level_management (since 1.2) – OK
- support variable wfd_type_alternative (since v2024) – OK
- made variable hyla_code optional (dropped in v2024) – OK
- there is one wrong WVLC code ("d") in de Kleiputten van Heist (Palingpot) - should we stay true to source?

This comment has been minimized.

Copy link
@florisvdh

florisvdh Dec 13, 2024

Member

should we stay true to source?

Good point. For nasty bugs we could sidestep the dilemma using the same approach as fix_geom, e.g. fix_extra with default FALSE, to honour the default of just returning the raw contents (even though we do streamline column names and classes to get more uniformity between objects).

This comment has been minimized.

Copy link
@florisvdh

florisvdh Jan 8, 2025

Member

Handled in c34e521 (#73)


# Used environment

Expand Down

0 comments on commit 265fbe0

Please sign in to comment.