-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02_regex_and_casewhen.Rmd
77 lines (61 loc) · 2.96 KB
/
02_regex_and_casewhen.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: "Regular expressions, if_else, case_when"
author: "John Little"
date: "`r Sys.Date()`"
output: html_notebook
---
`regex` is used to find patterns in your data. This is helpful for data cleaning and for normalizing data, e.g. when a sub-field of data needs to be liberated from a more expansive variable.
`if_else` and `case_when` can be used to conditionally change data values.
```{r}
library(tidyverse)
```
## Regex
There's so much to say about regex. We'll say the absolute minimum and leave you with a [link to more resources](https://cs.lmu.edu/~ray/notes/regex/). One approach is to watch a short youtube video on the general concept of _regular expressions_. Then use the [`stringr` library](https://stringr.tidyverse.org) to apply the regex concept in R and the Tidyverse. For my own approach, I use the [stringr cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf) as a help in understanding how to apply the regex concepts in the R context. Nonetheless, even the cheatsheet can sometimes be confusing. Regex is a deceptively deep topic. Start small and build.
regex symbol | regex function
------------:|:--------------
`\d` | a digit
`$` | the end of the pattern
`^` | the beginning of the pattern
`\w` | a word character
`+` | a multiplier: 1 or more
`*` | a multiplier: 0 or more
`\s` | a space
`.` | a wildcard (matches all symbols)
`()` | a capture group
> NOTE: in R you have to escape the back-slash. i.e `\\`. So `\d` is written as `\\d`
```{r}
starwars |>
select(height, name) |>
mutate(last_height_number = str_extract(height, "\\d$")) |>
mutate(first_height_number = str_extract(height, "^\\d")) |>
relocate(height, .after = first_height_number) |>
mutate(first_name = str_extract(name, "^\\w+")) |>
mutate(numer_of_names = str_count(name, "\\s") + 1) |>
mutate(number_of_vowels = str_count(name, "[aeiouAEIOU]")) |>
mutate(switch_names = str_replace(name, "(\\w+)\\s(.*)", "\\2, \\1")) |>
relocate(name, .after = first_name) |>
mutate(double_consonants = str_detect(name, "tt|ll|ff|ss|bb|cc|rr|gg")) |>
mutate(replace_dashes = if_else(str_detect(name, "-"), str_replace_all(name, "-", " FOO "), "")) |>
mutate(repd = str_replace_all(name, "-", " FOO "))
# filter(str_detect(name, regex("skywalker", ignore_case = TRUE)))
```
## `if_else()`
```{r}
starwars |>
filter(homeworld == "Tatooine") |>
mutate(skin_color = if_else(sex == "none", "powder coated aluminum", skin_color ))
```
## `case_when()`
```{r}
medium_worlds <- c("Alderaan", "Coruscant", "Kamino")
starwars |>
filter(homeworld %in% medium_worlds) |>
select(!is.list) |>
mutate(notes = case_when(
homeworld == "Alderaan" ~ "Extreme destruction",
homeworld == "Kamino" ~ "Super-earth",
homeworld == "Coruscant" ~ "Exoplanet"
)) |>
# relocate(notes, .after = name) |>
select(species, starts_with("n"), homeworld)
```