From f9ec0eb03b705b56dbe8de9afa533e4a6e57b6e6 Mon Sep 17 00:00:00 2001 From: Bill Denney Date: Wed, 18 Dec 2024 16:58:54 -0500 Subject: [PATCH] Fix spelling issues, update janitor.md --- DESCRIPTION | 2 + NEWS.md | 8 +-- R/round_half_up.R | 2 +- vignettes/janitor.Rmd | 2 +- vignettes/janitor.md | 155 +++++++++++++++++++----------------------- vignettes/tabyls.md | 6 +- 6 files changed, 82 insertions(+), 93 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 715ff2f7..ff84bb41 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -44,6 +44,7 @@ Suggests: rmarkdown, RSQLite, sf, + spelling, testthat (>= 3.0.0), tibble, tidygraph @@ -54,3 +55,4 @@ Encoding: UTF-8 LazyData: true Roxygen: list(markdown = TRUE) RoxygenNote: 7.3.2 +Language: en-US diff --git a/NEWS.md b/NEWS.md index b6a2844f..26f469ee 100644 --- a/NEWS.md +++ b/NEWS.md @@ -106,7 +106,7 @@ These are all minor breaking changes resulting from enhancements and are not exp ## New features -* The `adorn_totals()` function now accepts the special argument `fill = NA`, which will insert a class-appropriate `NA` value into each column that isn't being totaled. This preserves the class of each column; previously they were all convered to character. (thanks **@hamstr147** for implementing in #404 and **@ymer** for reporting in #298). +* The `adorn_totals()` function now accepts the special argument `fill = NA`, which will insert a class-appropriate `NA` value into each column that isn't being totaled. This preserves the class of each column; previously they were all converted to character. (thanks **@hamstr147** for implementing in #404 and **@ymer** for reporting in #298). * `adorn_totals()` now takes the value of `"both"` for the `where` argument. That is, `adorn_totals("both")` is a shorter version of `adorn_totals(c("col", "row"))`. (#362, thanks to **@svgsstats** for implementing and **@sfd99** for suggesting). @@ -130,7 +130,7 @@ These are all minor breaking changes resulting from enhancements and are not exp * A call to make a 3-way `tabyl()` now succeeds when the first variable is of class `ordered` (#386) -* If a totals row and/or column is present on a tabyl as a result of `adorn_totals()`, the functions `chisq.test()` and `fisher.test()` drop the totals and print a warning before proceding with the calculations (#385). +* If a totals row and/or column is present on a tabyl as a result of `adorn_totals()`, the functions `chisq.test()` and `fisher.test()` drop the totals and print a warning before proceeding with the calculations (#385). # janitor 2.0.1 (2020-04-12) @@ -276,7 +276,7 @@ This builds on the original functionality of janitor, with similar-but-improved ### A fully-overhauled `tabyl` -`tabyl()` is now a single function that can count combinations of one, two, or three variables, ala base R's `table()`. The resulting `tabyl` data.frames can be manipulated and formatted using a family of `adorn_` functions. See the [tabyls vignette](https://sfirke.github.io/janitor/articles/tabyls.html) for more. +`tabyl()` is now a single function that can count combinations of one, two, or three variables, a la base R's `table()`. The resulting `tabyl` data.frames can be manipulated and formatted using a family of `adorn_` functions. See the [tabyls vignette](https://sfirke.github.io/janitor/articles/tabyls.html) for more. The now-redundant legacy functions `crosstab()` and `adorn_crosstab()` have been deprecated, but remain in the package for now. Existing code that relies on the version of `tabyl` present in janitor versions <= 0.3.1 will break if the `sort` argument was used, as that argument no longer exists in `tabyl` (use `dplyr::arrange()` instead). @@ -292,7 +292,7 @@ No further changes are planned to `clean_names()` and its results should be stab ## Major features -- `clean_names()` transliterates accented letters, e.g., `çãüœ` becomes `cauoe` [(#120)](https://github.com/sfirke/janitor/issues/120). Thanks to **@fernandovmacedo**. +- `clean_names()` transliterates accented letters, e.g., `C'C#C%`, `make_clean_names()` allows for more general usage, e.g., on a vector. +Like base R's `make.names()`, but with the styling and case choice of the long-time janitor function `clean_names()`. While `clean_names()` is still offered for use in data.frame pipeline with `%>%`, `make_clean_names()` allows for more general usage, e.g., on a vector. It can also be used as an argument to `.name_repair` in the newest version of `tibble::as_tibble`: ```{r} diff --git a/vignettes/janitor.md b/vignettes/janitor.md index 3627d7b3..2ee3b61f 100644 --- a/vignettes/janitor.md +++ b/vignettes/janitor.md @@ -1,68 +1,45 @@ Overview of janitor functions ================ -2023-02-02 - -- Major functions - - Cleaning - - Clean data.frame names - with clean_names() - - Do those - data.frames actually contain the same columns? - - Exploring - - tabyl() - a - better version of table() - - Explore - records with duplicated values for specific combinations of variables - with get_dupes() - - Explore - relationships between columns with get_one_to_one() -- Minor functions - - Cleaning - - Manipulate - vectors of names with make_clean_names() - - Validate - that a column has a single_value() per group - - remove_empty() rows - and columns - - remove_constant() - columns - - Directionally-consistent - rounding behavior with round_half_up() - - Round - decimals to precise fractions of a given denominator with - round_to_fraction() - - Fix - dates stored as serial numbers with - excel_numeric_to_date() - - Convert a - mix of date and datetime formats to date - - Elevate column - names stored in a data.frame row - - Find the - header row buried within a messy data.frame - - Exploring - - Count - factor levels in groups of high, medium, and low with - top_levels() +2024-12-18 + +- [Major functions](#major-functions) + - [Cleaning](#cleaning) + - [Clean data.frame names with + `clean_names()`](#clean-dataframe-names-with-clean_names) + - [Do those data.frames actually contain the same + columns?](#do-those-dataframes-actually-contain-the-same-columns) + - [Exploring](#exploring) + - [`tabyl()` - a better version of + `table()`](#tabyl---a-better-version-of-table) + - [Explore records with duplicated values for specific combinations + of variables with + `get_dupes()`](#explore-records-with-duplicated-values-for-specific-combinations-of-variables-with-get_dupes) + - [Explore relationships between columns with + `get_one_to_one()`](#explore-relationships-between-columns-with-get_one_to_one) +- [Minor functions](#minor-functions) + - [Cleaning](#cleaning-1) + - [Manipulate vectors of names with + `make_clean_names()`](#manipulate-vectors-of-names-with-make_clean_names) + - [Validate that a column has a `single_value()` per + group](#validate-that-a-column-has-a-single_value-per-group) + - [`remove_empty()` rows and + columns](#remove_empty-rows-and-columns) + - [`remove_constant()` columns](#remove_constant-columns) + - [Directionally-consistent rounding behavior with + `round_half_up()`](#directionally-consistent-rounding-behavior-with-round_half_up) + - [Round decimals to precise fractions of a given denominator with + `round_to_fraction()`](#round-decimals-to-precise-fractions-of-a-given-denominator-with-round_to_fraction) + - [Fix dates stored as serial numbers with + `excel_numeric_to_date()`](#fix-dates-stored-as-serial-numbers-with-excel_numeric_to_date) + - [Convert a mix of date and datetime formats to + date](#convert-a-mix-of-date-and-datetime-formats-to-date) + - [Elevate column names stored in a data.frame + row](#elevate-column-names-stored-in-a-dataframe-row) + - [Find the header row buried within a messy + data.frame](#find-the-header-row-buried-within-a-messy-dataframe) + - [Exploring](#exploring-1) + - [Count factor levels in groups of high, medium, and low with + `top_levels()`](#count-factor-levels-in-groups-of-high-medium-and-low-with-top_levels) The janitor functions expedite the initial data exploration and cleaning that comes with any new data set. This catalog describes the usage for @@ -78,7 +55,7 @@ Functions for everyday use. Call this function every time you read data. -It works in a `%>%` pipeline, and handles problematic variable names, +It works in a `%>%` pipeline and handles problematic variable names, especially those that are so well-preserved by `readxl::read_excel()` and `readr::read_csv()`. @@ -94,8 +71,10 @@ and `readr::read_csv()`. ``` r # Create a data.frame with dirty names test_df <- as.data.frame(matrix(ncol = 6)) -names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)", - "REPEAT VALUE", "REPEAT VALUE", "") +names(test_df) <- c( + "firstName", "ábc@!*", "% successful (2009)", + "REPEAT VALUE", "REPEAT VALUE", "" +) ``` Clean the variable names, returning a data.frame: @@ -111,8 +90,8 @@ Compare to what base R produces: ``` r make.names(names(test_df)) -#> [1] "firstName" "ábc..." "X..successful..2009." "REPEAT.VALUE" "REPEAT.VALUE" -#> [6] "X" +#> [1] "firstName" "ábc..." "X..successful..2009." +#> [4] "REPEAT.VALUE" "REPEAT.VALUE" "X" ``` This function is powered by the underlying exported function @@ -229,10 +208,11 @@ sets of one-to-one clusters: ``` r library(dplyr) -starwars[1:4,] %>% +starwars[1:4, ] %>% get_one_to_one() #> [[1]] -#> [1] "name" "height" "mass" "skin_color" "birth_year" "films" +#> [1] "name" "height" "mass" "skin_color" "birth_year" +#> [6] "films" #> #> [[2]] #> [1] "hair_color" "starships" @@ -250,7 +230,7 @@ than the equivalent code they replace. ### Manipulate vectors of names with `make_clean_names()` -Like base R’s `make.names()`, but with the stylings and case choice of +Like base R’s `make.names()`, but with the styling and case choice of the long-time janitor function `clean_names()`. While `clean_names()` is still offered for use in data.frame pipeline with `%>%`, `make_clean_names()` allows for more general usage, e.g., on a vector. @@ -273,7 +253,7 @@ tibble::as_tibble(iris, .name_repair = janitor::make_clean_names) #> 8 5 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa -#> # … with 140 more rows +#> # ℹ 140 more rows ``` ### Validate that a column has a `single_value()` per group @@ -290,7 +270,8 @@ where it should not: ``` r not_one_to_one <- data.frame( X = rep(1:3, each = 2), - Y = c(rep(1:2, each = 2), 1:2)) + Y = c(rep(1:2, each = 2), 1:2) +) not_one_to_one #> X Y @@ -303,12 +284,13 @@ not_one_to_one # throws informative error: try(not_one_to_one %>% - dplyr::group_by(X) %>% - dplyr::mutate( - Z = single_value(Y, info = paste("Calculating Z for group X =", X))) - ) + dplyr::group_by(X) %>% + dplyr::mutate( + Z = single_value(Y, info = paste("Calculating Z for group X =", X)) + )) #> Error in dplyr::mutate(., Z = single_value(Y, info = paste("Calculating Z for group X =", : -#> ℹ In argument: `Z = single_value(Y, info = paste("Calculating Z for group X =", X))`. +#> ℹ In argument: `Z = single_value(Y, info = paste("Calculating Z for +#> group X =", X))`. #> ℹ In group 3: `X = 3`. #> Caused by error in `single_value()`: #> ! More than one (2) value found (1, 2): Calculating Z for group X = 3: Calculating Z for group X = 3 @@ -320,9 +302,11 @@ Does what it says. For cases like cleaning Excel files that contain empty rows and columns after being read into R. ``` r -q <- data.frame(v1 = c(1, NA, 3), - v2 = c(NA, NA, NA), - v3 = c("a", NA, "b")) +q <- data.frame( + v1 = c(1, NA, 3), + v2 = c(NA, NA, NA), + v3 = c("a", NA, "b") +) q %>% remove_empty(c("rows", "cols")) #> v1 v3 @@ -419,8 +403,10 @@ names of the data.frame and optionally (by default) remove the row in which names were stored and/or the rows above it. ``` r -dirt <- data.frame(X_1 = c(NA, "ID", 1:3), - X_2 = c(NA, "Value", 4:6)) +dirt <- data.frame( + X_1 = c(NA, "ID", 1:3), + X_2 = c(NA, "Value", 4:6) +) row_to_names(dirt, 2) #> ID Value @@ -454,7 +440,8 @@ grouped into head/middle/tail groups. ``` r f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"), - levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree")) + levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree") +) top_levels(f) #> f n percent #> strongly agree, agree 3 0.5000000 diff --git a/vignettes/tabyls.md b/vignettes/tabyls.md index ea526931..262a1801 100644 --- a/vignettes/tabyls.md +++ b/vignettes/tabyls.md @@ -254,7 +254,7 @@ humans %>% function or using janitor’s `round_half_up()` to round all ties up ([thanks, StackOverflow](https://stackoverflow.com/a/12688836/4470365)). - - e.g., round 10.5 up to 11, consistent with Excel’s tie-breaking + - e.g., round 10.5 up to 11, consistent with Excel's tie-breaking behavior. - This contrasts with rounding 10.5 down to 10 as in base R’s `round(10.5)`. @@ -263,7 +263,7 @@ humans %>% `adorn_pct_formatting()`; these two functions should not be called together. - **`adorn_ns()`**: add Ns to a tabyl. These can be drawn from the - tabyl’s underlying counts, which are attached to the tabyl as + tabyl's underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user. - **`adorn_title()`**: add a title to a tabyl (or other data.frame). Options include putting the column title in a new row on top of the @@ -427,7 +427,7 @@ comparison %>% #> Total 100.0% (3,000) 100.0% (3,000) 100.0% (6,000) ``` -Now we format them to insert the thousands commas. A tabyl’s raw Ns are +Now we format them to insert the thousands commas. A tabyl's raw Ns are stored in its `"core"` attribute. Here we retrieve those with `attr()`, then apply the base R function `format()` to all numeric columns. Lastly, we append these Ns using `adorn_ns()`.