diff --git a/docs/404.html b/docs/404.html deleted file mode 100644 index 7f43f6e3..00000000 --- a/docs/404.html +++ /dev/null @@ -1,163 +0,0 @@ - - - -
- - - - -I would prefer some discussion before an unsolicited code contribution, i.e., pull request. This ensures that your effort is not wasted and that we’re aligned on how to improve the janitor package.
-This is especially true if your proposed contribution does not match a currently open issue. If that’s the case, please open new issue(s) to have the discussion there, prior to submitting code.
-If your proposed contribution addresses multiple issues, it should ideally be broken into multiple pull requests. This will make it easier for me to review and approve.
-git clone https://github.com/<yourgithubusername>/janitor.git
-janitor
at sfirke/janitor
) by doing git remote add upstream https://github.com/sfirke/janitor.git
. Before making changes make sure to pull changes in from upstream by doing either git fetch upstream
then merge later or git pull upstream
to fetch and merge in one stepsfirke/janitor
-YEAR: 2016 -COPYRIGHT HOLDER: Sam Firke -- -
The MIT License (MIT)
-Copyright (c) 2016 Sam Firke
-Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- - -A stable version 1.0.0, with a new tabyl
API and with breaking changes to the output of clean_names()
.
This preserves the original functionality of janitor, but significantly changes the implementation.
-tabyl
-This is now a single function tabyl()
to count combinations of one, two, or three variables, ala base R’s table()
. This replaces the crosstab()
function. The resulting tabyl
data.frames can be manipulated and formatted using a family of adorn_
functions. See the tabyls vignette for more.
The now-redundant legacy functions crosstab()
and adorn_crosstab()
have been deprecated, but remain in the package for now. Existing code that relies on tabyl
will break if the sort
argument is used, as that argument no longer exists in tabyl
(use dplyr::arrange()
instead).
clean_names
-clean_names()
now detects and preserves camelCase inputs, allows multiple options for case outputs of the cleaned data.frame, and preserves whether there’s space between letters and numbers. It also transliterates accented letters and turns #
into "number"
. This may cause old code to break. E.g., variableName
as a raw column name is now converted to variable_name
(or variableName
, VariableName
, etc. depending on your preference), where it would previously have been converted to variablename
. To minimize this inconvenience, there’s a quick fix for compatibility: you can find-and-replace to insert the argument case = "old_janitor"
, preserving the old behavior of clean_names()
as of janitor version 0.3.1 (and thus not have to redo your scripts beyond that.)
clean_names()
transliterates accented letters, e.g., çãüœ
becomes cauoe
(#120). Thanks to @fernandovmacedo.
clean_names()
offers multiple options for variable name styling. In addition to snake_case
output you can select smallCamelCase
, BigCamelCase
, ALL_CAPS
and others. (#131).
-clean_names()
. Thanks also to @maelle for proposing this feature. janiLaunched the janitor documentation website: http://sfirke.github.io/janitor. Thanks to the pkgdown package!
remove_empty_rows()
and remove_empty_cols()
, which are replaced by the single function remove_empty()
. (#100)
-remove_empty()
does not have a default value for the which
argument, forcing more explicit and readable code. e.g. remove_empty("rows")
.The new adorn_title()
function shows the name of the 2nd tabyl
variable (column name) - this un-tidies the data.frame but makes the result clearer to readers (#77)
tabyl
objects now print with row numbers suppressedclean_names()
now retains the character #
as "number"
in the resulting namesround_half_up()
is now exported for public use. It’s an exact implementation of http://stackoverflow.com/questions/12688717/round-up-from-5-in-r/12688836#12688836, written by @mrdwab.adorn_totals("row")
handles quirky variable names in 1st column (#118)
-get_dupes()
returns the correct result when a variable in the input data.frame is already called "n"
(#162)
-This is a bug-fix release with no new functionality or changes. It fixes a bug where adorn_crosstab()
failed if the tibble
package was version > 1.4.
Major changes to janitor are currently in development on GitHub and will be released soon. This is not that next big release.
-The primary purpose of this release is to maintain accuracy given breaking changes to the dplyr package, upon which janitor is built, in dplyr version >0.6.0. This update also contains a number of minor improvements.
-Critical: if you update the package dplyr
to version >0.6.0, you must update janitor to version 0.3.0 to ensure accurate results from janitor’s tabyl()
function. This is due to a change in the behavior of dplyr’s _join
functions (discussed in #111).
janitor 0.3.0 is compatible with this new version of dplyr as well as old versions of dplyr back to 0.5.0. That is, updating janitor to 0.3.0 does not necessitate an update to dplyr >0.6.0.
-add_totals_row
and add_totals_col
were combined into a single function, adorn_totals()
. (#57). The add_totals_
functions are now deprecated and should not be used.adorn_crosstab()
is now “dat” instead of “crosstab” (indicating that the function can be called on any data.frame, not just a result of crosstab()
)%>%
pipe from magrittr (#107).Deprecated the following functions:
-use_first_valid_of()
- use dplyr::coalesce()
insteadconvert_to_NA()
- use dplyr::na_if()
insteadadd_totals_row()
and add_totals_col()
- replaced by the single function adorn_totals()
-adorn_totals()
and ns_to_percents()
can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57)
-adorn_totals("col")
retains factor class in 1st column if 1st column in the input data.frame was a factorclean_names()
now handles leading spaces (#85)
-adorn_crosstab()
and ns_to_percents()
work on a 2-column data.frame (#89)
-adorn_totals()
now works on a grouped tibble (#97)
-tabyl()
and crosstab()
(#87)
-NA_
column in the result of a crosstab()
will appear at the last column position (#109)
-tabyl()
and crosstab()
now appear in the package manual (#65)
-tabyl()
and crosstab()
failed to retain ill-formatted variable names only when using R 3.2.5 for Windows (#76)
-add_totals_row()
works on two-column data.frame (#69)
-use_first_valid_of()
returns POSIXct-class result when given POSIXct inputsSubmitted to CRAN!
-mtcars %>% tabyl(mpg) %>% tabyl(n)
(#54)
-get_dupes()
now works on variables with spaces in column names (#62)
-adorn_crosstab()
that formats the results of a crosstab()
for pretty printing. Shows % and N in the same cell, with the % symbol, user-specified rounding (method and number of digits), and the option to include a totals row and/or column. E.g., mtcars %>% crosstab(cyl, gear) %>% adorn_crosstab()
.crosstab()
can be called in a %>%
pipeline, e.g., mtcars %>% crosstab(cyl, gear)
. Thanks to @chrishaid (#34)
-tabyl()
can also be called in a %>%
pipeline, e.g., mtcars %>% tabyl(cyl)
(#35)
-use_first_valid_of()
function (#32)
-ns_to_percents()
, add_totals_row()
, add_totals_col()
,crosstab()
returns 0 instead of NA when there are no instances of a variable combination.tabyl(df$vecname)
retains the more-descriptive $
symbol in the column name of the result - if you want a legal R name in the result, call it as df %>% tabyl(vecname)
-clean_names()
---Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
-– “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insight” - The New York Times, 2014
-
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
-The main janitor functions:
-table()
; andThe tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
-janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the %>%
pipe and is optimized for cleaning data brought in with the readr and readxl packages.
You can install:
-the latest released version from CRAN with
- -the latest development version from GitHub with
- -Below are quick examples of how janitor tools are commonly used. A full description of each function can be found in janitor’s catalog of functions vignette.
-Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx:
Dirtiness includes:
-Here’s that data after being read in to R:
-library(pacman) # for loading packages
-p_load(readxl, janitor, dplyr, here)
-
-roster_raw <- read_excel(here("dirty_data.xlsx")) # available at http://github.com/sfirke/janitor
-glimpse(roster_raw)
-#> Observations: 13
-#> Variables: 11
-#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...
-#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...
-#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...
-#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...
-#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...
-#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
-#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...
-#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
-#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...
-#> $ Certification__1 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...
-#> $ Certification__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.
-Clean it with janitor functions:
-roster <- roster_raw %>%
- clean_names() %>%
- remove_empty(c("rows", "cols")) %>%
- mutate(hire_date = excel_numeric_to_date(hire_date),
- cert = coalesce(certification, certification_1)) %>% # from dplyr
- select(-certification, -certification_1) # drop unwanted columns
-
-roster
-#> # A tibble: 12 x 8
-#> first_name last_name employee_status subject hire_date percent_allocated full_time cert
-#> <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr>
-#> 1 Jason Bourne Teacher PE 2008-08-30 0.750 Yes Physical ed
-#> 2 Jason Bourne Teacher Drafting 2008-08-30 0.250 Yes Physical ed
-#> 3 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music
-#> 4 Ada Lovelace Teacher <NA> 1975-05-01 1.00 Yes PENDING
-#> 5 Desus Nice Administration Dean 2013-06-06 1.00 Yes PENDING
-#> 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.500 Yes Science 6-12
-#> 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.500 Yes Science 6-12
-#> 8 James Joyce Teacher English 1990-05-01 0.500 No English 6-12
-#> 9 Hedy Lamarr Teacher Science 1976-06-08 0.500 No PENDING
-#> 10 Carlos Boozer Coach Basketball 2015-08-05 NA No Physical ed
-#> 11 Young Boozer Coach <NA> 1995-01-01 NA No Political sci.
-#> 12 Micheal Larsen Teacher English 2009-09-15 0.800 No Vocal music
The core janitor cleaning function is clean_names()
- call it whenever you load data into R.
Use get_dupes()
to identify and examine duplicate records during data cleaning. Let’s see if any teachers are listed more than once:
roster %>% get_dupes(first_name, last_name)
-#> # A tibble: 4 x 9
-#> first_name last_name dupe_count employee_status subject hire_date percent_allocated full_time cert
-#> <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr> <chr>
-#> 1 Chien-Shiung Wu 2 Teacher Physics 1930-03-20 0.500 Yes Science…
-#> 2 Chien-Shiung Wu 2 Teacher Chemistry 1930-03-20 0.500 Yes Science…
-#> 3 Jason Bourne 2 Teacher PE 2008-08-30 0.750 Yes Physica…
-#> 4 Jason Bourne 2 Teacher Drafting 2008-08-30 0.250 Yes Physica…
Yes, some teachers appear twice. We ought to address this before counting employees.
-A variable (or combinations of two or three variables) can be tabulated with tabyl()
. The resulting data.frame can be tweaked and formatted with the suite of adorn_
functions for quick analysis and printing of pretty results in a report. adorn_
functions can be helpful with non-tabyls, too.
tabyl
can be called two ways:
tabyl(roster$subject)
-roster %>% tabyl(subject, employee_status)
.
-%>%
pipe; this allows for dplyr commands earlier in the pipelineLike table()
, but pipe-able, data.frame-based, and fully featured.
One variable:
-roster %>%
- tabyl(subject)
-#> subject n percent valid_percent
-#> Basketball 1 0.08333333 0.1
-#> Chemistry 1 0.08333333 0.1
-#> Dean 1 0.08333333 0.1
-#> Drafting 1 0.08333333 0.1
-#> English 2 0.16666667 0.2
-#> Music 1 0.08333333 0.1
-#> PE 1 0.08333333 0.1
-#> Physics 1 0.08333333 0.1
-#> Science 1 0.08333333 0.1
-#> <NA> 2 0.16666667 NA
Two variables:
-roster %>%
- filter(hire_date > as.Date("1950-01-01")) %>%
- tabyl(employee_status, full_time)
-#> employee_status No Yes
-#> Administration 0 1
-#> Coach 2 0
-#> Teacher 3 4
Three variables:
-roster %>%
- tabyl(full_time, subject, employee_status)
-#> $Administration
-#> full_time Basketball Chemistry Dean Drafting English Music PE Physics Science
-#> No 0 0 0 0 0 0 0 0 0
-#> Yes 0 0 1 0 0 0 0 0 0
-#>
-#> $Coach
-#> full_time Basketball Chemistry Dean Drafting English Music PE Physics Science NA_
-#> No 1 0 0 0 0 0 0 0 0 1
-#> Yes 0 0 0 0 0 0 0 0 0 0
-#>
-#> $Teacher
-#> full_time Basketball Chemistry Dean Drafting English Music PE Physics Science NA_
-#> No 0 0 0 0 2 0 0 0 1 0
-#> Yes 0 1 0 1 0 1 1 1 0 1
The suite of adorn_
functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:
roster %>%
- tabyl(employee_status, full_time) %>%
- adorn_totals("row") %>%
- adorn_percentages("row") %>%
- adorn_pct_formatting() %>%
- adorn_ns() %>%
- adorn_title("combined")
-#> employee_status/full_time No Yes
-#> Administration 0.0% (0) 100.0% (1)
-#> Coach 100.0% (2) 0.0% (0)
-#> Teacher 33.3% (3) 66.7% (6)
-#> Total 41.7% (5) 58.3% (7)
Pipe that right into knitr::kable()
in your RMarkdown report!
These modular adornments can be layered to reduce R’s deficit against Excel and SPSS when it comes to quick, informative counts.
-You are welcome to:
-The janitor functions expedite the initial data exploration and cleaning that comes with any new data set. This catalog describes the usage for each function.
-Functions for everyday use.
-clean_names()
-Call this function every time you read data.
-It works in a %>%
pipeline, and handles problematic variable names, especially those that are so well preserved by readxl::read_excel()
and readr::read_csv()
.
_
as a separator# Load dplyr for the %>% pipe
-library(dplyr)
-# Create a data.frame with dirty names
-test_df <- as.data.frame(matrix(ncol = 6))
-names(test_df) <- c("hIgHlo", "REPEAT VALUE", "REPEAT VALUE",
- "% successful (2009)", "abc@!*", "")
Clean the variable names, returning a data.frame:
-test_df %>%
- clean_names()
-#> h_ig_hlo repeat_value repeat_value_2 percent_successful_2009 abc x
-#> 1 NA NA NA NA NA NA
Compare to what base R produces:
-make.names(names(test_df))
-#> [1] "hIgHlo" "REPEAT.VALUE" "REPEAT.VALUE"
-#> [4] "X..successful..2009." "abc..." "X"
tabyl()
- a better version of table()
-tabyl()
is a tidyverse-oriented replacement for table()
. It counts combinations of one, two, or three variables, and then can be formatted with a suite of adorn_*
functions to look just how you want. For instance:
mtcars %>%
- tabyl(gear, cyl) %>%
- adorn_totals("col") %>%
- adorn_percentages("row") %>%
- adorn_pct_formatting(digits = 2) %>%
- adorn_ns()
-#> gear 4 6 8 Total
-#> 3 6.67% (1) 13.33% (2) 80.00% (12) 100.00% (15)
-#> 4 66.67% (8) 33.33% (4) 0.00% (0) 100.00% (12)
-#> 5 40.00% (2) 20.00% (1) 40.00% (2) 100.00% (5)
Learn more in the tabyls vignette.
-get_dupes()
-This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn’t be any.
-For example, in a tidy data frame you might expect to have a unique ID repeated for each year, and year repeated for each unique ID, but no duplicated pairs of unique ID & year. Say you want to check for their presence, and study any such duplicated records.
-get_dupes()
returns the records (and inserts a count of duplicates) so you can sleuth out the problematic cases:
get_dupes(mtcars, wt, cyl) # or mtcars %>% get_dupes(wt, cyl) if you prefer to pipe
-#> # A tibble: 4 x 12
-#> wt cyl dupe_… mpg disp hp drat qsec vs am gear carb
-#> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
-#> 1 3.44 6.00 2 19.2 168 123 3.92 18.3 1.00 0 4.00 4.00
-#> 2 3.44 6.00 2 17.8 168 123 3.92 18.9 1.00 0 4.00 4.00
-#> 3 3.57 8.00 2 14.3 360 245 3.21 15.8 0 0 3.00 4.00
-#> 4 3.57 8.00 2 15.0 301 335 3.54 14.6 0 1.00 5.00 8.00
Smaller functions for use in particular situations. More human-readable than the equivalent code they replace.
-excel_numeric_to_date()
-Ever load data from Excel and see a value like 42223
where a date should be? This function converts those serial numbers to class Date
, and contains an option for specifying the alternate date system for files created with Excel for Mac 2008 and earlier versions (which count from a different starting point).
excel_numeric_to_date(41103)
-#> [1] "2012-07-13"
-excel_numeric_to_date(41103, date_system = "mac pre-2011")
-#> [1] "2016-07-14"
remove_empty_cols()
and remove_empty_rows()
-One-line wrapper functions that do what they say. For cases like cleaning Excel files containing empty rows and columns.
-q <- data.frame(v1 = c(1, NA, 3),
- v2 = c(NA, NA, NA),
- v3 = c("a", NA, "b"))
-q %>%
- remove_empty_cols() %>%
- remove_empty_rows()
-#> v1 v3
-#> 1 1 a
-#> 3 3 b
top_levels()
-Originally designed for use with Likert survey data stored as factors. Returns a tbl_df
frequency table with appropriately-named rows, grouped into head/middle/tail groups.
NA
values.f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"),
- levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree"))
-top_levels(f)
-#> f n percent
-#> strongly agree, agree 3 0.5000000
-#> neutral 2 0.3333333
-#> disagree, strongly disagree 1 0.1666667
-top_levels(f, n = 1)
-#> f n percent
-#> strongly agree 2 0.3333333
-#> agree, neutral, disagree 4 0.6666667
-#> strongly disagree 0 0.0000000
The janitor functions expedite the initial data exploration and cleaning that comes with any new data set. This catalog describes the usage for each function.
-Functions for everyday use.
-clean_names()
-Call this function every time you read data.
-It works in a %>%
pipeline, and handles problematic variable names, especially those that are so well-preserved by readxl::read_excel()
and readr::read_csv()
.
œ
to oe
.# Create a data.frame with dirty names -test_df <- as.data.frame(matrix(ncol = 6)) -names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)", - "REPEAT VALUE", "REPEAT VALUE", "")
Clean the variable names, returning a data.frame:
-test_df %>% - clean_names() -#> first_name abc percent_successful_2009 repeat_value repeat_value_2 x -#> 1 NA NA NA NA NA NA
Compare to what base R produces:
-make.names(names(test_df)) -#> [1] "firstName" "ábc..." "X..successful..2009." -#> [4] "REPEAT.VALUE" "REPEAT.VALUE" "X"
This function is powered by the underlying exported function make_clean_names()
, which accepts and returns a character vector of names (see below). This allows for cleaning the names of any object, not just a data.frame. clean_names()
is retained for its convenience in piped workflows, and can be called on an sf
simple features object or a tbl_graph
tidygraph object in addition to a data.frame.
compare_df_cols()
-For cases when you are given a set of data files that should be identical, and you wish to read and combine them for analysis. But then dplyr::bind_rows()
or rbind()
fails, because of different columns or because the column classes don’t match across data.frames.
compare_df_cols()
takes unquoted names of data.frames / tibbles, or a list of data.frames, and returns a summary of how they compare. See what the column types are, which are missing or present in the different inputs, and how column types differ.
df1 <- data.frame(a = 1:2, b = c("big", "small")) # a factor by default -df2 <- data.frame(a = 10:12, b = c("medium", "small", "big"), c = 0, stringsAsFactors = FALSE) -df3 <- df1 %>% - dplyr::mutate(b = as.character(b)) - -compare_df_cols(df1, df2, df3) -#> column_name df1 df2 df3 -#> 1 a integer integer integer -#> 2 b factor character character -#> 3 c <NA> numeric <NA> - -compare_df_cols(df1, df2, df3, return = "mismatch") -#> column_name df1 df2 df3 -#> 1 b factor character character -compare_df_cols(df1, df2, df3, return = "mismatch", bind_method = "rbind") # default is dplyr::bind_rows -#> column_name df1 df2 df3 -#> 1 b factor character character -#> 2 c <NA> numeric <NA>
compare_df_cols_same()
returns TRUE
or FALSE
indicating if the data.frames can be successfully row-bound with the given binding method:
compare_df_cols_same(df1, df3) -#> column_name ..1 ..2 -#> 1 b factor character -#> [1] FALSE -compare_df_cols_same(df2, df3) -#> [1] TRUE
tabyl()
- a better version of table()
-tabyl()
is a tidyverse-oriented replacement for table()
. It counts combinations of one, two, or three variables, and then can be formatted with a suite of adorn_*
functions to look just how you want. For instance:
mtcars %>% - tabyl(gear, cyl) %>% - adorn_totals("col") %>% - adorn_percentages("row") %>% - adorn_pct_formatting(digits = 2) %>% - adorn_ns() %>% - adorn_title() -#> cyl -#> gear 4 6 8 Total -#> 3 6.67% (1) 13.33% (2) 80.00% (12) 100.00% (15) -#> 4 66.67% (8) 33.33% (4) 0.00% (0) 100.00% (12) -#> 5 40.00% (2) 20.00% (1) 40.00% (2) 100.00% (5)
Learn more in the tabyls vignette.
-get_dupes()
-This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn’t be any.
-For example, in a tidy data.frame you might expect to have a unique ID repeated for each year, but no duplicated pairs of unique ID & year. Say you want to check for and study any such duplicated records.
-get_dupes()
returns the records (and inserts a count of duplicates) so you can examine the problematic cases:
get_dupes(mtcars, wt, cyl) # or mtcars %>% get_dupes(wt, cyl) if you prefer to pipe -#> wt cyl dupe_count mpg disp hp drat qsec vs am gear -#> Merc 280 3.44 6 2 19.2 167.6 123 3.92 18.30 1 0 4 -#> Merc 280C 3.44 6 2 17.8 167.6 123 3.92 18.90 1 0 4 -#> Duster 360 3.57 8 2 14.3 360.0 245 3.21 15.84 0 0 3 -#> Maserati Bora 3.57 8 2 15.0 301.0 335 3.54 14.60 0 1 5 -#> carb -#> Merc 280 4 -#> Merc 280C 4 -#> Duster 360 4 -#> Maserati Bora 8
Smaller functions for use in particular situations. More human-readable than the equivalent code they replace.
-make_clean_names()
-Like base R’s make.names()
, but with the stylings and case choice of the long-time janitor function clean_names()
. While clean_names()
is still offered for use in data.frame pipeline with %>%
, make_clean_names()
allows for more general usage, e.g., on a vector.
It can also be used as an argument to .name_repair
in the newest version of tibble::as_tibble
:
tibble::as_tibble(iris, .name_repair = janitor::make_clean_names) -#> New names: -#> * Sepal.Length -> sepal_length -#> * Sepal.Width -> sepal_width -#> * Petal.Length -> petal_length -#> * Petal.Width -> petal_width -#> * Species -> species -#> # A tibble: 150 x 5 -#> sepal_length sepal_width petal_length petal_width species -#> <dbl> <dbl> <dbl> <dbl> <fct> -#> 1 5.1 3.5 1.4 0.2 setosa -#> 2 4.9 3 1.4 0.2 setosa -#> 3 4.7 3.2 1.3 0.2 setosa -#> 4 4.6 3.1 1.5 0.2 setosa -#> 5 5 3.6 1.4 0.2 setosa -#> 6 5.4 3.9 1.7 0.4 setosa -#> 7 4.6 3.4 1.4 0.3 setosa -#> 8 5 3.4 1.5 0.2 setosa -#> 9 4.4 2.9 1.4 0.2 setosa -#> 10 4.9 3.1 1.5 0.1 setosa -#> # … with 140 more rows
remove_empty()
rows and columnsDoes what it says. For cases like cleaning Excel files that contain empty rows and columns after being read into R.
-q <- data.frame(v1 = c(1, NA, 3), - v2 = c(NA, NA, NA), - v3 = c("a", NA, "b")) -q %>% - remove_empty(c("rows", "cols")) -#> v1 v3 -#> 1 1 a -#> 3 3 b
Just a simple wrapper for one-line functions, but it saves a little thinking for both the code writer and the reader.
-remove_constant()
columnsDrops columns from a data.frame that contain only a single constant value (with an na.rm
option to control whether NAs should be considered as different values from the constant).
remove_constant
and remove_empty
work on matrices as well as data.frames.
a <- data.frame(good = 1:3, boring = "the same") -a %>% remove_constant() -#> good -#> 1 1 -#> 2 2 -#> 3 3
round_half_up()
-R uses “banker’s rounding”, i.e., halves are rounded to the nearest even number. This function, an exact implementation of https://stackoverflow.com/questions/12688717/round-up-from-5/12688836#12688836, will round all halves up. Compare:
-nums <- c(2.5, 3.5) -round(nums) -#> [1] 2 4 -round_half_up(nums) -#> [1] 3 4
round_to_fraction()
-Say your data should only have values of quarters: 0, 0.25, 0.5, 0.75, 1, etc. But there are either user-entered bad values like 0.2
or floating-point precision problems like 0.25000000001
. round_to_fraction()
will enforce the desired fractional distribution by rounding the values to the nearest value given the specified denominator.
There’s also a digits
argument for optional subsequent rounding.
excel_numeric_to_date()
-Ever load data from Excel and see a value like 42223
where a date should be? This function converts those serial numbers to class Date
, with options for different Excel date encoding systems, preserving fractions of a date as time (in which case the returned value is of class POSIXlt
), and specifying a time zone.
excel_numeric_to_date(41103) -#> [1] "2012-07-13" -excel_numeric_to_date(41103.01) # ignores decimal places, returns Date object -#> [1] "2012-07-13" -excel_numeric_to_date(41103.01, include_time = TRUE) # returns POSIXlt object -#> [1] "2012-07-13 01:14:24 EDT" -excel_numeric_to_date(41103.01, date_system = "mac pre-2011") -#> [1] "2016-07-14"
Building on excel_numeric_to_date()
, the new functions convert_to_date()
and convert_to_datetime()
are more robust to a mix of inputs. Handy when reading many spreadsheets that should have the same column formats, but don’t.
For instance, here a vector with a date and an Excel datetime sees both values successfully converted to Date class:
-convert_to_date(c("2020-02-29", "40000.1")) -#> [1] "2020-02-29" "2009-07-06"
If a data.frame has the intended variable names stored in one of its rows, row_to_names
will elevate the specified row to become the names of the data.frame and optionally (by default) remove the row in which names were stored and/or the rows above it.
dirt <- data.frame(X_1 = c(NA, "ID", 1:3), - X_2 = c(NA, "Value", 4:6)) - -row_to_names(dirt, 2) -#> ID Value -#> 3 1 4 -#> 4 2 5 -#> 5 3 6
top_levels()
-Originally designed for use with Likert survey data stored as factors. Returns a tbl_df
frequency table with appropriately-named rows, grouped into head/middle/tail groups.
NA
values.f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"), - levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree")) -top_levels(f) -#> f n percent -#> strongly agree, agree 3 0.5000000 -#> neutral 2 0.3333333 -#> disagree, strongly disagree 1 0.1666667 -top_levels(f, n = 1) -#> f n percent -#> strongly agree 2 0.3333333 -#> agree, neutral, disagree 4 0.6666667 -#> strongly disagree 0 0.0000000
vignettes/tabyls.Rmd
- tabyls.Rmd
Analysts do a lot of counting. Indeed, it’s been said that “data science is mostly counting things.” But the base R function for counting, table()
, leaves much to be desired:
%>%
pipe)tabyl()
is an approach to tabulating variables that addresses these shortcomings. It’s part of the janitor package because counting is such a fundamental part of data cleaning and exploration.
tabyl()
is tidyverse-aligned and is primarily built upon the dplyr and tidyr packages.
On its surface, tabyl()
produces frequency tables using 1, 2, or 3 variables. Under the hood, tabyl()
also attaches a copy of these counts as an attribute of the resulting data.frame.
The result looks like a basic data.frame of counts, but because it’s also a tabyl
containing this metadata, you can use adorn_
functions to add additional information and pretty formatting.
The adorn_
functions are built to work on tabyls
, but have been adapted to work with similar, non-tabyl data.frames that need formatting.
This vignette demonstrates tabyl
in the context of studying humans in the starwars
dataset from dplyr:
Tabulating a single variable is the simplest kind of tabyl:
-library(janitor) -#> -#> Attaching package: 'janitor' -#> The following objects are masked from 'package:stats': -#> -#> chisq.test, fisher.test - -t1 <- humans %>% - tabyl(eye_color) - -t1 -#> eye_color n percent -#> blue 12 0.34285714 -#> blue-gray 1 0.02857143 -#> brown 17 0.48571429 -#> dark 1 0.02857143 -#> hazel 2 0.05714286 -#> yellow 2 0.05714286
When NA
values are present, tabyl()
also displays “valid” percentages, i.e., with missing values removed from the denominator. And while tabyl()
is built to take a data.frame and column names, you can also produce a one-way tabyl by calling it directly on a vector:
x <- c("big", "big", "small", "small", "small", NA) -tabyl(x) -#> x n percent valid_percent -#> big 2 0.3333333 0.4 -#> small 3 0.5000000 0.6 -#> <NA> 1 0.1666667 NA
Most adorn_
helper functions are built for 2-way tabyls, but those that make sense for a 1-way tabyl do work:
t1 %>% - adorn_totals("row") %>% - adorn_pct_formatting() -#> eye_color n percent -#> blue 12 34.3% -#> blue-gray 1 2.9% -#> brown 17 48.6% -#> dark 1 2.9% -#> hazel 2 5.7% -#> yellow 2 5.7% -#> Total 35 100.0%
This is often called a “crosstab” or “contingency” table. Calling tabyl
on two columns of a data.frame produces the same result as the common combination of dplyr::count()
, followed by tidyr::pivot_wider()
to wide form:
t2 <- humans %>% - tabyl(gender, eye_color) - -t2 -#> gender blue blue-gray brown dark hazel yellow -#> feminine 3 0 5 0 1 0 -#> masculine 9 1 12 1 1 2
Since it’s a tabyl
, we can enhance it with adorn_
helper functions. For instance:
-t2 %>% - adorn_percentages("row") %>% - adorn_pct_formatting(digits = 2) %>% - adorn_ns() -#> gender blue blue-gray brown dark hazel yellow -#> feminine 33.33% (3) 0.00% (0) 55.56% (5) 0.00% (0) 11.11% (1) 0.00% (0) -#> masculine 34.62% (9) 3.85% (1) 46.15% (12) 3.85% (1) 3.85% (1) 7.69% (2)
Adornments have options to control axes, rounding, and other relevant formatting choices (more on that below).
-Just as table()
accepts three variables, so does tabyl()
, producing a list of tabyls:
t3 <- humans %>% - tabyl(eye_color, skin_color, gender) - -# the result is a tabyl of eye color x skin color, split into a list by gender -t3 -#> $feminine -#> eye_color dark fair light pale tan white -#> blue 0 2 1 0 0 0 -#> blue-gray 0 0 0 0 0 0 -#> brown 0 1 4 0 0 0 -#> dark 0 0 0 0 0 0 -#> hazel 0 0 1 0 0 0 -#> yellow 0 0 0 0 0 0 -#> -#> $masculine -#> eye_color dark fair light pale tan white -#> blue 0 7 2 0 0 0 -#> blue-gray 0 1 0 0 0 0 -#> brown 3 4 3 0 2 0 -#> dark 1 0 0 0 0 0 -#> hazel 0 1 0 0 0 0 -#> yellow 0 0 0 1 0 1
If the adorn_
helper functions are called on a list of data.frames - like the output of a three-way tabyl
call - they will call purrr::map()
to apply themselves to each data.frame in the list:
library(purrr) -humans %>% - tabyl(eye_color, skin_color, gender, show_missing_levels = FALSE) %>% - adorn_totals("row") %>% - adorn_percentages("all") %>% - adorn_pct_formatting(digits = 1) %>% - adorn_ns %>% - adorn_title -#> $feminine -#> skin_color -#> eye_color fair light -#> blue 22.2% (2) 11.1% (1) -#> brown 11.1% (1) 44.4% (4) -#> hazel 0.0% (0) 11.1% (1) -#> Total 33.3% (3) 66.7% (6) -#> -#> $masculine -#> skin_color -#> eye_color dark fair light pale tan white -#> blue 0.0% (0) 26.9% (7) 7.7% (2) 0.0% (0) 0.0% (0) 0.0% (0) -#> blue-gray 0.0% (0) 3.8% (1) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) -#> brown 11.5% (3) 15.4% (4) 11.5% (3) 0.0% (0) 7.7% (2) 0.0% (0) -#> dark 3.8% (1) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) -#> hazel 0.0% (0) 3.8% (1) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) -#> yellow 0.0% (0) 0.0% (0) 0.0% (0) 3.8% (1) 0.0% (0) 3.8% (1) -#> Total 15.4% (4) 50.0% (13) 19.2% (5) 3.8% (1) 7.7% (2) 3.8% (1)
This automatic mapping supports interactive data analysis that switches between combinations of 2 and 3 variables. That way, if a user starts with humans %>% tabyl(eye_color, skin_color)
, adds some adorn_
calls, then decides to split the tabulation by gender and modifies their first line to humans %>% tabyl(eye_color, skin_color, gender
), they don’t have to rewrite the subsequent adornment calls to use map()
.
However, if feels more natural to call these with map()
or lapply()
, that is still supported. For instance, t3 %>% lapply(adorn_percentages)
would produce the same result as t3 %>% adorn_percentages
.
tabyl
will show missing levels (levels not present in the data) in the result
-NA
values can be displayed or suppressedtabyls
print without displaying row numbersYou can call chisq.test()
and fisher.test()
on a two-way tabyl to perform those statistical tests, just like on a base R table()
object.
adorn_*
functionsThese modular functions build on a tabyl
to approximate the functionality of a PivotTable in Microsoft Excel. They print elegant results for interactive analysis or for sharing in a report, e.g., with knitr::kable()
. For example:
humans %>% - tabyl(gender, eye_color) %>% - adorn_totals(c("row", "col")) %>% - adorn_percentages("row") %>% - adorn_pct_formatting(rounding = "half up", digits = 0) %>% - adorn_ns() %>% - adorn_title("combined") %>% - knitr::kable()
gender/eye_color | -blue | -blue-gray | -brown | -dark | -hazel | -yellow | -Total | -
---|---|---|---|---|---|---|---|
feminine | -33% (3) | -0% (0) | -56% (5) | -0% (0) | -11% (1) | -0% (0) | -100% (9) | -
masculine | -35% (9) | -4% (1) | -46% (12) | -4% (1) | -4% (1) | -8% (2) | -100% (26) | -
Total | -34% (12) | -3% (1) | -49% (17) | -3% (1) | -6% (2) | -6% (2) | -100% (35) | -
adorn_totals()
: Add totals row, column, or both.adorn_percentages()
: Calculate percentages along either axis or over the entire tabyladorn_pct_formatting()
: Format percentage columns, controlling the number of digits to display and whether to append the %
symboladorn_rounding()
: Round a data.frame of numbers (usually the result of adorn_percentages
), either using the base R round()
function or using janitor’s round_half_up()
to round all ties up (thanks, StackOverflow).
-round(10.5)
.adorn_rounding()
returns columns of class numeric
, allowing for graphing, sorting, etc. It’s a less-aggressive substitute for adorn_pct_formatting()
; these two functions should not be called together.adorn_ns()
: add Ns to a tabyl. These can be drawn from the tabyl’s underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user.adorn_title()
: add a title to a tabyl (or other data.frame). Options include putting the column title in a new row on top of the data.frame or combining the row and column titles in the data.frame’s first name slot.These adornments should be called in a logical order, e.g., you probably want to add totals before percentages are calculated. In general, call them in the order they appear above.
-You can also call adorn_
functions on other data.frames, not only the results of calls to tabyl()
. E.g., mtcars %>% adorn_totals("col") %>% adorn_percentages("col")
performs as expected, despite mtcars
not being a tabyl
.
This can be handy when you have a data.frame that is not a simple tabulation generated by tabyl
but would still benefit from the adorn_
formatting functions.
A simple example: calculate the proportion of records meeting a certain condition, then format the results.
-percent_above_165_cm <- humans %>% - group_by(gender) %>% - summarise(pct_above_165_cm = mean(height > 165, na.rm = TRUE)) - -percent_above_165_cm %>% - adorn_pct_formatting() -#> # A tibble: 2 x 2 -#> gender pct_above_165_cm -#> <chr> <chr> -#> 1 feminine 12.5% -#> 2 masculine 100.0%
You can control which columns are adorned by using the ...
argument. It accepts the tidyselect helpers. That is, you can specify columns the same way you would using dplyr::select()
.
For instance, say you have a numeric column that should not be included in percentage formatting and you wish to exempt it. Here, only the count
column is adorned:
mtcars %>% - count(cyl, gear) %>% - rename(proportion = n) %>% - adorn_percentages("col", na.rm = TRUE, proportion) %>% - adorn_pct_formatting(,,,proportion) # the commas say to use the default values of the other arguments -#> cyl gear proportion -#> 4 3 3.1% -#> 4 4 25.0% -#> 4 5 6.2% -#> 6 3 6.2% -#> 6 4 12.5% -#> 6 5 3.1% -#> 8 3 37.5% -#> 8 5 6.2%
Here we specify that only two consecutive numeric columns should be totaled (year
is numeric but should not be included):
cases <- data.frame( - region = c("East, West"), - year = 2015, - recovered = c(125, 87), - died = c(13, 12), - stringsAsFactors = FALSE -) - -cases %>% - adorn_totals(c("col", "row"), fill = "-", na.rm = TRUE, name = "Total Cases", recovered:died) -#> region year recovered died Total Cases -#> East, West 2015 125 13 138 -#> East, West 2015 87 12 99 -#> Total Cases - 212 25 237
Here’s a more complex example that uses a data.frame of means, not counts. We create a table containing the mean of a 3rd variable when grouped by two other variables, then use adorn_
functions to round the values and append Ns. The first part is pretty straightforward:
library(tidyr) # for spread() -mpg_by_cyl_and_am <- mtcars %>% - group_by(cyl, am) %>% - summarise(mpg = mean(mpg)) %>% - spread(am, mpg) - -mpg_by_cyl_and_am -#> # A tibble: 3 x 3 -#> # Groups: cyl [3] -#> cyl `0` `1` -#> <dbl> <dbl> <dbl> -#> 1 4 22.9 28.1 -#> 2 6 19.1 20.6 -#> 3 8 15.0 15.4
Now to adorn_
it. Since this is not the result of a tabyl()
call, it doesn’t have the underlying Ns stored in the core
attribute, so we’ll have to supply them:
mpg_by_cyl_and_am %>% - adorn_rounding() %>% - adorn_ns( - ns = mtcars %>% # calculate the Ns on the fly by calling tabyl on the original data - tabyl(cyl, am) - ) %>% - adorn_title("combined", row_name = "Cylinders", col_name = "Is Automatic") -#> Cylinders/Is Automatic 0 1 -#> 1 4 22.9 (3) 28.1 (8) -#> 2 6 19.1 (4) 20.6 (3) -#> 3 8 15.1 (12) 15.4 (2)
If needed, Ns can be manipulated in their own data.frame before they are appended. E.g., if you have a tabyl with values of N in the thousands, you could divide them by 1000, round, and append “k” before inserting them with adorn_ns
.
File an issue on GitHub if you have suggestions related to tabyl()
and its adorn_
helpers or encounter problems while using them.
2018-03-17
-On Travis CI:
-0 errors | 0 warnings | 0 notes
-This update to janitor v1.0 involves breaking changes that affect some downstream packages. I advised all downstream package maintainers of these changes on February 21, including providing code that would keep their packages compatible with all versions of janitor.
-I checked 4 reverse dependencies from CRAN.
-These CRAN packages depending on janitor each have 1 warning:
---Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
-– “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insight“ - The New York Times, 2014
-
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
-The main janitor functions:
-table()
; andThe tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
-janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the %>%
pipe and is optimized for cleaning data brought in with the readr and readxl packages.
You can install:
-the most recent officially-released version from CRAN with
-install.packages("janitor")
the latest development version from GitHub with
-install.packages("devtools") -devtools::install_github("sfirke/janitor")
This marks a major release for janitor, with many new functions and some breaking changes that may affect existing code. Please see the NEWS page to learn more about new capabilities.
-A full description of each function, organized by topic, can be found in janitor’s catalog of functions vignette. There you will find functions not mentioned in this README, like compare_df_cols()
which provides a summary of differences in column names and types when given a set of data.frames.
Below are quick examples of how janitor tools are commonly used.
-Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx:
Dirtiness includes:
-Here’s that data after being read in to R:
-library(readxl); library(janitor); library(dplyr); library(here) - -roster_raw <- read_excel(here("dirty_data.xlsx")) # available at http://github.com/sfirke/janitor -glimpse(roster_raw) -#> Rows: 13 -#> Columns: 11 -#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA,… -#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamar… -#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teach… -#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "… -#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700… -#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80 -#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "N… -#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA -#> $ Certification...9 <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-… -#> $ Certification...10 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA,… -#> $ Certification...11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.
-Name cleaning comes in two flavors. make_clean_names()
operates on character vectors and can be used during data import:
roster_raw_cleaner <- read_excel(here("dirty_data.xlsx"), - .name_repair = make_clean_names) -# Tells read_excel() how to repair repetitive column names, overriding the -# default repair setting -glimpse(roster_raw_cleaner) -#> Rows: 13 -#> Columns: 11 -#> $ first_name <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA, "… -#> $ last_name <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamarr"… -#> $ employee_status <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teacher… -#> $ subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "Sc… -#> $ hire_date <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700, … -#> $ percent_allocated <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80 -#> $ full_time <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "No" -#> $ do_not_edit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA -#> $ certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-12… -#> $ certification_2 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA, "… -#> $ certification_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
This can be further cleaned:
-roster <- roster_raw_cleaner %>% - remove_empty(c("rows", "cols")) %>% - mutate(hire_date = excel_numeric_to_date(hire_date), - cert = coalesce(certification, certification_2)) %>% # from dplyr - select(-certification, -certification_2) # drop unwanted columns - -roster -#> # A tibble: 12 x 8 -#> first_name last_name employee_status subject hire_date percent_allocated full_time cert -#> <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr> -#> 1 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed -#> 2 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed -#> 3 Alicia Keys Teacher Music 2001-08-15 1 Yes Instr. music -#> 4 Ada Lovelace Teacher <NA> 1975-05-01 1 Yes PENDING -#> 5 Desus Nice Administration Dean 2013-06-06 1 Yes PENDING -#> 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.5 Yes Science 6-12 -#> 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.5 Yes Science 6-12 -#> 8 James Joyce Teacher English 1990-05-01 0.5 No English 6-12 -#> 9 Hedy Lamarr Teacher Science 1976-06-08 0.5 No PENDING -#> 10 Carlos Boozer Coach Basketball 2015-08-05 NA No Physical ed -#> 11 Young Boozer Coach <NA> 1995-01-01 NA No Political sci. -#> 12 Micheal Larsen Teacher English 2009-09-15 0.8 No Vocal music
clean_names()
is a convenience version that can be used for piped data.frame workflows:
data("iris") -names(iris) # before cleaning: -#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" - -iris %>% - clean_names() %>% - names() # after cleaning: -#> [1] "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
Use get_dupes()
to identify and examine duplicate records during data cleaning. Let’s see if any teachers are listed more than once:
roster %>% get_dupes(contains("name")) -#> # A tibble: 4 x 9 -#> first_name last_name dupe_count employee_status subject hire_date percent_allocat… full_time cert -#> <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr> <chr> -#> 1 Chien-Shiung Wu 2 Teacher Physics 1930-03-20 0.5 Yes Science 6… -#> 2 Chien-Shiung Wu 2 Teacher Chemistry 1930-03-20 0.5 Yes Science 6… -#> 3 Jason Bourne 2 Teacher PE 2008-08-30 0.75 Yes Physical … -#> 4 Jason Bourne 2 Teacher Drafting 2008-08-30 0.25 Yes Physical …
Yes, some teachers appear twice. We ought to address this before counting employees.
-A variable (or combinations of two or three variables) can be tabulated with tabyl()
. The resulting data.frame can be tweaked and formatted with the suite of adorn_
functions for quick analysis and printing of pretty results in a report. adorn_
functions can be helpful with non-tabyls, too.
tabyl
can be called two ways:
tabyl(roster$subject)
-roster %>% tabyl(subject, employee_status)
.
-%>%
pipe; this allows tabyl
to be used in an analysis pipelineLike table()
, but pipe-able, data.frame-based, and fully featured.
One variable:
-roster %>% - tabyl(subject) -#> subject n percent valid_percent -#> Basketball 1 0.08333333 0.1 -#> Chemistry 1 0.08333333 0.1 -#> Dean 1 0.08333333 0.1 -#> Drafting 1 0.08333333 0.1 -#> English 2 0.16666667 0.2 -#> Music 1 0.08333333 0.1 -#> PE 1 0.08333333 0.1 -#> Physics 1 0.08333333 0.1 -#> Science 1 0.08333333 0.1 -#> <NA> 2 0.16666667 NA
Two variables:
-roster %>% - filter(hire_date > as.Date("1950-01-01")) %>% - tabyl(employee_status, full_time) -#> employee_status No Yes -#> Administration 0 1 -#> Coach 2 0 -#> Teacher 3 4
Three variables:
-roster %>% - tabyl(full_time, subject, employee_status, show_missing_levels = FALSE) -#> $Administration -#> full_time Dean -#> Yes 1 -#> -#> $Coach -#> full_time Basketball NA_ -#> No 1 1 -#> -#> $Teacher -#> full_time Chemistry Drafting English Music PE Physics Science NA_ -#> No 0 0 2 0 0 0 1 0 -#> Yes 1 1 0 1 1 1 0 1
Adorning tabyls
-The adorn_
functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:
roster %>% - tabyl(employee_status, full_time) %>% - adorn_totals("row") %>% - adorn_percentages("row") %>% - adorn_pct_formatting() %>% - adorn_ns() %>% - adorn_title("combined") -#> employee_status/full_time No Yes -#> Administration 0.0% (0) 100.0% (1) -#> Coach 100.0% (2) 0.0% (0) -#> Teacher 33.3% (3) 66.7% (6) -#> Total 41.7% (5) 58.3% (7)
Pipe that right into knitr::kable()
in your RMarkdown report.
These modular adornments can be layered to reduce R’s deficit against Excel and SPSS when it comes to quick, informative counts. Learn more about tabyl()
and the adorn_
functions from the tabyls vignette.
You are welcome to:
-Briefly describe what the feature would do and why it is in scope for the janitor package.
-Please briefly describe your problem and what output you expect.
-Please include a minimal reprex. The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it. If you’ve never heard of a reprex before, start by reading https://github.com/jennybc/reprex#what-is-a-reprex, and follow the advice further down the page.
-Delete these instructions once you have read them.
-Brief description of the problem
-# insert reprex here
NEWS.md
- Transliteration of characters within make_clean_names()
now operates across operating systems, independent of differences in stringi
installations (Fix #365, thanks to @eamoncaddigan for reporting and @billdenney for fixing).
This bug patch represents a breaking change with the way that make_clean_names()
worked in janitor versions 1.2.1.9000 and 2.0.0 as the transliterations are now more generalized and follow a more best-practice approach to transliterating to ASCII.
clean_names()
and make_clean_names()
are now more locale-independent and translation to ASCII is simpler (in many cases, Unicode is removed, e.g., the Greek character “delta” becomes a “d”). You may also now control how substitutions occur and add your own substitutions (like “%” becoming “percent”). As a result of these changes, the clean names generated by these functions may break with what was produced in prior versions of janitor. (Fix #331, thanks to @billdenney)As part of the improvements to make_clean_names()
and clean_names()
, the ...
argument was added, allowing the user to pass additional information to the underlying transformation function from the snakecase
package, to_any_case()
. This allows for greater user control of clean_names()
/ make_clean_names()
and for new functionality like specifying case = "title"
for transforming variable names back to title case for making plots.
The adorn_*
family of functions now allows control of columns to be adorned using the ...
argument. This often-requested feature results in a small breakage as the now-redundant argument skip_first_col
in adorn_percentages()
was removed.
Obsolete functions were deprecated: crosstab
, adorn_crosstab
, use_first_valid_of
, convert_to_NA
, remove_empty_cols
, remove_empty_rows
, add_totals_col
, add_totals_row
.
The new functions convert_to_date()
and convert_to_datetime()
generalize the work done by excel_numeric_to_date()
allowing conversion to date or datetimes from many forms of input from numeric, to characters that look like numbers, to characters that look like dates or datetimes, to Dates, to date-times (POSIXt) (#310, thanks to @billdenney for implementing). For instance, this succeeds: convert_to_date(c("2020-02-29", "40000.1"))
.
The new function signif_half_up()
rounds a numeric vector to the specified number of significant digits with halves rounded up (#314, thanks to @khueyama for suggesting and implementing).
make_clean_names()
now allows the user to specify parts of names to be replaced (Fix #316, thanks to @woodwards for reporting and @woodwards and @billdenney for implementing)
make_clean_names()
will ensure that column names are never duplicated (Fix #251, thanks to @jzadra for reporting and @billdenney for implementing)
clean_names()
and make_clean_names()
have a more generic interface where all arguments from make_clean_names()
are accessible from clean_names()
(Fix #339, thanks to @ari-nz and @billdenney).
The variables considered by the function get_dupes()
can be specified using the select helper functions from tidyselect
. This includes -column_name
to omit a variable as well as the matching functions starts_with()
, ends_with()
, contains()
, and matches()
. See ?tidyselect::select_helpers
for more (#326, thanks to @jzadra for suggesting and implementing).
A quiet
argument was added to remove_empty()
and remove_constant()
providing more information when quiet = 'FALSE'
(#70, thanks to @jbkunst for suggesting and @billdenney for implementing).
row_to_names()
works on matrix input (#320, thanks to @billdenney for suggesting and implementing
clean_names()
can now be called on tbl_graph objects from the tidygraph
package. (#252, thanks to @gvdr for bringing up the issue and thanks to @Tazinho for proposing solution).
adorn_ns()
doesn’t append anything to character columns when called on a data.frame resulting from a call to adorn_percentages()
. (#195).
The name
argument to adorn_totals()
is correctly applied to 3-way tabyls (#306) (thanks to @jzadra for reporting).
adorn_rounding()
now works when called on a 3-way tabyl.
remove_constant()
works correctly with tibbles (in addition to already working on data.frames and matrices) (thanks to @billdenney for implementing).
get_dupes()
works when called on a grouped tibble (#329) (thanks to @jzadra for fixing).
When the second variable in a tabyl (the column variable) contains the empty string ""
, it is converted to "emptystring_
before being spread to the tabyl’s column names. Previously it became the default variable name V1
. (#203).
Behind-the-scenes code changes to maintain compatibility with breaking changes to dplyr 1.0.0, tibble 3.0.0, and R 4.0.0.
Adjusted a single test to account for a different error message produced by the tidyselect
package. No changes to package functionality.
make_clean_names()
takes a character vector and returns the cleaned text, with the same functionality as the existing clean_names()
, which runs on a data.frame, manipulating its names. (#197, thanks @tazinho and everyone who contributed to the discussion).This function can be supplied as a value for the .name_repair
argument of as_tibble()
in the tibble
package. For example: as_tibble(iris, .name_repair = make_clean_names)
.
The new function compare_df_cols()
compares the names and classes of columns in a set of supplied data.frames or tibbles, reporting on the specific columns that are or are not similar. This is for the common use case where a set of data files should all have the same specifications but, in practice, may not. A companion function compare_df_cols_same()
gives a TRUE/FALSE
result indicating if the columns are the same (and therefore bindable, though FALSE is not definitive that binding will fail).
Its helper function describe_class()
is exported for developers who wish to extend it so that the compare_df_
functions treat their custom classes appropriately.
This feature (#50) took almost 3 years from conception to implementation. Major thanks to @billdenney for making it happen!
-A new function round_to_fraction()
allows rounding to a fraction with specified denominator, e.g., to the nearest 1/7 (#235, thanks to @billdenney for suggesting & implementing).
The functions janitor::chisq.test()
and janitor::fisher.test()
to enable running these statistical tests from the base stats
package on two-way tabyl
objects. While the package loading message says the base functions are masked, the base tests still run on table
objects (#255, thanks @juba for implementing).
remove_empty()
now has a companion function remove_constant()
which removes columns containing only a single unique value, optionally ignoring NA
(#222, thanks to @billdenney for suggesting & implementing).
excel_numeric_to_date()
now returns a POSIXct object and includes a time zone. (#225, thanks to @billdenney for the feature.)
clean_names()
can now be called on a simple features object from the sf
package. (#247, thanks to @JosiahParry for suggesting & implementing.)
adorn_totals()
gains an argument "name"
that allows the user to specify a value other than “Total” to appear as the name of the added row and/or column (#263). Thanks to @StephieLaPugh for suggesting and @daniel-barnett for implementing.
remove_empty()
and remove_constant()
now work with matrices (returning a matrix). (#215) Thanks to @jsta for reporting and @billdenney for patching.
If the third variable in a three-way tabyl is a factor, the resulting list is sorted in order of its levels (#250). Empty factor levels in the 3rd variable are still omitted regardless of the value of show_missing_levels
.
excel_numeric_to_date()
no longer gives an overflow error for integer input (for dates since 1968). (#241) Thanks to @hideaki for reporting and @billdenney for patching.
clean_names()
and make_clean_names()
now support ‘none’ as a case option, passed through to snakecase::to_any_case()
. (#269) Thanks to @andrewbarros for reporting and patching.
Patches a bug introduced in version 1.1.0 where excel_numeric_to_date()
would fail if given an input vector containing an NA
value.
excel_numeric_to_date()
again handles NA
correctly, in version 1.1.0 the function would error if any values of the input vector were NA
. (#220). Thanks @emilelatour for reporting and @billdenney for patching.This release was requested by CRAN to address some minor package dependency issues. It also contains several updates and additions described below.
-The new function row_to_names()
handles the case where a dirty data file is read in with its names stored as a row of the data.frame, rather than in the names. This function sets the names of the data.frame to this row and optionally cleans up the rows above and including where the names were stored. Thanks to @billdenney for writing this feature.
excel_numeric_to_date()
can now convert fractions of a day to time, e.g., excel_numeric_to_date(43001.01, include_time = TRUE)
returns the POSIXlt value "2017-09-23 00:14:24"
. Thanks to @billdenney.
As part of excel_numeric_to_date()
now handling times, if a Date-only result is requested (the default behavior of include_time = FALSE
), any fractional part of the date is now removed. The printed date itself is identical, but the internal representation of this object now contains only the integer part of the date. For example, while under both the old and new versions of this function the call excel_numeric_to_date_old(42001.1)
would return the Date object "2014-12-28"
, calling as.numeric
on this Date result would previously return 16432.1
, while now it returns 16432
.
This an improved behavior, as now excel_numeric_to_date(42001.1, include_time = FALSE) == as.Date("2014-12-28")
returns TRUE, while previously it would appear to be equivalent from the printed value but this comparison would return FALSE.
A stable version 1.0.0, with a new tabyl
API and with breaking changes to the output of clean_names()
.
This builds on the original functionality of janitor, with similar-but-improved tools and significantly-changed implementation.
-tabyl
-tabyl()
is now a single function that can count combinations of one, two, or three variables, ala base R’s table()
. The resulting tabyl
data.frames can be manipulated and formatted using a family of adorn_
functions. See the tabyls vignette for more.
The now-redundant legacy functions crosstab()
and adorn_crosstab()
have been deprecated, but remain in the package for now. Existing code that relies on the version of tabyl
present in janitor versions <= 0.3.1 will break if the sort
argument was used, as that argument no longer exists in tabyl
(use dplyr::arrange()
instead).
clean_names
-clean_names()
now detects and preserves camelCase inputs, allows multiple options for case outputs of the cleaned names, and preserves whether there’s space between letters and numbers. It also transliterates accented letters and turns #
into "number"
.
These changes may cause old code to break. E.g., a raw column name variableName
would now be converted to variable_name
(or variableName
, VariableName
, etc. depending on your preference), where previously it would have been converted to variablename
.
To minimize this inconvenience, there’s a quick fix for compatibility: you can find-and-replace to insert the argument case = "old_janitor"
, preserving the old behavior of clean_names()
as of janitor version 0.3.1 (and thus not have to redo your scripts beyond that.)
No further changes are planned to clean_names()
and its results should be stable from version 1.0.0 onward.
clean_names()
transliterates accented letters, e.g., çãüœ
becomes cauoe
(#120). Thanks to @fernandovmacedo.
clean_names()
offers multiple options for variable name styling. In addition to snake_case
output you can select smallCamelCase
, BigCamelCase
, ALL_CAPS
and others. (#131).Thanks to @tazinho, who wrote the snakecase package that janitor depends on to do this, as well as the patch to incorporate it into clean_names()
. And thanks to @maelle for proposing this feature.
Launched the janitor documentation website: http://sfirke.github.io/janitor. Thanks to the pkgdown package.
remove_empty_rows()
and remove_empty_cols()
, which are replaced by the single function remove_empty()
. (#100)
-To encourage transparency, remove_empty()
prints a message if no value is supplied for the which
argument; to suppress this, supply a value to which
, even if it’s the default c("rows", "cols")
.
The new adorn_title()
function adds the name of the 2nd tabyl
variable (i.e., the name of the column variable). This un-tidies the data.frame but makes the result clearer to readers (#77)
round_half_up()
is now exported for public use. It’s an exact implementation of http://stackoverflow.com/questions/12688717/round-up-from-5-in-r/12688836#12688836, written by @mrdwab.tabyl
objects now print with row numbers suppressedclean_names()
now retains the character #
as "number"
in the resulting namesadorn_totals("row")
handles quirky variable names in 1st column (#118)
-get_dupes()
returns the correct result when a variable in the input data.frame is already called "n"
(#162)
-This is a bug-fix release with no new functionality or changes. It fixes a bug where adorn_crosstab()
failed if the tibble
package was version > 1.4.
Major changes to janitor are currently in development on GitHub and will be released soon. This is not that next big release.
-The primary purpose of this release is to maintain accuracy given breaking changes to the dplyr package, upon which janitor is built, in dplyr version >0.6.0. This update also contains a number of minor improvements.
-Critical: if you update the package dplyr
to version >0.6.0, you must update janitor to version 0.3.0 to ensure accurate results from janitor’s tabyl()
function. This is due to a change in the behavior of dplyr’s _join
functions (discussed in #111).
janitor 0.3.0 is compatible with this new version of dplyr as well as old versions of dplyr back to 0.5.0. That is, updating janitor to 0.3.0 does not necessitate an update to dplyr >0.6.0.
-add_totals_row
and add_totals_col
were combined into a single function, adorn_totals()
. (#57). The add_totals_
functions are now deprecated and should not be used.adorn_crosstab()
is now “dat” instead of “crosstab” (indicating that the function can be called on any data.frame, not just a result of crosstab()
)Deprecated the following functions:
-use_first_valid_of()
- use dplyr::coalesce()
insteadconvert_to_NA()
- use dplyr::na_if()
insteadadd_totals_row()
and add_totals_col()
- replaced by the single function adorn_totals()
-adorn_totals()
and ns_to_percents()
can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57)
-adorn_totals("col")
retains factor class in 1st column if 1st column in the input data.frame was a factorclean_names()
now handles leading spaces (#85)
-adorn_crosstab()
and ns_to_percents()
work on a 2-column data.frame (#89)
-adorn_totals()
now works on a grouped tibble (#97)
-tabyl()
and crosstab()
(#87)
-NA_
column in the result of a crosstab()
will appear at the last column position (#109)
-tabyl()
and crosstab()
now appear in the package manual (#65)
-tabyl()
and crosstab()
failed to retain ill-formatted variable names only when using R 3.2.5 for Windows (#76)
-add_totals_row()
works on two-column data.frame (#69)
-use_first_valid_of()
returns POSIXct-class result when given POSIXct inputsmtcars %>% tabyl(mpg) %>% tabyl(n)
(#54)
-get_dupes()
now works on variables with spaces in column names (#62)
-adorn_crosstab()
that formats the results of a crosstab()
for pretty printing. Shows % and N in the same cell, with the % symbol, user-specified rounding (method and number of digits), and the option to include a totals row and/or column. E.g., mtcars %>% crosstab(cyl, gear) %>% adorn_crosstab()
.crosstab()
can be called in a %>%
pipeline, e.g., mtcars %>% crosstab(cyl, gear)
. Thanks to @chrishaid (#34)
-tabyl()
can also be called in a %>%
pipeline, e.g., mtcars %>% tabyl(cyl)
(#35)
-use_first_valid_of()
function (#32)
-ns_to_percents()
, add_totals_row()
, add_totals_col()
,crosstab()
returns 0 instead of NA when there are no instances of a variable combination.tabyl(df$vecname)
retains the more-descriptive $
symbol in the column name of the result - if you want a legal R name in the result, call it as df %>% tabyl(vecname)
-clean_names()
-2016-12-23
-This page is for planning the janitor package, at a high level. More-finite questions and ideas can be handled via GitHub issues. This is for say, articulating what the package does or doesn’t do, and how it should be organized. If it turns out we need a more discussion- and comment-friendly format, we can move to Google Docs, but let’s try commenting and editing here.
-Provide a framework and associated functions for checking and cleaning dirty data. There are two kinds of checks: interactive checks, like tabyl
, and programmatic checks that say, confirm in production that some variables contain no duplicate records, or contain no missing values.
This function is deprecated, use adorn_totals
instead.
add_totals_col(dat, na.rm = TRUE)- -
dat | -an input data.frame with at least one numeric column. |
-
---|---|
na.rm | -should missing values (including NaN) be omitted from the calculations? |
-
Returns a data.frame with a totals column containing row-wise sums.
- -This function is deprecated, use adorn_totals
instead.
add_totals_row(dat, fill = "-", na.rm = TRUE)- -
dat | -an input data.frame with at least one numeric column. |
-
---|---|
fill | -if there are more than one non-numeric columns, what string should fill the bottom row of those columns? |
-
na.rm | -should missing values (including NaN) be omitted from the calculations? |
-
Returns a data.frame with a totals row, consisting of "Total" in the first column and column sums in the others.
- -This function adds the column variable name to the top of a tabyl
for a complete display of information. This makes the tabyl prettier, but renders the data.frame less useful for further manipulation.
adorn_col_title(dat, placement = "top")- -
dat | -a data.frame of class |
-
---|---|
placement | -whether the column name should be added to the top of the tabyl in an otherwise-empty row |
-
the input tabyl, augmented with the column title.
- - --#> Error in adorn_col_title(., placement = "top"): object 'ns' not found
R/janitor_deprecated.R
- adorn_crosstab.Rd
This function is deprecated, use the adorn_
family of functions instead.
adorn_crosstab( - dat, - denom = "row", - show_n = TRUE, - digits = 1, - show_totals = FALSE, - rounding = "half to even" -)- -
dat | -a data.frame with row names in the first column and numeric values in all other columns. Usually the piped-in result of a call to |
-
---|---|
denom | -the denominator to use for calculating percentages. One of "row", "col", or "all". |
-
show_n | -should counts be displayed alongside the percentages? |
-
digits | -how many digits should be displayed after the decimal point? |
-
show_totals | -display a totals summary? Will be a row, column, or both depending on the value of |
-
rounding | -method to use for truncating percentages - either "half to even", the base R default method, or "half up", where 14.5 rounds up to 15. |
-
Returns a data.frame.
- -This function adds back the underlying Ns to a tabyl
whose percentages were calculated using adorn_percentages()
, to display the Ns and percentages together. You can also call it on a non-tabyl data.frame to which you wish to append Ns.
adorn_ns(dat, position = "rear", ns = attr(dat, "core"), ...)- -
dat | -a data.frame of class |
-
---|---|
position | -should the N go in the front, or in the rear, of the percentage? |
-
ns | -the Ns to append. The default is the "core" attribute of the input tabyl |
-
... | -columns to adorn. This takes a tidyselect specification. By default, all columns are adorned except for the first column and columns not of class |
-
a data.frame with Ns appended
- ---mtcars %>% - tabyl(am, cyl) %>% - adorn_percentages("col") %>% - adorn_pct_formatting() %>% - adorn_ns(position = "front")#> am 4 6 8 -#> 0 3 (27.3%) 4 (57.1%) 12 (85.7%) -#> 1 8 (72.7%) 3 (42.9%) 2 (14.3%)
R/adorn_pct_formatting.R
- adorn_pct_formatting.Rd
Numeric columns get multiplied by 100 and formatted as percentages according to user specifications. This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to adorn in the ...
argument. Non-numeric columns are always excluded.
adorn_pct_formatting( - dat, - digits = 1, - rounding = "half to even", - affix_sign = TRUE, - ... -)- -
dat | -a data.frame with decimal values, typically the result of a call to |
-
---|---|
digits | -how many digits should be displayed after the decimal point? |
-
rounding | -method to use for rounding - either "half to even", the base R default method, or "half up", where 14.5 rounds up to 15. |
-
affix_sign | -should the % sign be affixed to the end? |
-
... | -columns to adorn. This takes a tidyselect specification. By default, all numeric columns (besides the initial column, if numeric) are adorned, but this allows you to manually specify which columns should be adorned, for use on a data.frame that does not result from a call to |
-
a data.frame with formatted percentages
- --#> am 4 6 8 -#> 0 27.3% 57.1% 85.7% -#> 1 72.7% 42.9% 14.3%
R/adorn_percentages.R
- adorn_percentages.Rd
This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to adorn in the ...
argument.
adorn_percentages(dat, denominator = "row", na.rm = TRUE, ...)- -
dat | -a |
-
---|---|
denominator | -the direction to use for calculating percentages. One of "row", "col", or "all". |
-
na.rm | -should missing values (including NaN) be omitted from the calculations? |
-
... | -columns to adorn. This takes a tidyselect specification. By default, all numeric columns (besides the initial column, if numeric) are adorned, but this allows you to manually specify which columns should be adorned, for use on a data.frame that does not result from a call to |
-
Returns a data.frame of percentages, expressed as numeric values between 0 and 1.
- --#> am 4 6 8 -#> 0 0.2727273 0.5714286 0.8571429 -#> 1 0.7272727 0.4285714 0.1428571-# calculates correctly even with totals column and/or row: -mtcars %>% - tabyl(am, cyl) %>% - adorn_totals("row") %>% - adorn_percentages()#> am 4 6 8 -#> 0 0.1578947 0.2105263 0.6315789 -#> 1 0.6153846 0.2307692 0.1538462 -#> Total 0.3437500 0.2187500 0.4375000
Can run on any data.frame with at least one numeric column. This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to round in the ...
argument.
If you're formatting percentages, e.g., the result of adorn_percentages()
, use adorn_pct_formatting()
instead. This is a more flexible variant for ad-hoc usage. Compared to adorn_pct_formatting()
, it does not multiply by 100 or pad the numbers with spaces for alignment in the results data.frame. This function retains the class of numeric input columns.
adorn_rounding(dat, digits = 1, rounding = "half to even", ...)- -
dat | -a |
-
---|---|
digits | -how many digits should be displayed after the decimal point? |
-
rounding | -method to use for rounding - either "half to even", the base R default method, or "half up", where 14.5 rounds up to 15. |
-
... | -columns to adorn. This takes a tidyselect specification. By default, all numeric columns (besides the initial column, if numeric) are adorned, but this allows you to manually specify which columns should be adorned, for use on a data.frame that does not result from a call to |
-
Returns the data.frame with rounded numeric columns.
- ---mtcars %>% - tabyl(am, cyl) %>% - adorn_percentages() %>% - adorn_rounding(digits = 2, rounding = "half up")#> am 4 6 8 -#> 0 0.16 0.21 0.63 -#> 1 0.62 0.23 0.15#> -#>#>-#> -#>#>-#> -#>mtcars %>% - tabyl(am, cyl) %>% - adorn_percentages("all") %>% - mutate(dummy = "a") %>% - adorn_rounding()#> am 4 6 8 dummy -#> 0 0.1 0.1 0.4 a -#> 1 0.2 0.1 0.1 a-# control the columns modified using the ... argument: -mtcars %>% - tabyl(am, cyl) %>% - adorn_percentages("row") %>% - adorn_rounding(digits = 1, rounding = "half up", starts_with("8"))#> am 4 6 8 -#> 0 0.1578947 0.2105263 0.6 -#> 1 0.6153846 0.2307692 0.2
This function adds the column variable name to the top of a tabyl
for a complete display of information. This makes the tabyl prettier, but renders the data.frame less useful for further manipulation.
adorn_title(dat, placement = "top", row_name, col_name)- -
dat | -a data.frame of class |
-
---|---|
placement | -whether the column name should be added to the top of the tabyl in an otherwise-empty row |
-
row_name | -(optional) default behavior is to pull the row name from the attributes of the input |
-
col_name | -(optional) default behavior is to pull the column_name from the attributes of the input |
-
the input tabyl, augmented with the column title. Non-tabyl inputs that are of class tbl_df
are downgraded to basic data.frames so that the title row prints correctly.
-#> cyl -#> am 4 6 8 -#> 0 3 4 12 -#> 1 8 3 2-# Adding a title to a non-tabyl -library(tidyr); library(dplyr) -mtcars %>% - group_by(gear, am) %>% - summarise(avg_mpg = mean(mpg)) %>% - spread(gear, avg_mpg) %>% - adorn_title("top", row_name = "Gears", col_name = "Cylinders")#> Cylinders -#> 1 Gears 3 4 5 -#> 2 0 16.1066666666667 21.05 <NA> -#> 3 1 <NA> 26.275 21.38
This function defaults to excluding the first column of the input data.frame, assuming that it contains a descriptive variable, but this can be overridden by specifying the columns to be totaled in the ...
argument. Non-numeric columns are converted to character class and have a user-specified fill character inserted in the totals row.
adorn_totals(dat, where = "row", fill = "-", na.rm = TRUE, name = "Total", ...)- -
dat | -an input data.frame with at least one numeric column. If given a list of data.frames, this function will apply itself to each data.frame in the list (designed for 3-way |
-
---|---|
where | -one of "row", "col", or |
-
fill | -if there are non-numeric columns, what string should fill the bottom row of those columns? |
-
na.rm | -should missing values (including NaN) be omitted from the calculations? |
-
name | -name of the totals column or row |
-
... | -columns to total. This takes a tidyselect specification. By default, all numeric columns (besides the initial column, if numeric) are included in the totals, but this allows you to manually specify which columns should be included, for use on a data.frame that does not result from a call to |
-
Returns a data.frame augmented with a totals row, column, or both. The data.frame is now also of class tabyl
and stores information about the attached totals and underlying data in the tabyl attributes.
-#> am 4 6 8 -#> 0 3 4 12 -#> 1 8 3 2 -#> Total 11 7 14
tabyl
attributes to a data.frame. — as_tabyl • janitorA tabyl
is a data.frame containing counts of a variable or co-occurrences of two variables (a.k.a., a contingency table or crosstab). This specialized kind of data.frame has attributes that enable adorn_
functions to be called for precise formatting and presentation of results. E.g., display results as a mix of percentages, Ns, add totals rows or columns, rounding options, in the style of Microsoft Excel PivotTable.
A tabyl
can be the result of a call to janitor::tabyl()
, in which case these attributes are added automatically. This function adds tabyl
class attributes to a data.frame that isn't the result of a call to tabyl
but meets the requirements of a two-way tabyl:
-1) First column contains values of variable 1
-2) Column names 2:n are the values of variable 2
-3) Numeric values in columns 2:n are counts of the co-occurrences of the two variables.*
* = this is the ideal form of a tabyl, but janitor's adorn_
functions tolerate and ignore non-numeric columns in positions 2:n.
For instance, the result of dplyr::count()
followed by tidyr::spread()
can be treated as a tabyl
.
The result of calling tabyl()
on a single variable is a special class of one-way tabyl; this function only pertains to the two-way tabyl.
as_tabyl(dat, axes = 2, row_var_name = NULL, col_var_name = NULL)- -
dat | -a data.frame with variable values in the first column and numeric values in all other columns. |
-
---|---|
axes | -is this a two_way tabyl or a one_way tabyl? If this function is being called by a user, this should probably be "2". One-way tabyls are created by |
-
row_var_name | -(optional) the name of the variable in the row dimension; used by |
-
col_var_name | -(optional) the name of the variable in the column dimension; used by |
-
Returns the same data.frame, but with the additional class of "tabyl" and the attribute "core".
- --as_tabyl(mtcars)#> mpg cyl disp hp drat wt qsec vs am gear carb -#> 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 -#> 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 -#> 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 -#> 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 -#> 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 -#> 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 -#> 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 -#> 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 -#> 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 -#> 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 -#> 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 -#> 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 -#> 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 -#> 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 -#> 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 -#> 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 -#> 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 -#> 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 -#> 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 -#> 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 -#> 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 -#> 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 -#> 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 -#> 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 -#> 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 -#> 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 -#> 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 -#> 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 -#> 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 -#> 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 -#> 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 -#> 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2-
This generic function overrides stats::chisq.test. If the passed table -is a two-way tabyl, it runs it through janitor::chisq.test.tabyl, otherwise -it just calls stats::chisq.test.
-chisq.test(x, ...) - -# S3 method for default -chisq.test(x, y = NULL, ...) - -# S3 method for tabyl -chisq.test(x, tabyl_results = TRUE, ...)- -
x | -a two-way tabyl, a numeric vector or a factor |
-
---|---|
... | -other parameters passed to stats::chisq.test |
-
y | -if x is a vector, must be another vector or factor of the same length |
-
tabyl_results | -if TRUE and x is a tabyl object, also return `observed`, `expected`, `residuals` and `stdres` as tabyl |
-
The result is the same as the one of stats::chisq.test. If `tabyl_results` -is TRUE, the returned tables `observed`, `expected`, `residuals` and `stdres` -are converted to tabyls.
- --#> Warning: Chi-squared approximation may be incorrect#> -#> Pearson's Chi-squared test -#> -#> data: tab -#> X-squared = 18.036, df = 4, p-value = 0.001214 -#>chisq.test(tab)$residuals#> Warning: Chi-squared approximation may be incorrect#> gear 4 6 8 -#> 3 -1.8303523 -0.70731720 2.1225827 -#> 4 1.9079181 0.84866842 -2.2912878 -#> 5 0.2145291 -0.08964215 -0.1267731-
Resulting names are unique and consist only of the _
character, numbers, and letters.
-Capitalization preferences can be specified using the case
parameter.
Accented characters are transliterated to ASCII. For example, an "o" with a -German umlaut over it becomes "o", and the Spanish character "enye" becomes -"n".
-This function takes and returns a data.frame, for ease of piping with
-`%>%`
. For the underlying function that works on a character vector
-of names, see make_clean_names
.
clean_names(dat, ...) - -# S3 method for data.frame -clean_names(dat, ...) - -# S3 method for default -clean_names(dat, ...) - -# S3 method for sf -clean_names(dat, ...) - -# S3 method for tbl_graph -clean_names(dat, ...)- -
dat | -the input data.frame. |
-
---|---|
... | -Arguments passed on to
|
-
Returns the data.frame with clean names.
-clean_names()
is intended to be used on data.frames
- and data.frame
like objects. For this reason there are methods to
- support using clean_names()
on sf
and tbl_graph
(from
- tidygraph
) objects. For cleaning named lists and vectors, consider
- using make_clean_names()
.
-# not run: -# clean_names(poorly_named_df) - -# or pipe in the input data.frame: -# poorly_named_df %>% clean_names() - -# if you prefer camelCase variable names: -# poorly_named_df %>% clean_names(., "small_camel") - -# not run: -# library(readxl) -# read_excel("messy_excel_file.xlsx") %>% clean_names() -
R/compare_df_cols.R
- compare_df_cols.Rd
Generate a comparison of data.frames (or similar objects) that indicates if -they will successfully bind together by rows.
-compare_df_cols( - ..., - return = c("all", "match", "mismatch"), - bind_method = c("bind_rows", "rbind"), - strict_description = FALSE -)- -
... | -A combination of data.frames, tibbles, and lists of -data.frames/tibbles. The values may optionally be named arguments; if -named, the output column will be the name; if not named, the output column -will be the data.frame name (see examples section). |
-
---|---|
return | -Should a summary of "all" columns be returned, only return -"match"ing columns, or only "mismatch"ing columns? |
-
bind_method | -What method of binding should be used to determine
-matches? With "bind_rows", columns missing from a data.frame would be
-considered a match (as in |
-
strict_description | -Passed to |
-
A data.frame with a column named "column_name" with a value named
- after the input data.frames' column names, and then one column per
- data.frame (named after the input data.frame). If more than one input has
- the same column name, the column naming will have suffixes defined by
- sequential use of base::merge()
and may differ from expected naming.
- The rows within the data.frame-named columns are descriptions of the
- classes of the data within the columns (generated by
- describe_class
).
Due to the returned "column_name" column, no input data.frame may be - named "column_name".
-The strict_description
argument is most typically used to understand
- if factor levels match or are bindable. Factors are typically bindable,
- but the behavior of what happens when they bind differs based on the
- binding method ("bind_rows" or "rbind"). Even when
- strict_description
is FALSE
, data.frames may still bind
- because some classes (like factors and characters) can bind even if they
- appear to differ.
Other Data frame type comparison:
-compare_df_cols_same()
,
-describe_class()
-#> column_name data.frame(A = 1) data.frame(B = 2) -#> 1 A numeric <NA> -#> 2 B <NA> numeric#> column_name dfA dfB -#> 1 A numeric <NA> -#> 2 B <NA> numeric# a combination of list and data.frame input -compare_df_cols(listA=list(dfA=data.frame(A=1), dfB=data.frame(B=2)), data.frame(A=3))#> column_name dfA dfB data.frame(A = 3) -#> 1 A numeric <NA> numeric -#> 2 B <NA> numeric <NA>
R/compare_df_cols.R
- compare_df_cols_same.Rd
Check whether a set of data.frames are row-bindable. Calls
-compare_df_cols()
and returns TRUE if there are no mis-matching rows. `
compare_df_cols_same( - ..., - bind_method = c("bind_rows", "rbind"), - verbose = TRUE -)- -
... | -A combination of data.frames, tibbles, and lists of -data.frames/tibbles. The values may optionally be named arguments; if -named, the output column will be the name; if not named, the output column -will be the data.frame name (see examples section). |
-
---|---|
bind_method | -What method of binding should be used to determine
-matches? With "bind_rows", columns missing from a data.frame would be
-considered a match (as in |
-
verbose | -Print the mismatching columns if binding will fail. |
-
TRUE
if row binding will succeed or FALSE
if it will
- fail.
Other Data frame type comparison:
-compare_df_cols()
,
-describe_class()
-#> [1] TRUE#> [1] TRUE#> [1] TRUE#> column_name ..1 ..2 -#> 1 A numeric <NA> -#> 2 B <NA> numeric#> [1] FALSE
NA
values. — convert_to_NA • janitorConverts instances of user-specified strings into NA
. Can operate on either a single vector or an entire data.frame.
convert_to_NA(dat, strings)- -
dat | -vector or data.frame to operate on. |
-
---|---|
strings | -character vector of strings to convert. |
-
Returns a cleaned object. Can be a vector, data.frame, or tibble::tbl_df
depending on the provided input.
Deprecated, do not use in new code. Use dplyr::na_if()
instead.
janitor_deprecated
R/convert_to_date.R
- convert_to_date.Rd
Convert many date and datetime formats as may be received from Microsoft -Excel
-convert_to_date( - x, - ..., - character_fun = lubridate::ymd, - string_conversion_failure = c("error", "warning") -) - -convert_to_datetime( - x, - ..., - tz = "UTC", - character_fun = lubridate::ymd_hms, - string_conversion_failure = c("error", "warning") -)- -
x | -The object to convert |
-
---|---|
... | -Passed to further methods. Eventually may be passed to -`excel_numeric_to_date()`, `base::as.POXIXct()`, or `base::as.Date()`. |
-
character_fun | -A function to convert non-numeric-looking, non-NA values -in `x` to POSIXct objects. |
-
string_conversion_failure | -If a character value fails to parse into the -desired class and instead returns `NA`, should the function return the -result with a warning or throw an error? |
-
tz | -The timezone for POSIXct output, unless an object is POSIXt -already. Ignored for Date output. |
-
POSIXct objects for `convert_to_datetime()` or Date objects for - `convert_to_date()`.
-Character conversion checks if it matches something that looks like - a Microsoft Excel numeric date, converts those to numeric, and then runs - convert_to_datetime_helper() on those numbers. Then, character to Date or - POSIXct conversion occurs via `character_fun(x, ...)` or - `character_fun(x, tz=tz, ...)`, respectively.
-convert_to_datetime
: Convert to a date-time (POSIXct)
Other Date-time cleaning:
-excel_numeric_to_date()
-convert_to_date("2009-07-06")#> [1] "2009-07-06"convert_to_date(40000)#> [1] "2009-07-06"convert_to_date("40000.1")#> [1] "2009-07-06"#> [1] "2020-02-29" "2009-07-06"convert_to_datetime( - c("2009-07-06", "40000.1", "40000", NA), - character_fun=lubridate::ymd_h, truncated=1, tz="UTC" -)#> [1] "2009-07-06 00:00:00 UTC" "2009-07-06 02:24:00 UTC" -#> [3] "2009-07-06 00:00:00 UTC" NA
This function is deprecated, use tabyl(dat, var1, var2)
instead.
crosstab(...)- -
... | -arguments |
-
---|
Describe the class(es) of an object
-describe_class(x, strict_description = TRUE) - -# S3 method for factor -describe_class(x, strict_description = TRUE) - -# S3 method for default -describe_class(x, strict_description = TRUE)- -
x | -The object to describe |
-
---|---|
strict_description | -Should differing factor levels be treated
-as differences for the purposes of identifying mismatches?
- |
-
A character scalar describing the class(es) of an object where if the - scalar will match, columns in a data.frame (or similar object) should bind - together without issue.
-For package developers, an S3 generic method can be written for
- describe_class()
for custom classes that may need more definition
- than the default method. This function is called by compare_df_cols
.
factor
: Describe factors with their levels
-and if they are ordered.
default
: List all classes of an object.
Other Data frame type comparison:
-compare_df_cols_same()
,
-compare_df_cols()
-describe_class(1)#> [1] "numeric"#> [1] "factor(levels=c(\"A\"))"#> [1] "ordered, factor(levels=c(\"A\", \"B\"))"#> [1] "factor"
R/excel_dates.R
- excel_numeric_to_date.Rd
Converts numbers like 42370
into date values like
-2016-01-01
.
Defaults to the modern Excel date encoding system. However, Excel for Mac -2008 and earlier Mac versions of Excel used a different date system. To -determine what platform to specify: if the date 2016-01-01 is represented by -the number 42370 in your spreadsheet, it's the modern system. If it's 40908, -it's the old Mac system. More on date encoding systems at -http://support.office.com/en-us/article/Date-calculations-in-Excel-e7fe7167-48a9-4b96-bb53-5612a800b487.
-A list of all timezones is available from base::OlsonNames()
, and the
-current timezone is available from base::Sys.timezone()
.
If your input data has a mix of Excel numeric dates and actual dates, see the -more powerful functions `convert_to_date()` and `convert_to_datetime()`.
-excel_numeric_to_date( - date_num, - date_system = "modern", - include_time = FALSE, - round_seconds = TRUE, - tz = "" -)- -
date_num | -numeric vector of serial numbers to convert. |
-
---|---|
date_system | -the date system, either |
-
include_time | -Include the time (hours, minutes, seconds) in the output? -(See details) |
-
round_seconds | -Round the seconds to an integer (only has an effect when
- |
-
tz | -Time zone, used when |
-
Returns a vector of class Date if include_time
is
- FALSE
. Returns a vector of class POSIXlt if include_time
is
- TRUE
.
When using include_time=TRUE
, days with leap seconds will not
- be accurately handled as they do not appear to be accurately handled by
- Windows (as described in
- https://support.microsoft.com/en-us/help/2722715/support-for-the-leap-second).
Other Date-time cleaning:
-convert_to_date()
-excel_numeric_to_date(40000)#> [1] "2009-07-06"excel_numeric_to_date(40000.5) # No time is included#> [1] "2009-07-06"excel_numeric_to_date(40000.5, include_time = TRUE) # Time is included#> [1] "2009-07-06 13:00:00 EDT"excel_numeric_to_date(40000.521, include_time = TRUE) # Time is included#> [1] "2009-07-06 13:30:14 EDT"excel_numeric_to_date(40000.521, include_time = TRUE, - round_seconds = FALSE) # Time with fractional seconds is included#> [1] "2009-07-06 13:30:14 EDT"
This generic function overrides stats::fisher.test. If the passed table -is a two-way tabyl, it runs it through janitor::fisher.test.tabyl, otherwise -it just calls stats::fisher.test.
-fisher.test(x, ...) - -# S3 method for default -fisher.test(x, y = NULL, ...) - -# S3 method for tabyl -fisher.test(x, ...)- -
x | -a two-way tabyl, a numeric vector or a factor |
-
---|---|
... | -other parameters passed to stats::fisher.test |
-
y | -if x is a vector, must be another vector or factor of the same length |
-
The result is the same as the one of stats::fisher.test.
- --#> -#> Fisher's Exact Test for Count Data -#> -#> data: tab -#> p-value = 8.26e-05 -#> alternative hypothesis: two.sided -#>-
data.frame
with identical values for the specified variables. — get_dupes • janitordata.frame
with identical values for the specified variables.R/get_dupes.R
- get_dupes.Rd
For hunting duplicate records during data cleaning. Specify the data.frame and the variable combination to search for duplicates and get back the duplicated rows.
-get_dupes(dat, ...)- -
dat | -The input data.frame. |
-
---|---|
... | -Unquoted variable names to search for duplicates. This takes a tidyselect specification. |
-
Returns a data.frame with the full records where the specified variables have duplicated values, as well as a variable dupe_count
showing the number of rows sharing that combination of duplicated values. If the input data.frame was of class tbl_df
, the output is as well.
-get_dupes(mtcars, mpg, hp)#> mpg hp dupe_count cyl disp drat wt qsec vs am gear carb -#> Mazda RX4 21 110 2 6 160 3.9 2.620 16.46 0 1 4 4 -#> Mazda RX4 Wag 21 110 2 6 160 3.9 2.875 17.02 0 1 4 4-# or called with the magrittr pipe %>% : -mtcars %>% get_dupes(wt)#> wt dupe_count mpg cyl disp hp drat qsec vs am gear carb -#> Hornet Sportabout 3.44 3 18.7 8 360.0 175 3.15 17.02 0 0 3 2 -#> Merc 280 3.44 3 19.2 6 167.6 123 3.92 18.30 1 0 4 4 -#> Merc 280C 3.44 3 17.8 6 167.6 123 3.92 18.90 1 0 4 4 -#> Duster 360 3.57 2 14.3 8 360.0 245 3.21 15.84 0 0 3 4 -#> Maserati Bora 3.57 2 15.0 8 301.0 335 3.54 14.60 0 1 5 8#> mpg cyl disp hp drat vs am gear carb dupe_count wt qsec -#> Mazda RX4 21 6 160 110 3.9 0 1 4 4 2 2.620 16.46 -#> Mazda RX4 Wag 21 6 160 110 3.9 0 1 4 4 2 2.875 17.02mtcars %>% get_dupes(starts_with("cy"))#> cyl dupe_count mpg disp hp drat wt qsec vs am gear -#> Datsun 710 4 11 22.8 108.0 93 3.85 2.320 18.61 1 1 4 -#> Merc 240D 4 11 24.4 146.7 62 3.69 3.190 20.00 1 0 4 -#> Merc 230 4 11 22.8 140.8 95 3.92 3.150 22.90 1 0 4 -#> Fiat 128 4 11 32.4 78.7 66 4.08 2.200 19.47 1 1 4 -#> Honda Civic 4 11 30.4 75.7 52 4.93 1.615 18.52 1 1 4 -#> Toyota Corolla 4 11 33.9 71.1 65 4.22 1.835 19.90 1 1 4 -#> Toyota Corona 4 11 21.5 120.1 97 3.70 2.465 20.01 1 0 3 -#> Fiat X1-9 4 11 27.3 79.0 66 4.08 1.935 18.90 1 1 4 -#> Porsche 914-2 4 11 26.0 120.3 91 4.43 2.140 16.70 0 1 5 -#> Lotus Europa 4 11 30.4 95.1 113 3.77 1.513 16.90 1 1 5 -#> Volvo 142E 4 11 21.4 121.0 109 4.11 2.780 18.60 1 1 4 -#> Mazda RX4 6 7 21.0 160.0 110 3.90 2.620 16.46 0 1 4 -#> Mazda RX4 Wag 6 7 21.0 160.0 110 3.90 2.875 17.02 0 1 4 -#> Hornet 4 Drive 6 7 21.4 258.0 110 3.08 3.215 19.44 1 0 3 -#> Valiant 6 7 18.1 225.0 105 2.76 3.460 20.22 1 0 3 -#> Merc 280 6 7 19.2 167.6 123 3.92 3.440 18.30 1 0 4 -#> Merc 280C 6 7 17.8 167.6 123 3.92 3.440 18.90 1 0 4 -#> Ferrari Dino 6 7 19.7 145.0 175 3.62 2.770 15.50 0 1 5 -#> Hornet Sportabout 8 14 18.7 360.0 175 3.15 3.440 17.02 0 0 3 -#> Duster 360 8 14 14.3 360.0 245 3.21 3.570 15.84 0 0 3 -#> Merc 450SE 8 14 16.4 275.8 180 3.07 4.070 17.40 0 0 3 -#> Merc 450SL 8 14 17.3 275.8 180 3.07 3.730 17.60 0 0 3 -#> Merc 450SLC 8 14 15.2 275.8 180 3.07 3.780 18.00 0 0 3 -#> Cadillac Fleetwood 8 14 10.4 472.0 205 2.93 5.250 17.98 0 0 3 -#> Lincoln Continental 8 14 10.4 460.0 215 3.00 5.424 17.82 0 0 3 -#> Chrysler Imperial 8 14 14.7 440.0 230 3.23 5.345 17.42 0 0 3 -#> Dodge Challenger 8 14 15.5 318.0 150 2.76 3.520 16.87 0 0 3 -#> AMC Javelin 8 14 15.2 304.0 150 3.15 3.435 17.30 0 0 3 -#> Camaro Z28 8 14 13.3 350.0 245 3.73 3.840 15.41 0 0 3 -#> Pontiac Firebird 8 14 19.2 400.0 175 3.08 3.845 17.05 0 0 3 -#> Ford Pantera L 8 14 15.8 351.0 264 4.22 3.170 14.50 0 1 5 -#> Maserati Bora 8 14 15.0 301.0 335 3.54 3.570 14.60 0 1 5 -#> carb -#> Datsun 710 1 -#> Merc 240D 2 -#> Merc 230 2 -#> Fiat 128 1 -#> Honda Civic 2 -#> Toyota Corolla 1 -#> Toyota Corona 1 -#> Fiat X1-9 1 -#> Porsche 914-2 2 -#> Lotus Europa 2 -#> Volvo 142E 2 -#> Mazda RX4 4 -#> Mazda RX4 Wag 4 -#> Hornet 4 Drive 1 -#> Valiant 1 -#> Merc 280 4 -#> Merc 280C 4 -#> Ferrari Dino 6 -#> Hornet Sportabout 2 -#> Duster 360 4 -#> Merc 450SE 3 -#> Merc 450SL 3 -#> Merc 450SLC 3 -#> Cadillac Fleetwood 4 -#> Lincoln Continental 4 -#> Chrysler Imperial 4 -#> Dodge Challenger 2 -#> AMC Javelin 2 -#> Camaro Z28 4 -#> Pontiac Firebird 2 -#> Ford Pantera L 4 -#> Maserati Bora 8
janitor has simple little tools for examining and cleaning dirty data.
-The main janitor functions can: perfectly format ugly data.frame
column names; isolate
-duplicate records for further study; and provide quick one- and two-variable tabulations
-(i.e., frequency tables and crosstabs) that improve on the base R function table()
.
Other functions in the package can format for reporting the results of these tabulations. -These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
-This package follows the principles of the "tidyverse" and in particular works well with
-the %>%
pipe function.
janitor was built with beginning-to-intermediate R users in mind -and is optimized for user-friendliness. Advanced users can already do everything -covered here, but they can do it faster with janitor and save their thinking for -more fun tasks.
- -R/janitor_deprecated.R
- janitor_deprecated.Rd
These functions have already become defunct or may be defunct as soon as the next release.
-R/make_clean_names.R
- make_clean_names.Rd
Resulting strings are unique and consist only of the _
-character, numbers, and letters. By default, the resulting strings will only
-consist of ASCII characters, but non-ASCII (e.g. Unicode) may be allowed by
-setting ascii=FALSE
. Capitalization preferences can be specified
-using the case
parameter.
For use on the names of a data.frame, e.g., in a `%>%`
pipeline,
-call the convenience function clean_names
.
When ascii=TRUE
(the default), accented characters are transliterated
-to ASCII. For example, an "o" with a German umlaut over it becomes "o", and
-the Spanish character "enye" becomes "n".
The order of operations is: replace
, (optional) ASCII conversion,
-removing initial spaces and punctuation, apply base::make.names()
,
-apply to_any_case
, and add numeric suffixes to
-duplicates.
See the documentation for snakecase::to_any_case
for more about how
-to control its behavior.
On some systems, not all transliterators to ASCII are available. If this is
-the case on your system, all available transliterators will be used, and a
-warning will be issued once per session indicating that results may be
-different when run on a different system. That warning can be disabled with
-options(janitor_warn_transliterators=FALSE)
.
If the objective of your call to make_clean_names()
is only to translate to
-ASCII, try the following instead:
-stringi::stri_trans_general(x, id="Any-Latin;Greek-Latin;Latin-ASCII")
.
make_clean_names( - string, - case = "snake", - replace = c(`'` = "", `"` = "", `%` = "_percent_", `#` = "_number_"), - ascii = TRUE, - use_make_names = TRUE, - sep_in = "\\.", - transliterations = "Latin-ASCII", - parsing_option = 1, - numerals = "asis", - ... -)- -
string | -A character vector of names to clean. |
-
---|---|
case | -The desired target case (default is |
-
replace | -A named character vector where the name is replaced by the -value. |
-
ascii | -Convert the names to ASCII ( |
-
use_make_names | -Should |
-
sep_in | -(short for separator input) if character, is interpreted as a
-regular expression (wrapped internally into |
-
transliterations | -A character vector (if not |
-
parsing_option | -An integer that will determine the parsing_option.
|
-
numerals | -A character specifying the alignment of numerals ( |
-
... | -Arguments passed on to
|
-
Returns the "cleaned" character vector.
---# cleaning the names of a vector: -x <- structure(1:3, names = c("name with space", "TwoWords", "total $ (2009)")) -x#> name with space TwoWords total $ (2009) -#> 1 2 3#> name_with_space two_words total_2009 -#> 1 2 3#> [1] "nameWithSpace" "twoWords" "total2009"-# similar to janitor::clean_names(poorly_named_df): -# not run: -# make_clean_names(names(poorly_named_df)) - -
R/remove_empties.R
- remove_constant.Rd
Remove constant columns from a data.frame or matrix.
-remove_constant(dat, na.rm = FALSE, quiet = TRUE)- -
dat | -the input data.frame or matrix. |
-
---|---|
na.rm | -should |
-
quiet | -Should messages be suppressed ( |
-
remove_empty()
for removing empty
- columns or rows.
Other remove functions:
-remove_empty()
-#> B -#> 1 1 -#> 2 2 -#> 3 3-# To find the columns that are constant -data.frame(A=1, B=1:3) %>% - dplyr::select_at(setdiff(names(.), names(remove_constant(.)))) %>% - unique()#> A -#> 1 1
R/remove_empties.R
- remove_empty.Rd
Removes all rows and/or columns from a data.frame or matrix that
- are composed entirely of NA
values.
remove_empty(dat, which = c("rows", "cols"), quiet = TRUE)- -
dat | -the input data.frame or matrix. |
-
---|---|
which | -one of "rows", "cols", or |
-
quiet | -Should messages be suppressed ( |
-
Returns the object without its missing rows or columns.
-remove_constant()
for removing
- constant columns.
Other remove functions:
-remove_constant()
-# not run: -# dat %>% remove_empty("rows") -
R/janitor_deprecated.R
- remove_empty_cols.Rd
This function is deprecated, use remove_empty("cols")
instead.
remove_empty_cols(dat)- -
dat | -the input data.frame. |
-
---|
Returns the data.frame with no empty columns.
- --# not run: -# dat %>% remove_empty_cols -
This function is deprecated, use remove_empty("rows")
instead.
remove_empty_rows(dat)- -
dat | -the input data.frame. |
-
---|
Returns the data.frame with no empty rows.
- --# not run: -# dat %>% remove_empty_rows -
R/round_half_up.R
- round_half_up.Rd
In base R round()
, halves are rounded to even, e.g., 12.5 and 11.5 are both rounded to 12. This function rounds 12.5 to 13 (assuming digits = 0
). Negative halves are rounded away from zero, e.g., -0.5 is rounded to -1.
This may skew subsequent statistical analysis of the data, but may be desirable in certain contexts. This function is implemented exactly from http://stackoverflow.com/a/12688836; see that question and comments for discussion of this issue.
-round_half_up(x, digits = 0)- -
x | -a numeric vector to round. |
-
---|---|
digits | -how many digits should be displayed after the decimal point? |
-
-round_half_up(12.5)#> [1] 13round_half_up(1.125, 2)#> [1] 1.13round_half_up(1.125, 1)#> [1] 1.1round_half_up(-0.5, 0) # negatives get rounded away from zero#> [1] -1-
R/round_to_fraction.R
- round_to_fraction.Rd
Round a decimal to the precise decimal value of a specified -fractional denominator. Common use cases include addressing floating point -imprecision and enforcing that data values fall into a certain set.
-E.g., if a decimal represents hours and values should be logged to the nearest
-minute, round_to_fraction(x, 60)
would enforce that distribution and 0.57
-would be rounded to 0.566667, the equivalent of 34/60. 0.56 would also be rounded
-to 34/60.
Set denominator = 1
to round to whole numbers.
The digits
argument allows for rounding of the subsequent result.
round_to_fraction(x, denominator, digits = Inf)- -
x | -A numeric vector |
-
---|---|
denominator | -The denominator of the fraction for rounding (a scalar or -vector positive integer). |
-
digits | -Integer indicating the number of decimal places to be used
-after rounding to the fraction. This is passed to |
-
the input x rounded to a decimal value that has an integer numerator relative
- to denominator
(possibly subsequently rounded to a number of decimal
- digits).
If digits
is Inf
, x
is rounded to the fraction
- and then kept at full precision. If digits
is "auto"
, the
- number of digits is automatically selected as
- ceiling(log10(denominator)) + 1
.
-round_to_fraction(1.6, denominator = 2)#> [1] 1.5round_to_fraction(pi, denominator = 7) # 22/7#> [1] 3.142857#> [1] 8.142857 9.250000#> [1] 8.143 9.250#> [1] 8.1400 9.2500 10.2997
R/row_to_names.R
- row_to_names.Rd
Elevate a row to be the column names of a data.frame.
-row_to_names(dat, row_number, remove_row = TRUE, remove_rows_above = TRUE)- -
dat | -The input data.frame |
-
---|---|
row_number | -The row of |
-
remove_row | -Should the row |
-
remove_rows_above | -If |
-
A data.frame with new names (and some rows removed, if specified)
- --x <- data.frame(X_1 = c(NA, "Title", 1:3), - X_2 = c(NA, "Title2", 4:6)) -x %>% - row_to_names(row_number = 2)#> Title Title2 -#> 3 1 4 -#> 4 2 5 -#> 5 3 6
R/round_half_up.R
- signif_half_up.Rd
In base R signif()
, halves are rounded to even, e.g.,
-signif(11.5, 2)
and signif(12.5, 2)
are both rounded to 12.
-This function rounds 12.5 to 13 (assuming digits = 2
). Negative halves
-are rounded away from zero, e.g., signif(-2.5, 1)
is rounded to -3.
This may skew subsequent statistical analysis of the data, but may be -desirable in certain contexts. This function is implemented from -https://stackoverflow.com/a/1581007; see that question and -comments for discussion of this issue.
-signif_half_up(x, digits = 6)- -
x | -a numeric vector to round. |
-
---|---|
digits | -integer indicating the number of significant digits to be used. |
-
-signif_half_up(12.5, 2)#> [1] 13signif_half_up(1.125, 3)#> [1] 1.13signif_half_up(-2.5, 1) # negatives get rounded away from zero#> [1] -3-
A fully-featured alternative to table()
. Results are data.frames and can be formatted and enhanced with janitor's family of adorn_
functions.
Specify a data.frame and the one, two, or three unquoted column names you want to tabulate. Three variables generates a list of 2-way tabyls, split by the third variable.
-Alternatively, you can tabulate a single variable that isn't in a data.frame by calling tabyl
on a vector, e.g., tabyl(mtcars$gear)
.
tabyl(dat, ...) - -# S3 method for default -tabyl(dat, show_na = TRUE, show_missing_levels = TRUE, ...) - -# S3 method for data.frame -tabyl(dat, var1, var2, var3, show_na = TRUE, show_missing_levels = TRUE, ...)- -
dat | -a data.frame containing the variables you wish to count. Or, a vector you want to tabulate. |
-
---|---|
... | -the arguments to tabyl (here just for the sake of documentation compliance, as all arguments are listed with the vector- and data.frame-specific methods) |
-
show_na | -should counts of |
-
show_missing_levels | -should counts of missing levels of factors be displayed? These will be rows and/or columns of zeroes. Useful for keeping consistent output dimensions even when certain factor levels may not be present in the data. |
-
var1 | -the column name of the first variable. |
-
var2 | -(optional) the column name of the second variable (the rows in a 2-way tabulation). |
-
var3 | -(optional) the column name of the third variable (the list in a 3-way tabulation). |
-
Returns a data.frame with frequencies and percentages of the tabulated variable(s). A 3-way tabulation returns a list of data.frames.
- ---tabyl(mtcars, cyl)#> cyl n percent -#> 4 11 0.34375 -#> 6 7 0.21875 -#> 8 14 0.43750tabyl(mtcars, cyl, gear)#> cyl 3 4 5 -#> 4 1 8 2 -#> 6 2 4 1 -#> 8 12 0 2tabyl(mtcars, cyl, gear, am)#> $`0` -#> cyl 3 4 5 -#> 4 1 2 0 -#> 6 2 2 0 -#> 8 12 0 0 -#> -#> $`1` -#> cyl 3 4 5 -#> 4 0 6 2 -#> 6 0 2 1 -#> 8 0 0 2 -#>-# or using the %>% pipe -mtcars %>% - tabyl(cyl, gear)#> cyl 3 4 5 -#> 4 1 8 2 -#> 6 2 4 1 -#> 8 12 0 2-# illustrating show_na functionality: -my_cars <- rbind(mtcars, rep(NA, 11)) -my_cars %>% tabyl(cyl)#> cyl n percent valid_percent -#> 4 11 0.33333333 0.34375 -#> 6 7 0.21212121 0.21875 -#> 8 14 0.42424242 0.43750 -#> NA 1 0.03030303 NAmy_cars %>% tabyl(cyl, show_na = FALSE)#> cyl n percent -#> 4 11 0.34375 -#> 6 7 0.21875 -#> 8 14 0.43750#> val n percent -#> hi 1 0.25 -#> lo 1 0.25 -#> med 2 0.50
R/top_levels.R
- top_levels.Rd
Get a frequency table of a factor variable, grouped into categories by level.
-top_levels(input_vec, n = 2, show_na = FALSE)- -
input_vec | -the factor variable to tabulate. |
-
---|---|
n | -number of levels to include in top and bottom groups |
-
show_na | -should cases where the variable is NA be shown? |
-
Returns a data.frame (actually a tbl_df
) with the frequencies of the grouped, tabulated variable. Includes counts and percentages, and valid percentages (calculated omitting NA
values, if present in the vector and show_na = TRUE
.)
-#> as.factor(mtcars$hp) n percent -#> 52, 62 2 0.0625 -#> <<< Middle Group (18 categories) >>> 28 0.8750 -#> 264, 335 2 0.0625
tabyl
attributes from a data.frame. — untabyl • janitorStrips away all tabyl
-related attributes from a data.frame.
untabyl(dat)- -
dat | -a data.frame of class |
-
---|
Returns the same data.frame, but without the tabyl
class and attributes.
-#> $names -#> [1] "am" "n" "percent" -#> -#> $class -#> [1] "data.frame" -#> -#> $row.names -#> [1] 1 2 -#>
R/janitor_deprecated.R
- use_first_valid_of.Rd
At each position of the input vectors, iterates through in order and returns the first non-NA value. This is a robust replacement of the common ifelse(!is.na(x), x, ifelse(!is.na(y), y, z))
. It's more readable and handles problems like ifelse
's inability to work with dates in this way.
use_first_valid_of(..., if_all_NA = NA)- -
... | -the input vectors. Order matters: these are searched and prioritized in the order they are supplied. |
-
---|---|
if_all_NA | -what value should be used when all of the vectors return |
-
Returns a single vector with the selected values.
-Deprecated, do not use in new code. Use dplyr::coalesce()
instead.
janitor_deprecated