Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sort_by.data.table #6679

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -206,3 +206,5 @@ S3method(format_list_item, data.frame)

export(fdroplevels, setdroplevels)
S3method(droplevels, data.table)

S3method(sort_by, data.table)
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,9 @@ rowwiseDT(

6. `fread()` gains `logicalYN` argument to read columns consisting only of strings `Y`, `N` as `logical` (as opposed to character), [#4563](https://github.com/Rdatatable/data.table/issues/4563). The default is controlled by option `datatable.logicalYN`, itself defaulting to `FALSE`, for back-compatibility -- some smaller tables (especially sharded tables) might inadvertently read a "true" string column as `logical` and cause bugs. This is particularly important for tables with a column named `y` or `n` -- automatic header detection under `logicalYN=TRUE` will see these values in the first row as being "data" as opposed to column names. A parallel option was not included for `fwrite()` at this time -- users looking for a compact representation of logical columns can still use `fwrite(logical01=TRUE)`. We also opted for now to check only `Y`, `N` and not `Yes`/`No`/`YES`/`NO`.

7. Base R generic `sort_by()` (new in R 4.4.0) is implemented for data.table's. It internally uses data.table's `forder()` instead of base R `order()` for efficiency. Hence, it uses C-locale as data.table's conventional sorting (suggested by @rikivillalba).


## BUG FIXES

1. `fwrite()` respects `dec=','` for timestamp columns (`POSIXct` or `nanotime`) with sub-second accuracy, [#6446](https://github.com/Rdatatable/data.table/issues/6446). Thanks @kav2k for pointing out the inconsistency and @MichaelChirico for the PR.
Expand Down
12 changes: 12 additions & 0 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2532,6 +2532,18 @@
}
}

sort_by.data.table <- function (x, y, ...)

Check warning on line 2535 in R/data.table.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/data.table.R,line=2535,col=31,[function_left_parentheses_linter] Remove spaces before the left parenthesis in a function definition.
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
{
if (!cedta()) return(NextMethod()) # nocov
if (inherits(y, "formula"))
y <- .formula2varlist(y, x)
if (!is.list(y))
y <- list(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a few more tests -- currently all cases use the formula interface.

Copy link
Contributor Author

@rikivillalba rikivillalba Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests for other interfaces
https://github.com/Rdatatable/data.table/blob/bc3a413d1912340f93b1b50004a04a3b1ca2b485/inst/tests/tests.Rraw#L20711-L20725

Some questions arose to me

  • sort_by.data.frame base method uses .formula2varlist(), which is a public but "internal" base package, that as the manpage says,

most of which are only user-visible because of the special nature of the base namespace.

we can use that in sort_by.data.table right?

  • the list columns. I included a test in which the sort column is a list column. As forder treats a list like a dt itself, the result may have the same number of rows as elements each vector of the sorting list. Nevertheless this is the same behaviour as dt[order(x)] where x is a list column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use that in sort_by.data.table right?

Hmm, thanks for flagging. Actually, I'm not concerned so much about that, but you did make me remember that it's a pretty new {base} function. It won't be available in all the R versions we support --> will cause an R CMD check NOTE there.

And actually, we've crossed this bridge before in #5393:

data.table/R/data.table.R

Lines 2460 to 2465 in d263924

# same as split.data.frame - handling all exceptions, factor orders etc, in a single stream of processing was a nightmare in factor and drop consistency
# evaluate formula mirroring split.data.frame #5392. Mimics base::.formula2varlist.
if (inherits(f, "formula"))
f <- eval(attr(terms(f), "variables"), x, environment(f))
# be sure to use x[ind, , drop = FALSE], not x[ind], in case downstream methods don't follow the same subsetting semantics (#5365)
return(lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), function(ind) x[ind, , drop = FALSE]))

So actually, we should just make a local copy of .formula2varlist() to re-use between split.data.table and sort_by.data.table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list columns. I included a test in which the sort column is a list column. As forder treats a list like a dt itself, the result may have the same number of rows as elements each vector of the sorting list. Nevertheless this is the same behaviour as dt[order(x)] where x is a list column.

As long as the behavior is consistent with [order(...)] I think we can ignore it in this PR.

# use forder instead of base 'order'
o <- do.call(forder, c(unname(y), list(...)))
x[o, , drop = FALSE]
}

# TO DO, add more warnings e.g. for by.data.table(), telling user what the data.table syntax is but letting them dispatch to data.frame if they want

copy = function(x) {
Expand Down
7 changes: 7 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -20686,6 +20686,13 @@ test(2299.10, data.table(a=1), output="a\
test(2299.11, data.table(a=list(data.frame(b=1))), output="a\n1: <data.frame[1x1]>")
test(2299.12, data.table(a=list(data.table(b=1))), output="a\n1: <data.table[1x1]>")

# sort_by.data.table
DT1 = data.table(a = c(1, 3, 2, NA, 3) , b = 4:0)
DT2 = data.table(a = c("c", "a", "B")) # data.table uses C-locale and should sort_by if cedta()
test(2300.01, sort_by(DT1, ~ a + b), data.table(a = c(1,2,3,3,NA), b = c(4L,2L,0L,3L,1L)))
test(2300.02, sort_by(DT1, ~ I(a + b)), data.table(a = c(3,2,1,3,NA), b = c(0L,2L,4L,3L,1L)))
test(2300.03, sort_by(DT2, ~ a), data.table(a = c("B", "a", "c")))

if (test_bit64) {
# Join to integer64 doesn't require integer32 representation, just integer64, #6625
i64_val = .Machine$integer.max + 1
Expand Down
Loading