Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sort_by.data.table #6679

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

add sort_by.data.table #6679

wants to merge 6 commits into from

Conversation

rikivillalba
Copy link
Contributor

@rikivillalba rikivillalba commented Dec 21, 2024

Proposal for #6662

  • Note that sort_by.data.table() will sort using C-locale now, this is incompatible with base::sort_by.data.frame() used on data.tables

Copy link

github-actions bot commented Dec 21, 2024

Comparison Plot

Generated via commit fe8f6f1

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 32 seconds
Installing different package versions 7 minutes and 52 seconds
Running and plotting the test cases 2 minutes and 26 seconds

R/data.table.R Outdated Show resolved Hide resolved
if (inherits(y, "formula"))
y <- .formula2varlist(y, x)
if (!is.list(y))
y <- list(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a few more tests -- currently all cases use the formula interface.

Copy link
Contributor Author

@rikivillalba rikivillalba Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests for other interfaces
https://github.com/Rdatatable/data.table/blob/bc3a413d1912340f93b1b50004a04a3b1ca2b485/inst/tests/tests.Rraw#L20711-L20725

Some questions arose to me

  • sort_by.data.frame base method uses .formula2varlist(), which is a public but "internal" base package, that as the manpage says,

most of which are only user-visible because of the special nature of the base namespace.

we can use that in sort_by.data.table right?

  • the list columns. I included a test in which the sort column is a list column. As forder treats a list like a dt itself, the result may have the same number of rows as elements each vector of the sorting list. Nevertheless this is the same behaviour as dt[order(x)] where x is a list column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use that in sort_by.data.table right?

Hmm, thanks for flagging. Actually, I'm not concerned so much about that, but you did make me remember that it's a pretty new {base} function. It won't be available in all the R versions we support --> will cause an R CMD check NOTE there.

And actually, we've crossed this bridge before in #5393:

data.table/R/data.table.R

Lines 2460 to 2465 in d263924

# same as split.data.frame - handling all exceptions, factor orders etc, in a single stream of processing was a nightmare in factor and drop consistency
# evaluate formula mirroring split.data.frame #5392. Mimics base::.formula2varlist.
if (inherits(f, "formula"))
f <- eval(attr(terms(f), "variables"), x, environment(f))
# be sure to use x[ind, , drop = FALSE], not x[ind], in case downstream methods don't follow the same subsetting semantics (#5365)
return(lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), function(ind) x[ind, , drop = FALSE]))

So actually, we should just make a local copy of .formula2varlist() to re-use between split.data.table and sort_by.data.table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list columns. I included a test in which the sort column is a list column. As forder treats a list like a dt itself, the result may have the same number of rows as elements each vector of the sorting list. Nevertheless this is the same behaviour as dt[order(x)] where x is a list column.

As long as the behavior is consistent with [order(...)] I think we can ignore it in this PR.

NAMESPACE Outdated Show resolved Hide resolved
Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think it needs an entry in man/, probably in ?forder would be the most logical place?

@rikivillalba
Copy link
Contributor Author

Looks good! I think it needs an entry in man/, probably in ?forder would be the most logical place?

Added a reference in setorder.Rd (?forder). PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants