Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_rows() doesn't work when passed a tbl_spark #1045

Closed
jmunyoon opened this issue Jan 11, 2024 · 2 comments · Fixed by #1047
Closed

min_rows() doesn't work when passed a tbl_spark #1045

jmunyoon opened this issue Jan 11, 2024 · 2 comments · Fixed by #1047

Comments

@jmunyoon
Copy link

The problem

min_rows() errors when a tbl_spark is passed due to nrow(tbl_spark) returning NA; sdf_nrow(tbl_spark) should be used in this case

Reproducible example

sc = spark_connect(master = "local")

cars = copy_to(sc, mtcars)

rand_forest(
  mode = "regression",
  engine = "spark",
  mtry = 5,
  trees = 100,
  min_n = 5
) %>%
  translate()
## see how ml_random_forest() is using min_rows():
# Random Forest Model Specification (regression)
# 
# Main Arguments:
# mtry = 5
# trees = 100
# min_n = 6
# 
# Computational engine: spark 
# 
# Model fit template:
#   sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), 
#                              type = "regression", feature_subset_strategy = "5", num_trees = 100, 
#                              min_instances_per_node = min_rows(~6, x), seed = sample.int(10^5, 1))

rand_forest(
  mode = "regression",
  engine = "spark",
  mtry = 5,
  trees = 100,
  min_n = 5
) %>%
  fit(
    object = .,
    formula = mpg ~ .,
    data = cars
  )
# Error in if (num_rows > n - offset) { : 
#   missing value where TRUE/FALSE needed

min_rows
## see use of nrow() as first line of function body:
# function (num_rows, source, offset = 0) 
# {
#   n <- nrow(source)
#   if (num_rows > n - offset) {
#     msg <- paste0(num_rows, " samples were requested but there were ", 
#                   n, " rows in the data. ", n - offset, " will be used.")
#     rlang::warn(msg)
#     num_rows <- n - offset
#   }
#   as.integer(num_rows)
# }

## this doesn't work for a tbl_spark...
nrow(cars) # NA
## ...but this does
sdf_nrow(cars) # 32
@jmunyoon jmunyoon changed the title min_rows() doesn't work when passed a tbl_spark min_rows() doesn't work when passed a tbl_spark Jan 11, 2024
@hfrick
Copy link
Member

hfrick commented Jan 17, 2024

@jmunyoon thanks for the well-written issue!

Copy link

github-actions bot commented Feb 1, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants