`min_rows()` doesn't work when passed a `tbl_spark` #1045

jmunyoon · 2024-01-11T21:38:55Z

The problem

min_rows() errors when a tbl_spark is passed due to nrow(tbl_spark) returning NA; sdf_nrow(tbl_spark) should be used in this case

Reproducible example

sc = spark_connect(master = "local")

cars = copy_to(sc, mtcars)

rand_forest(
  mode = "regression",
  engine = "spark",
  mtry = 5,
  trees = 100,
  min_n = 5
) %>%
  translate()
## see how ml_random_forest() is using min_rows():
# Random Forest Model Specification (regression)
# 
# Main Arguments:
# mtry = 5
# trees = 100
# min_n = 6
# 
# Computational engine: spark 
# 
# Model fit template:
#   sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), 
#                              type = "regression", feature_subset_strategy = "5", num_trees = 100, 
#                              min_instances_per_node = min_rows(~6, x), seed = sample.int(10^5, 1))

rand_forest(
  mode = "regression",
  engine = "spark",
  mtry = 5,
  trees = 100,
  min_n = 5
) %>%
  fit(
    object = .,
    formula = mpg ~ .,
    data = cars
  )
# Error in if (num_rows > n - offset) { : 
#   missing value where TRUE/FALSE needed

min_rows
## see use of nrow() as first line of function body:
# function (num_rows, source, offset = 0) 
# {
#   n <- nrow(source)
#   if (num_rows > n - offset) {
#     msg <- paste0(num_rows, " samples were requested but there were ", 
#                   n, " rows in the data. ", n - offset, " will be used.")
#     rlang::warn(msg)
#     num_rows <- n - offset
#   }
#   as.integer(num_rows)
# }

## this doesn't work for a tbl_spark...
nrow(cars) # NA
## ...but this does
sdf_nrow(cars) # 32

The text was updated successfully, but these errors were encountered:

hfrick · 2024-01-17T09:36:25Z

@jmunyoon thanks for the well-written issue!

github-actions · 2024-02-01T00:52:05Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

jmunyoon changed the title ~~min_rows() doesn't work when passed a tbl_spark~~ min_rows() doesn't work when passed a tbl_spark Jan 11, 2024

simonpcouch mentioned this issue Jan 16, 2024

fix model fit for spark tbls #1047

Merged

simonpcouch closed this as completed in #1047 Jan 17, 2024

github-actions bot locked and limited conversation to collaborators Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`min_rows()` doesn't work when passed a `tbl_spark` #1045

`min_rows()` doesn't work when passed a `tbl_spark` #1045

jmunyoon commented Jan 11, 2024

hfrick commented Jan 17, 2024

github-actions bot commented Feb 1, 2024

min_rows() doesn't work when passed a tbl_spark #1045

min_rows() doesn't work when passed a tbl_spark #1045

Comments

jmunyoon commented Jan 11, 2024

The problem

Reproducible example

hfrick commented Jan 17, 2024

github-actions bot commented Feb 1, 2024

`min_rows()` doesn't work when passed a `tbl_spark` #1045

`min_rows()` doesn't work when passed a `tbl_spark` #1045