You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
min_rows() errors when a tbl_spark is passed due to nrow(tbl_spark) returning NA; sdf_nrow(tbl_spark) should be used in this case
Reproducible example
sc= spark_connect(master="local")
cars= copy_to(sc, mtcars)
rand_forest(
mode="regression",
engine="spark",
mtry=5,
trees=100,
min_n=5
) %>%
translate()
## see how ml_random_forest() is using min_rows():# Random Forest Model Specification (regression)## Main Arguments:# mtry = 5# trees = 100# min_n = 6## Computational engine: spark ## Model fit template:# sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), # type = "regression", feature_subset_strategy = "5", num_trees = 100, # min_instances_per_node = min_rows(~6, x), seed = sample.int(10^5, 1))
rand_forest(
mode="regression",
engine="spark",
mtry=5,
trees=100,
min_n=5
) %>%
fit(
object=.,
formula=mpg~.,
data=cars
)
# Error in if (num_rows > n - offset) { : # missing value where TRUE/FALSE neededmin_rows## see use of nrow() as first line of function body:# function (num_rows, source, offset = 0) # {# n <- nrow(source)# if (num_rows > n - offset) {# msg <- paste0(num_rows, " samples were requested but there were ", # n, " rows in the data. ", n - offset, " will be used.")# rlang::warn(msg)# num_rows <- n - offset# }# as.integer(num_rows)# }## this doesn't work for a tbl_spark...
nrow(cars) # NA## ...but this does
sdf_nrow(cars) # 32
The text was updated successfully, but these errors were encountered:
jmunyoon
changed the title
min_rows() doesn't work when passed a tbl_sparkmin_rows() doesn't work when passed a tbl_sparkJan 11, 2024
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
The problem
min_rows()
errors when atbl_spark
is passed due tonrow(tbl_spark)
returningNA
;sdf_nrow(tbl_spark)
should be used in this caseReproducible example
The text was updated successfully, but these errors were encountered: