-
-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For a single record data frame train_test_split() sometimes assigns this single record to test set. #975
Comments
What's the behavior of scikit-learn here? We should match that, unless there's some reason not to. One thing to note: we can't check the length of the DataFrame / array during graph construction. So if scikit-learn does any kind of length check, then we won't be able to (easily) match that behavior. |
The following code (which should be equivalent to the dask code above): import pandas as pd
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
for _ in range(20):
df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})
x = df[['x0', 'x1']]
y = df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
# line below throws identical error as line above
# x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
if x_train.shape[0].compute() == 0:
print('x_train is empty!')
break throws a following error:
So it looks like default behavior for this case is raise? |
hey can i work on this issue? |
Sure, thank. |
Describe the issue:
Disclaimer: I know the bug looks silly but I still wanted to give a heads up.
For a single data frame with only 1 record
train_test_split()
sometimes returns empty train set and test set with 1 record - is that desired behavior?Minimal Complete Verifiable Example:
Anything else we need to know?:
Nope
Environment:
The text was updated successfully, but these errors were encountered: