-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cpu backend to q28 #257
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this , have requested changes
|
||
if isinstance(x, cudf.Series): | ||
vectorizer = HashingVectorizer | ||
preprocessor = lambda s:s.str.lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is common for both so should be outside the if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They need to use different hashing vectorizers (HashingVectorizer
vs SKHashingVectorizer
) and the lambda functions need to be slightly different (s:s.str.lower()
vs s:s.lower()
)
output_ser = cudf.Series(cudf.core.column.full(size=len(ser), fill_value=2, dtype=np.int32)) | ||
else: | ||
output_ser = pd.Series(2, index=ser.index, dtype=np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have similar behavior here. We should not use cudf specific API when not needed.
Lets use cupy and numpy arrray here.e
pd.Series(np.full(shape=len(ser), fill_value=2, dtype=np.int32))
cudf.Series(cp.full(shape=len(ser), fill_value=2, dtype=cp.int32))))
|
||
if isinstance(reviews_df, dask_cudf.DataFrame): | ||
y = y.map_partitions(lambda x: cp.asarray(x, np.int32)).persist() | ||
y._meta = cp.array(y._meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the map_partitions
call creates a dask array with a cupy chunktype but for some reason the metadata shows a numpy chunktype. This issue has a deeper explanation: rapidsai/cudf#4309
|
||
if average == "binary" and nclasses > 2: | ||
raise ValueError | ||
|
||
if nclasses < 2: | ||
raise ValueError("Single class precision is not yet supported") | ||
|
||
ddh = DistributedDataHandler.create([y, y_pred]) | ||
gpu_futures = client.sync(_extract_partitions, [y, y_pred], client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are they called gpu_futures
, wont this be for both cpu and gpu ?
global_tp = cp.sum(res[:, 0]) | ||
global_fp = cp.sum(res[:, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global_tp = cp.sum(res[:, 0]) | |
global_fp = cp.sum(res[:, 1]) | |
global_tp = res[:, 0].sum() | |
global_fp = res[:, 1].sum() |
model.fit(X_train, y_train) | ||
else: | ||
model = ParallelPostFit(estimator=MultNB(alpha=0.001)) | ||
model.fit(X_train.compute(), y_train.compute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are going to train this on the client process, is this intentional ? We should not do anything on the client process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that was a placeholder I meant to replace. Do you have any suggestion for the best way to parallelize SKlearn here? Unfortunately the dask-ml naive bayes model is incompatible with the sparse matrices we use in this query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still train on worker processes. We can probably do something like below.
est_model = MultNB(alpha=0.001)
X_d = X_train.repartition(npartitions=1).to_delayed()
y_d = y_train .repartition(npartitions=1).to_delayed()
delayed_model = [delayed(est_model.fit)(x_p, y_p) for x_p, y_p in zip(X_d, y_d)]
model = delayed_model[0].compute()
model = ParallelPostFit(estimator=model)
del est_model
No description provided.