Add cpu backend to q28 #257

ChrisJar · 2022-05-02T06:19:36Z

No description provided.

VibhuJawa

Thanks for working on this , have requested changes

VibhuJawa · 2022-06-06T22:08:45Z

gpu_bdb/bdb_tools/q28_utils.py

+
+    if isinstance(x, cudf.Series):
+        vectorizer = HashingVectorizer
+        preprocessor = lambda s:s.str.lower()


This is common for both so should be outside the if statement

They need to use different hashing vectorizers (HashingVectorizer vs SKHashingVectorizer) and the lambda functions need to be slightly different (s:s.str.lower() vs s:s.lower())

VibhuJawa · 2022-06-06T22:16:06Z

gpu_bdb/bdb_tools/q28_utils.py

+        output_ser = cudf.Series(cudf.core.column.full(size=len(ser), fill_value=2, dtype=np.int32))
+    else:
+        output_ser = pd.Series(2, index=ser.index, dtype=np.int32)


We should have similar behavior here. We should not use cudf specific API when not needed.

Lets use cupy and numpy arrray here.e

pd.Series(np.full(shape=len(ser), fill_value=2, dtype=np.int32))

cudf.Series(cp.full(shape=len(ser), fill_value=2, dtype=cp.int32))))

VibhuJawa · 2022-06-06T22:17:05Z

gpu_bdb/bdb_tools/q28_utils.py

+
+    if isinstance(reviews_df, dask_cudf.DataFrame):
+        y = y.map_partitions(lambda x: cp.asarray(x, np.int32)).persist()
+        y._meta = cp.array(y._meta)


Why do we need to do this ?

Basically the map_partitions call creates a dask array with a cupy chunktype but for some reason the metadata shows a numpy chunktype. This issue has a deeper explanation: rapidsai/cudf#4309

VibhuJawa · 2022-06-06T22:21:40Z

gpu_bdb/bdb_tools/q28_utils.py


    if average == "binary" and nclasses > 2:
        raise ValueError

    if nclasses < 2:
        raise ValueError("Single class precision is not yet supported")

-    ddh = DistributedDataHandler.create([y, y_pred])
+    gpu_futures = client.sync(_extract_partitions, [y, y_pred], client)


Why are they called gpu_futures , wont this be for both cpu and gpu ?

VibhuJawa · 2022-06-06T22:24:22Z

gpu_bdb/bdb_tools/q28_utils.py

+            global_tp = cp.sum(res[:, 0])
+            global_fp = cp.sum(res[:, 1])


Suggested change

global_tp = cp.sum(res[:, 0])

global_fp = cp.sum(res[:, 1])

global_tp = res[:, 0].sum()

global_fp = res[:, 1].sum()

VibhuJawa · 2022-06-06T22:26:26Z

gpu_bdb/bdb_tools/q28_utils.py

+        model.fit(X_train, y_train)
+    else:
+        model = ParallelPostFit(estimator=MultNB(alpha=0.001))
+        model.fit(X_train.compute(), y_train.compute())


You are going to train this on the client process, is this intentional ? We should not do anything on the client process.

Yep that was a placeholder I meant to replace. Do you have any suggestion for the best way to parallelize SKlearn here? Unfortunately the dask-ml naive bayes model is incompatible with the sparse matrices we use in this query.

We should still train on worker processes. We can probably do something like below.

est_model = MultNB(alpha=0.001) X_d = X_train.repartition(npartitions=1).to_delayed() y_d = y_train .repartition(npartitions=1).to_delayed() delayed_model = [delayed(est_model.fit)(x_p, y_p) for x_p, y_p in zip(X_d, y_d)] model = delayed_model[0].compute() model = ParallelPostFit(estimator=model) del est_model

sft-managed added 3 commits May 2, 2022 06:18

Add cpu backend to q28

74c6f8d

Cleanup imports

d5baa15

Unify zeros

c100d65

randerzander requested a review from VibhuJawa June 6, 2022 20:49

VibhuJawa requested changes Jun 6, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cpu backend to q28 #257

Add cpu backend to q28 #257

ChrisJar commented May 2, 2022

VibhuJawa left a comment

VibhuJawa Jun 6, 2022

ChrisJar Jun 7, 2022

VibhuJawa Jun 6, 2022

VibhuJawa Jun 6, 2022

ChrisJar Jun 6, 2022

VibhuJawa Jun 6, 2022

VibhuJawa Jun 6, 2022

VibhuJawa Jun 6, 2022

ChrisJar Jun 7, 2022

VibhuJawa Jun 7, 2022

Add cpu backend to q28 #257

Are you sure you want to change the base?

Add cpu backend to q28 #257

Conversation

ChrisJar commented May 2, 2022

VibhuJawa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment