Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[template] create templates for use in generating actions #1282

Open
wants to merge 129 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Dec 20, 2024

This PR should resolve #1276 and is an attempt at better solving the problem space of #1030

I believe that #1259 could be implemented easier using this change, but its dependency on rebooting is antithetical to Dataproc in many ways and has not been included. I will meet with NVIDIA and the Dataproc engineering team to troubleshoot the problem.

This PR includes code refactored out of GPU-acceleration-related and dask-related actions and into files under the templates/ directory of the repository. There are a set of PRs which rebase to this branch:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @davorg - As I was reading through the literature, bringing myself back up to speed on the state of the art of template toolkits, I saw that there was a book with one of my friends' names on it that I had glanced at many times over the last couple of decades. I did not realize until just a few days ago that the dlc who was in charge of desk allocation in my cube farm when I started was the same dlc who wrote the book on this particular subject.

Anyway, I've been thinking of you and our peers as I've been hacking away at this installer. If you felt like looking things over and picking some nits, I'd love to hear your feedback. I hope your holidays are merry and all that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomif oh, hey I see that you are actively participating in Template.pm development. I'm not doing a lot with it in this repository; everything is pretty straightforward, I think. If you had some spare time to take a peek at the new templates/ directory in this repo, and especially the templates/generate-action.pl, it might be fun to chat about it. I hope your holidays went well!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjac: hi! Where can I find the templates directory? Please give a url.

Copy link
Contributor Author

@cjac cjac Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

@cjac cjac force-pushed the template-gpu-20241219 branch from 8d28938 to e511a6e Compare January 2, 2025 20:23
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

Copy link
Contributor Author

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments to address issues with documentation

# --metadata=ENABLE_MIG can be used to enable or disable MIG. The default is to enable it.
# The script does a reboot to fully enable MIG and then configures the MIG device based on the
# user specified MIG_CGI profiles specified via: --metadata=^:^MIG_CGI='9,9'. If MIG_CGI
# is not specified it assumes it's using an A100 and configures 2 instances with profile id 9.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/A100/H100/

#
# This script should be specified in --metadata=startup-script-url= option and
# --metadata=ENABLE_MIG can be used to enable or disable MIG. The default is to enable it.
# The script does a reboot to fully enable MIG and then configures the MIG device based on the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not ever reboot, and neither should you

templates/spark-rapids/mig.sh.in Outdated Show resolved Hide resolved
@cjac
Copy link
Contributor Author

cjac commented Jan 2, 2025

/gcbrun

2 similar comments
@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

using the test suite I just cleaned up for #1275

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 3, 2025

2.1-debian11 failure:

2025-01-03T03:18:35.157639402Z AssertionError: 1 != 0 : Failed to execute command:
2025-01-03T03:18:35.157650162Z gcloud dataproc jobs submit spark --cluster=test-gpu-standard-2-1-20250103-030909-kdee --region=us-central1 --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar --class=org.apache.spark.examples.ml.JavaIndexToStringExample --properties=spark.executor.resource.gpu.amount=1,spark.executor.cores=6,spark.executor.memory=4G,spark.task.resource.gpu.amount=0.333,spark.task.cpus=2,spark.yarn.unmanagedAM.enabled=false
2025-01-03T03:18:35.157660172Z STDOUT:
2025-01-03T03:18:35.157694322Z 
2025-01-03T03:18:35.157706472Z STDERR:
2025-01-03T03:18:35.157715992Z Job [474683bad64a45e8af6cc00ccc9695ae] submitted.
2025-01-03T03:18:35.157726222Z Waiting for job output...
2025-01-03T03:18:35.157735722Z 25/01/03 03:14:42 INFO SparkEnv: Registering MapOutputTracker
2025-01-03T03:18:35.157768422Z 25/01/03 03:14:42 INFO SparkEnv: Registering BlockManagerMaster
2025-01-03T03:18:35.157778362Z 25/01/03 03:14:42 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
2025-01-03T03:18:35.157787802Z 25/01/03 03:14:42 INFO SparkEnv: Registering OutputCommitCoordinator
2025-01-03T03:18:35.157797152Z 25/01/03 03:14:43 INFO DataprocSparkPlugin: Registered 128 driver metrics
2025-01-03T03:18:35.157805932Z 25/01/03 03:14:43 INFO ShimLoader: Loading shim for Spark version: 3.3.2
2025-01-03T03:18:35.157815022Z 25/01/03 03:14:43 INFO ShimLoader: Complete Spark build info: 3.3.2, https://bigdataoss-internal.googlesource.com/third_party/apache/spark, dataproc-branch-3.3.2, 5672c094ffe3ff9aa967db7b81163e1cc586a093, 2024-10-23T22:06:45Z
2025-01-03T03:18:35.157824862Z 25/01/03 03:14:43 INFO ShimLoader: findURLClassLoader found a URLClassLoader org.apache.spark.util.MutableURLClassLoader@61ab89b0
2025-01-03T03:18:35.157836082Z 25/01/03 03:14:43 INFO ShimLoader: Updating spark classloader org.apache.spark.util.MutableURLClassLoader@61ab89b0 with the URLs: jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark3xx-common/, jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark332/
2025-01-03T03:18:35.157845492Z 25/01/03 03:14:43 INFO ShimLoader: Spark classLoader org.apache.spark.util.MutableURLClassLoader@61ab89b0 updated successfully
2025-01-03T03:18:35.157869502Z 25/01/03 03:14:43 INFO ShimLoader: Updating spark classloader org.apache.spark.util.MutableURLClassLoader@61ab89b0 with the URLs: jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark3xx-common/, jar:file:/usr/lib/spark/jars/rapids-4-spark_2.12-23.08.2.jar!/spark332/
2025-01-03T03:18:35.157880132Z 25/01/03 03:14:43 INFO ShimLoader: Spark classLoader org.apache.spark.util.MutableURLClassLoader@61ab89b0 updated successfully
2025-01-03T03:18:35.157891322Z 25/01/03 03:14:43 INFO RapidsPluginUtils: RAPIDS Accelerator build: {date=2023-10-05T09:57:39Z, cudf_version=23.08.0, version=23.08.2, user=, branch=HEAD, url=https://github.com/NVIDIA/spark-rapids.git, revision=56da18a1be0148025cb00ced2ffe039fbf9c3391}
2025-01-03T03:18:35.157900352Z 25/01/03 03:14:43 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: {date=2023-08-10T03:31:37Z, version=23.08.0, user=, branch=HEAD, url=https://github.com/NVIDIA/spark-rapids-jni.git, revision=73fcd5ce22a622e5937a613bc5c4a1b32a40aec1}
2025-01-03T03:18:35.157909062Z 25/01/03 03:14:43 INFO RapidsPluginUtils: cudf build: {date=2023-08-10T03:31:37Z, version=23.08.0, user=, branch=HEAD, url=https://github.com/rapidsai/cudf.git, revision=8150d38e080c8fb021921ade83fe3aa3be04b47d}
2025-01-03T03:18:35.157917332Z 25/01/03 03:14:43 WARN RapidsPluginUtils: RAPIDS Accelerator 23.08.2 using cudf 23.08.0.
2025-01-03T03:18:35.157926612Z 25/01/03 03:14:43 WARN RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
2025-01-03T03:18:35.157935632Z 25/01/03 03:14:43 WARN RapidsPluginUtils: The current setting of spark.task.resource.gpu.amount (0.333) is not ideal to get the best performance from the RAPIDS Accelerator plugin. It's recommended to be 1/{executor core count} unless you have a special use case.
2025-01-03T03:18:35.157944382Z 25/01/03 03:14:43 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2025-01-03T03:18:35.157954512Z 25/01/03 03:14:43 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `NOT_ON_GPU`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
2025-01-03T03:18:35.157963672Z 25/01/03 03:14:44 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at test-gpu-standard-2-1-20250103-030909-kdee-m.us-central1-f.c.cloud-dataproc-ci.internal./10.128.0.50:8032
2025-01-03T03:18:35.157972112Z 25/01/03 03:14:44 INFO AHSProxy: Connecting to Application History server at test-gpu-standard-2-1-20250103-030909-kdee-m.us-central1-f.c.cloud-dataproc-ci.internal./10.128.0.50:10200
2025-01-03T03:18:35.157991762Z 25/01/03 03:14:44 INFO Configuration: found resource resource-types.xml at file:/etc/hadoop/conf.empty/resource-types.xml
2025-01-03T03:18:35.158000832Z 25/01/03 03:14:44 INFO ResourceUtils: Adding resource type - name = yarn.io/gpu, units = , type = COUNTABLE
2025-01-03T03:18:35.158009842Z 25/01/03 03:14:46 INFO YarnClientImpl: Submitted application application_1735873832026_0001
2025-01-03T03:18:35.158020202Z 25/01/03 03:14:56 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
2025-01-03T03:18:35.158029782Z 25/01/03 03:15:00 WARN GpuOverrides: 
2025-01-03T03:18:35.158038982Z !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158050092Z   @Expression <AggregateExpression> stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158059502Z     ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158069402Z       ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158101482Z         ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158112172Z           @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158123202Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158131912Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158140402Z       ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158148742Z         ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158158062Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158166272Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158186702Z   @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158195922Z   @Expression <Alias> StringIndexerAggregator(org.apache.spark.sql.Row)#14 AS StringIndexerAggregator(org.apache.spark.sql.Row)#15 could run on GPU
2025-01-03T03:18:35.158204532Z     @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158213622Z   !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.158222172Z     @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.158230522Z     !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158239022Z       @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158247572Z         ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158256122Z           ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158265132Z             ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158284682Z               @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158293312Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158302412Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158311202Z           ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158327352Z             ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158336902Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158345412Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158353802Z       @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.158372552Z       @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.158381252Z       ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.158390802Z         @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158399782Z 
2025-01-03T03:18:35.158408922Z 25/01/03 03:15:00 INFO GpuOverrides: Plan conversion to the GPU took 82.60 ms
2025-01-03T03:18:35.158418282Z 25/01/03 03:15:00 WARN GpuOverrides: 
2025-01-03T03:18:35.158427332Z !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158436242Z   @Expression <AggregateExpression> stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158444812Z     ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158453752Z       ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158462122Z         ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158470662Z           @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158478612Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158487022Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158495642Z       ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158504672Z         ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158513622Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158522272Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158531532Z   @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158540392Z   @Expression <Alias> StringIndexerAggregator(org.apache.spark.sql.Row)#14 AS StringIndexerAggregator(org.apache.spark.sql.Row)#15 could run on GPU
2025-01-03T03:18:35.158563802Z     @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158572982Z   !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.158581902Z     @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.158591402Z     !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158600182Z       @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158609412Z         ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158617432Z           ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158626352Z             ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158635312Z               @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158644272Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158653102Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158661752Z           ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158690872Z             ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158701652Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158710832Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158719952Z       @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.158729072Z       @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.158738432Z       ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.158759192Z         @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158768182Z 
2025-01-03T03:18:35.158777272Z 25/01/03 03:15:00 INFO GpuOverrides: Plan conversion to the GPU took 7.28 ms
2025-01-03T03:18:35.158786472Z 25/01/03 03:15:01 INFO GpuOverrides: Plan conversion to the GPU took 1.26 ms
2025-01-03T03:18:35.158795052Z 25/01/03 03:15:01 WARN GpuOverrides: 
2025-01-03T03:18:35.158804352Z !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158827882Z   @Expression <AggregateExpression> stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.158838842Z     ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.158848372Z       ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.158857722Z         ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.158867352Z           @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.158876662Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158888172Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158897632Z       ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.158906542Z         ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158915172Z       ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.158923792Z         ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.158932872Z   @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158941622Z   @Expression <Alias> StringIndexerAggregator(org.apache.spark.sql.Row)#14 AS StringIndexerAggregator(org.apache.spark.sql.Row)#15 could run on GPU
2025-01-03T03:18:35.158962062Z     @Expression <AttributeReference> StringIndexerAggregator(org.apache.spark.sql.Row)#14 could run on GPU
2025-01-03T03:18:35.158971042Z   !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.158980162Z     @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.158988692Z     !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.158997872Z       @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.159030592Z         ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.159043982Z           ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.159053702Z             ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.159062732Z               @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159081082Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159091482Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159100782Z           ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.159109962Z             ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159119532Z           ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159128692Z             ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159138382Z       @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.159147352Z       @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.159156532Z       ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.159177462Z         @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159186652Z 
2025-01-03T03:18:35.159195722Z 25/01/03 03:15:01 INFO GpuOverrides: Plan conversion to the GPU took 4.75 ms
2025-01-03T03:18:35.159204872Z 25/01/03 03:15:01 INFO GpuOverrides: GPU plan transition optimization took 13.66 ms
2025-01-03T03:18:35.159214022Z 25/01/03 03:15:01 WARN GpuOverrides: 
2025-01-03T03:18:35.159223582Z !Exec <ShuffleExchangeExec> cannot run on GPU because Columnar exchange without columnar children is inefficient
2025-01-03T03:18:35.159233342Z   @Partitioning <SinglePartition$> could run on GPU
2025-01-03T03:18:35.159242602Z   !Exec <ObjectHashAggregateExec> cannot run on GPU because not all expressions can be replaced
2025-01-03T03:18:35.159252392Z     @Expression <AggregateExpression> partial_stringindexeraggregator(org.apache.spark.ml.feature.StringIndexerAggregator@f243d5c, Some(createexternalrow(category#1.toString, StructField(category,StringType,false))), Some(interface org.apache.spark.sql.Row), Some(StructType(StructField(category,StringType,false))), encodeusingserializer(input[0, java.lang.Object, true], true), decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true), encodeusingserializer(input[0, java.lang.Object, true], true), BinaryType, true, 0, 0) could run on GPU
2025-01-03T03:18:35.159261702Z       ! <ComplexTypedAggregateExpression> StringIndexerAggregator(org.apache.spark.sql.Row) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression
2025-01-03T03:18:35.159282182Z         ! <CreateExternalRow> createexternalrow(category#1.toString, StructField(category,StringType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
2025-01-03T03:18:35.159293432Z           ! <Invoke> category#1.toString cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.Invoke
2025-01-03T03:18:35.159302532Z             @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159311742Z         ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159321662Z           ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159333852Z         ! <DecodeUsingSerializer> decodeusingserializer(input[0, binary, true], Array[org.apache.spark.util.collection.OpenHashMap], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.DecodeUsingSerializer
2025-01-03T03:18:35.159343432Z           ! <BoundReference> input[0, binary, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159353252Z         ! <EncodeUsingSerializer> encodeusingserializer(input[0, java.lang.Object, true], true) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.EncodeUsingSerializer
2025-01-03T03:18:35.159362232Z           ! <BoundReference> input[0, java.lang.Object, true] cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.BoundReference
2025-01-03T03:18:35.159371352Z     @Expression <AttributeReference> buf#19 could run on GPU
2025-01-03T03:18:35.159390812Z     @Expression <AttributeReference> buf#20 could run on GPU
2025-01-03T03:18:35.159400042Z     ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
2025-01-03T03:18:35.159408912Z       @Expression <AttributeReference> category#1 could run on GPU
2025-01-03T03:18:35.159417562Z 
2025-01-03T03:18:35.159426412Z 25/01/03 03:15:01 INFO GpuOverrides: Plan conversion to the GPU took 4.15 ms
2025-01-03T03:18:35.159435922Z 25/01/03 03:15:01 INFO GpuOverrides: GPU plan transition optimization took 7.25 ms
2025-01-03T03:18:35.159446272Z 25/01/03 03:15:43 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container from a bad node: container_1735873832026_0001_01_000003 on host: test-gpu-standard-2-1-20250103-030909-kdee-w-0.us-central1-f.c.cloud-dataproc-ci.internal. Exit status: 1. Diagnostics: [2025-01-03 03:15:43.085]Exception from container-launch.
2025-01-03T03:18:35.159455422Z Container id: container_1735873832026_0001_01_000003
2025-01-03T03:18:35.159464442Z Exit code: 1
2025-01-03T03:18:35.159472832Z Exception message: Launch container failed
2025-01-03T03:18:35.159481212Z Shell error output: Nonzero exit code=1, error message='Invalid argument number'

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

3 similar comments
@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

well that's good news, then.

@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

@cjac cjac marked this pull request as ready for review January 4, 2025 07:59
@cjac
Copy link
Contributor Author

cjac commented Jan 4, 2025

/gcbrun

cjac added 2 commits January 8, 2025 22:09
* increased minimum memory threshold for ram disk
* moved apt_add_repo and friends to common/install_functions

templates/dask/util_functions:
* validating conda tarball before caching to gcs

templates/generate-action.pl:
* improved usage documentation a little

templates/gpu/install_functions
* using /opt/conda/miniconda3/bin/python3 instead of /usr/bin/ for
  venv pre-install
@cjac cjac force-pushed the template-gpu-20241219 branch from 019f562 to 119f1b1 Compare January 9, 2025 06:09
cjac added 10 commits January 8, 2025 22:10
* increase wait time for scheduler to come online
* reduce noise from tar -t

templates/gpu/yarn_functions,
templates/gpu/install_functions:
* protect many functions from running without attached accelerator

templates/gpu/install_gpu_driver.sh.in
* set +e in exit handler

templates/gpu/spark_functions:
* re-factor new function into this template

templates/spark-rapids/spark-rapids.sh.in
* removed redundant call to configure_gpu_script
* set +e in exit handler
* include version in action generator
@cjac cjac force-pushed the template-gpu-20241219 branch from 763e1ff to 900c10a Compare January 9, 2025 22:03
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlc - Can I get a review of the templates/ directory in this repository, please? I tried to keep it simple for the initial implementation, but if you have any advice about how we can further reduce duplication, I'd be all ears. I'm thinking about picking up your book and getting into the minutia, but the PR will be closed far before then, I hope!

@cjac cjac force-pushed the template-gpu-20241219 branch from cea2aa3 to 2afff45 Compare January 10, 2025 03:20
@cjac cjac force-pushed the template-gpu-20241219 branch from 2afff45 to aa792c3 Compare January 10, 2025 03:23
@cjac
Copy link
Contributor Author

cjac commented Jan 11, 2025

What changes will be required in the steps to create a cluster with init scripts using templates after this PR?

Currently, the steps include:
--initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh

Referencing the example: Create a Dataproc cluster using T4s.

Update: I just learnt about [spark-rapids] generate spark-rapids/spark-rapids.sh from template. I assume we programmatically regenerate spark-rapids.sh whenever a change is made to the template.

Those instructions seem right, but it may take less time with the new versions, since much of the work is now cached, and when the memory is sufficient, installation utilizes ram disks.

Maybe mention that with new custom images, secure boot can be enabled. The new custom image script requires that the secret manager api service be enabled for the project.

And yes, new actions will be generated from templates on each, now versioned, release.

@cjac cjac force-pushed the template-gpu-20241219 branch from effd8b5 to 374ff96 Compare January 12, 2025 03:55
@cjac cjac force-pushed the template-gpu-20241219 branch from 76a6df6 to 0c3eb51 Compare January 20, 2025 00:51
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 23, 2025
cjac added a commit to LLC-Technologies-Collier/initialization-actions that referenced this pull request Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[initialization-actions] The repository has manually-generated, re-used code which gets out of sync
4 participants