Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for org.apache.spark.sql.catalyst.expressions.Bin #11967

Open
wants to merge 2 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Jan 14, 2025

Closes #11648

Based on NVIDIA/spark-rapids-jni#2760

This PR adds support for org.apache.spark.sql.catalyst.expressions.Bin.

Ran a simple perf test for 5 times.

// use big data gen
import org.apache.spark.sql.tests.datagen._
val dataTable = DBGen().addTable("data", "a long", 100000000)
dataTable("a").setNullProbability(0.1)
dataTable.toDF(spark).write.mode("OVERWRITE").parquet("long")

// spark-rapids
val df = spark.read.parquet("long")
spark.time(df.selectExpr("bin(a)").foreach(_ => ()))

Results:

GPU Time taken: 6748 ms 5384 ms 5520 ms 5411 ms 5411 ms
CPU Time taken: 4726 ms 3697 ms 3558 ms 3584 ms 3613 ms 

GpuBin seems to be much slower than CpuBin. 😰

@revans2
Copy link
Collaborator

revans2 commented Jan 14, 2025

The GPU is consistently slower than the CPU is at this. Have you done any profiling so we can understand what is happening?

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks fine. I just want to understand why we are losing to the CPU and if we can fix that before we check it in, or if we should file a follow on issue.

@ustcfy
Copy link
Collaborator Author

ustcfy commented Jan 14, 2025

The GPU is consistently slower than the CPU is at this. Have you done any profiling so we can understand what is happening?

I haven't done profiling yet, but I will do it next. 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] it would be nice if we could support org.apache.spark.sql.catalyst.expressions.Bin
2 participants