Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Create NDS-H benchmark for performance analysis #182

Open
5 of 7 tasks
mattahrens opened this issue Mar 6, 2024 · 2 comments
Open
5 of 7 tasks

[FEA] Create NDS-H benchmark for performance analysis #182

mattahrens opened this issue Mar 6, 2024 · 2 comments

Comments

@mattahrens
Copy link
Collaborator

mattahrens commented Mar 6, 2024

I would like to add another benchmark to the repository to support additional workloads for comparison. The TPC-H benchmark is used by different partners for comparison so we can enable the execution of a TPC-H similar workload benchmark. The requirements are similar to what we have for NDS:

Data generation

  • P0: Support generation of raw data at various scale factors
  • P0: Support conversion of raw data to Parquet
  • P1: Support conversion of raw data to ORC
  • P1: Support conversion of raw data to CSV

Query generation

  • P0: Support generation of queries at various scale factors

Power run execution

  • P0: Support execution of full query set given a specified input path
  • P1: Support execution of individual query given a specific query and input path

We can add additional requirements once the initial NH scripts are set up to more closely match how we execute NDS.

Relevant links of other repos that execute TPC-H workloads:

Disclaimers for TPC-H:

  • TPC-H is Copyright © 1993-2024 Transaction Processing Performance Council. The full TPC-H specification in PDF format can be found here
  • TPC, TPC Benchmark, and TPC-H are trademarks of the Transaction Processing Performance Council.
@mattahrens mattahrens added the ? - Needs Triage Need team to review and classify label Mar 6, 2024
@wjxiz1992
Copy link
Collaborator

Hi Matt, after some discussion with @GaryShen2008 several things to confirm:

  1. Do we need to latest TPC-H tool version?
    If we want to be able to execute the whole TPC-H benchmark as soon as possible, we can leverage https://github.com/databricks/spark-sql-perf?tab=readme-ov-file#tpc-h directly to generate TPC-H data and run TPC-H queries. But note the TPC-H version it uses is still v2.4.0 while the latest is v3.0.1. I do see a bunch of patches in the TPC-H specifications PDF file so I think it's an issue.
    If we want to use the latest TPC-H tool, the effort will be similar to the one for NDS.

  2. code structure change
    There're some NDS specific code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_gen_data.py#L42-L68 but also a lot of general code like https://github.com/NVIDIA/spark-rapids-benchmarks/blob/dev/nds/nds_power.py#L125-L135. now all of them are under NDS folder. If we want good looking code, a refactor will be necessary. but if we want short-term goad, for example, we want to be able to run TPC-H ASAP, we can just create an NDH folder, and put in existing code like https://github.com/databricks/spark-sql-perf/blob/master/src/main/notebooks/TPC-multi_datagen.scala along with some simple wrapper code to make it work.

These are the current gaps we see according to previous related work.

@mattahrens
Copy link
Collaborator Author

  1. Yes, let's use the latest version of the TPC-H tool version. I believe the other repo links I provided in the issue description may be using the latest version.
  2. Let's start with just bringing up NH benchmark and then we can refactor to have common utilities between NDS and NH.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 10, 2024
@mattahrens mattahrens changed the title [FEA] Create NH benchmark for performance analysis [FEA] Create NDS-H benchmark for performance analysis May 21, 2024
@bilalbari bilalbari assigned bilalbari and unassigned yinqingh May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants