-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nds_h validation script #198
Conversation
Signed-off-by: Yinqing Hao <[email protected]>
Observed inconsistent results for query18. Need further investigations on this.
|
I rerun this query with the same data(we synced offline) and got the same diff, I'm afraid it's a rapids bug. I'll file a bug at rapids side. |
Signed-off-by: Yinqing Hao <[email protected]>
a3cb119
to
574de5a
Compare
Returns: | ||
bool: True if result matches otherwise False | ||
""" | ||
if query_name in SKIP_QUERIES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip checking query15_part1
and query15_part3
since these are create/drop view queries and no output for these queries.
use_iterator: bool): | ||
# skip output for specific query columns | ||
if query_name in SKIP_COLUMNS: | ||
df = df.drop(*SKIP_COLUMNS[query_name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop column o_orderkey
of the output of query 18 due to the non-deterministic results
nds-h/nds_h_validate.py
Outdated
import os | ||
import re | ||
import time | ||
from decimal import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: too wide import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
nds-h/nds_h_validate.py
Outdated
from decimal import * | ||
|
||
from pyspark.sql import DataFrame, SparkSession | ||
from pyspark.sql.types import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
nds-h/nds_h_validate.py
Outdated
df = df.drop(*SKIP_COLUMNS[query_name]) | ||
|
||
# apply sorting if specified | ||
non_float_cols = [col(field.name) for \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non_float_cols = [col(field.name) for field in df.schema.fields
if field.dataType not in (FloatType(), DoubleType())]
float_cols = [col(field.name) for field in df.schema.fields
if field.dataType in (FloatType(), DoubleType())]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
nds-h/nds_h_validate.py
Outdated
return math.isclose(expected, actual, rel_tol=epsilon) | ||
elif isinstance(expected, str) and isinstance(actual, str): | ||
return expected == actual | ||
elif expected == None and actual == None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these were covered in below return ``expected == actual`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this function to make it more readable and I think it should work as the same as previous one. Please let me know if some special cases are not covered in this function. Thanks! cc @wjxiz1992
def compare(expected, actual, epsilon=0.00001):
#TODO 1: we can optimize this with case-match after Python 3.10
#TODO 2: we can support complex data types like nested type if needed in the future.
# now NDS only contains simple data types.
if isinstance(expected, float) and isinstance(actual, float):
# Double is converted to float in pyspark...
if math.isnan(expected) and math.isnan(actual):
return True
return math.isclose(expected, actual, rel_tol=epsilon)
if isinstance(expected, Decimal) and isinstance(actual, Decimal):
return math.isclose(expected, actual, rel_tol=epsilon)
return expected == actual
nds-h/nds_h_validate.py
Outdated
raise Exception(f"More than one summary file found for query {query_name} in folder {prefix}.") | ||
if len(file_glob) == 0: | ||
raise Exception(f"No summary file found for query {query_name} in folder {prefix}.") | ||
for filename in file_glob: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(file_glob) > 1:
should have already errored out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
Signed-off-by: Yinqing Hao <[email protected]>
56fc6f2
to
25a0cdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR is to add a validation script for nds_h queries to compare outputs between CPU and GPU. The main checking logic is the same with nds_validate.py but remove some special checks for NDS queries.