-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSProcessing] Enforce re-order for node label processing during classification #1136
base: main
Are you sure you want to change the base?
Conversation
4d0afcd
to
865d943
Compare
865d943
to
40b16f9
Compare
40b16f9
to
811f4b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any performance number to compare between the built-in spark method and pandaUDF method?
.alias(self.label_column) | ||
.cast("long") | ||
.alias(self.label_column), | ||
*original_cols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we add all the original columns here for the label value? Will it overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well here we want to ensure that the order_col is preserved in the output, but DistSingleLabelTransformation
is not aware of order_col
. We could modify the constructor to also provide that, and then return just the label and order column.
The argument against would be that these two cols are selected downstream anyway, and the other columns are not materialized, so there shouldn't be any real performance penalty.
I'm good with either option.
bucket, | ||
f"{s3_prefix}/{output_file}", | ||
) | ||
|
||
def run(self) -> None: | ||
""" | ||
Executes the Spark processing job. | ||
Executes the Spark processing job, optional repartition job, and uploads any metadata files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executes the Spark processing job, optional repartition job, and uploads any metadata files | |
Executes the Spark processing job, optional repartition job, and uploads transformed metadata files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also uploads perf_counters.json
, not sure if we would qualify that as a "transformed" metadata file.
graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py
Show resolved
Hide resolved
if self.filesystem_type == FilesystemType.LOCAL: | ||
os.makedirs(os.path.dirname(out_path), exist_ok=True) | ||
|
||
pq.write_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move this part to line 666? And if emr-serverless support directly use pyarrow to write to S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move it yes.
emr-serverless support directly use pyarrow to write to S3
Yes we already use pyarrow to write and modify files on EMRS during the re-partition stage that runs on the Spark leader.
split_metadata = {} | ||
|
||
def write_masks_numpy(np_mask_arrays: Sequence[np.ndarray]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about put these two functions into utils.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rule of thumb to make a function public is : "is this function used in at least two places in the codebase?".
e.g. that's why I moved _create_metadata_entry
from an inner function to be part DGHL in this PR.
Right now the answer is no for these functions, if we see a need to re-use them we can pull them out.
Issue #, if available:
Fixes #1135 #1138
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.