[GSProcessing] Enforce re-order for node label processing during classification #1136

thvasilo · 2025-01-17T06:42:15Z

Issue #, if available:

Description of changes:

We guarantee ordering for node label classification by ordering the transformed label DF after processing by the NODE_INT_MAPPING id, and doing the same for masks.
Because there is no guarantee for order when writing to Parquet from Spark even for ordered Spark DataFrames, we collect the labels and masks to a Pandas DF on the Spark leader, and write that using pyarrow.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…sification

jalencato

Do we have any performance number to compare between the built-in spark method and pandaUDF method?

jalencato · 2025-01-17T22:11:21Z

...graphstorm_processing/data_transformations/dist_transformations/dist_label_transformation.py

-            .alias(self.label_column)
+            .cast("long")
+            .alias(self.label_column),
+            *original_cols,


So we add all the original columns here for the label value? Will it overkill?

Hmm, well here we want to ensure that the order_col is preserved in the output, but DistSingleLabelTransformation is not aware of order_col. We could modify the constructor to also provide that, and then return just the label and order column.

The argument against would be that these two cols are selected downstream anyway, and the other columns are not materialized, so there shouldn't be any real performance penalty.

I'm good with either option.

jalencato · 2025-01-17T22:12:01Z

graphstorm-processing/graphstorm_processing/distributed_executor.py

                    bucket,
                    f"{s3_prefix}/{output_file}",
                )

    def run(self) -> None:
        """
-        Executes the Spark processing job.
+        Executes the Spark processing job, optional repartition job, and uploads any metadata files


Suggested change

Executes the Spark processing job, optional repartition job, and uploads any metadata files

Executes the Spark processing job, optional repartition job, and uploads transformed metadata files

This also uploads perf_counters.json , not sure if we would qualify that as a "transformed" metadata file.

graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py

jalencato · 2025-01-17T22:15:17Z

graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py

+        if self.filesystem_type == FilesystemType.LOCAL:
+            os.makedirs(os.path.dirname(out_path), exist_ok=True)
+
+        pq.write_table(


Shall we move this part to line 666? And if emr-serverless support directly use pyarrow to write to S3?

We can move it yes.

emr-serverless support directly use pyarrow to write to S3

Yes we already use pyarrow to write and modify files on EMRS during the re-partition stage that runs on the Spark leader.

jalencato · 2025-01-17T22:21:00Z

graphstorm-processing/graphstorm_processing/graph_loaders/dist_heterogeneous_loader.py

        split_metadata = {}
+
+        def write_masks_numpy(np_mask_arrays: Sequence[np.ndarray]):


What about put these two functions into utils.py?

My rule of thumb to make a function public is : "is this function used in at least two places in the codebase?".

e.g. that's why I moved _create_metadata_entry from an inner function to be part DGHL in this PR.

Right now the answer is no for these functions, if we see a need to re-use them we can pull them out.

thvasilo added ready able to trigger the CI gsprocessing For issues and PRs related the the GSProcessing library 0.4.1 labels Jan 17, 2025

thvasilo force-pushed the reorder-node-ids branch 2 times, most recently from 4d0afcd to 865d943 Compare January 17, 2025 19:28

thvasilo marked this pull request as ready for review January 17, 2025 19:42

thvasilo requested a review from jalencato January 17, 2025 19:42

thvasilo self-assigned this Jan 17, 2025

thvasilo added this to the 0.4.1 release milestone Jan 17, 2025

thvasilo force-pushed the reorder-node-ids branch from 865d943 to 40b16f9 Compare January 17, 2025 21:13

[GSProcessing] Enforce ordering for node label processing during clas…

811f4b1

…sification

thvasilo force-pushed the reorder-node-ids branch from 40b16f9 to 811f4b1 Compare January 17, 2025 21:19

jalencato reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSProcessing] Enforce re-order for node label processing during classification #1136

[GSProcessing] Enforce re-order for node label processing during classification #1136

thvasilo commented Jan 17, 2025 •

edited

Loading

jalencato left a comment

jalencato Jan 17, 2025

thvasilo Jan 17, 2025

jalencato Jan 17, 2025

thvasilo Jan 17, 2025

jalencato Jan 17, 2025

thvasilo Jan 17, 2025

jalencato Jan 17, 2025

thvasilo Jan 17, 2025 •

edited

Loading

	Executes the Spark processing job, optional repartition job, and uploads any metadata files
	Executes the Spark processing job, optional repartition job, and uploads transformed metadata files

		split_metadata = {}

		def write_masks_numpy(np_mask_arrays: Sequence[np.ndarray]):

[GSProcessing] Enforce re-order for node label processing during classification #1136

Are you sure you want to change the base?

[GSProcessing] Enforce re-order for node label processing during classification #1136

Conversation

thvasilo commented Jan 17, 2025 • edited Loading

jalencato left a comment

Choose a reason for hiding this comment

jalencato Jan 17, 2025

Choose a reason for hiding this comment

thvasilo Jan 17, 2025

Choose a reason for hiding this comment

jalencato Jan 17, 2025

Choose a reason for hiding this comment

thvasilo Jan 17, 2025

Choose a reason for hiding this comment

jalencato Jan 17, 2025

Choose a reason for hiding this comment

thvasilo Jan 17, 2025

Choose a reason for hiding this comment

jalencato Jan 17, 2025

Choose a reason for hiding this comment

thvasilo Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

thvasilo commented Jan 17, 2025 •

edited

Loading

thvasilo Jan 17, 2025 •

edited

Loading