Skip to content
This repository has been archived by the owner on May 9, 2024. It is now read-only.

[WIN] Plasticc benchmark crashes on the original plasticc dataset on HDK tasks execution #581

Open
gshimansky opened this issue Jul 13, 2023 · 0 comments

Comments

@gshimansky
Copy link
Contributor

Apparently plasticc no longer successfully completes when it is ran on the original plasticc dataset instead of the synthetic. This is true for both HDK version 0.6 and 0.7. On some systems execution ends silently, on some systems there is an error message
[2023-07-13 17:34:49.912154] [0x00005874] [info] 0 71 BufferMgr.cpp:720 Check failed: buffer_it->second->buffer. Debugging shows that it happens on the line that triggers HDK execution df_meta.shape # to trigger real execution.

You can reproduce the problem by checking out the benchmarks repo https://github.com/gshimansky/data-science-processing-workload.

To execute benchmark on the original dataset download test_set.csv, training_set.csv, test_set_metadata.csv and training_set_metadata.csv from modin datasets s3 bucket s3://modin-datasets/plasticc. You can execute benchmarks/plasticc.py directly like this:

set MODIN_STORAGE_FORMAT=hdk
set MODIN_ENGINE=native
set MODIN_EXPERIMENTAL=true
python benchmarks/plasticc.py training_set.csv test_set.csv training_set_metadata.csv test_set_metadata.csv

or you can rename these files into plasticc_training_set.csv, plasticc_test_set.csv, plasticc_training_set_metadata.csv and plasticc_test_set_metadata.csv respectively and running launcher.py with option -ru (reuse):

python launcher.py -m plasticc -ru --hdk

With -ru launcher skips generation stage and reuses dataset files already present in current directory.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant