-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too slow "Column Pair Trends" #546
Comments
OK, I have to say that after a change everything goes fast, but in reality I don't know exactly why because have no clue how the library works. Many of my fields are high-cardinality ones with almost unique text values. So far, I had them as Is this a correct approach to skip them? I just need some validation. Thanks. |
Hi @echatzikyriakidis, appreciate the feedback. Before going to parallelization (which we certainly can look into), it is helpful to look back at metadata and ensure everything is running right. SDMetrics uses metadata to make sure it is applying the correct metrics. For example, if you are storing HTTP codes such as Here are the docs for what the metadata should look like. Based on your description, here's what I think is going on:
I would recommend continuing to use |
Hi @npatki, That's exactly what I ended up doing. I set those high cardinality text fields with sdtype=text and now everything is fast. Thanks! |
Thanks for confirming @echatzikyriakidis. I left a feature request in #548 so make it easier (and more intuitive) to specify which columns you want to ignore when generating a report. |
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
Hi @npatki,
It seems that , it is too slow when running Column Pair Trends from Quality Report.
My current example:
Generating report ...
(1/4) Evaluating Column Shapes: : 100%|██████████| 59/59 [03:39<00:00, 3.72s/it]
(2/4) Evaluating Column Pair Trends: : 0%| | 0/158 [00:00<?, ?it/s]
Suggestion:
Is it possible to change the library so that both single-table and multi-table reports (Quality+Diagnostic and any other that exists) to allow parallelization (either multithreading or multiprocessing) ?
Every calculation of column shapes or trends in column pairs can run in parallel. No need for sequential computation, since each computation is independent. Right?
Thanks!
The text was updated successfully, but these errors were encountered: