-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support merge fragment with dataset #3256
Conversation
a32f6ba
to
5330afa
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3256 +/- ##
==========================================
+ Coverage 78.47% 79.01% +0.53%
==========================================
Files 245 246 +1
Lines 85088 87521 +2433
Branches 85088 87521 +2433
==========================================
+ Hits 66772 69151 +2379
- Misses 15501 15507 +6
- Partials 2815 2863 +48
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the goal of the PR? Normally when distributing work across fragments we are trying to divide N units of work into F tasks where each task does N/F units of work.
However, with this merge we are taking N units of work (read N rows, build hashtable of size N, write N rows) and breaking into F tasks where each task is still (more or less) N units of work (read N rows, build hashtable of size N, write N/F rows).
Is the problem just that the write step is very expensive?
python/src/dataset.rs
Outdated
#[getter(manifest_max_field_id)] | ||
fn manifest_max_field_id(self_: PyRef<'_, Self>) -> PyResult<i32> { | ||
Ok(self_.ds.manifest().max_field_id()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this max_field_id
?
@westonpace . for example, I have tagged all rows, tag is created by a cluster algorithm. so it's hard to use add_columns api to create this column. so I want to shuffle the tag dataframe based on fragment_id |
I see. I had misunderstood. I thought the hashtable was built on the DS for some reason. However, you are right, on closer look we are making the hashtable on the fragment data so it's only the scan step that remains N units of work. Also, perhaps more importantly, this reduces the amount of memory required because we have a smaller hashtable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for working on this
this PR allows merge dataset concurrently.