Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support merge fragment with dataset #3256

Merged
merged 11 commits into from
Dec 23, 2024

Conversation

chenkovsky
Copy link
Contributor

@chenkovsky chenkovsky commented Dec 16, 2024

this PR allows merge dataset concurrently.

@github-actions github-actions bot added enhancement New feature or request python labels Dec 16, 2024
@chenkovsky chenkovsky force-pushed the feature/merge_fragment branch from a32f6ba to 5330afa Compare December 16, 2024 16:26
@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 14.03509% with 49 lines in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (83b8efd) to head (745c345).
Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/fragment.rs 14.03% 49 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3256      +/-   ##
==========================================
+ Coverage   78.47%   79.01%   +0.53%     
==========================================
  Files         245      246       +1     
  Lines       85088    87521    +2433     
  Branches    85088    87521    +2433     
==========================================
+ Hits        66772    69151    +2379     
- Misses      15501    15507       +6     
- Partials     2815     2863      +48     
Flag Coverage Δ
unittests 79.01% <14.03%> (+0.53%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of the PR? Normally when distributing work across fragments we are trying to divide N units of work into F tasks where each task does N/F units of work.

However, with this merge we are taking N units of work (read N rows, build hashtable of size N, write N rows) and breaking into F tasks where each task is still (more or less) N units of work (read N rows, build hashtable of size N, write N/F rows).

Is the problem just that the write step is very expensive?

Comment on lines 519 to 523
#[getter(manifest_max_field_id)]
fn manifest_max_field_id(self_: PyRef<'_, Self>) -> PyResult<i32> {
Ok(self_.ds.manifest().max_field_id())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this max_field_id?

@chenkovsky
Copy link
Contributor Author

chenkovsky commented Dec 20, 2024

@westonpace . for example, I have tagged all rows, tag is created by a cluster algorithm. so it's hard to use add_columns api to create this column. so I want to shuffle the tag dataframe based on fragment_id
for example dataframe schema is <row_addr,tag>, we can get fragment_id from row_addr, then for each fragment, we only need to build a small piece of hashtable, and merge quickly. we also don't need to worry about memory.

@westonpace
Copy link
Contributor

I see. I had misunderstood. I thought the hashtable was built on the DS for some reason. However, you are right, on closer look we are making the hashtable on the fragment data so it's only the scan step that remains N units of work.

Also, perhaps more importantly, this reduces the amount of memory required because we have a smaller hashtable.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for working on this

@wjones127 wjones127 merged commit ae70478 into lancedb:main Dec 23, 2024
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants