feat: support merge fragment with dataset #3256

chenkovsky · 2024-12-16T16:09:12Z

this PR allows merge dataset concurrently.

codecov-commenter · 2024-12-16T16:47:52Z

Codecov Report

Attention: Patch coverage is 14.03509% with 49 lines in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (83b8efd) to head (745c345).
Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/fragment.rs	14.03%	49 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3256      +/-   ##
==========================================
+ Coverage   78.47%   79.01%   +0.53%     
==========================================
  Files         245      246       +1     
  Lines       85088    87521    +2433     
  Branches    85088    87521    +2433     
==========================================
+ Hits        66772    69151    +2379     
- Misses      15501    15507       +6     
- Partials     2815     2863      +48

Flag	Coverage Δ
unittests	`79.01% <14.03%> (+0.53%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

westonpace

What is the goal of the PR? Normally when distributing work across fragments we are trying to divide N units of work into F tasks where each task does N/F units of work.

However, with this merge we are taking N units of work (read N rows, build hashtable of size N, write N rows) and breaking into F tasks where each task is still (more or less) N units of work (read N rows, build hashtable of size N, write N/F rows).

Is the problem just that the write step is very expensive?

westonpace · 2024-12-20T14:50:45Z

python/src/dataset.rs

+    #[getter(manifest_max_field_id)]
+    fn manifest_max_field_id(self_: PyRef<'_, Self>) -> PyResult<i32> {
+        Ok(self_.ds.manifest().max_field_id())
+    }
+


Can we call this max_field_id?

chenkovsky · 2024-12-20T15:35:33Z

@westonpace . for example, I have tagged all rows, tag is created by a cluster algorithm. so it's hard to use add_columns api to create this column. so I want to shuffle the tag dataframe based on fragment_id
for example dataframe schema is <row_addr,tag>, we can get fragment_id from row_addr, then for each fragment, we only need to build a small piece of hashtable, and merge quickly. we also don't need to worry about memory.

westonpace · 2024-12-20T16:22:58Z

I see. I had misunderstood. I thought the hashtable was built on the DS for some reason. However, you are right, on closer look we are making the hashtable on the fragment data so it's only the scan step that remains N units of work.

Also, perhaps more importantly, this reduces the amount of memory required because we have a smaller hashtable.

wjones127

Looks good. Thanks for working on this

github-actions bot added enhancement New feature or request python labels Dec 16, 2024

feat: support merge fragment with dataset

5330afa

chenkovsky force-pushed the feature/merge_fragment branch from a32f6ba to 5330afa Compare December 16, 2024 16:26

format

946d8b5

broccoliSpicy requested a review from wjones127 December 16, 2024 18:39

chenkovsky added 4 commits December 17, 2024 07:10

cherry pick another pr

47a01bb

update doc

8ea7ccb

Update fragment.py

e68a6e0

Update fragment.py

42d0d6b

westonpace reviewed Dec 20, 2024

View reviewed changes

chenkovsky added 5 commits December 21, 2024 08:57

update field name

384b437

format

745c345

Merge branch 'main' into feature/merge_fragment

1113a4c

Merge branch 'main' into feature/merge_fragment

3900809

merge main

dce9631

wjones127 approved these changes Dec 23, 2024

View reviewed changes

wjones127 merged commit ae70478 into lancedb:main Dec 23, 2024
24 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support merge fragment with dataset #3256

feat: support merge fragment with dataset #3256

chenkovsky commented Dec 16, 2024 •

edited

Loading

codecov-commenter commented Dec 16, 2024 •

edited

Loading

westonpace left a comment

westonpace Dec 20, 2024

chenkovsky commented Dec 20, 2024 •

edited

Loading

westonpace commented Dec 20, 2024

wjones127 left a comment

feat: support merge fragment with dataset #3256

feat: support merge fragment with dataset #3256

Conversation

chenkovsky commented Dec 16, 2024 • edited Loading

codecov-commenter commented Dec 16, 2024 • edited Loading

Codecov Report

westonpace left a comment

Choose a reason for hiding this comment

westonpace Dec 20, 2024

Choose a reason for hiding this comment

chenkovsky commented Dec 20, 2024 • edited Loading

westonpace commented Dec 20, 2024

wjones127 left a comment

Choose a reason for hiding this comment

chenkovsky commented Dec 16, 2024 •

edited

Loading

codecov-commenter commented Dec 16, 2024 •

edited

Loading

chenkovsky commented Dec 20, 2024 •

edited

Loading