-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added toolkit compare
#719
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #719 +/- ##
==========================================
- Coverage 87.75% 87.45% -0.30%
==========================================
Files 127 128 +1
Lines 11038 11355 +317
Branches 1484 1533 +49
==========================================
+ Hits 9686 9931 +245
- Misses 982 1046 +64
- Partials 370 378 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Deploying datachain-documentation with Cloudflare Pages
|
@skshetry @dreadatour can we please review this folks? |
@iterative/datachain a reminder, please take a look team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Not related to this PR, but it is also would be nice to have an example of real-world usage 👀
@@ -112,25 +125,27 @@ def _to_list(obj: Union[str, Sequence[str]]) -> list[str]: | |||
for c in [f"{_rprefix(c, rc)}{rc}" for c, rc in zip(on, right_on)] | |||
] | |||
) | |||
diff_cond.append((added_cond, "A")) | |||
diff_cond.append((added_cond, CompareStatus.ADDED)) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@dmpetrov can you also take a look please? you have more context on this. |
src/datachain/toolkit/diff.py
Outdated
added: bool = True, | ||
deleted: bool = True, | ||
modified: bool = True, | ||
unchanged: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these arguments? Can we leave this up to the user to filter?
dc.compare(...).filter(C("col") == "added")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataChain.compare()
returns new chain with that column which can be filtered and this new toolkit method does exactly what you described, so it saves the user that one filter step. Now, we can discuss if toolkit method is even needed in the first place.
So with this PR we have:
compare()
insrc.datachain.diff
-> accepts 2 chains and returns new "diff" chain with status columnDataChain.compare()
-> simple wrapper around 1) where left chain isself
compare()
insrc.datachain.toolkit
-> wrapper around 1) but instead of returning one "diff" chain with status column, it splits that chain into multiple chains where each chain represents only one status which is basically what you did in your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this new toolkit we maybe have too many functions, although 2) was meant to be "private"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the 3rd should have a different name and should follow similar pattern as the compare() - it should be in dc.py if there is not much code or in lib.diff otherwise with a wrapper in dc.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, lib.diff is a better place for the code than a new toolkit.
PS: I might be the person who proposed the toolkit file but it does not seem a good idea in this case 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 3) should not be in dc.py
as it returns multiple instances of DataChain
so it should be in util file.
Also, the question is should we use src.datachain.diff.py
or src.datachain.lib.diff.py
for public util functions?
src/datachain/toolkit/diff.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain the difference between a toolkit
and a lib
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to move "heavy" code from dc.py to somewhere else and keep only simple wrapper function in dc.py.
If it's implemented in lib/diff.py - it's enough and we don't need toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we organize into individual top-level modules, eg: datachain.diff
, instead of cramming everything in a nested module in datachain.toolkit
or datahchain.lib
modules?
Namespaces are one honking great idea -- let's do more of those!
Flat is better than nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks like the best option.
@ilongin what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also for top-level modules. I will move this to datachain.diff
@@ -16,6 +17,21 @@ | |||
C = Column | |||
|
|||
|
|||
def get_status_col_name() -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the column name need to be random? Can we have a default column name that can be changed by users?
Eg:
def compare(col="status"):
pass
dc.compare(col=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this column name will not be in the results. It's only needed for our internal implementation of the diff. User will have separate chains for each status and status column is not needed in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments are inline. Please take a look.
I'm approving to not to block.
) | ||
# we still need status column for internal implementation even if not | ||
# needed in the output | ||
status_col = status_col or get_status_col_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure you drop this random column if it was None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status column is dropped with select_except(...)
before returning to the user.
src/datachain/toolkit/diff.py
Outdated
) | ||
``` | ||
""" | ||
status_col = get_status_col_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User can define it, can't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this toolkit method status column is not returned to the user so it's created only for our internal implementation and removed before returning to the user. User can define status column in core DataChain.compare()
which returns one chain with all statuses written in that status column
src/datachain/toolkit/diff.py
Outdated
CompareStatus.UNCHANGED | ||
) | ||
|
||
return chains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how status column is cleaned up. or I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cleaned in filter_by_status()
method with select_except()
src/datachain/toolkit/diff.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to move "heavy" code from dc.py to somewhere else and keep only simple wrapper function in dc.py.
If it's implemented in lib/diff.py - it's enough and we don't need toolkit.
Added
compare
function to toolkit. It usesdatachain.lib.diff.compare()
to compareDataChain
s and returns dictionary with values for each ofadded
,removed
,modified
andunchanged
chains. Each chain consists only of those changes that are related to it (added
only has added fields,deleted
onlydeleted fields etc.) Keys of dicts are shortcuts for each status:
A,
D,
Mand
U`.Status column is removed from each resulting chain as it's not needed.
Also added
CompareStatus
enum and replaced hardcoded status letters (A
,D
...) across the codebasej