Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A means of viewing all differences between two datatrees #9929

Open
danielfromearth opened this issue Jan 7, 2025 · 11 comments
Open

A means of viewing all differences between two datatrees #9929

danielfromearth opened this issue Jan 7, 2025 · 11 comments
Labels
API design enhancement topic-DataTree Related to the implementation of a DataTree class topic-testing

Comments

@danielfromearth
Copy link

danielfromearth commented Jan 7, 2025

Is your feature request related to a problem?

It can be frustrating to figure out why two Datatrees are not returning True when running xarray.DataTree.identical() or xarray.DataTree.equals().

Currently, if xarray's diff functions detect any difference in the tree structure, they raise at that point, and so do not show all of the differences. Thus, the current functions excel when the user wants to check that two datatrees are equal, but not when the user wants to discover subtle differences — and there are cases in which such subtle differences may be desired.

For example, when developing or testing new datatree transformations, I would like to be able to quickly check that the datatree has been modified as expected. Or, when expecting two datasets to be the same but they are not, it would be helpful to be able to quickly traverse the entire tree structure and see the differences.

Describe the solution you'd like

I think it would be useful to have a means of visually representing all the differences between two xarray Datatree objects, either showing the whole trees and highlighting all the differences, or showing only the differences.

I'm imagining a solution that shows a comparison report similar to ncompare, which provides aligned and colorized difference reports for quick assessments of groups, variable names, types, shapes, and attributes (see ncompare's readme gif or the example notebook). In contrast to ncompare, the proposed solution would work on the xarray data model.

The solution could be a new function, perhaps in the testing suite, such as xarray.testing.all_differences(dt1: DataTree, dt2: DataTree). This could be based on the diff_datatree_repr function that is used in assert_isomorphic:

assert a.isomorphic(b), diff_datatree_repr(a, b, "isomorphic")

def diff_datatree_repr(a: DataTree, b: DataTree, compat):

Describe alternatives you've considered

Showing differences between Datatrees will achieve similar goals to https://github.com/nasa/ncompare. However, a solution in xarray would be different than ncompare, because ncompare looks directly at the netCDF/HDF files, and makes assumptions that that is the data model you care about. xarray instead opens netCDF (or a range of other formats) into an in-memory object which has a data model that is almost but not quite the same as netCDF's data model, then xarray's assertions compare those. For example, netCDF can have dimensions with no corresponding coordinate values, which aren't a part of xarray's data model. In addition, a solution in xarray would be applicable to data coming from additional formats like Zarr.

Additional context

No response

Copy link

welcome bot commented Jan 7, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added API design topic-testing topic-DataTree Related to the implementation of a DataTree class labels Jan 7, 2025
@mhuzaifa5
Copy link

Hey , I wanna work on it . If not assigned to anyone else , kindly assign it to me .

@TomNicholas
Copy link
Member

Thanks for raising @danielfromearth .

(FYI @mhuzaifa5 we don't generally assign issues to individuals - usually we discuss how to solve things on an issue and then anyone is free to open a pull request.)


My main question about this feature is whether or not the use case could instead be relatively easily handled using existing API, or with a tweak to the behaviour of existing API.

Pseudocode using existing (though private) API:

from xarray.core.formatting import diff_dataset_repr

def show_all_differences(dt1: DataTree, dt2: DataTree) -> str:
    diff = ''
    for ds1, ds2 in zip(
       [node.ds for node in dt1.subtree], 
       [node.ds for node in dt1.subtree],
    ):
       
       diff += diff_dataset_repr(ds1, ds2)

    return diff

Tweaking existing API:

Currently, if xarray's diff functions detect any difference in the tree structure, they raise at that point, and so do not show all of the differences.

We could change that. Either we could show all differences by default, we could show the differences in the structure then show the detailed differences afterwards (I think that would be my vote), or we could even use Exception Groups and put differences for each node in a different Exception Group...

It would be useful if others could weigh in on how useful they would find this.

@mhuzaifa5
Copy link

@TomNicholas I think the showing differences in the strutured followed by detailed differences would be a more structured and friendly way of viewing the differences in the respective tree structures.

@mhuzaifa5
Copy link

@TomNicholas Can i work on this along with your guidance .

@danielfromearth
Copy link
Author

@TomNicholas I think the showing differences in the strutured followed by detailed differences would be a more structured and friendly way of viewing the differences in the respective tree structures.

Following some conversation with @betolink, note that in ncompare, a side-by-side text report is generated that shows a group node (regardless of whether it is in both trees) followed by all differences in that group, then it proceeds to another group node, etc. That works well enough, but for xarray, generating an HTML-based comparison would be an interesting alternative, since it (1) could enable collapsing of groups and (2) could be more intuitive for xarray users if it matched xarray's current in-notebook representation of an individual dataset/datatree.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 8, 2025

generating an HTML-based comparison

Interesting idea! The implementation of xarray's HTML repr is in xarray.core.formatting_html.py, but it has no diff'ing code, so there is currently no HTML equivalent of diff_dataset_repr to just call.

@flamingbear
Copy link
Member

I have some un-formed thoughts. Mostly about how much to reveal in the details and is it possible to tune that to the user's desires. If the datatree is a giant, possibly lazy, zarr store you might not want to show all of the detailed differences. I have wanted to know each of the possible options: are the trees identical? are there missing groups from either (and what is missing)? what variables are different, and even what data in the variables are different. Not positive this is useful information though.

@TomNicholas
Copy link
Member

TomNicholas commented Jan 8, 2025

Good points @flamingbear. That reminds me that we do already have an xarray.testing.assert_isomorphic function that checks the tree structures are identical without comparing any of the actual data in the groups.

@danielfromearth
Copy link
Author

are the trees identical? are there missing groups from either (and what is missing)? what variables are different, and even what data in the variables are different

I would add at least one other level of detail before the last (data value checking) one in your list: "what variable characteristics (e.g., scale factors, dimensions, shape, units) are different?"

@TomNicholas
Copy link
Member

I would add at least one other level of detail before the last (data value checking) one in your list: "what variable characteristics (e.g., scale factors, dimensions, shape, units) are different?"

I think diff_dataset_repr already does this, at least for variable names, dimensions and shapes.

scale factors

These are not explicitly part of xarray's model, so aren't available to be compared in the same way. Instead the dataset will have used these values to decode upon opening. (The .encoding is there though but the assertions don't check that.)

units

Unless you're using pint-xarray these also aren't an explicit part of the data model, instead just being one of many entries in .attrs. Those are compared by assert_identical though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design enhancement topic-DataTree Related to the implementation of a DataTree class topic-testing
Projects
None yet
Development

No branches or pull requests

4 participants