-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large memory expansion when data are converted and organized into groups/EchoData object #489
Comments
@oftfrfbf : I looked into the 95 MB file you linked and the problem is indeed of the same form as I suspected before: The 3 frequency channels are configured with different pulse lengths and hence different sampling intervals. You can see that the backscatter data from the 3 channels have the number of pings (9923) but different number of samples (1302, 2604, 10417). When converted to the actual range in meters, all 3 channels are set up to record to 500 m. When the |
In the spirit of having everything in one place, below I printed out the same info for the 725 MB OOI file as the above, so that we are all on the same page of what the NaN-padding-induced problem is. Even though the backscatter data from all 3 channels have the same number of pings and samples along range, only the last channel was a split-beam, so there are no angle data from the first 2 channels. When assembling the |
I am looking at this problem again and am trying to explore converting Numpy Arrays to Dask Arrays packaged within Xarray datasets (specifically to minimize memory expansion when multiple channels are merged here as noted in previous examples). My hope is that the chunking that Dask Arrays allow for could keep the memory profile within reasonable limits but I will report back with either a success or failure. Also, I am not sure if my thinking goes against recent notes on eliminating xarray merges in favor of numpy? |
Since dask arrays implement a good part of the numpy API, the recent work on converting xarray merges to numpy operations should be easily changed to work with dask instead. |
In PR #774 we extended the @oftfrfbf thank you very much for your input on this issue and for providing us with an excellent test file! |
This issue is split from #407 since the symptom is similar but the underlying cause is different.
Tagging @oftfrfbf, @lsetiawan @emiliom so that we can continue the discussion here. 😃
This issue focuses on the problem that there is sometimes very large expansion of memory use when echopype is converting a file of moderate size.
(vs #407 focuses on the problem when the data file itself is too large to fit into memory.)
Description
Based on @oftfrfbf and @lsetiawan 's investigations (quoted below), there may very well be two different types of things happening (or more? depending on the exact form of data):
for this 725 MB OOI file:
alongship_angle
andathwartship_angle
data, but the other 2 single-beam channels do not produce this data.frequency
(channel),range_bin
andping_time
.for the 95 MB file shared by @oftfrfbf:
Below is @oftfrfbf's profiling result (quoted from here):
Potential solution?
For the case with the OOI file, I wonder if we could circumvent the problem by saving the larger data variables (
backscatter_r
,alonghip_angle
,athwartship_angle
) one by one inxr.Dataset.to_zarr
usingmode='a'
, so that the max memory usage would not surpass what would be required for one of them (vs having all 3 in memory at the same time).The case with the 95 MB file is more complicated if the reason is as described above.
A couple of not particularly well-formed thoughts:
.to_zarr
withregion
specification while organizing and storing the Dataset, so that the in-memory presence is smaller.pydata/sparse
for the in memory component. There's some discussion on dask support for sparse array here.Thoughts???
The text was updated successfully, but these errors were encountered: