Large memory expansion when data are converted and organized into groups/EchoData object #489

leewujung · 2021-11-12T19:52:09Z

This issue is split from #407 since the symptom is similar but the underlying cause is different.

Tagging @oftfrfbf, @lsetiawan @emiliom so that we can continue the discussion here. 😃

This issue focuses on the problem that there is sometimes very large expansion of memory use when echopype is converting a file of moderate size.
(vs #407 focuses on the problem when the data file itself is too large to fit into memory.)

Description

Based on @oftfrfbf and @lsetiawan 's investigations (quoted below), there may very well be two different types of things happening (or more? depending on the exact form of data):

for this 725 MB OOI file:
- 1 of the 3 channels is split-beam and produces alongship_angle and athwartship_angle data, but the other 2 single-beam channels do not produce this data.
- The data are first parsed into lists and then organized/merged into xarray DataArray/Dataset. During the second stage @lsetiawan pointed out that there is an expansion of memory.
- I suspect this is due to trying to pad NaNs to the (correctly) missing angle data slices in order to assemble a data cube with dimensions frequency (channel), range_bin and ping_time.
for the 95 MB file shared by @oftfrfbf:
Below is @oftfrfbf's profiling result (quoted from here):

Profiling for parser.parse_raw() (shown in red) only uses about 3 GB of memory while lines 424 through 445, accounting for the creation of the EchoData model (shown in blue), brought that value up to a total of 10 GB.
- I wonder whether the recorded range within this file was changed dramatically between ping to ping? I haven't had a change to look into the file but will try to do so soon and report back.
- The reason why I suspect this is again due to the padding NaN approach to bring pings of different lengths into a 3D data cube: if there are some pings that are somehow much much "longer" than the other pings, all the other pings would be padded with NaN to that longer length, exploding the memory.

Potential solution?

For the case with the OOI file, I wonder if we could circumvent the problem by saving the larger data variables (backscatter_r, alonghip_angle, athwartship_angle) one by one in xr.Dataset.to_zarr using mode='a', so that the max memory usage would not surpass what would be required for one of them (vs having all 3 in memory at the same time).

The case with the 95 MB file is more complicated if the reason is as described above.
A couple of not particularly well-formed thoughts:

use .to_zarr with region specification while organizing and storing the Dataset, so that the in-memory presence is smaller.
somehow use sparse arrays like what's discussed here and pydata/sparse for the in memory component. There's some discussion on dask support for sparse array here.

Thoughts???

The text was updated successfully, but these errors were encountered:

leewujung · 2021-11-12T19:53:05Z

@emiliom : this and this in the xarray sparse support thread are interestingly related to our discussion on ragged array representation a couple weeks back. :)

leewujung · 2021-11-14T18:23:38Z

@oftfrfbf : I looked into the 95 MB file you linked and the problem is indeed of the same form as I suspected before:

The 3 frequency channels are configured with different pulse lengths and hence different sampling intervals.

Working from the parser_obj:

You can see that the backscatter data from the 3 channels have the number of pings (9923) but different number of samples (1302, 2604, 10417). When converted to the actual range in meters, all 3 channels are set up to record to 500 m.

When the EchoData object is assembled, these data are merged into a 3D data cube with the shorter pings padded with NaN. The size of data for the 1st and 2nd channels will therefore be 8x and 4x, respectively. The same expansion happens for the angle data also since all 3 channels are split-beam.

leewujung · 2021-11-14T18:35:45Z

In the spirit of having everything in one place, below I printed out the same info for the 725 MB OOI file as the above, so that we are all on the same page of what the NaN-padding-induced problem is.

Even though the backscatter data from all 3 channels have the same number of pings and samples along range, only the last channel was a split-beam, so there are no angle data from the first 2 channels.

When assembling the EchoData object, 2 slices of NaNs will be created to match the size of the last channel for each of the angles (alongship and athwartship), hence the memory expansion.

oftfrfbf · 2022-01-24T10:50:01Z

I am looking at this problem again and am trying to explore converting Numpy Arrays to Dask Arrays packaged within Xarray datasets (specifically to minimize memory expansion when multiple channels are merged here as noted in previous examples). My hope is that the chunking that Dask Arrays allow for could keep the memory profile within reasonable limits but I will report back with either a success or failure. Also, I am not sure if my thinking goes against recent notes on eliminating xarray merges in favor of numpy?

imranmaj · 2022-01-30T00:21:29Z

Also, I am not sure if my thinking goes against recent notes on eliminating xarray merges in favor of numpy?

Since dask arrays implement a good part of the numpy API, the recent work on converting xarray merges to numpy operations should be easily changed to work with dask instead.

b-reyes · 2022-08-11T23:19:52Z

In PR #774 we extended the open_raw function to include a kwarg that allows users to directly write variables with a large memory footprint to a temporary zarr store for EK60/EK80 echosounders. In this PR we also successfully opened and wrote to a zarr file both of the files presented in this issue. Currently, this functionality is in beta as it requires unit testing and we would like to add an option that automatically determines if one should write to a temporary zarr store (see #782). However, PR #774 addresses the core problem highlighted in this issue. I believe we can close this issue.

@oftfrfbf thank you very much for your input on this issue and for providing us with an excellent test file!

leewujung · 2022-08-18T22:49:39Z

@b-reyes : I agree that we can close this issue now! I believe @imranmaj's changes addressed the AD2CP side of things, and irregular data like this is much less likely to happen for AZFP.

leewujung added the data conversion label Nov 12, 2021

emiliom changed the title ~~Large memory expansion when data are organized into groups/EchoData object~~ Large memory expansion when data are converted and organized into groups/EchoData object Nov 13, 2021

leewujung mentioned this issue Nov 14, 2021

Converting large data files #407

Closed

leewujung added this to Echopype Nov 28, 2021

leewujung moved this to Todo in Echopype Nov 28, 2021

emiliom added this to the 0.6.0 milestone Dec 9, 2021

emiliom mentioned this issue Dec 16, 2021

Refactor AD2CP to improve speed and memory usage through removal of xr.merge #505

Merged

1 task

leewujung assigned imranmaj Jan 20, 2022

leewujung moved this from Todo to In Progress in Echopype Mar 3, 2022

leewujung removed this from the 0.6.0 milestone Mar 17, 2022

leewujung mentioned this issue Mar 30, 2022

Classifier change for 0.6.0 #607

Closed

leewujung assigned b-reyes Jun 15, 2022

leewujung added this to the 0.6.3 milestone Jun 15, 2022

leewujung unassigned imranmaj Jul 7, 2022

leewujung modified the milestones: 0.6.3, 0.6.2 Jul 7, 2022

leewujung mentioned this issue Jul 26, 2022

Adding an option to make EchoData from open_raw delay-friendly #408

Closed

b-reyes mentioned this issue Aug 9, 2022

Add functionality that directly writes variables to a temporary zarr store #774

Merged

leewujung closed this as completed Aug 18, 2022

Repository owner moved this from In Progress to Done in Echopype Aug 18, 2022

This was referenced Sep 27, 2023

Replace the current parsed2zarr mechanism #1179

Closed

Fixtures to generate mock data for clean up use_swap handling #1183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large memory expansion when data are converted and organized into groups/EchoData object #489

Large memory expansion when data are converted and organized into groups/EchoData object #489

leewujung commented Nov 12, 2021 •

edited

Loading

leewujung commented Nov 12, 2021

leewujung commented Nov 14, 2021

leewujung commented Nov 14, 2021

oftfrfbf commented Jan 24, 2022

imranmaj commented Jan 30, 2022

b-reyes commented Aug 11, 2022

leewujung commented Aug 18, 2022

Large memory expansion when data are converted and organized into groups/EchoData object #489

Large memory expansion when data are converted and organized into groups/EchoData object #489

Comments

leewujung commented Nov 12, 2021 • edited Loading

Description

Potential solution?

leewujung commented Nov 12, 2021

leewujung commented Nov 14, 2021

leewujung commented Nov 14, 2021

oftfrfbf commented Jan 24, 2022

imranmaj commented Jan 30, 2022

b-reyes commented Aug 11, 2022

leewujung commented Aug 18, 2022

leewujung commented Nov 12, 2021 •

edited

Loading