-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transforming dataset to a gridded dataset #167
Comments
Hi @miguelcarcamov. I'll be back in the office next week and will reply in more detail then. In the meantime, you may want to take a look at pfb-clean https://github.com/ratt-ru/pfb-clean/blob/master/pfb/operators/gridder.py. /cc @landmanbester |
I guess the short answer here is no. All the gridding actually happens inside the wgridder in ducc0 which does not expose the grids. From the user's perspective you have images going in and visibilities coming out. All w-corrections, padding, oversampling etc. is taken care of under the hood. If I understand correctly, what you are suggesting might not actually give you a smaller data product in the end because you will have to store a grid roughly the size of the image per baseline. Note that the gridder operator above is a bit outdated, I keep it there for legacy reasons. All gridding related operations now happen here but I am not sure if that is useful to you because these workers really just do the data handling. There are dask wrappers for the wgridder in codex-africanus here |
@landmanbester I disagree with this - I mean, the size of the product will depend on the size of the grid and your pixel size (in Fourier space), but also on the size of your convolution kernels. The worst case would be to have a really big kernel and then... Yes, you can end up with a grid with the size of the image per baseline. Although if we are using dask, there is no much to be worried about that. To get to some point I have two questions:
Cheers, |
Much of the performance impact of handling each baseline's data separately is that data for each baseline is not contiguously ordered in a monotically increasing TIME ordered Measurement Set, which is the order that newer interferometers will almost certainly choose. Instead, each baseline's data is striped over the entire Measurement Set which leads to poor Disk I/O performance. If I recall correctly, WSClean performs a reordering step where data is separated out into separate baselines in order to avoid these issues. This reordering step should be efficient because the data can be read and written contiguously. The other alternative, as @landmanbester points out is, for each dask chunk, to maintain grids for each baseline separately. Obviously, the dominant issues here are the size of the grid and the number of baselines. E.g. for single precision complex valued MeerKAT data: But, I'm interested in your intention behind gridding -- it sounds as if you're gridding data with the intention of degridding it again to produce data with a higher time resolution. Is this the case? Perhaps you may be intending to average complex visibility data instead? We have some experience with this too and have an application called xova which performs this process:
I haven't tried using joblib, but this should, in principle, be possible.
I think the answer to this question will lie in your response to my response in (1) :-) |
Hi @sjperkins , I have just come back from holidays.
Is it possible to do this reordering using dask-ms? Maybe I can test doing this reordering and trying to check if the gridding per baseline goes faster....
Correct. Although this would decrease a bit since not all pixels in the grid will have non-zero values (some of them will be zero) - and that's what I am planning to store. So also depends on the convolution kernel and your pixel size in uv-space - I am trying to replicate the original dataset structure but with gridded data instead of ungridded data. Obviously those zero value data points in the grid won't be part of the gridded dataset.
Not exactly... I am planning to grid to decrease the size of the original data - As I said the amount of this decrease depends on the convolution kernels and the pixel-size in uv-space. But let's say that we will end up with a smaller dataset. This dataset then will pass through a reconstruction algorithm, which will give us the image/s and will add a model column to the dataset. This dataset will still be gridded, but it will be possible to apply a degridding in order to obtain data points of the data columns (data and model) in the original (non-gridded) uv points. Hope that this explains better what I am trying to do. Cheers! |
Description
Hello, is anyone working on transforming the dask-ms dataset to a gridded dataset? I am working on it using dask and the function bincount which works well. However, I would like to have the same structure as the original MS - or Dask-ms dataset. Say, less rows (due to the gridding) but per baseline and per spw.
Is not a problem looping the ms inside and ms_list. However, the problem is when trying to loop the baselines inside that ms_list (this takes a lot of computing time, because I guess each time that I do a compute(), the file is opened and read). I will share my code to see if you can give me a hand.
The text was updated successfully, but these errors were encountered: