-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask-ms does not scale with big datasets (spectral line datasets) #214
Comments
This is less of a problem with
There aren't any amazing solutions as present. My plan is to move away from the underlying casacore tables and instead store everything in zarr arrays. |
Thanks for the reply @JSKenyon. Yep I was worried that it could be that. Well, let me know if there are any updates incorporating zarr arrays to dask-ms. Also, how can I use the convert function inside my code? Follow up question to this: Do you know if ngCASA also uses python-casacore wrappers to read Measurement Sets? |
You would likely need to use some of the lower level stuff - you can take a look at https://github.com/ratt-ru/dask-ms/blob/process-executors/daskms/apps/convert.py to see how this happens in
I am not sure - I should actually reach out to them. Last I checked, they were also going the route of converting casacore tables to xarray datasets backed by zarr. Unfortunately, most of the limitations are in the casa tables themselves and would require a major rewrite to change. It is technically possible to read a measurement set from multiple processes (rather than threads), but it is not possible to write in parallel. A single write will also lock all reads. The best option will likely be to read the ms in parallel from multiple processes and then write to a format which supports parallel writes (e.g. zarr if you handle chunks correctly). |
Follow up question - What if we write the data using a TaQL query for that? (as a short term patch). I remember that casacore developers helped me a lot writing a fast writetoMS code. Check writeMS function. Also, how CASA does it then? because the read and writes in CASA are really really fast! Cheers, |
@miguelcarcamov the C++ table system underneath opens a set of tables only once, so even if you use taql or other table objects they point to the same table with the same table lock underneath the hood. If you use taql to select a smaller set to read and/or write it will of course make things faster - it looks like you are doing that in the C++ snippet. I've experimented with adding multi-threaded reading support to the casacore table system (see casacore/casacore#1167). However this is still a work in progress and quite a large undertaking to make the table system thread safe. I have shown that you can get the same scaling as you would get from multiple processes. As @JSKenyon mentioned currently all calls to the table system via python-casacore are gil-locked. If you want really experimental python support you can check out casacore/python-casacore#209. If you follow the discussion you will see that there are caviats. Because the table system is not thread safe you should not pass tables between threads (edit: or open a table proxy with tables initially opened in other threads). The next task I will do on this thread is to make the table system fully threadsafe so that this can be done safely. We don't yet fully support this in dask-ms (not until a full implementation is completed for casacore and python-casacore) |
An alternative I have for you is to do what the WSRT archive did and split databases out by scan -- it least on the data I have. (edit: If you want to do this in a single script you will still need to use multiple processes due to the aforementioned GIL locking of python-casacore operations) This way you get a table lock per table object and you can possibly chunk things that way. It is however very cumbersome and not worth it if you intend this as a once off operation? |
Yep, in fact that's what I'm doing when reading and writing a MS. That was an advice that casacore developers gave me. |
Ok sorry about that -- unfortunately it is a fundamental limitation in the data format itself. Currently the only way you would get parallelism is to have heafty operations per chunk that significantly outweighs reading/writing. Hopefully with some support coming from the future MeerKAT / SKA data processor there will be some traction to make the data format more fine-grain parallel. |
In fact, I realized that if you simulate a ~12h observation in one frecuency (1 channel) - this returns a dataset with ~3M rows and dask-ms reads and writes very slow as well. |
Could you be a bit more specific? Is it slow relative to reads using just
python casacore with the same taql selections and read/write operations?
I find, if you tune the chunk sizes appropriately the performance in terms
of flagging scales quite well - certainly better than what we get with
aoflagger.
However your read and write operations have to be large enough here -
thousands of channels and tens of thousands of rows per operation (and
subsequent processing step) so that you are not bound by the internal
casacore locking and Gil locking, as discussed. 3M rows with a single
channel sounds really small to me.
…On Fri, 01 Jul 2022, 19:04 Miguel Cárcamo, ***@***.***> wrote:
In fact, I realized that if you simulate a ~12h observation in one
frecuency (1 channel) - this returns a dataset with ~3M rows and dask-ms
reads and writes very slow as well.
—
Reply to this email directly, view it on GitHub
<#214 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4RE6WZV6MWJBIALMSRY23VR4QJRANCNFSM5X7D6XLQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
See the discussion here:
https://arxiv.org/abs/2206.09179?context=astro-ph
I've uploaded a preprint while I await ASP to wrap up the proceedings. By
large dataset I'm referring to 100s of GiB range in order to properly test
strong scaling. It may also be useful for you to profile your code using
the cprofiler and inspect where the time is spent -- if it is with casacore
I don't expect it to scale - as mentioned the library is inherently
serialised.
…On Sat, 02 Jul 2022, 07:44 Benna Hugo, ***@***.***> wrote:
Could you be a bit more specific? Is it slow relative to reads using just
python casacore with the same taql selections and read/write operations?
I find, if you tune the chunk sizes appropriately the performance in terms
of flagging scales quite well - certainly better than what we get with
aoflagger.
However your read and write operations have to be large enough here -
thousands of channels and tens of thousands of rows per operation (and
subsequent processing step) so that you are not bound by the internal
casacore locking and Gil locking, as discussed. 3M rows with a single
channel sounds really small to me.
On Fri, 01 Jul 2022, 19:04 Miguel Cárcamo, ***@***.***>
wrote:
> In fact, I realized that if you simulate a ~12h observation in one
> frecuency (1 channel) - this returns a dataset with ~3M rows and dask-ms
> reads and writes very slow as well.
>
> —
> Reply to this email directly, view it on GitHub
> <#214 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AB4RE6WZV6MWJBIALMSRY23VR4QJRANCNFSM5X7D6XLQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Hi @bennahugo, sorry for not being clear on this. Check this dataset. I have read it using from daskms import xds_from_ms
ms = xds_from_ms("9.8-9.5.ms", taql_where="!FLAG_ROW", index_cols=["SCAN_NUMBER", "TIME", "ANTENNA1", "ANTENNA2"])
ms[0].MODEL_DATA.data.compute() This takes ~3m 44s, and it feels too long to me considering it's just one channel and two correlations. I agree that one might need to change the chunks though. Is there anyway that one could change the chunks from the dask-ms API? |
@miguelcarcamov I suspect the
dask-ms used to warn about excessively fragmented row ordering and we're thinking about reintroducing these warnings so that these kind of performance problems are more obvious to the user: |
Hi @sjperkins, I have updated to dask-ms==0.2.9 and that was enough to get a very decent speed up, it went from ~3 minutes to ~2 seconds. I'm going to keep testing using |
Great to hear.
Yes, this would only be valid for a MS with 1 spectral window, but you could have any number number of channels in that SPW.
At minimum, I think it's necessary to have
This has some cost to it, probably because it needs to load each grouping column from disk. |
I'm going to close this issue, but feel free to reopen if you've think I've done this in error. |
I am glad you seem to have come right @miguelcarcamov. There are actually a host of factors at play here.
Absolutely - it takes a from daskms import xds_from_ms
ms = xds_from_ms(
"9.8-9.5.ms",
taql_where="!FLAG_ROW",
index_cols=["SCAN_NUMBER", "TIME", "ANTENNA1", "ANTENNA2"],
group_cols=["DATA_DESC_ID"],
chunks={"row": 100000}
)
ms[0].MODEL_DATA.data.compute() Edit: Do note that using |
I've created an issue tracking documentation of performance tuning concerns: |
Two follow up questions.
|
If you use This is supported by building CASA Table and Data Manager descriptors from xarray Datasets. There's an undocumented API to do this here: https://github.com/ratt-ru/dask-ms/tree/master/daskms/descriptors
Yes, see the functionality in But the idea is to for xarray Datasets to represent data independently of disk format and that conversion between formats occurs through these Datasets. So there's
|
I'd recommend looking through the test cases to get an understanding of how the above functionality works |
Description
I'm trying to apply a phase correction to my visibilities, therefore I loop the ms list and I apply a phase-shift. Then I write the visibilities using
xds_to_table
and then I dodask.write
.This should be really straight forward, however, it takes more than 2 days to run.
What I Did
I'm using pyralysis IO which uses dask-ms.
Cheers,
The text was updated successfully, but these errors were encountered: