-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Develop new approach for handling remote I/O #15919
Comments
Thank you for raising @vyasr ! I have spent some time exploring the importance of cudf's If we were to change the python code to stop relying on Near-term Solution: In order to avoid excessive host-memory usage in the near term, we could probably introduce some kind of "sparse" byte-range data-source to libcudf. It is fairly easy to populate a mapping of known byte ranges efficiently with fsspec. If these known byte ranges could be used to populate a structure that is understood as a file-like object by libcudf, then we can avoid the host-memory issue. (Possible) Long-term Solution: We roll our own filesystem API at the cpp level and avoid all python-related performance concerns :) |
@martindurant - As I mentioned in #16004, we'd love to hear your thoughts. A bit more background: Cudf currently relies on python/fsspec to transfer the necessary data into host memory. For the parquet and csv readers, we support partial IO by wrapping the fsspec file in an arrow PythonFile. For all other IO, we always transfer the entire file into local memory before handing over the bytes to libcudf (c++). Although my original plan was to expand/improve Arrow NativeFile support throughout cudf, it now seems clear that we need to remove the problematic libarrow dependency instead. This means that we are in need of something new at the C++ level. Some questions for you:
|
Sounds like we should talk - there's a lot going on here. I assume that the best case would be if you can move bytes directly from the network interface to GPU without bothering with python bytes (as a reminder, I and others have rust implementations of some of this) but python can continue to be the control layer. Another matter, is that HTTP streams are commonly compressed by the server, which is basically pointless for binary fine formals which have internal compression, and take serious CPU usage before you even get the range of bytes you asked for. Specific answers to your questions:
cat_ranges, yes. The set of calls is sent concurrently, so latency is not a problem, but the event loop is running in a single python thread, so no true parallelism. That's not a porblem for almost all use cases, where the stream handling (including any decompression) is fast compared to the network bandwidth. I assume in your case, you will have super-high bandwidth cases too. When called from multiple dask threads in a single process, all the fsspec IO still happens in one dedicated thread.
I mentioned the rust prototype (rfsspec) for this reason. It doesn't have a file-like interface, but the FS implementations are designed to be like fsspec, so it would not be hard. Unlike asyncio, tokio can grab the even loop in many threads and efficiently spawn workers for compute (the prototype does not do this). The question is, do you really want a python file-like object, which is blocking and has internal state? Maybe not. The arrow file objects are like that too, I think. To me, it would make sense to have multiple cursors into memory buffers, especially if you know which buffers you will be needing (like fsspec.parquet). OTOH, directly calling python with a C++ wrapper (which you describe) is probably very easy to implement, but see the next answer.
this is hard to say. Getting bandwidth limited transfers with fsspec and pure python is certainly possible now, but calling into the python interpreter on every |
Thanks for the quick response @martindurant !
Yes we absolutely want to do this in the long run. My preliminary plan/suggestion was to improve remote-IO performance in two general steps:
Okay, thanks for clarifying - This event loop limitation makes sense. Just to clarify, if we were only interested in transferring raw bytes as quickly as possible (no "compute" at all), would there be any benefit to initiating an s3 transfer in parallel? Or does the concurrent
It's really great that you have already been looking into this - I am very interested in exploring rfsspec to see how the general
I completely agree with you that a python-like file is not the ideal solution here. I am only interested in a |
Absolutely, and this would be a very simple construct. I could write it in rust in no time, but not C++ :) For python, of course we already have this.
My code of course only uses pretty high-level things, so no direct talking to devices. I have no idea what it would take, but I'm sure you know people who do. I have considered whether it might be useful to have a full rust-based general fsspec-like, with python bindings, but not exclusively. I don't have the spare effort to do that really. Of course, python encourages more backends, since they are much easier to write, and tricks like what kerchunk achieves - I would not want to exclude the potential to do that.
I did some light benchmarking of zarr workflows at the time (no ranges, whole files only), and found that there was no drastic difference, at most 20% improvement occasionally, sometimes none at all. Probably, asyncio is not the bottleneck, the overhead is pretty small. Having said that, I never finished implementing https://github.com/martindurant/rfsspec/blob/main/src/io.rs , which would bring a file-like experience without copying or making |
I don't know if, in theory, it's possible to have multiple threads talking to the network interface at once. Certainly, parallel copies for host to device memory are possible. None of this is touched by the high-level python code, of course. I think in python's case, multiple event loops in multiple threads would just end up costing overhead (because of the GIL, or maybe even without it). So yes, I think cat_ranges can be near optimal. Other network parameters probably come way sooner, like fsspec/s3fs#873 . |
FYI: |
Is your feature request related to a problem? Please describe.
Currently cudf supports reading files from remote sources by reusing the arrow NativeFile interface. Such files can be passed down from Python into libcudf and configured to only read the selected subset of data from the remote resource. This can be vital for the performance of some workflows. However, as part of #15193 we will be removing libarrow as a dependency of cudf and libcudf. This means that we will no longer be able to rely on the NativeFile interface. This is a breaking change for the cudf and libcudf APIs, as well as being a performance hit for some workflows.
Describe the solution you'd like
We need to evaluate alternatives that will allow us to maintain or improve upon performance for remote I/O while not depending on libarrow. The arrow removal has numerous ancillary benefits and will be moving forward, so we need to find a way to mitigate that. Ideally we would also want to get a sense of how much the NativeFile-based interfaces are currently used.
The text was updated successfully, but these errors were encountered: