-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] cudf and kvikio? (this is actually a question about implementing RDMA in another library) #16004
Comments
cuDF does use KvikIO's C++ library libkvikio for GPUDirect Storage, in the libcudf C++ code that underpins cuDF's Python interface. Here are some pointers to the build system and docs referencing KvikIO functionality in cuDF:
If you have any questions about KvikIO or how cuDF uses it, please feel free to ask! |
Ahhh ok I missed that and only saw the stuff in the parquet library that has cuFile and nvcomp calls in C++. We'd like to start over in uproot by just working with local file access via DMA with kvikio to the GPU. Is there any conceptual issue we should be aware of there? Thanks! Also @nsmith- |
@madsbk might be better suited to answer that question, but I think you should be fine to use the Python kvikio API or the C++ libkvikio API. Also I don't know this source code very well but there is an example from the cuCIM library that shows how to do GPUDirect Storage reads of uncompressed TIFF images using the kvikio Python API. See: https://github.com/rapidsai/cucim/blob/branch-24.08/examples/python/gds_whole_slide/ |
Since uproot is pure Python, I guess you want to use KvikIO's Python API, which should look a lot like Python's regular file API: https://docs.rapids.ai/api/kvikio/nightly/quickstart/ Using |
OK - thanks @madsbk @bdice. I had read the quickstart, but became curious about performance penalties stemming from any conceptual issues with using the python bindings when I saw direct usage of cuFile and nvcomp in the C++/CUDA of cudf. This is why I am asking. Is there anything we need to do in order to use fsspec effectively with this software stack? We do have a custom storage access protocol that we've glued into fsspec. I imagine local file access is fine, but is there any special care to be taken for remote s3, etc.? |
Using the Python API in a Python application shouldn't introduce any extra overhead compared to writing your own bindings. cudf uses the C++ API because cudf is mostly written in C++.
No, KvikIO only support local file access at the moment :/ |
Thanks for the clarification, we'll go ahead and give it a shot in uproot. Should be an interesting ride. Are there any particular bits of care we need to give with respect to thread divergence and decompression? I imagine keeping chunks of data mostly the same size ~helps? |
And, indeed, the remote file access will be super important for our use case. I'll talk to the nVidia contacts we have with Fermilab about getting some attention in this direction. |
Obviously, I have my own opinions on how remote file access should work, at least from the user API perspective, so if any group at rapids/nvidia/etc are working on this, I'd be happy to be part of the conversation. |
cc: @rjzamora on the topic of remote filesystem access. |
We are in the process of prioritizing remote-file performance in cudf, and it may make the most sense to leverage kvikio as a space to make the necessary improvements. @martindurant - I'd love to get your thoughts in #15919 |
If you are trying to do remote RDMA (GPU RDMA) you could use https://github.com/rapidsai/ucxx but you will need accelerated network devices as well. This is more common at HPC Labs so I would also ask about Infiniband or Slingshot. cc @pentschev It looks like Wilson is being decommissioned soon but was architected with Infiniband NICs |
Yes - we have an internal cluster that we're putting together testing this stuff on. We'll have to get accelerated NICs for it, which will come in time. We unfortunately don't have infiniband or slingshot on that cluster. Wilson's GPUs are quite old in any case, K40s, iirc. The main thing I'd like to achieve with this is loading files from a remote storage system into GPU memory. While pretty obvious if it's mounted, with s3 or similar it wasn't clear the proper way forward. Will continue with kvikio and mounted filesystems in the mean time. |
Is there any reason why I would start getting memory errors on a 40GB slice of an a100 when trying to allocate more than 10 GiB or so? |
Are you using RMM (if you are allocating through cudf, you are), you can try the new memory profiler in the nightly package: https://docs.rapids.ai/api/rmm/nightly/guide/#memory-profiler . It might show where and the size of the peak memory allocation. |
@madsbk Great, thank you. I am indeed still using cudf so I'll try this out. awkward-array uses cupy as a backend, so in the fullness of time we'll have to understand if we're using |
CuPy does not use RMM by default but you can enable RMM as the memory manager for CuPy. Here are some references. https://docs.rapids.ai/api/rmm/stable/guide/#using-rmm-with-third-party-libraries https://docs.cupy.dev/en/stable/user_guide/interoperability.html#rmm |
Yes! I saw that, thanks for further docs. We'll have to figure out something in awkward that detects and properly uses RMM when it is present. |
For parquet, whether or not you can decompress bytes into their final storage depends on quite a few different factors. You generally want V2 pages and SIMPLE encoding; and even then, only some data types will work (not string!). I wouldn't be surprised if cudf doesn't even bother looking for this efficient path. |
This is float32/int/bool data in either singly-nested ragged arrays or just flat arrays. No strings or anything. |
The encoding stats - how many pages of each type - are contained in the main file metadata (row_groups[].columns[].meta_data.encoding_stats). I think strictly they are optional, but I expect them to be present. The pages themselves can be either V1 or V2 as you come to them. A "version: 2" in the global metadata guarantees nothing. In https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html , you see data_page_version as an option, V1 by default(!!), so if you were writing your own files, this is what you would change. For INT, delta encoding would not allow decompress-to-target |
Oh, and I don't know if you can see any of those details via pyarrow. I use fastparquet... |
It looks like I can enforce these options by setting |
@fstrug you should follow this! |
These encodings are something we will want in our data, especially the split float (due to our use of mantissa truncation for reduced precision), so maybe not worth worrying too much about minimizing intermediate buffers and just operate on smaller numbers of row groups at a time? |
@nsmith- We'll have to see how this balances with throughput performance. I'd be curious if we can load the next dataset in while finishing compute on the last. Probably some automated sizing can be done per-analysis. |
Hi! I'm an experimental high energy physicist interested in rapid data analysis, in high energy physics we predominantly use the ROOT which has a pleasant python implementation here: https://github.com/scikit-hep/uproot5
cudf's cuFile and nvcomp based file i/o is extremely powerful (recently benchmarked this ourselves). However, I noticed that https://github.com/rapidsai/kvikio implements this i/o and decompression layer more generally for other libraries, and it is not used in cudf.
Would there be any lost performance from using kvikio as opposed to a dedicated C++ implementation, since uproot is nominally a python only implementation of the ROOT i/o standard? i.e. are there any major performance reasons reasons cudf does not use kvikio?
@martindurant @jpivarski
The text was updated successfully, but these errors were encountered: