Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Should byte_array_view in parquet reader/writer change #11408

Open
hyperbolic2346 opened this issue Jul 29, 2022 · 6 comments
Open

[QST] Should byte_array_view in parquet reader/writer change #11408

hyperbolic2346 opened this issue Jul 29, 2022 · 6 comments
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested

Comments

@hyperbolic2346
Copy link
Contributor

What is your question?
Should byte_array_view change to a different implementation method or even go away completely.

Motivation

When reviewing the byte_array_view PR it was brought up in review comments that things could be done differently and possibly better. This issue is an attempt to bring this design out in the light and get some discourse going so we can build it the best way possible. Jake was, rightfully, concerned about the cognitive overload of having another object type that has to be understood, no matter how minimal the type turns out to be.

Backstory and origin

The original thought was that it would be nice to leverage the existing templates in the statistics code to get elements and compute max/min just like everything else. This meant that .element on a column would be able to return a type that represents a list<uint8>. This is almost identical to a string column, so the thought was to have something analogous to string_view that could be used. This was quickly dismissed due to the issue of not having all list columns comprised of this thing and it felt like we were forcing something. All string columns are lists of chars, but not all list columns are lists of bytes.

Requirements

The requirements in the statistics code are the ability to get an element from a table, compare elements, and compose an element from a pointer and a length. The statistics code goes to great length to type-erase the statistics blobs so they can be easily consumed at a large scale on the GPU and the reconstructs them later. It also uses thrust::min and cub::reduceBlock to process them, so comparison operators are needed.

Slippery issues to understand

We can't use the same statistics types as strings because string_view::max() is actually not the same as a max byte or a max byte_array_view. The distinction is subtle, but important between all of them.

  • The max UTF8 string is actually just 5 bytes long and defined inside the string_view header. No UTF8 string can have a higher value, so comparisons work even though it isn't an infinitely-long character string as one would initially think.
  • Maximum value for an unsigned byte is obviously 255, but this isn't the what is intended when one asks for the max byte array view. Instead, the goal is to know the "biggest" one. This includes the length and the internal bytes. 0xff, 0x05 is less than 0xff, 0x15 and 0xff is less than 0x00, 0x00.
  • Maximum byte_array_view is defined conceptually as an infinite array of 0xff. This isn't possible to statically define for comparison like the string_view class, so some magic values were used of a nullptr and max length. These then have to be explicitly compared later in the comparison function to achieve the proper results.

Lots of places required special handling for byte_array_view and potentially get worse with the different possible solutions. The goal of course is to make these areas as clean as possible, so I thought it would be good to point some of them out here.

  • Here is where the code grabs the data from the column. There is conversion in here for types, which is used for things like duration and timestamps. Originally it was thought this could be a good spot to convert from a list_view, which can be returned from .element calls on a list column. This didn't end up being a great solution, but I can't remember the details.
  • min/max calculations and block reduce happens down in typed_statistics_chunk. This code is responsible for figuring out min, max, null counts, and aggregations like sum. It has to pick up this new type and operate on it.
  • Actual data writing in parquet looks something like this where an element is grabbed and written into place.

Possible solutions

  1. Use device_span directly. This requires passing comparison functions to cub and thrust for the calculations, but is completely doable. This was attempted, potentially poorly, with not great looking results.
  2. Composition vs inheritance. This came up multiple times as to why it was built with composition, holding a device_span inside, vs inheriting from device_span either publicly or privately. There isn't a great answer here to argue against inheritance. I originally thought that this would be a very small subset of device_span and I didn't want to muddy the waters with all the accessors and iterators, but after further inspection, I don't see anything that I would want to remove from device_span, so this would be a viable path. It does still hold the issue of cognitive overload of yet another type someone encounters.
  3. Continues to live on as it is now.
  4. Your amazing idea that didn't come up in development or review.
@hyperbolic2346 hyperbolic2346 added question Further information is requested Needs Triage Need team to review and classify labels Jul 29, 2022
@GregoryKimball GregoryKimball added code quality cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jul 29, 2022
@vuule
Copy link
Contributor

vuule commented Jul 29, 2022

I got some ideas, but they depend on a few points I'm not sure about yet:

  1. Are comparison semantics of byte_array_view similar/equivalent to string_view?
  2. Is byte_array_view a device_span? I.e. would deriving from device_span follow Liskov Substitution Principle?

@hyperbolic2346
Copy link
Contributor Author

  1. Identical outside of max value. string_view has a nice and clean static for max value and the byte array has to indicate it with magic values.
  2. It can certainly be viewed as such. It is a device_span with helper comparison functions inside. It is a list of bytes, which is exactly a device_span.

@vuule
Copy link
Contributor

vuule commented Aug 1, 2022

Aiming to avoid code duplication:
ordered_device_span: public device_span` + comparison impl

string_view : derived from ordered_device_span<char> + min/max impl

byte_array_view : derived from ordered_device_span<uint8_t> + min/max impl

Probably requires CRTP for min/max without virtual functions. not required at all, never mind :)

@davidwendt
Copy link
Contributor

davidwendt commented Aug 1, 2022

I'd rather keep cudf::string_view out of this discussion. It is intended to match up somewhat to std::string_view https://en.cppreference.com/w/cpp/string/basic_string_view
And some version of it may migrate to libcu++ at some point. At the very least, the cudf::string_view in the future may contain a cuda::std::string_view member.

@vuule
Copy link
Contributor

vuule commented Aug 1, 2022

Didn't know about libcu++ potential involvement.
In that case, my vote is to publicly derive from device_span to remove duplicated data access members (FWIW). Much less exciting solution :D

@github-actions
Copy link

github-actions bot commented Sep 1, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-30d labels Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants