[QST] Should byte_array_view in parquet reader/writer change #11408

hyperbolic2346 · 2022-07-29T21:19:39Z

What is your question?
Should byte_array_view change to a different implementation method or even go away completely.

Motivation

When reviewing the byte_array_view PR it was brought up in review comments that things could be done differently and possibly better. This issue is an attempt to bring this design out in the light and get some discourse going so we can build it the best way possible. Jake was, rightfully, concerned about the cognitive overload of having another object type that has to be understood, no matter how minimal the type turns out to be.

Backstory and origin

The original thought was that it would be nice to leverage the existing templates in the statistics code to get elements and compute max/min just like everything else. This meant that .element on a column would be able to return a type that represents a list<uint8>. This is almost identical to a string column, so the thought was to have something analogous to string_view that could be used. This was quickly dismissed due to the issue of not having all list columns comprised of this thing and it felt like we were forcing something. All string columns are lists of chars, but not all list columns are lists of bytes.

Requirements

The requirements in the statistics code are the ability to get an element from a table, compare elements, and compose an element from a pointer and a length. The statistics code goes to great length to type-erase the statistics blobs so they can be easily consumed at a large scale on the GPU and the reconstructs them later. It also uses thrust::min and cub::reduceBlock to process them, so comparison operators are needed.

Slippery issues to understand

We can't use the same statistics types as strings because string_view::max() is actually not the same as a max byte or a max byte_array_view. The distinction is subtle, but important between all of them.

The max UTF8 string is actually just 5 bytes long and defined inside the string_view header. No UTF8 string can have a higher value, so comparisons work even though it isn't an infinitely-long character string as one would initially think.
Maximum value for an unsigned byte is obviously 255, but this isn't the what is intended when one asks for the max byte array view. Instead, the goal is to know the "biggest" one. This includes the length and the internal bytes. 0xff, 0x05 is less than 0xff, 0x15 and 0xff is less than 0x00, 0x00.
Maximum byte_array_view is defined conceptually as an infinite array of 0xff. This isn't possible to statically define for comparison like the string_view class, so some magic values were used of a nullptr and max length. These then have to be explicitly compared later in the comparison function to achieve the proper results.

Lots of places required special handling for byte_array_view and potentially get worse with the different possible solutions. The goal of course is to make these areas as clean as possible, so I thought it would be good to point some of them out here.

Here is where the code grabs the data from the column. There is conversion in here for types, which is used for things like duration and timestamps. Originally it was thought this could be a good spot to convert from a list_view, which can be returned from .element calls on a list column. This didn't end up being a great solution, but I can't remember the details.
min/max calculations and block reduce happens down in typed_statistics_chunk. This code is responsible for figuring out min, max, null counts, and aggregations like sum. It has to pick up this new type and operate on it.
Actual data writing in parquet looks something like this where an element is grabbed and written into place.

Possible solutions

Use device_span directly. This requires passing comparison functions to cub and thrust for the calculations, but is completely doable. This was attempted, potentially poorly, with not great looking results.
Composition vs inheritance. This came up multiple times as to why it was built with composition, holding a device_span inside, vs inheriting from device_span either publicly or privately. There isn't a great answer here to argue against inheritance. I originally thought that this would be a very small subset of device_span and I didn't want to muddy the waters with all the accessors and iterators, but after further inspection, I don't see anything that I would want to remove from device_span, so this would be a viable path. It does still hold the issue of cognitive overload of yet another type someone encounters.
Continues to live on as it is now.
Your amazing idea that didn't come up in development or review.

The text was updated successfully, but these errors were encountered:

vuule · 2022-07-29T23:02:45Z

I got some ideas, but they depend on a few points I'm not sure about yet:

Are comparison semantics of byte_array_view similar/equivalent to string_view?
Is byte_array_view a device_span? I.e. would deriving from device_span follow Liskov Substitution Principle?

hyperbolic2346 · 2022-08-01T18:27:00Z

Identical outside of max value. string_view has a nice and clean static for max value and the byte array has to indicate it with magic values.
It can certainly be viewed as such. It is a device_span with helper comparison functions inside. It is a list of bytes, which is exactly a device_span.

vuule · 2022-08-01T20:52:13Z

Aiming to avoid code duplication:
ordered_device_span: public device_span` + comparison impl

string_view : derived from ordered_device_span<char> + min/max impl

byte_array_view : derived from ordered_device_span<uint8_t> + min/max impl

~~Probably requires CRTP for min/max without virtual functions.~~ not required at all, never mind :)

davidwendt · 2022-08-01T21:11:06Z

I'd rather keep cudf::string_view out of this discussion. It is intended to match up somewhat to std::string_view https://en.cppreference.com/w/cpp/string/basic_string_view
And some version of it may migrate to libcu++ at some point. At the very least, the cudf::string_view in the future may contain a cuda::std::string_view member.

vuule · 2022-08-01T23:52:07Z

Didn't know about libcu++ potential involvement.
In that case, my vote is to publicly derive from device_span to remove duplicated data access members (FWIW). Much less exciting solution :D

github-actions · 2022-09-01T00:13:05Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

hyperbolic2346 added question Further information is requested Needs Triage Need team to review and classify labels Jul 29, 2022

hyperbolic2346 mentioned this issue Jul 29, 2022

Adding byte array view structure #11322

Merged

3 tasks

GregoryKimball added code quality cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jul 29, 2022

github-actions bot added the inactive-30d label Sep 1, 2022

GregoryKimball added this to the Parquet continuous improvement milestone Nov 19, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-30d labels Apr 2, 2023

vyasr removed the code quality label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Should byte_array_view in parquet reader/writer change #11408

[QST] Should byte_array_view in parquet reader/writer change #11408

hyperbolic2346 commented Jul 29, 2022

vuule commented Jul 29, 2022 •

edited

Loading

hyperbolic2346 commented Aug 1, 2022

vuule commented Aug 1, 2022 •

edited

Loading

davidwendt commented Aug 1, 2022 •

edited

Loading

vuule commented Aug 1, 2022

github-actions bot commented Sep 1, 2022

[QST] Should byte_array_view in parquet reader/writer change #11408

[QST] Should byte_array_view in parquet reader/writer change #11408

Comments

hyperbolic2346 commented Jul 29, 2022

Motivation

Backstory and origin

Requirements

Slippery issues to understand

Possible solutions

vuule commented Jul 29, 2022 • edited Loading

hyperbolic2346 commented Aug 1, 2022

vuule commented Aug 1, 2022 • edited Loading

davidwendt commented Aug 1, 2022 • edited Loading

vuule commented Aug 1, 2022

github-actions bot commented Sep 1, 2022

vuule commented Jul 29, 2022 •

edited

Loading

vuule commented Aug 1, 2022 •

edited

Loading

davidwendt commented Aug 1, 2022 •

edited

Loading