Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backwards compatible parquet MAP_KEY_VALUE is not treated properly #12044

Open
revans2 opened this issue Nov 1, 2022 · 1 comment
Open
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 1, 2022

Describe the bug
The parquet specification at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md when talking about backwards compatibility in maps says that

Some existing data incorrectly used MAP_KEY_VALUE in place of MAP. For backward-compatibility, a group annotated with MAP_KEY_VALUE that is not contained by a MAP-annotated group should be handled as a MAP-annotated group.

The example schema given for this is.

// Map<String, Integer> (nullable map, nullable values)
optional group my_map (MAP_KEY_VALUE) {
  repeated group map {
    required binary key (UTF8);
    optional int32 value;
  }
}

I created a parquet file and put it in file.zip that is very similar, but it uses int32 for both the key and the value.

message spark {
  required group my_map (MAP_KEY_VALUE) {
    repeated group map {
      required int32 key;
      required int32 value;
    }
  }
}

When I read the data back using CUDF I get a schema like TABLE<STRUCT<STRUCT<INT32, INT32>>>, but what we want is TABLE<LIST<STRUCT<INT32, INT32>>>. Because that first column is a STRUCT and not a LIST only the first row in the LIST is returned.

It looks like panads is able to do this.

>>> pd.read_parquet("MAP_KEY_VALUE_TEST.parquet")
             my_map
0  [(0, 2), (1, 3)]
>>> pd.read_parquet("MAP_KEY_VALUE_TEST.parquet").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   my_map  1 non-null      object
dtypes: object(1)
memory usage: 136.0+ bytes

Additional context
This is probably not a super high priority. It is an odd/rare corner case. At least until a customer hit this.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 1, 2022
@GregoryKimball
Copy link
Contributor

Thank you @revans2 for documenting this deviation. If this came from your testing of corner-cases in the parquet spec, that is a great outcome.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 14, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

2 participants