[BUG] Backwards compatible parquet MAP_KEY_VALUE is not treated properly #12044

revans2 · 2022-11-01T21:46:34Z

Describe the bug
The parquet specification at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md when talking about backwards compatibility in maps says that

Some existing data incorrectly used MAP_KEY_VALUE in place of MAP. For backward-compatibility, a group annotated with MAP_KEY_VALUE that is not contained by a MAP-annotated group should be handled as a MAP-annotated group.

The example schema given for this is.

// Map<String, Integer> (nullable map, nullable values)
optional group my_map (MAP_KEY_VALUE) {
  repeated group map {
    required binary key (UTF8);
    optional int32 value;
  }
}

I created a parquet file and put it in file.zip that is very similar, but it uses int32 for both the key and the value.

message spark {
  required group my_map (MAP_KEY_VALUE) {
    repeated group map {
      required int32 key;
      required int32 value;
    }
  }
}

When I read the data back using CUDF I get a schema like TABLE<STRUCT<STRUCT<INT32, INT32>>>, but what we want is TABLE<LIST<STRUCT<INT32, INT32>>>. Because that first column is a STRUCT and not a LIST only the first row in the LIST is returned.

It looks like panads is able to do this.

>>> pd.read_parquet("MAP_KEY_VALUE_TEST.parquet")
             my_map
0  [(0, 2), (1, 3)]
>>> pd.read_parquet("MAP_KEY_VALUE_TEST.parquet").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   my_map  1 non-null      object
dtypes: object(1)
memory usage: 136.0+ bytes

Additional context
This is probably not a super high priority. It is an odd/rare corner case. At least until a customer hit this.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2022-11-14T04:25:34Z

Thank you @revans2 for documenting this deviation. If this came from your testing of corner-cases in the parquet spec, that is a great outcome.

revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Nov 1, 2022

revans2 mentioned this issue Nov 1, 2022

[BUG] The backwards compatible parquet annotation MAP_KEY_VALUE crashes on read NVIDIA/spark-rapids#6970

Open

GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 14, 2022

GregoryKimball added this to the Parquet continuous improvement milestone Nov 19, 2022

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Backwards compatible parquet MAP_KEY_VALUE is not treated properly #12044

[BUG] Backwards compatible parquet MAP_KEY_VALUE is not treated properly #12044

revans2 commented Nov 1, 2022

GregoryKimball commented Nov 14, 2022

[BUG] Backwards compatible parquet MAP_KEY_VALUE is not treated properly #12044

[BUG] Backwards compatible parquet MAP_KEY_VALUE is not treated properly #12044

Comments

revans2 commented Nov 1, 2022

GregoryKimball commented Nov 14, 2022