0.10.0 Regression - Cannot parse metadata #39

harel-e · 2024-02-13T19:35:24Z

Starting with DuckDB 0.10.0, I can no longer use the extension to scan iceberg files

select * from iceberg_scan('s3://zones/iceberg/test/zones_43_1b7d1e67-8f53-42ab-ba55-a66f438f344f/metadata/00006-0d6ca258-8039-446f-843a-69e21344f272.metadata.json');

Error: IO Error: Invalid field found while parsing field: required

The above works fine with DuckDB 0.9.2

I attached the metadata file, which was created by Iceberg.
00006-0d6ca258-8039-446f-843a-69e21344f272.metadata.json

harel-e · 2024-02-14T07:57:36Z

It seems that if the schema contains a required field ("required" : true) the extension fails.
I tried it with https://duckdb.org/data/iceberg_data.zip
I changed one of the fields in v2.metadata.json from "required" : false to "required" : true

0.9.2 - No problems

0.10.0

SELECT * FROM iceberg_scan('data/iceberg/lineitem_iceberg', allow_moved_paths = true);
Error: IO Error: Invalid field found while parsing field: required

Sol-Hee · 2024-02-20T08:09:59Z

I'm experiencing the same problem.
I upgraded the code that used to work at 0.9.2 to 0.10.0 and got IOException.

IOException: IO Error: Invalid field found while parsing field: type

samansmink · 2024-02-22T11:36:06Z

Thanks for reporting @harel-e and @Sol-Hee.

This is a known issue that is a result from #30. In that PR we switched from parsing the parquet schemas to actually parsing the iceberg schema. However, since duckdb's schema param for the parquet reader does not yet support nested types, this support was dropped. The way forward here is to:

add support for nested types in duckdb parquet reader's schema option
add support for iceberg nested types in this extension

harel-e · 2024-02-27T04:01:20Z

@samansmink Thank you for this wonderful extension.
I wonder if there an estimate when this issue will be fixed and are there open issues on the two items above?
I can assist with testing if needed.
Thanks

samansmink · 2024-02-27T08:21:24Z

Hey @harel-e, at this moment we can not give any time estimate on this, sorry!

harel-e · 2024-02-28T09:06:34Z

Ok, thanks for the update. Too bad I cannot upgrade. I'll stick to DuckDB 0.9.2 until the issue is fixed.

veca1982 · 2024-03-16T09:23:45Z

Hello, first of all thanks to DuckDB community for beautiful work.
I stumbled across this issue facing the same problem :).
One can shortcut the problem. It assumes working with DuckDB through ccde.
We just need to find iceberg data files (in my case parquet files) from table scan (iceberg API) and input those file paths to DuckDB read_parquet function.
Snippet would be something like this (using python and pyiceberg api)

table = catalog.load_table("some.table")
scan = table.scan(row_filter=And(GreaterThanOrEqual("ts", "2024-03-14T08:00:00.000+00:00"), LessThanOrEqual("ts", "2024-03-14T08:10:00.000+00:00"))
files_data = scan.plan_files()
scan_file_paths = [file_data.file.file_path for file_data in files_data]

Let say you're using jupyter notebook with jupysql

%%sql
SELECT count(*)
FROM read_parquet({{scan_file_paths}});

harel-e · 2024-03-16T19:24:37Z

@veca1982 - Thanks for the workaround suggestion. Does this approach handle updates/deletes ?

veca1982 · 2024-03-16T20:15:33Z

No, it works for Inserts/Append only.
For updates/deletes you can use pyiceberg table scan, output it to arrow table, pandas dataframe, DuckDb table... And than query it with DuckDB 😁.

rustyconover · 2024-04-10T12:18:40Z

I think I've addressed this with #50, it should fix parsing the metadata.

harel-e · 2024-04-10T15:00:28Z

@rustyconover - can't wait to test it out. I hope it will make it to the 0.10.2 release.

harel-e · 2024-05-08T07:45:21Z

I just tested #50 - It does indeed fix the metadata parsing issue.

samansmink · 2024-05-13T12:15:14Z

fixed by #50

Fokko mentioned this issue Feb 22, 2024

Unable to read complex data types(e.g. Map, Struct) after upgrading to latest (0.10.0) version #41

Closed

harel-e mentioned this issue May 8, 2024

Change yyjson_get_tag() to yyjson_get_type() when doing type comparisons. #50

Merged

samansmink closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.10.0 Regression - Cannot parse metadata #39

0.10.0 Regression - Cannot parse metadata #39

harel-e commented Feb 13, 2024

harel-e commented Feb 14, 2024

Sol-Hee commented Feb 20, 2024

samansmink commented Feb 22, 2024

harel-e commented Feb 27, 2024

samansmink commented Feb 27, 2024

harel-e commented Feb 28, 2024

veca1982 commented Mar 16, 2024

harel-e commented Mar 16, 2024

veca1982 commented Mar 16, 2024

rustyconover commented Apr 10, 2024

harel-e commented Apr 10, 2024

harel-e commented May 8, 2024 •

edited

Loading

samansmink commented May 13, 2024

0.10.0 Regression - Cannot parse metadata #39

0.10.0 Regression - Cannot parse metadata #39

Comments

harel-e commented Feb 13, 2024

harel-e commented Feb 14, 2024

Sol-Hee commented Feb 20, 2024

samansmink commented Feb 22, 2024

harel-e commented Feb 27, 2024

samansmink commented Feb 27, 2024

harel-e commented Feb 28, 2024

veca1982 commented Mar 16, 2024

harel-e commented Mar 16, 2024

veca1982 commented Mar 16, 2024

rustyconover commented Apr 10, 2024

harel-e commented Apr 10, 2024

harel-e commented May 8, 2024 • edited Loading

samansmink commented May 13, 2024

harel-e commented May 8, 2024 •

edited

Loading