-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated Parquet files are extremely fragmented #123
Comments
I also did some very rough benchmarks before/after making the rowgroups nicer: DuckDB:
Polars:
Pretty significant! |
Right.. That is something we should take up upstream in Polars. |
I think this |
I tried doing Ideally Parquet rowgroups should be much larger (the format spec recommends 512MB to 1GB, but in practice I've seen more like 128MB or so). Any thoughts on if I should create an issue in Polars itself to follow-up there? |
Though that recommendation doesn't mean better performance I think. 512MB is very large and we could a lot more in parallel if we shrink the sizes. Currently, our row-groups are fixed in row-count and we don't dynamically try to hit a certain output size. Maybe we could, but I do think we should favor relatively smaller groups. (But I agree, not this small) |
Yup, I agree that 512MB as a blanket statement is probably not the best idea. Those were suggestions based on HDFS I believe, but for modern workloads we should probably try to optimize for something like AWS S3/object storage.
Do you have an idea for how small you'd prefer here? For AWS S3, Typical sizes for byte-range requests are 8 MB or 16 MB which could be a good idea to build around. Perhaps a good guideline should be to try for Column chunks of about WDYT? |
I think I will try to hit a row count rather than a row-group size (defaulting to 512^2). Currently there was an issue in Polars that allowed very small splits (a few rows) to be written. Will fix it. |
That sounds great and will be a big improvement vs the current behavior of just a few thousand rows! Thanks for the quick responses 👏 👏 👏 Any thoughts on why |
Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.
For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!
Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file 😅
Writing this instead with PyArrow (I amended the code in
prepare_data.py
), we get much more well-behaved rowgroups:Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!
The text was updated successfully, but these errors were encountered: