Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exposition: impose row group limit for parquet #31

Merged
merged 4 commits into from
May 31, 2024
Merged

Conversation

mihirn
Copy link
Contributor

@mihirn mihirn commented May 28, 2024

Remove internal buffering from the ParquetWriter and instead
write single snapshots to the ArrowWriter which buffers data
internally until the row group size limit is reached. Set
the row group limit for the ArrowWriter from the specified
batch size in the ParquetOptions. Finally, reduce the default
row group size from 1M to 50K.

Remove internal buffering from the ParquetWriter and instead
write single snapshots to the ArrowWriter which buffers data
internally until the row group size limit is reached. Set
the row group limit for the ArrowWriter from the specified
batch size in the ParquetOptions. Finally, reduce the default
row group size from 1M to 50K.
@swlynch99
Copy link
Contributor

Does this solve https://github.com/iopsystems/systemslab/issues/1571? If so that would be quite nice

@mihirn
Copy link
Contributor Author

mihirn commented May 28, 2024

I don't know if it fixes it entirely, but it should reduce the footprint substantially (for instance, I'm able to tune stuff down to a memory overhead of < 1GB).

@@ -14,7 +12,7 @@ use parquet::format::{FileMetaData, KeyValue};

use crate::snapshot::{HashedSnapshot, Snapshot};

const DEFAULT_MAX_BATCH_SIZE: usize = 1024 * 1024;
const DEFAULT_MAX_BATCH_SIZE: usize = 50_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth leaving a comment with the rationale for this size?

@mihirn mihirn merged commit 4abf8d9 into main May 31, 2024
9 checks passed
@mihirn mihirn deleted the msn/rowgroup-size branch May 31, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants