-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(puffin): Add PuffinReader #892
base: main
Are you sure you want to change the base?
Conversation
31c248b
to
4c97569
Compare
4c97569
to
62a2e6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fqaiser94 for this great pr, generally LGTM! Just some minor suggestions.
crates/iceberg/src/puffin/reader.rs
Outdated
} | ||
|
||
/// Returns file metadata | ||
pub(crate) async fn file_metadata(&mut self) -> Result<&FileMetadata> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider using a LazyLock
to initalize it? This way it could be thread safe, and discard the &mut
modifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to be able to get rid of the &mut
here 👍
Struggling to get a LazyLock
based implementation working though.
The main challenge is that FileMetadata::read
is an async function :/
Open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think async_lazy crate is a good fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL, I implemented the lazy initialization using tokio::sync::OnceCell
.
I couldn't get the async lazy crate approach to work because:
- It doesn't expose any APIs for fallible operations.
- My
FileMetadata::read
function needs a reference to theInputFile
which doesn't work well with this approach:
let file_metadata: Lazy<Result<FileMetadata>> = Lazy::new(|| Box::pin(async { FileMetadata::read(&input_file).await }));
mismatched types
expected fn pointer `fn() -> Pin<Box<(dyn futures::Future<Output = std::result::Result<puffin::metadata::FileMetadata, error::Error>> + std::marker::Send + 'static)>>`
found closure `{closure@crates/iceberg/src/puffin/reader.rs:36:67: 36:69}`rustc[Click for full compiler diagnostic](rust-analyzer-diagnostics-view:/diagnostic%20message%20[2]?2#file:///Users/fq/src/iceberg-rust/crates/iceberg/src/puffin/reader.rs)
reader.rs(36, 57): arguments to this function are incorrect
reader.rs(36, 107): closures can only be coerced to `fn` types if they do not capture any variables
lib.rs(73, 12): associated function defined here
/// Sequence number of the Iceberg table's snapshot the blob was computed from | ||
pub(crate) sequence_number: i64, | ||
/// The actual blob data | ||
pub(crate) data: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add comment to explain that this is always uncompressed data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can we include a CompressionCodec
in the Blob
to indicate the compress method? (we can do this in a follow-up PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add comment to explain that this is always uncompressed data?
Done.
Or can we include a CompressionCodec in the Blob to indicate the compress method? (we can do this in a follow-up PR)
I can see where you're coming from and while I'm not completely opposed to the idea, I think I prefer to return the uncompressed data at this point. I will also note that the Java PuffinReader API does the same i.e. always returns the data after decompression.
4d64762
to
2216a8b
Compare
8b2af10
to
e15baab
Compare
e15baab
to
8752ce4
Compare
Part of #744
Summary
Context