Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(puffin): Add PuffinReader #892

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

fqaiser94
Copy link
Contributor

@fqaiser94 fqaiser94 commented Jan 15, 2025

Part of #744

Summary

  • Add PuffinReader

Context

  • This is the fourth of a number of PRs to add support for Iceberg Puffin file format.
  • It might be helpful to refer to the overarching PR from which these changes were split to understand better how these changes will fit in to the larger picture.
  • It may also be helpful to refer to the Java reference implementation for PuffinReader here.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fqaiser94 for this great pr, generally LGTM! Just some minor suggestions.

}

/// Returns file metadata
pub(crate) async fn file_metadata(&mut self) -> Result<&FileMetadata> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using a LazyLock to initalize it? This way it could be thread safe, and discard the &mut modifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to be able to get rid of the &mut here 👍
Struggling to get a LazyLock based implementation working though.
The main challenge is that FileMetadata::read is an async function :/
Open to suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think async_lazy crate is a good fit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, I implemented the lazy initialization using tokio::sync::OnceCell.

I couldn't get the async lazy crate approach to work because:

  1. It doesn't expose any APIs for fallible operations.
  2. My FileMetadata::read function needs a reference to the InputFile which doesn't work well with this approach:
let file_metadata: Lazy<Result<FileMetadata>> = Lazy::new(|| Box::pin(async { FileMetadata::read(&input_file).await }));

mismatched types
expected fn pointer `fn() -> Pin<Box<(dyn futures::Future<Output = std::result::Result<puffin::metadata::FileMetadata, error::Error>> + std::marker::Send + 'static)>>`
      found closure `{closure@crates/iceberg/src/puffin/reader.rs:36:67: 36:69}`rustc[Click for full compiler diagnostic](rust-analyzer-diagnostics-view:/diagnostic%20message%20[2]?2#file:///Users/fq/src/iceberg-rust/crates/iceberg/src/puffin/reader.rs)
reader.rs(36, 57): arguments to this function are incorrect
reader.rs(36, 107): closures can only be coerced to `fn` types if they do not capture any variables
lib.rs(73, 12): associated function defined here

crates/iceberg/src/puffin/reader.rs Outdated Show resolved Hide resolved
/// Sequence number of the Iceberg table's snapshot the blob was computed from
pub(crate) sequence_number: i64,
/// The actual blob data
pub(crate) data: Vec<u8>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add comment to explain that this is always uncompressed data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can we include a CompressionCodec in the Blob to indicate the compress method? (we can do this in a follow-up PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add comment to explain that this is always uncompressed data?

Done.

Or can we include a CompressionCodec in the Blob to indicate the compress method? (we can do this in a follow-up PR)

I can see where you're coming from and while I'm not completely opposed to the idea, I think I prefer to return the uncompressed data at this point. I will also note that the Java PuffinReader API does the same i.e. always returns the data after decompression.

@fqaiser94 fqaiser94 force-pushed the puffin_reader branch 2 times, most recently from 4d64762 to 2216a8b Compare January 23, 2025 01:54
@fqaiser94 fqaiser94 force-pushed the puffin_reader branch 4 times, most recently from 8b2af10 to e15baab Compare January 25, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants