Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support for decompressive transcoding (e.g. Content-Encoding: gzip )
This fixes fsspec#461 and fsspec#233 without needing users to change their existing code. The bugs were caused by assumptions in fsspec about the veracity and non-ambiguity of 'size' information returned by AbstractBufferedFile subclasses like GCSFile and AbstractFileSystem subclasses like GCSFileSystem (e.g. `self.size = self.details["size"]` in `AbstractBufferedFile`, which is used by all base caches to truncate requests and responses). Since in GCS if compression-at-rest/compression transcoding is used there's no way to retrieve the real size of the object's *content* without decompressing the whole thing either server or client side, fixing these issues required overriding some behaviors in the underlying base classes. Care was taken to preserve behavior for storage objects not using compression at rest, however. This commit: 1) adds a read() implementation in GCSFile which allows calls to succeed even when size isn't well-defined. It's 2) adds a TranscodingReadAheadCache, which is mostly identical to the readahead cache that GCSFile already uses but allows end = None to read until the end of the file, while still handling cached data prefixes. 3) changes FileSystem _info() to set size = None if contentEncoding is gzip. The fix keeps the data handling for non-gzip GCS files identical, while adding new control flow to detect when transcoding is done and adding some logic for handling those edge-cases. This did unfortunately mean implementing implementing variant methods with only minor changes to how they perform underlying operations (e.g. read() in GCSFile) which were previously just inherited from AbstractBufferedFile. It does introduce two new semantic changes though. First off, [in line with fsspec's ArchiveFileSystem](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.archive.AbstractArchiveFileSystem.info) semantics, GCSFs will return size = None when the file can not be determined fully in advance. The only possible performance overhead seen by non-users of compressive decoding is a single HEAD get request done before the point where we create the GCSFile object in GCSFilesystem, because we need to swap out the cache to one compatible with the lack of concrete file size but do not yet have the information to make that control flow decision.
- Loading branch information