-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3fs 2024.3.0 fails reading glob patterns through pandas #862
Comments
This is a consequence of fsspec/filesystem_spec#1536 (cc @Skylion007 ), and the logic is: pandas only reads single files, but glob can return many. So with the previous version, we would simply pick the first match, but now we assume single paths are not to be expanded, as that would cause ambiguity. The glob() methods still functions as before, and open_files will expand (e.g., as used by dask's read_csv ). |
@Skylion007 , do you think we should fall back to the old behaviour in the case that no file is found with the exact given path, and it contains special glob characters? |
The old behavior can be manually specified by passing |
|
Ugh... maybe allowing that arg to be overridden better would be a nicer place to start? |
I don't think most users will want to add an extra kwarg to their calls (right, @calloc ?) |
I am bit confused about the old behavior, wouldn't the previous glob pattern only return the first element if it went through fsspec.open(). |
fsspec.open returned fsspec.open_files(..)[0] |
The following fixes passing expand=False
but simply falling back if open(expand=False) fails does not work, because in the current implementation, open() always succeeds whether or not the target file exists
and only fails when this OpenFile is actually used. OTOH, we don't really want to check for every concrete filepath passed whether it exists or not, and assume the caller have reasonable parameters. So in the case that the path can be interpreted in multiple ways (i.e., has globp characters), we need special handling. Thoughts, @Skylion007 ? |
I fixed expand= in fsspec/filesystem_spec#1551 , which made release 2024.3.1, but I am still considering the right default value, depending on how many people were relying on the previous (accidental) behaviour. |
@martindurant This does feel like a trap since it pandas csv will only read 1 glob file in this case, but i suppose supporting this use case is better than not doing so. i was thinking that we may want to refactor it a bit so expand could only be overriden in |
Following my PR, changing the default is easy, although I suppose it could be a config option rather than/as well as a kwarg. |
Intention is to read all CSV files under an s3 directory using Pandas via s3fs.
We specify path as
s3://some-workspace/tmp/jr_1d02d47beccb8f7d2c2e8429bbaa3e143f90fcb7837373fb565abdade4a8c845_audit/*.csv
Reverting back to old version
s3fs==2024.2.0
works. Notepandas==2.1.4
.The text was updated successfully, but these errors were encountered: