-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for other tabular files (.xls*
and .sqlite
)
#247
Comments
From the user perspective:
@goeffthomas Do you see more parameters? @benjelloun What do you think of the proposed Croissant equivalents fo each parameter? |
I wonder if we could use less Excel-specific language? Like, would
This prompted a couple thoughts:
Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated
Similar to the some themes from the Excel comments above:
|
Actually, I've just uncovered that our implementation of what becomes |
@goeffthomas Can you say more on why you need this for the Kaggle implementation? My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case? |
@benjelloun No, that's not the case. Our Data Explorer presents all of these tabular files in a viewer that may make it seem that way. But all of the files remain intact as xlsx or sqlite in the dataset. |
@benjelloun Could we try to get this into the next Croissant version? |
@goeffthomas That's a good idea. The directions discussed here make sense to me overall. Do you and @marcenacp want to put together a short proposal for a spec change, and we can use that to iron out the details? A couple comments on the earlier discussion:
|
While testing, I discovered some major limitations to a dependency on `mlcroissant`. Namely: - The spec does not currently support sqlite or Excel-like files, which means we can't load data from those file types: mlcommons/croissant#247 - Our current implementation of Croissant doesn't support parquet files because we don't analyze the schema of parquet files (yet). Without that, we can't generate the `RecordSet` information. So parquet is also unusable purely with Croissant - Our implementation of Croissant directs users to download the archive first, before pathing into the various tables within. This means that interacting with any table in a dataset requires downloading the entire dataset. The real benefit of `mlcroissant` was that it handled all the file type parsing and reading for us. But since it's incomplete, we can do that ourselves with pandas.
There's plenty of documentation about how to build
RecordSet
s andField
s from a CSV viasource.extract.column
, but there isn't any for.xls*
or.sqlite
files. These also house tabular info that should be representable asRecordSet
s withField
s. How should those columns be extracted from the sheets (for.xls*
) and tables (for.sqlite
)?The text was updated successfully, but these errors were encountered: