Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for other tabular files (.xls* and .sqlite) #247

Open
goeffthomas opened this issue Oct 19, 2023 · 7 comments
Open

Add documentation for other tabular files (.xls* and .sqlite) #247

goeffthomas opened this issue Oct 19, 2023 · 7 comments

Comments

@goeffthomas
Copy link
Contributor

There's plenty of documentation about how to build RecordSets and Fields from a CSV via source.extract.column, but there isn't any for .xls* or .sqlite files. These also house tabular info that should be representable as RecordSets with Fields. How should those columns be extracted from the sheets (for .xls*) and tables (for .sqlite)?

@marcenacp
Copy link
Contributor

From the user perspective:

  • In order to read from Excel
    • Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).
    • Subset of columns to use in the sheet (optional). For example, E:F if only columns E and F are of interest in the sheet. Croissant: this could be RecordSet.extract.column (already exists).
  • In order to read from sqlite
    • Link to the database to connect to. Croissant: this could be FileObject.contentUrl (already exists) with "encodingFormat": "application/x-sqlite3" (new).
    • SQL command to execute. Croissant: this could be RecordSet.extract.sql (new).

@goeffthomas Do you see more parameters?

@benjelloun What do you think of the proposed Croissant equivalents fo each parameter?

@goeffthomas
Copy link
Contributor Author

goeffthomas commented Nov 1, 2023

Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).

I wonder if we could use less Excel-specific language? Like, would RecordSet.extract.table be more general and achieve the same purpose? In my mind, Excel is just a file that support multiple tables, but maybe that's oversimplifying. Maybe the sqlite handling would benefit from this as well in some way?

For example, E:F if only columns E and F are of interest in the sheet

This prompted a couple thoughts:

  1. I thought extract was at the Field level. If not, and you can use this at the RecordSet level, would you forego documenting Fields because you're gathering many columns in a single extraction?
  2. Adding targeting by column lettering is pretty handy especially if there's no header. That got me thinking, how does our current handling of CSVs account for data without headers? Can you specify a column by 0-based index of the column?
  3. I'd also be okay just not implementing this kind of Excel-specific targeting unless it's asked for later, so basically just treat each sheet like its own CSV

with "encodingFormat": "application/x-sqlite3"

Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463

SQL command to execute. Croissant: this could be RecordSet.extract.sql

Similar to the some themes from the Excel comments above:

  1. What about a RecordSet.extract.command as a more generic term that might apply to other formats?
  2. What are your thoughts on supporting at the Field level? So, similar to Excel if we used a more generic extract.table, you could use extract.table and extract.column in tandem. Maybe a SQL command could be constructed from this if the RecordSet.fields all have the same extract.table.

@goeffthomas
Copy link
Contributor Author

Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463

Actually, I've just uncovered that our implementation of what becomes encodingFormat doesn't account for .sqlite files. All of them end up as the default application/octet-stream. Though it may be nicer to key on encodingFormat alone for determining how to handle the file, we may want a fallback that checks by file extension when it's application/octet-stream.

@benjelloun
Copy link
Contributor

@goeffthomas Can you say more on why you need this for the Kaggle implementation? My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?

@goeffthomas
Copy link
Contributor Author

My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?

@benjelloun No, that's not the case. Our Data Explorer presents all of these tabular files in a viewer that may make it seem that way. But all of the files remain intact as xlsx or sqlite in the dataset.

@goeffthomas
Copy link
Contributor Author

@benjelloun Could we try to get this into the next Croissant version?

@benjelloun
Copy link
Contributor

@goeffthomas That's a good idea. The directions discussed here make sense to me overall. Do you and @marcenacp want to put together a short proposal for a spec change, and we can use that to iron out the details?

A couple comments on the earlier discussion:

  • There is a need to specify how to describe a "database connection" in a way that works for SQL Lite and other systems as well. Hopefully we can find a standard approach we just adopt.
  • We need a mechanism to deal with multiple tabs in a spreadsheet.
  • We need a mechanism to address nameless columns, e.g., using numbering. We could change the current extract column to be column_name or column_number.
  • Fancy: Support SQL queries for extraction. This can be useful, but means that we have a way to evaluate that query (e.g., by sending it to a database server)

goeffthomas added a commit to Kaggle/kagglehub that referenced this issue Dec 11, 2024
While testing, I discovered some major limitations to a dependency on `mlcroissant`. Namely:
- The spec does not currently support sqlite or Excel-like files, which means we can't load data from those file types: mlcommons/croissant#247
- Our current implementation of Croissant doesn't support parquet files because we don't analyze the schema of parquet files (yet). Without that, we can't generate the `RecordSet` information. So parquet is also unusable purely with Croissant
- Our implementation of Croissant directs users to download the archive first, before pathing into the various tables within. This means that interacting with any table in a dataset requires downloading the entire dataset.

The real benefit of `mlcroissant` was that it handled all the file type parsing and reading for us. But since it's incomplete, we can do that ourselves with pandas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants