Add documentation for other tabular files (`.xls*` and `.sqlite`) #247

goeffthomas · 2023-10-19T00:15:08Z

There's plenty of documentation about how to build RecordSets and Fields from a CSV via source.extract.column, but there isn't any for .xls* or .sqlite files. These also house tabular info that should be representable as RecordSets with Fields. How should those columns be extracted from the sheets (for .xls*) and tables (for .sqlite)?

The text was updated successfully, but these errors were encountered:

marcenacp · 2023-11-01T11:00:26Z

From the user perspective:

In order to read from Excel
- Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).
- Subset of columns to use in the sheet (optional). For example, E:F if only columns E and F are of interest in the sheet. Croissant: this could be RecordSet.extract.column (already exists).
In order to read from sqlite
- Link to the database to connect to. Croissant: this could be FileObject.contentUrl (already exists) with "encodingFormat": "application/x-sqlite3" (new).
- SQL command to execute. Croissant: this could be RecordSet.extract.sql (new).

@goeffthomas Do you see more parameters?

@benjelloun What do you think of the proposed Croissant equivalents fo each parameter?

goeffthomas · 2023-11-01T18:25:48Z

Name of the sheet. Croissant: this could be RecordSet.extract.sheet_name (new).

I wonder if we could use less Excel-specific language? Like, would RecordSet.extract.table be more general and achieve the same purpose? In my mind, Excel is just a file that support multiple tables, but maybe that's oversimplifying. Maybe the sqlite handling would benefit from this as well in some way?

For example, E:F if only columns E and F are of interest in the sheet

This prompted a couple thoughts:

I thought extract was at the Field level. If not, and you can use this at the RecordSet level, would you forego documenting Fields because you're gathering many columns in a single extraction?
Adding targeting by column lettering is pretty handy especially if there's no header. That got me thinking, how does our current handling of CSVs account for data without headers? Can you specify a column by 0-based index of the column?
I'd also be okay just not implementing this kind of Excel-specific targeting unless it's asked for later, so basically just treat each sheet like its own CSV

with "encodingFormat": "application/x-sqlite3"

Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463

SQL command to execute. Croissant: this could be RecordSet.extract.sql

Similar to the some themes from the Excel comments above:

What about a RecordSet.extract.command as a more generic term that might apply to other formats?
What are your thoughts on supporting at the Field level? So, similar to Excel if we used a more generic extract.table, you could use extract.table and extract.column in tandem. Maybe a SQL command could be constructed from this if the RecordSet.fields all have the same extract.table.

goeffthomas · 2023-11-03T01:16:58Z

Could we look for both formats? I'm a bit biased here because I'll be going from file extension -> MIME type and won't be using the deprecated x-sqlite3. Context: https://stackoverflow.com/a/55071463

Actually, I've just uncovered that our implementation of what becomes encodingFormat doesn't account for .sqlite files. All of them end up as the default application/octet-stream. Though it may be nicer to key on encodingFormat alone for determining how to handle the file, we may want a fallback that checks by file extension when it's application/octet-stream.

benjelloun · 2023-11-03T09:43:05Z

@goeffthomas Can you say more on why you need this for the Kaggle implementation? My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?

goeffthomas · 2023-11-04T03:32:30Z

My naive assumption was that all tabular data files are loaded and converted into a homogeneous representation on the Kaggle side, and so you can provide access to all of them through Croissant in a common format, such as CSV. Is that not the case?

@benjelloun No, that's not the case. Our Data Explorer presents all of these tabular files in a viewer that may make it seem that way. But all of the files remain intact as xlsx or sqlite in the dataset.

goeffthomas · 2024-10-31T17:45:33Z

@benjelloun Could we try to get this into the next Croissant version?

benjelloun · 2024-11-04T14:19:04Z

@goeffthomas That's a good idea. The directions discussed here make sense to me overall. Do you and @marcenacp want to put together a short proposal for a spec change, and we can use that to iron out the details?

A couple comments on the earlier discussion:

There is a need to specify how to describe a "database connection" in a way that works for SQL Lite and other systems as well. Hopefully we can find a standard approach we just adopt.
We need a mechanism to deal with multiple tabs in a spreadsheet.
We need a mechanism to address nameless columns, e.g., using numbering. We could change the current extract column to be column_name or column_number.
Fancy: Support SQL queries for extraction. This can be useful, but means that we have a way to evaluate that query (e.g., by sending it to a database server)

While testing, I discovered some major limitations to a dependency on `mlcroissant`. Namely: - The spec does not currently support sqlite or Excel-like files, which means we can't load data from those file types: mlcommons/croissant#247 - Our current implementation of Croissant doesn't support parquet files because we don't analyze the schema of parquet files (yet). Without that, we can't generate the `RecordSet` information. So parquet is also unusable purely with Croissant - Our implementation of Croissant directs users to download the archive first, before pathing into the various tables within. This means that interacting with any table in a dataset requires downloading the entire dataset. The real benefit of `mlcroissant` was that it handled all the file type parsing and reading for us. But since it's incomplete, we can do that ourselves with pandas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for other tabular files (`.xls*` and `.sqlite`) #247

Add documentation for other tabular files (`.xls*` and `.sqlite`) #247

goeffthomas commented Oct 19, 2023

marcenacp commented Nov 1, 2023

goeffthomas commented Nov 1, 2023 •

edited

Loading

goeffthomas commented Nov 3, 2023

benjelloun commented Nov 3, 2023

goeffthomas commented Nov 4, 2023

goeffthomas commented Oct 31, 2024

benjelloun commented Nov 4, 2024

Add documentation for other tabular files (.xls* and .sqlite) #247

Add documentation for other tabular files (.xls* and .sqlite) #247

Comments

goeffthomas commented Oct 19, 2023

marcenacp commented Nov 1, 2023

goeffthomas commented Nov 1, 2023 • edited Loading

goeffthomas commented Nov 3, 2023

benjelloun commented Nov 3, 2023

goeffthomas commented Nov 4, 2023

goeffthomas commented Oct 31, 2024

benjelloun commented Nov 4, 2024

Add documentation for other tabular files (`.xls*` and `.sqlite`) #247

Add documentation for other tabular files (`.xls*` and `.sqlite`) #247

goeffthomas commented Nov 1, 2023 •

edited

Loading