Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File format not detected for some files #139

Open
kalbers opened this issue Jul 5, 2018 · 11 comments
Open

File format not detected for some files #139

kalbers opened this issue Jul 5, 2018 · 11 comments

Comments

@kalbers
Copy link

kalbers commented Jul 5, 2018

For example, a file containing HTML is detected as text/html/. See attached.

brokenCSV.xlsx

@zerocrates
Copy link
Contributor

There are just bound to be type detection issues with plain text, it's more or less impossible to do reliably.

The module needs to be able to handle things if the detection comes back as text/plain, text/html, or whatever.

@Daniel-KM
Copy link
Contributor

The module understand only .csv, .tsv and, .ods. .xslx is a strange format that is a duplicate of .ods, so it should not exist. .ods is managed by most of the proprietary softwares too, and you can even set it by default.

@zerocrates
Copy link
Contributor

zerocrates commented Jul 5, 2018

Sorry, that's just a bit of confusion: the issue isn't about allowing XLSX as input.

His actual problem file was a .csv which had a column containing HTML data. Apparently there was enough HTML to cause the detected mime type to be text/html.

He converted to xlsx for the purpose of the issue just because Github doesn't allow upload of CSV, apparently.

@zerocrates
Copy link
Contributor

@kalbers, you should probably be able to upload the actual file you used as a Gist and link to it that way, I'd imagine.

@kalbers
Copy link
Author

kalbers commented Jul 6, 2018

John's correct, I only uploaded it as an XLSX because github wouldn't allow the CSV file. If you don't want to generate the CSV file from the XLSX, I've uploaded it here: https://gist.github.com/kalbers/c6ddea80ca492dd0d382f8e410430d74#file-brokencsv-csv

@zerocrates
Copy link
Contributor

The solution here would seem to be to map unknown text types to the "csv" source... we could enumerate some extras in the config file but I'm sure there are corner cases beyond "too much HTML" where you can get text/plain or other formats.

@zerocrates
Copy link
Contributor

I have a slight concern about the TSV mapping as well (that we could mis-autodetect a CSV, or non-standard-complying TSV, as the tab-separated-values type which eliminates the enclosure settings and so on), but I don't know if that's a real-world-realistic problem.

@zerocrates
Copy link
Contributor

For now I'm taking the simple route of just adding text/html to text/plain as a type where we'll detect based on the extension.

I'd still like to remove the TSV mapping but we're not currently exposing "escape character" in that form so it would be slightly different... so I'm leaving that alone for the time being.

@pprw
Copy link

pprw commented Sep 8, 2019

On the same subject, CSVImport does not recognize a csv file. It a simple CSV file (with .csv extension) but it was produced from a .xlsx with libreoffice. This import works when saving the same file to .ods. CSVImport says that it doesn't recognize the format.

@zerocrates
Copy link
Contributor

Do you know what it is being detected as? Or, can you share the file? The problem is probably down to something about the specific content of the CSV looking like "something else."

@pprw
Copy link

pprw commented Sep 10, 2019

I have sent you the problematic csv file by email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants