-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File format not detected for some files #139
Comments
There are just bound to be type detection issues with plain text, it's more or less impossible to do reliably. The module needs to be able to handle things if the detection comes back as text/plain, text/html, or whatever. |
The module understand only |
Sorry, that's just a bit of confusion: the issue isn't about allowing XLSX as input. His actual problem file was a .csv which had a column containing HTML data. Apparently there was enough HTML to cause the detected mime type to be text/html. He converted to xlsx for the purpose of the issue just because Github doesn't allow upload of CSV, apparently. |
@kalbers, you should probably be able to upload the actual file you used as a Gist and link to it that way, I'd imagine. |
John's correct, I only uploaded it as an XLSX because github wouldn't allow the CSV file. If you don't want to generate the CSV file from the XLSX, I've uploaded it here: https://gist.github.com/kalbers/c6ddea80ca492dd0d382f8e410430d74#file-brokencsv-csv |
The solution here would seem to be to map unknown text types to the "csv" source... we could enumerate some extras in the config file but I'm sure there are corner cases beyond "too much HTML" where you can get text/plain or other formats. |
I have a slight concern about the TSV mapping as well (that we could mis-autodetect a CSV, or non-standard-complying TSV, as the tab-separated-values type which eliminates the enclosure settings and so on), but I don't know if that's a real-world-realistic problem. |
For now I'm taking the simple route of just adding text/html to text/plain as a type where we'll detect based on the extension. I'd still like to remove the TSV mapping but we're not currently exposing "escape character" in that form so it would be slightly different... so I'm leaving that alone for the time being. |
On the same subject, CSVImport does not recognize a csv file. It a simple CSV file (with .csv extension) but it was produced from a .xlsx with libreoffice. This import works when saving the same file to .ods. CSVImport says that it doesn't recognize the format. |
Do you know what it is being detected as? Or, can you share the file? The problem is probably down to something about the specific content of the CSV looking like "something else." |
I have sent you the problematic csv file by email. |
For example, a file containing HTML is detected as text/html/. See attached.
brokenCSV.xlsx
The text was updated successfully, but these errors were encountered: