Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: refine filetype detection #3828

Merged
merged 3 commits into from
Dec 17, 2024
Merged

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Dec 14, 2024

Summary
Fixes a bug where a CSV file with asserted content-type application/vnd.ms-excel was incorrectly identified as an XLS file and failed partitioning.

Additional Context
The content_type argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the .csv -> XLS mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping.

XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector.

@cragwolfe
Copy link
Contributor

worth adding a test for the specific bug, "CSV file with asserted content-type application/vnd.ms-excel was incorrectly identified as an XLS file"?

@scanny scanny force-pushed the scanny/fix-csv-filetype-detection branch 2 times, most recently from 88cd3b0 to 2f1f68c Compare December 15, 2024 00:57
@scanny
Copy link
Collaborator Author

scanny commented Dec 15, 2024

worth adding a test for the specific bug, "CSV file with asserted content-type application/vnd.ms-excel was incorrectly identified as an XLS file"?

@cragwolfe Oh yes, absolutely, that was the first one I added :) it's this one, named a little differently as test_it_ignores_asserted_XLS_content_type_when_file_is_CSV():
2d15fea#diff-d88874605ea6403a7c07fd5ebb3bf5d795888e7fb2ecd90f15e60ef2300901a5R364

Line 364 in ‎test_unstructured/file_utils/test_filetype.py if the link doesn't work quite right.

The eight file-types based on CFB or ZIP compound files can be detected
with 100% accuracy (or at least 99.999%). Try this strategy first,
ignoring the unreliable content-type for these file-types.

This fixes a problem where a CSV file with an asserted XLS file-type was
mistakenly typed as XLS and failed partitioning.
@scanny scanny force-pushed the scanny/fix-csv-filetype-detection branch from 60fb87c to b7a0b3b Compare December 17, 2024 00:23
@scanny scanny enabled auto-merge December 17, 2024 00:26
@scanny scanny added this pull request to the merge queue Dec 17, 2024
Merged via the queue into main with commit b5ff79d Dec 17, 2024
41 checks passed
@scanny scanny deleted the scanny/fix-csv-filetype-detection branch December 17, 2024 01:43
Copy link

sentry-io bot commented Dec 23, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ _ArrayMemoryError: Unable to allocate 23.9 GiB for an array with shape (80032, 80032) and data type float32 /general/v0/general View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants