-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Reading CSV file with "" field causes rows to not be read #8926
Comments
cudf sees the 2 x double quotes as a way to escape a single Possibly related issue rapidsai/cudf#12145 |
@tgravescs can you please share the steps to repro the issue? I'm only seeing nulls returned for the fields containing only |
for me with Spark I just read a file with the contents in the description and it comes back wrong. I did not create a CUDF reproduce case. I just tested this again and there might be multiple things going on with one of them me messing up. The original user query looked something like: When I was testing to get the small repro case I left off the delimiter specification. CPU:
GPU:
Going back to specifying delimiter actually doesn't show the same problem with my reproduce case.
CPU:
Now if I go back to the actual customer data though using the delimiter option then it still does report the wrong number of rows.
So to summarize:
|
Ok so I was able to narrow down the customer data to the following to reproduce it: Note there are tabs between 27 and "foo"" and 2 and "bar"
Bobby pointed out this looks like #6435 as well. |
Note the second issue with customer data and where delimiter is specified as tab with the reproduce data containing |
This is a dupe of #6435 and the CUDF issue rapidsai/cudf#11948 is to fix it. |
I suspect that I had trouble reproing this because the issue only reproes when a field has other characters except |
Describe the bug
Doing a count() after reading a CSV file with the plugin is reporting less rows then the CPU.
After investigating it looks like a field with just quotes ("") causes issues with that being read properly.
In this case the file was tab delimited and some rows have entries with "" similar to:
doing a count on that in GPU returns result 1 instead of 2.
The text was updated successfully, but these errors were encountered: