Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading CSV file with "" field causes rows to not be read #8926

Open
tgravescs opened this issue Aug 3, 2023 · 7 comments
Open

[BUG] Reading CSV file with "" field causes rows to not be read #8926

tgravescs opened this issue Aug 3, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
Doing a count() after reading a CSV file with the plugin is reporting less rows then the CPU.

After investigating it looks like a field with just quotes ("") causes issues with that being read properly.

In this case the file was tab delimited and some rows have entries with "" similar to:

row1    10      ""
rows2   11      ""

doing a count on that in GPU returns result 1 instead of 2.

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 3, 2023
@sameerz
Copy link
Collaborator

sameerz commented Aug 7, 2023

cudf sees the 2 x double quotes as a way to escape a single ". Spark sees 2 x double as an empty field.

Possibly related issue rapidsai/cudf#12145

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 8, 2023
@vuule
Copy link

vuule commented Aug 15, 2023

@tgravescs can you please share the steps to repro the issue? I'm only seeing nulls returned for the fields containing only "".

@tgravescs
Copy link
Collaborator Author

for me with Spark I just read a file with the contents in the description and it comes back wrong. I did not create a CUDF reproduce case.

I just tested this again and there might be multiple things going on with one of them me messing up.

The original user query looked something like:
val df = spark.read.format("csv").option("delimiter", "\t").load("./csvfileDir/")

When I was testing to get the small repro case I left off the delimiter specification.

CPU:

scala> val df = spark.read.csv("./foo.csv")

scala> spark.sql("set spark.rapids.sql.enabled=false")
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> df.count()
res4: Long = 2

scala> df.show()
+-------------+
|          _c0|
+-------------+
| row1\t10\t""|
|rows2\t11\t""|
+-------------+

GPU:

scala> spark.sql("set spark.rapids.sql.enabled=true")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df = spark.read.csv("./foo.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string]

scala> df.count()
res7: Long = 1

scala> df.show()
+-----------+
|        _c0|
+-----------+
|row1\t10\t"|
+-----------+

Going back to specifying delimiter actually doesn't show the same problem with my reproduce case.
GPU:

scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./foo.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala> df.count()
res9: Long = 2

scala> df.show()
+-----+---+----+
|  _c0|_c1| _c2|
+-----+---+----+
| row1| 10|null|
|rows2| 11|null|
+-----+---+----+

CPU:

scala> spark.sql("set spark.rapids.sql.enabled=false")
res15: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./foo.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala> df.count()
res16: Long = 2

scala> df.show()
+-----+---+----+
|  _c0|_c1| _c2|
+-----+---+----+
| row1| 10|null|
|rows2| 11|null|
+-----+---+----+

Now if I go back to the actual customer data though using the delimiter option then it still does report the wrong number of rows.

scala> spark.sql("set spark.rapids.sql.enabled=true")
scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./customerdata/")

scala> df.count()
res12: Long = 7600375     

scala> spark.sql("set spark.rapids.sql.enabled=false")

scala> df.count()
res14: Long = 7916199       

So to summarize:

  1. There is some issue when delimiter is not specified. Not sure if its plugin or cudf
  2. I need to go back to customer data to find repro case with delimiter

@tgravescs
Copy link
Collaborator Author

Ok so I was able to narrow down the customer data to the following to reproduce it:

Note there are tabs between 27 and "foo"" and 2 and "bar"

27      "foo\""
2       "bar"
// CPU
scala> spark.conf.set("spark.rapids.sql.enabled", "false")
scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./testRepro.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]

scala> df.show(false)
+---+----+
|_c0|_c1 |
+---+----+
|27 |foo"|
|2  |bar |
+---+----+

// GPU
scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> df.show(false)


+---+------------------+
|_c0|_c1               |
+---+------------------+
|27 |"foo\"\n2\t"bar"\n|
+---+------------------+

Bobby pointed out this looks like #6435 as well.

@tgravescs
Copy link
Collaborator Author

tgravescs commented Aug 15, 2023

Note the second issue with customer data and where delimiter is specified as tab with the reproduce data containing "foo\"" seems this is actually the same as #129

@revans2
Copy link
Collaborator

revans2 commented Aug 16, 2023

This is a dupe of #6435 and the CUDF issue rapidsai/cudf#11948 is to fix it.

@vuule
Copy link

vuule commented Aug 16, 2023

I suspect that I had trouble reproing this because the issue only reproes when a field has other characters except "". If that's the case, this is a duplicate of #11948.
I looked into that issue earlier and found that the state machine is insufficient to deal with this case, but I can't find any comments about my findings ;(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants