[BUG] Reading CSV file with "" field causes rows to not be read #8926

tgravescs · 2023-08-03T19:10:51Z

Describe the bug
Doing a count() after reading a CSV file with the plugin is reporting less rows then the CPU.

After investigating it looks like a field with just quotes ("") causes issues with that being read properly.

In this case the file was tab delimited and some rows have entries with "" similar to:

row1    10      ""
rows2   11      ""

doing a count on that in GPU returns result 1 instead of 2.

The text was updated successfully, but these errors were encountered:

sameerz · 2023-08-07T21:44:56Z

cudf sees the 2 x double quotes as a way to escape a single ". Spark sees 2 x double as an empty field.

Possibly related issue rapidsai/cudf#12145

vuule · 2023-08-15T08:16:48Z

@tgravescs can you please share the steps to repro the issue? I'm only seeing nulls returned for the fields containing only "".

tgravescs · 2023-08-15T18:45:46Z

for me with Spark I just read a file with the contents in the description and it comes back wrong. I did not create a CUDF reproduce case.

I just tested this again and there might be multiple things going on with one of them me messing up.

The original user query looked something like:
val df = spark.read.format("csv").option("delimiter", "\t").load("./csvfileDir/")

When I was testing to get the small repro case I left off the delimiter specification.

CPU:

scala> val df = spark.read.csv("./foo.csv")

scala> spark.sql("set spark.rapids.sql.enabled=false")
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> df.count()
res4: Long = 2

scala> df.show()
+-------------+
|          _c0|
+-------------+
| row1\t10\t""|
|rows2\t11\t""|
+-------------+

GPU:

scala> spark.sql("set spark.rapids.sql.enabled=true")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df = spark.read.csv("./foo.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string]

scala> df.count()
res7: Long = 1

scala> df.show()
+-----------+
|        _c0|
+-----------+
|row1\t10\t"|
+-----------+

Going back to specifying delimiter actually doesn't show the same problem with my reproduce case.
GPU:

scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./foo.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala> df.count()
res9: Long = 2

scala> df.show()
+-----+---+----+
|  _c0|_c1| _c2|
+-----+---+----+
| row1| 10|null|
|rows2| 11|null|
+-----+---+----+

CPU:

scala> spark.sql("set spark.rapids.sql.enabled=false")
res15: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./foo.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala> df.count()
res16: Long = 2

scala> df.show()
+-----+---+----+
|  _c0|_c1| _c2|
+-----+---+----+
| row1| 10|null|
|rows2| 11|null|
+-----+---+----+

Now if I go back to the actual customer data though using the delimiter option then it still does report the wrong number of rows.

scala> spark.sql("set spark.rapids.sql.enabled=true")
scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./customerdata/")

scala> df.count()
res12: Long = 7600375     

scala> spark.sql("set spark.rapids.sql.enabled=false")

scala> df.count()
res14: Long = 7916199

So to summarize:

There is some issue when delimiter is not specified. Not sure if its plugin or cudf
I need to go back to customer data to find repro case with delimiter

tgravescs · 2023-08-15T19:29:24Z

Ok so I was able to narrow down the customer data to the following to reproduce it:

Note there are tabs between 27 and "foo"" and 2 and "bar"

27      "foo\""
2       "bar"

// CPU
scala> spark.conf.set("spark.rapids.sql.enabled", "false")
scala> val df = spark.read.format("csv").option("delimiter", "\t").load("./testRepro.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]

scala> df.show(false)
+---+----+
|_c0|_c1 |
+---+----+
|27 |foo"|
|2  |bar |
+---+----+

// GPU
scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> df.show(false)


+---+------------------+
|_c0|_c1               |
+---+------------------+
|27 |"foo\"\n2\t"bar"\n|
+---+------------------+

Bobby pointed out this looks like #6435 as well.

tgravescs · 2023-08-15T19:30:58Z

Note the second issue with customer data and where delimiter is specified as tab with the reproduce data containing "foo\"" seems this is actually the same as #129

revans2 · 2023-08-16T18:49:09Z

This is a dupe of #6435 and the CUDF issue rapidsai/cudf#11948 is to fix it.

vuule · 2023-08-16T22:07:12Z

I suspect that I had trouble reproing this because the issue only reproes when a field has other characters except "". If that's the case, this is a duplicate of #11948.
I looked into that issue earlier and found that the state machine is insufficient to deal with this case, but I can't find any comments about my findings ;(

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 3, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Reading CSV file with "" field causes rows to not be read #8926

[BUG] Reading CSV file with "" field causes rows to not be read #8926

tgravescs commented Aug 3, 2023

sameerz commented Aug 7, 2023

vuule commented Aug 15, 2023

tgravescs commented Aug 15, 2023

tgravescs commented Aug 15, 2023

tgravescs commented Aug 15, 2023 •

edited

Loading

revans2 commented Aug 16, 2023 •

edited

Loading

vuule commented Aug 16, 2023

[BUG] Reading CSV file with "" field causes rows to not be read #8926

[BUG] Reading CSV file with "" field causes rows to not be read #8926

Comments

tgravescs commented Aug 3, 2023

sameerz commented Aug 7, 2023

vuule commented Aug 15, 2023

tgravescs commented Aug 15, 2023

tgravescs commented Aug 15, 2023

tgravescs commented Aug 15, 2023 • edited Loading

revans2 commented Aug 16, 2023 • edited Loading

vuule commented Aug 16, 2023

tgravescs commented Aug 15, 2023 •

edited

Loading

revans2 commented Aug 16, 2023 •

edited

Loading