Unicode characters incorrectly escaped in encode #334

samhh · 2020-08-16T22:33:23Z

At least, that's what I think is happening. In a REPL, the following will fail:

Toml.decode (Toml.text "k") $ Toml.encode (Toml.text "k") "ü"
-- Left [ParseError (TomlParseError {unTomlParseError = "1:8:\n  |\n1 | k = \"\\252\"\n  |        ^\nInvalid escape sequence: \\2\n"})]

Looking at the encoding, here's what we're given:

Toml.encode (Toml.text "k") "ü"
-- "k = \"\\252\"\n"

And passing that into decode will fail per the above:

Toml.decode (Toml.text "k") "k = \"\\252\"\n"
-- Left [ParseError (TomlParseError {unTomlParseError = "1:8:\n  |\n1 | k = \"\\252\"\n  |        ^\nInvalid escape sequence: \\2\n"})]

Looking at the TOML spec, it looks like Unicode characters should be encoded with a \u prefix. Modifying the string to contain an extra u0 allows encoding to succeed, and I think that's what we want given that's roughly the decimal output in an online Unicode converter:

Toml.decode (Toml.text "k") "k = \"\\u0252\"\n"
-- Right "\594"

But I'm pretty ignorant about character encoding and am honestly not sure if that's the right output. 😄

The text was updated successfully, but these errors were encountered:

chshersh · 2020-08-18T16:28:48Z

Hi @samhh, thanks for submitting the issue! This indeed looks like an unexpected behaviour 😞
tomland uses Text internally, and during encoding Text is printed using the show function. The show does the escaping of all characters. You can reproduce this behaviour even easier in GHCi:

λ: show "ü"
"\"\\252\""

It looks like some smarter handling of Unicode characters is required to preserve TOML semantics.

Relevant code to change is here:

tomland/src/Toml/Type/Printer.hs

Line 141 in 38d74d3

valText (Text s) = showText s

A more interesting question is why such errors weren't caught by our property tests? 🤔 🤔 🤔

dariodsa · 2020-12-08T20:25:47Z

I implemented the requested feature, now I only have to implement tests and figure it out why it wasn't caught by our test cases. Code needs some cleaning but it is working. :-)

dariodsa · 2020-12-08T21:18:09Z

Our tests was ok, but they were testing something different. Text was generated with unicode characters but they were written like \u010d, but they weren't generate in its real form, č. Thank you @samhh for catching that error.
I will need to rewrite some tests but I will explain it in more details in PR.

* [#334] parse and unparse tests * removed parsing and unparing tests * [#334] showUnicodeText * [#334] escaping unicode character as well as regular characters * [#334] resolved issue with escaping regular unescaped chars * added tests, but they are not in use * examples.hs revert to original content * [#334] changes requested by chshersh

The `tomland` library currently has some issues with Unicode characters, which makes it not usable right now. In the future we could try to migrate again but for now we should probably stick to YAML. See kowainik/tomland#334

chshersh added bug Something isn't working pretty-printer Everything related to `Toml -> Text` labels Aug 18, 2020

dariodsa self-assigned this Nov 14, 2020

dariodsa added a commit that referenced this issue Nov 14, 2020

[#334] parse and unparse tests

f922d9a

dariodsa added a commit that referenced this issue Nov 18, 2020

[#334] showUnicodeText

0e81ac8

dariodsa mentioned this issue Nov 27, 2020

[Parser] Multiline string #353

Closed

dariodsa added a commit that referenced this issue Dec 8, 2020

[#334] escaping unicode character as well as regular characters

7587a21

dariodsa added a commit that referenced this issue Dec 9, 2020

[#334] resolved issue with escaping regular unescaped chars

0349ff4

dariodsa mentioned this issue Dec 9, 2020

Unicode escape #354

Merged

dariodsa added a commit that referenced this issue Dec 19, 2020

[#334] changes requested by chshersh

fb68968

chshersh modified the milestone: v1.3.2.0: Always improving Feb 12, 2021

srid mentioned this issue Aug 17, 2022

Stork search unicode issue in title srid/emanote#336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode characters incorrectly escaped in encode #334

Unicode characters incorrectly escaped in encode #334

samhh commented Aug 16, 2020

chshersh commented Aug 18, 2020

dariodsa commented Dec 8, 2020 •

edited

Loading

dariodsa commented Dec 8, 2020

Unicode characters incorrectly escaped in encode #334

Unicode characters incorrectly escaped in encode #334

Comments

samhh commented Aug 16, 2020

chshersh commented Aug 18, 2020

dariodsa commented Dec 8, 2020 • edited Loading

dariodsa commented Dec 8, 2020

dariodsa commented Dec 8, 2020 •

edited

Loading