Escaped HTML tags are "un-escaped" when rendering HTML #52

sh-at-cs · 2024-02-19T13:58:40Z

Consider:

from mjml import mjml_to_html

result = mjml_to_html("""
<mjml>
  <mj-body>
    <mj-text>
      Pretty unsafe: &lt;script&gt;
    </mj-text>
  </mj-body>
</mjml>""")

print(result["html"])

In the resulting HTML output, the formerly HTML-escaped < (<) and > (&gt) are "un-escaped", so the rendered HTML actually contains Pretty unsafe: <script>.

Why does this happen? This reverses the user's safety measures and can be dangerous.

The MJML reference implementation doesn't do this and correctly keeps such escape sequences untouched: https://mjml.io/try-it-live/fvvhZhdu9V

The text was updated successfully, but these errors were encountered:

sh-at-cs · 2024-02-19T18:16:47Z

I tracked it down to the changes made in 84c495d (#45), where the formatter argument to BeautifulSoup's decode_contents was explicitly set to None. If it is set to "minimal" instead (the default and prior value), HTML escape sequences are left as they are.

But it seems like (part of) the whole point of #45 was to not leave HTML entities as they are so as to fix #44...

@caseyjhol Care to comment? 🙂

caseyjhol · 2024-02-19T19:03:25Z

Ah I could've sworn I tested and accounted for this scenario, but clearly I missed the mark. I think the better approach then might be to limit the scope to mjStyle rather than all HTML.

sh-at-cs · 2024-02-19T21:51:02Z

@caseyjhol Thanks for the quick response! You're right, you did add an example of a very similar situation:

mjml-python/tests/testdata/css-inlining.mjml

Line 22 in 713d0aa

< Hello World >

mjml-python/tests/testdata/css-inlining-expected.html

Line 127 in 713d0aa

    
           <div style="font-family:helvetica;font-size:20px;line-height:1;text-align:left;color:#F45E43;"><span style="border: 3px solid blue;"> &lt; Hello World &gt; </span></div>

But as it turns out, the entities are in fact "un-escaped" during rendering of the test MJML file as described in this issue - the reason the test passes anyway is that the space between < and Hello makes HTMLCompare consider them equal to < Hello (the rendered output). Minimal example:

# no exception:
htmlcompare.assert_same_html("&lt; script &gt;", "< script >")
# exception:
htmlcompare.assert_same_html("&lt;script &gt;", "<script >")

I haven't checked HTMLCompare's code, but I think this is probably because putting a space between < and the tag name makes it stop being an HTML tag, so the < is in fact equivalent to < in this context.

But that is exactly the case that people don't need to guard against with sanitization/escaping. So the fix for this issue should add another test case (or extend this one) with an escaped HTML tag like <script>.

FelixSchwarz · 2024-02-20T08:44:27Z

@sh-at-cs Thank you for reporting this issue, including the detailed analysis. I think this is a serious issue which we need to fix. I'll try to spend some time on this later - either on reviewing a solution or trying to fix this myself. When we merged #45 somehow the security implications completely escaped my attention.

FelixSchwarz · 2024-02-21T07:38:29Z

Unfortunately all could do yesterday was to add the test case in a new branch fix-escaped-html-tags.

How should we fix this issue? I see two approaches (just brainstorming here):

contents of <mj-style> should be kept as-is
possibly parse the contents using tinycss2 to ensure this is really just CSS

Would that solve the issue if we also remove the formatter=None? As far as I could see, css_inline will keep the escaped sequences (as it should).

caseyjhol · 2024-02-21T15:16:57Z

Going to try to dig into this today a bit.

FelixSchwarz · 2024-02-21T16:10:17Z

I pushed some additional code in the branch fix-escaped-html-tags. Basically the idea is to do the unescape only for contents of <mj-style>.

I also added CSS parsing using tinycss2. My idea is that this would blow up if the contents would be invalid (e.g. other HTML code) but I did not test that. Not sure if it is worth the additional dependency because arbitrary CSS can influence the displayed contents...

Update: It seems like tinycss happily parses even completely invalid HTML but I think css_inline would remove that. Should we just assume that the <mj-style> does not contain untrusted user input? (and maybe add a section to the readme?)

FelixSchwarz · 2024-02-21T16:20:41Z

Also maybe you can also check the security advisory draft I created for this issue. Feel free to suggest additions and please check if you agree with the severity classification.

sh-at-cs changed the title ~~Escaped HTML tags are "un-escaped" when rendering to HTML~~ Escaped HTML tags are "un-escaped" when rendering HTML Feb 19, 2024

FelixSchwarz mentioned this issue Feb 21, 2024

escaped HTML entities like > were unescaped in the final mjml output #54

Merged

FelixSchwarz closed this as completed in #54 Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaped HTML tags are "un-escaped" when rendering HTML #52

Escaped HTML tags are "un-escaped" when rendering HTML #52

sh-at-cs commented Feb 19, 2024 •

edited

Loading

sh-at-cs commented Feb 19, 2024 •

edited

Loading

caseyjhol commented Feb 19, 2024 •

edited

Loading

sh-at-cs commented Feb 19, 2024

FelixSchwarz commented Feb 20, 2024

FelixSchwarz commented Feb 21, 2024

caseyjhol commented Feb 21, 2024

FelixSchwarz commented Feb 21, 2024 •

edited

Loading

FelixSchwarz commented Feb 21, 2024

Escaped HTML tags are "un-escaped" when rendering HTML #52

Escaped HTML tags are "un-escaped" when rendering HTML #52

Comments

sh-at-cs commented Feb 19, 2024 • edited Loading

sh-at-cs commented Feb 19, 2024 • edited Loading

caseyjhol commented Feb 19, 2024 • edited Loading

sh-at-cs commented Feb 19, 2024

FelixSchwarz commented Feb 20, 2024

FelixSchwarz commented Feb 21, 2024

caseyjhol commented Feb 21, 2024

FelixSchwarz commented Feb 21, 2024 • edited Loading

FelixSchwarz commented Feb 21, 2024

sh-at-cs commented Feb 19, 2024 •

edited

Loading

sh-at-cs commented Feb 19, 2024 •

edited

Loading

caseyjhol commented Feb 19, 2024 •

edited

Loading

FelixSchwarz commented Feb 21, 2024 •

edited

Loading