Sending Huge Tree loses namespace #218

coolacid · 2017-02-22T17:30:45Z

When sending Taxii 11 Messages with a huge tree causes future messages to loose the namespace.

Example output:

{http://taxii.mitre.org/messages/taxii_xml_binding-1.1}Discovery_Request
{http://taxii.mitre.org/messages/taxii_xml_binding-1.1}Discovery_Request
{http://taxii.mitre.org/messages/taxii_xml_binding-1.1}Discovery_Request
internal error: Huge input lookup, line 1, column 13580900 (line 1)
Discovery_Request
internal error: Huge input lookup, line 1, column 10616240 (line 1)
Discovery_Request
Inbox_Message
Discovery_Request

Simple Test Script:

from lxml import etree
from libtaxii.common import parse_xml_string

def test(xml_string, encoding='utf_8'):
    # From https://github.com/TAXIIProject/libtaxii/blob/master/libtaxii/messages_11.py#L79
    decoded_string = xml_string.decode(encoding, 'replace')
    etree_xml = parse_xml_string(decoded_string)
    qn = etree.QName(etree_xml)
    print qn

with open('data1.txt', 'r') as myfile:
    string=myfile.readlines()

for s in string:
    try:.
        test(s)
    except Exception as e:
        print "Exception: %s" % e
        pass

Sample data would be to big to provide. However, Start by creating a standard discovery packet, then a huge tree packet (Example: Large sample file), then a discovery packet. This will yield the issue.

The text was updated successfully, but these errors were encountered:

gtback · 2017-02-22T19:30:31Z

Thanks, @coolacid. I'll look into this.

kscheetz · 2017-02-22T20:50:57Z

Is this related to the huge_tree issue? If so, possible duplicate of #18.

coolacid · 2017-02-22T21:24:10Z

Not a duplicate of #18 - while related, this is specific to when huge_tree is disabled and the namespace no longer parses.

While turning on huge_tree does work around the issue - The package should be able to handle the input when huge_tree is set to false.

daybarr · 2019-05-29T16:53:05Z

Here's a self contained script to reproduce the issue.

from __future__ import print_function

import libtaxii.messages_11 as tm11

def load_xml(xml_bytes):
    try:
        msg = tm11.get_message_from_xml(xml_bytes)
        print('Loaded %r' % (msg,))
    except Exception:
        import traceback
        print(traceback.format_exc())

if __name__ == '__main__':
    poll_request_bytes = tm11.PollRequest(
        message_id=tm11.generate_message_id(),
        collection_name='Test collection',
        subscription_id='Test subscription',
    ).to_xml()

    # Use some content for the text node that is known to trigger the
    # parser's
    #      XMLSyntaxError: xmlSAX2Characters: huge text node
    # exception by having ten million bytes in the text node *and* an entity
    # that needs expansion (&lt; in this example).
    huge_text = '<' + ('x' * (10 * 1000 * 1000))
    inbox_msg_bytes = tm11.InboxMessage(
        message_id=tm11.generate_message_id(),
        content_blocks=[
            tm11.ContentBlock(
                content_binding='urn:example.com:huge_tree_issue:218',
                content=huge_text,
            )
        ]
    ).to_xml()

    print('First load of poll request - works fine')
    load_xml(poll_request_bytes)
    print('\nLoad inbox message - expected to error due to "huge text node"')
    load_xml(inbox_msg_bytes)
    print('\nSecond load of same poll request - should work fine too')
    # Uncomment next line to "fix" by resetting broken cached parser
    # libtaxii.common.set_xml_parser(None)
    load_xml(poll_request_bytes)

For a python 2.7.15 install this gives:

First load of poll request - works fine
Loaded <libtaxii.messages_11.PollRequest object at 0x0000000003790128>

Load inbox message - expected to error due to "huge text node"
Traceback (most recent call last):
  File "break_libtaxii.py", line 8, in load_xml
    msg = tm11.get_message_from_xml(xml_bytes)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\messages_11.py", line 76, in get_message_from_xml
    etree_xml = parse_xml_string(xml_string)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 60, in parse_xml_string
    return parse(xmlstr)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 34, in parse
    e = etree.parse(s, get_xml_parser()).getroot()
  File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1857, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1758, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10000380 (line 1)


Second load of same poll request - should work fine too
Traceback (most recent call last):
  File "break_libtaxii.py", line 8, in load_xml
    msg = tm11.get_message_from_xml(xml_bytes)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\messages_11.py", line 79, in get_message_from_xml
    raise ValueError('Unsupported namespace: %s' % qn.namespace)
ValueError: Unsupported namespace: None

Note how the second parse of the poll request now fails with ValueError: Unsupported namespace: None. In fact, once a message with a huge tree has been parsed and resulted in error, no more messages can be parsed until the cached lxml.etree.XMLParser that libtaxii uses is reset by a call to libtaxii.common.set_xml_parser(None). Uncommenting the line to enable that workaround, and rerunning the script gives:

First load of poll request - works fine
Loaded <libtaxii.messages_11.PollRequest object at 0x00000000038D0128>

Load inbox message - expected to error due to "huge text node"
Traceback (most recent call last):
  File "break_libtaxii.py", line 8, in load_xml
    msg = tm11.get_message_from_xml(xml_bytes)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\messages_11.py", line 76, in get_message_from_xml
    etree_xml = parse_xml_string(xml_string)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 60, in parse_xml_string
    return parse(xmlstr)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 34, in parse
    e = etree.parse(s, get_xml_parser()).getroot()
  File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1857, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1758, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10000379 (line 1)


Second load of same poll request - should work fine too
Loaded <libtaxii.messages_11.PollRequest object at 0x00000000042BA278>

Arguably this is an lxml bug - it seems that the lxml.etree.XMLParser can be left in a state where it cannot be used for parsing new messages. I can't find explicit lxml docs that say it's expected to be ok to re-use the same parser for processing multiple documents though.

There are several ways to workaround the problem:

Don't have libtaxii.common.get_xml_parser cache the XMLParser it creates in the libtaxii.common._XML_PARSER global. Just create it from scratch on every call. It only takes ~2 microseconds on my machine. Note though, that this would break the libtaxii.common.set_xml_parser feature which works by changing that global.
Use lxml.etree.XMLParser.copy() to return a copy of the cached _XML_PARSER on every call to get_xml_parser. This seems to reset the bit of state that causes future parses to lose the namespace. Taking a copy is about twice as quick as reconstructing the parser from scratch every time. It's not clear why this clears the bit of state that seems to have broken the parser, it may be safer not to rely on this happening to work for now and just reconstruct every time?
Change the way we construct the lxml.etree.XMLParser to not use ns_clean=True - it seems this is what breaks the namespaces?
Change the way we construct the lxml.etree.XMLParser to use huge_tree=True instead of False. This is Enable parsing of huge XML files #18 and the workaround suggested above by @coolacid .

daybarr · 2019-05-29T23:48:50Z

Change the way we construct the lxml.etree.XMLParser to use huge_tree=True instead of False. This is Enable parsing of huge XML files #18 and the workaround suggested above by @coolacid .

I don't think that is a good idea for the reason that it can cause python to crash with trivial construction of a deeply nested input document. See my comment on #18

emmanvg · 2019-07-26T16:57:00Z

I made a change to how we use the cached XMLParser. It will not fully address this issue, but at least help in not losing future namespaces if at some point the XMLParser is in an inconsistent state.

emmanvg mentioned this issue Jul 26, 2019

Parser changes #238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending Huge Tree loses namespace #218

Sending Huge Tree loses namespace #218

coolacid commented Feb 22, 2017

gtback commented Feb 22, 2017

kscheetz commented Feb 22, 2017

coolacid commented Feb 22, 2017

daybarr commented May 29, 2019

daybarr commented May 29, 2019

emmanvg commented Jul 26, 2019

Sending Huge Tree loses namespace #218

Sending Huge Tree loses namespace #218

Comments

coolacid commented Feb 22, 2017

gtback commented Feb 22, 2017

kscheetz commented Feb 22, 2017

coolacid commented Feb 22, 2017

daybarr commented May 29, 2019

daybarr commented May 29, 2019

emmanvg commented Jul 26, 2019