Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sending Huge Tree loses namespace #218

Open
coolacid opened this issue Feb 22, 2017 · 6 comments
Open

Sending Huge Tree loses namespace #218

coolacid opened this issue Feb 22, 2017 · 6 comments

Comments

@coolacid
Copy link

When sending Taxii 11 Messages with a huge tree causes future messages to loose the namespace.

Example output:

{http://taxii.mitre.org/messages/taxii_xml_binding-1.1}Discovery_Request
{http://taxii.mitre.org/messages/taxii_xml_binding-1.1}Discovery_Request
{http://taxii.mitre.org/messages/taxii_xml_binding-1.1}Discovery_Request
internal error: Huge input lookup, line 1, column 13580900 (line 1)
Discovery_Request
internal error: Huge input lookup, line 1, column 10616240 (line 1)
Discovery_Request
Inbox_Message
Discovery_Request

Simple Test Script:

from lxml import etree
from libtaxii.common import parse_xml_string

def test(xml_string, encoding='utf_8'):
    # From https://github.com/TAXIIProject/libtaxii/blob/master/libtaxii/messages_11.py#L79
    decoded_string = xml_string.decode(encoding, 'replace')
    etree_xml = parse_xml_string(decoded_string)
    qn = etree.QName(etree_xml)
    print qn

with open('data1.txt', 'r') as myfile:
    string=myfile.readlines()

for s in string:
    try:.
        test(s)
    except Exception as e:
        print "Exception: %s" % e
        pass

Sample data would be to big to provide. However, Start by creating a standard discovery packet, then a huge tree packet (Example: Large sample file), then a discovery packet. This will yield the issue.

@gtback
Copy link
Contributor

gtback commented Feb 22, 2017

Thanks, @coolacid. I'll look into this.

@kscheetz
Copy link

Is this related to the huge_tree issue? If so, possible duplicate of #18.

@coolacid
Copy link
Author

Not a duplicate of #18 - while related, this is specific to when huge_tree is disabled and the namespace no longer parses.

While turning on huge_tree does work around the issue - The package should be able to handle the input when huge_tree is set to false.

@daybarr
Copy link
Contributor

daybarr commented May 29, 2019

Here's a self contained script to reproduce the issue.

from __future__ import print_function

import libtaxii.messages_11 as tm11

def load_xml(xml_bytes):
    try:
        msg = tm11.get_message_from_xml(xml_bytes)
        print('Loaded %r' % (msg,))
    except Exception:
        import traceback
        print(traceback.format_exc())

if __name__ == '__main__':
    poll_request_bytes = tm11.PollRequest(
        message_id=tm11.generate_message_id(),
        collection_name='Test collection',
        subscription_id='Test subscription',
    ).to_xml()

    # Use some content for the text node that is known to trigger the
    # parser's
    #      XMLSyntaxError: xmlSAX2Characters: huge text node
    # exception by having ten million bytes in the text node *and* an entity
    # that needs expansion (< in this example).
    huge_text = '<' + ('x' * (10 * 1000 * 1000))
    inbox_msg_bytes = tm11.InboxMessage(
        message_id=tm11.generate_message_id(),
        content_blocks=[
            tm11.ContentBlock(
                content_binding='urn:example.com:huge_tree_issue:218',
                content=huge_text,
            )
        ]
    ).to_xml()

    print('First load of poll request - works fine')
    load_xml(poll_request_bytes)
    print('\nLoad inbox message - expected to error due to "huge text node"')
    load_xml(inbox_msg_bytes)
    print('\nSecond load of same poll request - should work fine too')
    # Uncomment next line to "fix" by resetting broken cached parser
    # libtaxii.common.set_xml_parser(None)
    load_xml(poll_request_bytes)

For a python 2.7.15 install this gives:

First load of poll request - works fine
Loaded <libtaxii.messages_11.PollRequest object at 0x0000000003790128>

Load inbox message - expected to error due to "huge text node"
Traceback (most recent call last):
  File "break_libtaxii.py", line 8, in load_xml
    msg = tm11.get_message_from_xml(xml_bytes)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\messages_11.py", line 76, in get_message_from_xml
    etree_xml = parse_xml_string(xml_string)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 60, in parse_xml_string
    return parse(xmlstr)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 34, in parse
    e = etree.parse(s, get_xml_parser()).getroot()
  File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1857, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1758, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10000380 (line 1)


Second load of same poll request - should work fine too
Traceback (most recent call last):
  File "break_libtaxii.py", line 8, in load_xml
    msg = tm11.get_message_from_xml(xml_bytes)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\messages_11.py", line 79, in get_message_from_xml
    raise ValueError('Unsupported namespace: %s' % qn.namespace)
ValueError: Unsupported namespace: None

Note how the second parse of the poll request now fails with ValueError: Unsupported namespace: None. In fact, once a message with a huge tree has been parsed and resulted in error, no more messages can be parsed until the cached lxml.etree.XMLParser that libtaxii uses is reset by a call to libtaxii.common.set_xml_parser(None). Uncommenting the line to enable that workaround, and rerunning the script gives:

First load of poll request - works fine
Loaded <libtaxii.messages_11.PollRequest object at 0x00000000038D0128>

Load inbox message - expected to error due to "huge text node"
Traceback (most recent call last):
  File "break_libtaxii.py", line 8, in load_xml
    msg = tm11.get_message_from_xml(xml_bytes)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\messages_11.py", line 76, in get_message_from_xml
    etree_xml = parse_xml_string(xml_string)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 60, in parse_xml_string
    return parse(xmlstr)
  File "C:\Users\day.barr\dev\libtaxii\libtaxii\common.py", line 34, in parse
    e = etree.parse(s, get_xml_parser()).getroot()
  File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1857, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1758, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10000379 (line 1)


Second load of same poll request - should work fine too
Loaded <libtaxii.messages_11.PollRequest object at 0x00000000042BA278>

Arguably this is an lxml bug - it seems that the lxml.etree.XMLParser can be left in a state where it cannot be used for parsing new messages. I can't find explicit lxml docs that say it's expected to be ok to re-use the same parser for processing multiple documents though.

There are several ways to workaround the problem:

  • Don't have libtaxii.common.get_xml_parser cache the XMLParser it creates in the libtaxii.common._XML_PARSER global. Just create it from scratch on every call. It only takes ~2 microseconds on my machine. Note though, that this would break the libtaxii.common.set_xml_parser feature which works by changing that global.
  • Use lxml.etree.XMLParser.copy() to return a copy of the cached _XML_PARSER on every call to get_xml_parser. This seems to reset the bit of state that causes future parses to lose the namespace. Taking a copy is about twice as quick as reconstructing the parser from scratch every time. It's not clear why this clears the bit of state that seems to have broken the parser, it may be safer not to rely on this happening to work for now and just reconstruct every time?
  • Change the way we construct the lxml.etree.XMLParser to not use ns_clean=True - it seems this is what breaks the namespaces?
  • Change the way we construct the lxml.etree.XMLParser to use huge_tree=True instead of False. This is Enable parsing of huge XML files #18 and the workaround suggested above by @coolacid .

@daybarr
Copy link
Contributor

daybarr commented May 29, 2019

I don't think that is a good idea for the reason that it can cause python to crash with trivial construction of a deeply nested input document. See my comment on #18

@emmanvg emmanvg mentioned this issue Jul 26, 2019
@emmanvg
Copy link
Contributor

emmanvg commented Jul 26, 2019

I made a change to how we use the cached XMLParser. It will not fully address this issue, but at least help in not losing future namespaces if at some point the XMLParser is in an inconsistent state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants