Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving uri_regex #168

Open
guidovranken opened this issue Feb 9, 2015 · 4 comments
Open

Improving uri_regex #168

guidovranken opened this issue Feb 9, 2015 · 4 comments

Comments

@guidovranken
Copy link
Contributor

The current uri_regex as defined in validation.py is not sufficiently precise to differentiate between legitimate, specification-compliant URI's and strings that are impossible to be usable as URI's.

Currently, the following regex is used:

uri_regex = RegexTuple("(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?", "URI Format")

Parts of this regex between brackets, such as

[^:/?#]

imply that every character may be matched except the colon, the slash, the question mark and the number sign (#). Consequently, invalid characters, including non-printable "garbage" characters, will be considered valid by the above regex:

>>> import re
>>> r = "(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
>>> re.match(r, "https://taxii.mitre.org").group(0)
'https://taxii.mitre.org'
>>> re.match(r, chr(0x0A) * 4 + "://taxii.mitre.org").group(0)
'\n\n\n\n://taxii.mitre.org'

(The first re.match demonstrates the (correct) validation of http://taxii.mitre.org, the second re.match demonstrates the incorrect validation of a string that contains newline characters).

Given a sufficiently diverse amount of input data, I reckon that this might cause problems in the long run if strings that are not real URI's are parsed and "green-lighted" by libtaxii.

@MarkDavidson
Copy link
Contributor

@guidovranken, thanks for reporting this. I think improving the regex is a good thing to do.

This regex is from the URI specification [1] - Appendix B. While libtaxii uses the regex in a manner slightly different than the way the RFC uses it (libtaxii uses it for validation, the RFC uses it for capture groups / parsing), it was really used because it was a low-effort quick win.

Thank you.
-Mark

[1] https://tools.ietf.org/html/rfc3986

@guidovranken
Copy link
Contributor Author

Thanks. The perfect matching of a URI is more complex than it initially seems: see here for an attempt at constructing a sufficiently complete URI-matching regex. That particular regex might be usable for libtaxii although beware that it might be using a different syntax than Python's re uses, I haven't really tried it yet..

@bworrell
Copy link
Contributor

@MarkDavidson, @guidovranken,

We actually use that Daring Fireball regex in the email-to-cybox tool here.

It seems to work fine for what we intended: just grabbing URLs out of email message bodies.

@guidovranken
Copy link
Contributor Author

Great, then this seems to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants