-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving uri_regex #168
Comments
@guidovranken, thanks for reporting this. I think improving the regex is a good thing to do. This regex is from the URI specification [1] - Appendix B. While libtaxii uses the regex in a manner slightly different than the way the RFC uses it (libtaxii uses it for validation, the RFC uses it for capture groups / parsing), it was really used because it was a low-effort quick win. Thank you. |
Thanks. The perfect matching of a URI is more complex than it initially seems: see here for an attempt at constructing a sufficiently complete URI-matching regex. That particular regex might be usable for libtaxii although beware that it might be using a different syntax than Python's |
We actually use that Daring Fireball regex in the It seems to work fine for what we intended: just grabbing URLs out of email message bodies. |
Great, then this seems to be resolved. |
The current
uri_regex
as defined invalidation.py
is not sufficiently precise to differentiate between legitimate, specification-compliant URI's and strings that are impossible to be usable as URI's.Currently, the following regex is used:
Parts of this regex between brackets, such as
imply that every character may be matched except the colon, the slash, the question mark and the number sign (#). Consequently, invalid characters, including non-printable "garbage" characters, will be considered valid by the above regex:
(The first
re.match
demonstrates the (correct) validation ofhttp://taxii.mitre.org
, the secondre.match
demonstrates the incorrect validation of a string that contains newline characters).Given a sufficiently diverse amount of input data, I reckon that this might cause problems in the long run if strings that are not real URI's are parsed and "green-lighted" by libtaxii.
The text was updated successfully, but these errors were encountered: