-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map keys not checked for uniqueness (must they be?) #28
Comments
Hmm the descriptions for this need to be tightened up, but basically:
|
Spec clarified here: 2690775 |
Ah, now I see what that section was meant to say. I'm still unclear on this schema concept. It seems to be a combination of a traditional schema and a way to change the fundamental rules of parsing. CE already has ways to define "custom" types, but also the schema will let you re-purpose existing data types to new semantics. Normally I would just ignore a schema system, since I've seen them for XML/JSON/CSV/… and never found them to be worth the trouble. But with CE it looks like I may be physically unable to parse some CE data unless I implement schemas. Right? That's a little weird. |
Hmm, I just noticed this rule:
And since references to keyable types are themselves keyable types, that means I have to keep all sets of keys of all maps which contain any references, so that when I get to the end of the document and they're finally resolvable, I can tell if the earlier maps were valid or not, and what's in them. I'd hoped I could process an arbitrary size document using only O(max_nesting_depth) memory, but I don't think that's possible. Technically, that wasn't possible before -- I needed to keep the current map's worth of keys to check it -- but then at least "max object count in a container" would limit that. This looks like a bigger DOS risk to me. All someone has to do is make a bunch of maps with a lot of long keys and 1 forward reference, and the parser will have to store essentially the entire document in memory until the end. (Or write a multi-pass parser, but if the data isn't coming from random-access storage, that might not be feasible.)
With these two features, it seems easy to build a 5 GB document which would require a parser to hold all of its data in memory at once, until the very end (using way more than 5 GB of memory, since hash tables are more expensive than raw data), and without hitting any of the mandatory limits (no more than 1000 anything in an anything). I don't know when I'd ever use forward refs. I don't know why I'd ever use refs for keys. The intersection of these two features seems like more trouble than it's worth. Personally, I'd rather have a more streaming-friendly format. Maybe I'll give my implementation a "max map key buffer size" parameter, and then set it to a tiny value by default. I don't want to block an actual feature, but I also can't imagine it'll be used much for legitimate data. |
The only type that cannot be parsed without a schema is the constant (which is only available in CTE, so only during human editing, or if viewed on a system where such a schema is available). Everything else is parseable and gets interpreted according to the defaults when there's no schema to say otherwise. I still haven't defined what the schema even is, although JSON-Schema looks good as a starting point. Regarding memory usage, you'll need to keep all keys of all maps leading up to the current depth. So if you had 1000 entry maps, 50 containers deep, you'd have to keep in memory on average 501000/2 entries in order to check for duplicates (average case) with a worst case of 501000. But yes if you have a forward ref to a key, you'd need to keep map keys around longer than the current depth... potentially the whole document. Maybe it's better to just not allow references as keys at all. Not sure they add enough usefulness anyway... With references, an amplification attack is a real danger, which is why I'd double-count all objects and sub-objects that a reference represents. So if you have a marked map containing 1000 entries within its depths, every reference to that map "adds" 1000 objects towards the maximum objects allowed in a document. I think I put that in the spec but it may not be very clear... |
I would think that if you had the document The question of double-counting all referenced objects depends on how the parser reports them to the application. I haven't really given that much thought yet, and I don't know that it's ever spelled out in the spec. I'm starting to think that it would make sense, depending on how the application intends to use the document, to let it request any of:
I tried converting a marker/reference from CE to formats that don't have this feature:
I was hoping this would either duplicate the data or give a "references not supported" error, to give a hint as to the intent of that feature, but it dropped them, which I suppose just means it's not implemented yet either way. |
The CE spec says, in various places:
Yet there's also a section "Key Collisions" which says:
and goes on to list 4 possible solutions. The 4th one is "abort processing", but the first 3 are seemingly non-conformant solutions.
Anyway, enctool seems to pick a 5th option: pass duplicate keys without error or warning.
The text was updated successfully, but these errors were encountered: