-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider Using JSON Lines format #3
Comments
We did consider JSON Lines, but decided against using it for at least the following reasons:
We don't think JSON Lines is going to buy us much, and its yet another bespoke format, while easy to implement, that might lead to interop problems down the road like:
That said, we don't know how strongly people feel about 1, and we could address 2 by injecting a "previousLog" item as the first entry for the metadata log. We also don't think CELs are going to be processing speed/storage limited in any way given the current choices (or that they'd be all that different using JSON Lines vs. not). Sure, there may be corner cases, and we should talk about those, but we haven't found one that would make a huge difference yet. |
@msporny The CBOR spec defines support for streaming applications, which should be supported by most processors. I believe it just means that you keep appending CBOR-encoded objects to the data stream. It might also be worth considering deterministic CBOR for the entries. As to your JSONL questions:
Stop processing, the log is invalid.
The newline at the end of each log line would never be part of a hash, and string values would need to escape any newline characters that are contained within.
I don't know why you would want to, but again it would be automatically escaped within a string literal.
No, use a consistent encoding for for the whole file, ideally UTF-8. I don't think you can switch encodings in the middle of a JSON file? |
I believe standardizing on the ability to create a cryptographic event log as a series of JSON documents is vastly more valuable than pure JSON. Since an event log is append only, so to should it's data structure be. The JSON lines is just semantics here -- what we're really talking about here is an ordered list of JSON documents. The
I don't think I agree here but am not very experienced with CBOR to be honest. At the end of the day JSON Lines is still JSON so shouldn't it compress the same?
This can easily be done with an event at the beginning of a log that essentially points to a |
Put another way a powerful application of cryptographic event logs is often write once read many (blockchains). Having to rewrite the data entirely every event is a non-starter in many use cases. |
@andrewwhitehead wrote:
Good, the answer I was expecting. I couldn't find where this rule is specified in the did:webvh spec (there is no text that identifies what a conforming document is).
So, the \r would be included in the hash? I couldn't find this rule in the did:webvh spec
It's not about "want to" -- developers do weird things, and you don't have to escape My point is: JSON Lines is being presented as something simple and easy to use -- and the reality is that it's underspecified and doesn't have a stable ref, so using it in a standard will require you to either create an RFC at IETF/W3C, or bake all the rules into the spec itself. Now, signing up to do that work is fine, it's just not free and it's far more effort than one might think it is... that, or you just remove the requirement and don't have to pay the high cost of standardizing a web page that a developer put together.
Hmm, upon re-reading the JSON Lines web page, looks like they mandate UTF-8, which is fine, at least that addresses the issue I stated above. Unfortunately, they then say that an implementer can choose to escape characters to work with plain ASCII, which then creates canonicalization problems... what do you hash, the UTF-8 version or the canonicalized version? |
@brianorwhatever wrote:
You have the sort of confidence in developers that I do not have (based on far too many observations of misimplementations of software standards). :)
That's not the point I was trying to make. The point I was trying to make was that a JSON Lines file and a CBOR file will have very different compression characteristics as the file size grows. I was also making the point that the differences might not matter at all (that is, all we need might be JSON and the folks arguing for CBOR don't have a strong case). That said, we'd need more use case examples -- if we only think about DID Document change logs, then we'll over-optimize for that case and might fail to create a generalized solution for all use cases. If we have a significant number of people that really want to see a CBOR representation, and we don't provide it, or provide an easy way to get there, then we will have failed to create a broadly useful standard.
Yes, I agree. |
I'm not entirely convinced of this, but there seems to be enough push for "figure out how to use JSON Lines" for me to make another pass on the spec to see how it might work, using the suggestion you provided above. Before I do that, here are the questions I have: These questions presume people aren't going to back down from having a CBOR representation.
But ultimately, if we do some/all of the above, would did:webvh be interested in using this log format? If not, why? What other features would be needed? |
I think there is a misinterpretation of how we are using JSON lines in did:webvh. There aren't any canonicalization concerns as we are never hashing the file. We are only ever hashing a JSON document using traditional DI proofs. So a Does that answer your questions above? |
I think @brianorwhatever has answered your questions about how JSON Lines does not impact hashing/signing/canonicalization, and we’ve all pitched the value of it — both for writing and reading. I do see your comments about tightening up how JSON Lines MUST be used in the the |
@brianorwhatever wrote:
Yeah, I got that mostly. I was trying to ask questions beyond how IOW, if you're going to just stick to what
The event.
The JSON version.
No, all hashes are derived from the JSON canonicalized version.
The CBOR version contains just the compressed JSON data, JCS is mandatory. DCBOR isn't used. Streaming CBOR is used for the CBOR log format.
Yes, canonicalization is necessary before hashing the object. The minimum canonicalization necessary is to remove all
The process is above:
NOTE: This creates a situation where each JSON Line can have a different bytestream (some with spaces, some with horizontal and vertical tabs, some with
Probably, given the processor needs to do that at some point. This could have a negative consequence for very large JSON objects, so some limits would need to be specified (like SHOULD NOT allow JSON objects that have more than 1,000 properties throughout the entire data structure). All of those answers are completely acceptable for your use case, but as you can imagine, the folks that want CBOR might not be happy with those answers given that it requires their CBOR processors to convert to/from JSON, and perform JCS to verify the log. Doing stuff like that in embedded systems is typically frowned upon. IOW, we run the risk of alienating the CBOR-only developers (or finding a middle ground). |
@swcurran wrote:
Not exactly :), see #3 (comment) ... and not this one (which is among the one with the strongest consequence to the CEL spec):
|
I think for our purposes it would make more sense to have a |
on re-read of the spec I think I've found something interesting for this conversation. None of the algorithms reference anything outside of the |
I would add to @andrewwhitehead ’s |
@msporny — to try to answer your questions — especially from Comment #3 and if did:webvh can use CEL.
I’m still trying to wrap my head around whether we could easily use CEL for did:webvh. At least the SCID, hash handling, linking and parameters usage would all have to be rethought. We can’t lose those, but the question is whether doing them differently would be acceptable in the tradeoffs we’ve considered — log size, portability, verification effort, SCID-security, etc. Seems more like an interactive conversation than a GitHub issue. |
If full JSONL isn't acceptable I think a top level array makes more sense then the current design. See #12 |
Changes the CEL data model to use arrays instead of objects at the root level. This makes the format simpler and better matches how event logs actually work. Key changes: - Root structure is now an array instead of {log: [...]} - previousLog is now an event type rather than a property - Updated examples and docs to match Before: ```json { "log": [{ "event": {...}, "proof": [...] }] } ``` After: ```json [{ "event": {...}, "proof": [...] }] ``` This should make the format easier to work with, especially for streaming. It also means everything (including previousLog) is just an event, which simplifies processing. Closes digitalbazaar#3
Given this is intended to be a log, the use of JSON Lines might be useful. Logs are the precise use of case of JSON Lines.
By using JSON Lines, the log can be extended with each entry, vs. being re-written on every update. Further, those that are getting an update of a log they have already processed can request (where supported -- e.g. using HTTP header parameters) from the point they have already received. Given the
previousEvent
hash, the log reader would know if there was a problem in the receipt. The log reader could cache either the entire log, adding on the new entries, or they could retain the (implementation specific) state of their processing, and continue processing the new log entries.The current format precludes that type of partial file retrieval.
The text was updated successfully, but these errors were encountered: