Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Using JSON Lines format #3

Open
swcurran opened this issue Dec 3, 2024 · 16 comments · May be fixed by #13
Open

Consider Using JSON Lines format #3

swcurran opened this issue Dec 3, 2024 · 16 comments · May be fixed by #13

Comments

@swcurran
Copy link

swcurran commented Dec 3, 2024

Given this is intended to be a log, the use of JSON Lines might be useful. Logs are the precise use of case of JSON Lines.

By using JSON Lines, the log can be extended with each entry, vs. being re-written on every update. Further, those that are getting an update of a log they have already processed can request (where supported -- e.g. using HTTP header parameters) from the point they have already received. Given the previousEvent hash, the log reader would know if there was a problem in the receipt. The log reader could cache either the entire log, adding on the new entries, or they could retain the (implementation specific) state of their processing, and continue processing the new log entries.

The current format precludes that type of partial file retrieval.

@msporny
Copy link
Member

msporny commented Dec 4, 2024

We did consider JSON Lines, but decided against using it for at least the following reasons:

  1. We couldn't easily go to CBOR, as there is no CBOR Lines definition, and CBOR was a target of interest among some that provided requirements for the spec. The CBOR representation is around 175 characters per log entry (with 3 witnesses per entry), which means you can effect about 1,000 changes and still be under 175KB in size... with processing times (to verify the log) under several hundred milliseconds: https://digitalbazaar.github.io/cel-spec/#minimizing-event-logs
  2. We wanted to put metadata on the top-most log object to provide things like the ability to link an arbitrary number of these files together, which is described more here (via the previousLog feature): https://digitalbazaar.github.io/cel-spec/#example-an-event-log-containing-multiple-events-0 ... this feature also allows one to do the "processing state" thing you mention before.

We don't think JSON Lines is going to buy us much, and its yet another bespoke format, while easy to implement, that might lead to interop problems down the road like:

  • What do you do with a malformed line? throw it out? stop processing?
  • Do you include the hash of a \r if you see a \r\n in processing?
  • Are you allowed to use U+2028 and U+2029 in JSON Lines? (legal in JSON)
  • Can you mix and match UTF-32BE, UTF-16BE, UTF-32LE, UTF-16LE, and UTF-8 on each line? (all legal modes for JSON)

That said, we don't know how strongly people feel about 1, and we could address 2 by injecting a "previousLog" item as the first entry for the metadata log. We also don't think CELs are going to be processing speed/storage limited in any way given the current choices (or that they'd be all that different using JSON Lines vs. not). Sure, there may be corner cases, and we should talk about those, but we haven't found one that would make a huge difference yet.

@andrewwhitehead
Copy link

andrewwhitehead commented Dec 4, 2024

@msporny The CBOR spec defines support for streaming applications, which should be supported by most processors. I believe it just means that you keep appending CBOR-encoded objects to the data stream. It might also be worth considering deterministic CBOR for the entries.

As to your JSONL questions:

What do you do with a malformed line? throw it out? stop processing?

Stop processing, the log is invalid.

Do you include the hash of a \r if you see a \r\n in processing?

The newline at the end of each log line would never be part of a hash, and string values would need to escape any newline characters that are contained within.

Are you allowed to use U+2028 and U+2029 in JSON Lines? (legal in JSON)

I don't know why you would want to, but again it would be automatically escaped within a string literal.

Can you mix and match UTF-32BE, UTF-16BE, UTF-32LE, UTF-16LE, and UTF-8 on each line? (all legal modes for JSON)

No, use a consistent encoding for for the whole file, ideally UTF-8. I don't think you can switch encodings in the middle of a JSON file?

@brianorwhatever
Copy link

I believe standardizing on the ability to create a cryptographic event log as a series of JSON documents is vastly more valuable than pure JSON. Since an event log is append only, so to should it's data structure be. The JSON lines is just semantics here -- what we're really talking about here is an ordered list of JSON documents.

The previousLog in the example has an application/cel media type. That media type can just as easily define a list of JSON documents as it can define another type of JSON document. I'm also confident any developer (human or robot) wouldn't have a problem parsing a list of JSON to get a data type they are expecting given the name of the data model (a log of events)

We couldn't easily go to CBOR, as there is no CBOR Lines definition, and CBOR was a target of interest among some that provided requirements for the spec. The CBOR representation is around 175 characters per log entry (with 3 witnesses per entry), which means you can effect about 1,000 changes and still be under 175KB in size... with processing times (to verify the log) under several hundred milliseconds: https://digitalbazaar.github.io/cel-spec/#minimizing-event-logs

I don't think I agree here but am not very experienced with CBOR to be honest. At the end of the day JSON Lines is still JSON so shouldn't it compress the same?

We wanted to put metadata on the top-most log object to provide things like the ability to link an arbitrary number of these files together, which is described more here (via the previousLog feature): https://digitalbazaar.github.io/cel-spec/#example-an-event-log-containing-multiple-events-0 ... this feature also allows one to do the "processing state" thing you mention before.

This can easily be done with an event at the beginning of a log that essentially points to a previousLog. In fact, regardless of the application of the event log I expect most will have some sort of initializing event.

@brianorwhatever
Copy link

Put another way a powerful application of cryptographic event logs is often write once read many (blockchains). Having to rewrite the data entirely every event is a non-starter in many use cases.

@msporny
Copy link
Member

msporny commented Dec 5, 2024

@andrewwhitehead wrote:

As to your JSONL questions:

What do you do with a malformed line? throw it out? stop processing?
Stop processing, the log is invalid.

Good, the answer I was expecting. I couldn't find where this rule is specified in the did:webvh spec (there is no text that identifies what a conforming document is).

Do you include the hash of a \r if you see a \r\n in processing?

The newline at the end of each log line would never be part of a hash, and string values would need to escape any newline characters that are contained within.

So, the \r would be included in the hash? I couldn't find this rule in the did:webvh spec

Are you allowed to use U+2028 and U+2029 in JSON Lines? (legal in JSON)
I don't know why you would want to, but again it would be automatically escaped within a string literal.

It's not about "want to" -- developers do weird things, and you don't have to escape U+2028 and U+2029 in JSON -- my point being, using JSON Lines creates new things you need to think about and spec... and the JSON Lines spec doesn't say anything about these details.

My point is: JSON Lines is being presented as something simple and easy to use -- and the reality is that it's underspecified and doesn't have a stable ref, so using it in a standard will require you to either create an RFC at IETF/W3C, or bake all the rules into the spec itself.

Now, signing up to do that work is fine, it's just not free and it's far more effort than one might think it is... that, or you just remove the requirement and don't have to pay the high cost of standardizing a web page that a developer put together.

Can you mix and match UTF-32BE, UTF-16BE, UTF-32LE, UTF-16LE, and UTF-8 on each line? (all legal modes for JSON)
No, use a consistent encoding for for the whole file, ideally UTF-8. I don't think you can switch encodings in the middle of a JSON file?

Hmm, upon re-reading the JSON Lines web page, looks like they mandate UTF-8, which is fine, at least that addresses the issue I stated above. Unfortunately, they then say that an implementer can choose to escape characters to work with plain ASCII, which then creates canonicalization problems... what do you hash, the UTF-8 version or the canonicalized version?

@msporny
Copy link
Member

msporny commented Dec 5, 2024

@brianorwhatever wrote:

I'm also confident any developer (human or robot) wouldn't have a problem parsing a list of JSON to get a data type they are expecting

You have the sort of confidence in developers that I do not have (based on far too many observations of misimplementations of software standards). :)

I don't think I agree here but am not very experienced with CBOR to be honest. At the end of the day JSON Lines is still JSON so shouldn't it compress the same?

That's not the point I was trying to make. The point I was trying to make was that a JSON Lines file and a CBOR file will have very different compression characteristics as the file size grows. I was also making the point that the differences might not matter at all (that is, all we need might be JSON and the folks arguing for CBOR don't have a strong case).

That said, we'd need more use case examples -- if we only think about DID Document change logs, then we'll over-optimize for that case and might fail to create a generalized solution for all use cases. If we have a significant number of people that really want to see a CBOR representation, and we don't provide it, or provide an easy way to get there, then we will have failed to create a broadly useful standard.

This can easily be done with an event at the beginning of a log that essentially points to a previousLog. In fact, regardless of the application of the event log I expect most will have some sort of initializing event.

Yes, I agree.

@msporny
Copy link
Member

msporny commented Dec 5, 2024

Having to rewrite the data entirely every event is a non-starter in many use cases.

I'm not entirely convinced of this, but there seems to be enough push for "figure out how to use JSON Lines" for me to make another pass on the spec to see how it might work, using the suggestion you provided above. Before I do that, here are the questions I have:

These questions presume people aren't going to back down from having a CBOR representation.

  1. What is the content that's hashed? The JSON version or the CBOR version? That is, are the hashes for a CBOR-encoded version and a JSON-encoded version different or the same?
  2. For CBOR, do we hash a bytestream without any canonicalization? Do we use DBOR and hash that? How do we separate CBOR objects in the log (using Streaming CBOR?)?
  3. For JSON Lines, is there any canonicalization on the JSON object that's done before we hash the object in JSON Lines? Clearly, there is an unspoken c14n step for JSON Lines, but what is that process? Should we hash a JCS version of the event and that becomes the entry?

But ultimately, if we do some/all of the above, would did:webvh be interested in using this log format? If not, why? What other features would be needed?

@msporny msporny changed the title JSON Lines for logs Consider Using JSON Lines format Dec 5, 2024
@brianorwhatever
Copy link

I think there is a misinterpretation of how we are using JSON lines in did:webvh. There aren't any canonicalization concerns as we are never hashing the file. We are only ever hashing a JSON document using traditional DI proofs. So a did.jsonl that has 5 versions has 5 lines in it, where each line is a JSON object that has Data Integrity proofs on it.

Does that answer your questions above?

@swcurran
Copy link
Author

swcurran commented Dec 5, 2024

I think @brianorwhatever has answered your questions about how JSON Lines does not impact hashing/signing/canonicalization, and we’ve all pitched the value of it — both for writing and reading. I do see your comments about tightening up how JSON Lines MUST be used in the the did:webvh (and other specs) and we’ll see what we can do to clarify.

@msporny
Copy link
Member

msporny commented Dec 8, 2024

@brianorwhatever wrote:

I think there is a misinterpretation of how we are using JSON lines in did:webvh. There aren't any canonicalization concerns as we are never hashing the file. We are only ever hashing a JSON document using traditional DI proofs.

Yeah, I got that mostly. I was trying to ask questions beyond how did:webvh works today. Like -- is there only ever one log file? Can you point to things outside of the log file (via cryptographic hash)? If so, aren't you concerned of having really huge log files? What do you do in that case? Is it possible to hash the log file itself?

IOW, if you're going to just stick to what did:webvh does today, then these seem to be the answers to those questions:

What is the content that's hashed?

The event.

The JSON version or the CBOR version?

The JSON version.

That is, are the hashes for a CBOR-encoded version and a JSON-encoded version different or the same?

No, all hashes are derived from the JSON canonicalized version.

For CBOR, do we hash a bytestream without any canonicalization? Do we use DBOR and hash that? How do we separate CBOR objects in the log (using Streaming CBOR?)?

The CBOR version contains just the compressed JSON data, JCS is mandatory. DCBOR isn't used. Streaming CBOR is used for the CBOR log format.

For JSON Lines, is there any canonicalization on the JSON object that's done before we hash the object in JSON Lines?

Yes, canonicalization is necessary before hashing the object. The minimum canonicalization necessary is to remove all \n values and encode the entire object as UTF-8. JCS is not required for canonicalization to JSON Lines, but is required for the proofs.

Clearly, there is an unspoken c14n step for JSON Lines, but what is that process?

The process is above:

  1. Remove all \n values.

NOTE: This creates a situation where each JSON Line can have a different bytestream (some with spaces, some with horizontal and vertical tabs, some with \r values). The benefit here is "simpler canonicalization", the drawback is that it would be nice to JCS the content given that the proof step needs to do that anyway.

Should we hash a JCS version of the event and that becomes the entry?

Probably, given the processor needs to do that at some point. This could have a negative consequence for very large JSON objects, so some limits would need to be specified (like SHOULD NOT allow JSON objects that have more than 1,000 properties throughout the entire data structure).

All of those answers are completely acceptable for your use case, but as you can imagine, the folks that want CBOR might not be happy with those answers given that it requires their CBOR processors to convert to/from JSON, and perform JCS to verify the log. Doing stuff like that in embedded systems is typically frowned upon. IOW, we run the risk of alienating the CBOR-only developers (or finding a middle ground).

@msporny
Copy link
Member

msporny commented Dec 8, 2024

@swcurran wrote:

I think @brianorwhatever has answered your questions

Not exactly :), see #3 (comment)

... and not this one (which is among the one with the strongest consequence to the CEL spec):

But ultimately, if we do some/all of the above, would did:webvh be interested in using this log format? If not, why? What other features would be needed?

@andrewwhitehead
Copy link

I think for our purposes it would make more sense to have a nextLog-type transaction (setting a parameter with the new log location). With previousLog the resolver would need to walk backwards to find the first log, then walk forwards again through all the transactions. It seems more efficient to always start from the first log (did.jsonl in our case), and this avoids any log rewriting. The cryptographic chain from the entry hashes is sufficient without also hashing the previous log file.

@brianorwhatever
Copy link

on re-read of the spec I think I've found something interesting for this conversation. None of the algorithms reference anything outside of the eventLog. So the JSON here is essentially just a meta layer wrapping the event log which is the core of this specification. I wonder if a more explicit separation of these layers would make this conversation irrelevant.
What data is in this meta layer outside of previousLog? The data at that layer won't be cryptographically protected unless otherwise stated (as is the case with previousLog which has aproof). This seems like it will complexity for consumers whenever a new meta property is added here. I think verification by consumers should be tied to log entries only.

@swcurran
Copy link
Author

swcurran commented Dec 9, 2024

I would add to @andrewwhitehead ’s nextLog, that did:webvh would need to have relative references to the logs because (a) did:webvh allows portability (retaining the SCID and History, but changing the location part of the DID), and (b) that requires that the log be moved WITHOUT changing the history. Thus the nextLog references would have to point to files (logically) beside the first log.

@swcurran
Copy link
Author

swcurran commented Dec 9, 2024

@msporny — to try to answer your questions — especially from Comment #3 and if did:webvh can use CEL.

  1. Currently did:webvh does not have a way to do multi-file logs, so that would be helpful.
  2. The hashing in did:webvh is very specific as to what is included in any hash — the SCID and entryHash. It is not “just over the event”, but it is over specific JSON that is canonicalized using JCS. Likewise, JCS is used in the Data Integrity proof cryptosuite.
    1. The SCID calculation puts placeholders (literal {SCID}} for the SCID, hashes the result and then replaces the placeholders with the SCID. Verification is the reverse.
    2. Entry hashes sets the versionId to the previous entry’s value (excluding the proof) and then hashes it, and puts the resulting hash into the versionId. This enables the chaining of the events from the inception. Again, verification repeats the process.
  3. The JSON Lines processing is generated from the event after it is created. No hashing is ever done on it.
  4. I would think that would be how it could be used in CEL — that (for example) a prevHash would be defined as a hash on the event before it is put into JSON Lines format using JCS or the like. I agree that any edge cases on JSON Lines such as you mentioned need to be formalized. But I’m optimistic those are pretty light.
  5. In addition to the versionId, versionTime, the DIDDoc (state) and the proof — there is an additional field parameters that contains metadata about how to process the current and subsequent DID Log entries. It is used in a number of places, and is important to enable the longevity of a DID — being able to outlast crypto algorithms over the lifetime of the DID. We could not use CEL without that.
  6. The challenge I have is both the need for CEL to have same capability, and if it did, where specific parameters would go. For example:
    1. I think both CEL and did:webvh would need the identification of a version.
    2. Which should define what hash algorithms are permitted? I suspect both.
    3. Clearly some of the parameters are did:webvh specific — pre-rotation, authorized keys.
  7. Witness handling is very different in CEL and did:webvh, as we have opted to try to keep the bulk (in files size and verification processing) introduced by witnesses to a minimum.

I’m still trying to wrap my head around whether we could easily use CEL for did:webvh. At least the SCID, hash handling, linking and parameters usage would all have to be rethought. We can’t lose those, but the question is whether doing them differently would be acceptable in the tradeoffs we’ve considered — log size, portability, verification effort, SCID-security, etc.

Seems more like an interactive conversation than a GitHub issue.

@brianorwhatever
Copy link

If full JSONL isn't acceptable I think a top level array makes more sense then the current design. See #12

brianorwhatever added a commit to aviarytech/cel-spec that referenced this issue Jan 15, 2025
Changes the CEL data model to use arrays instead of objects at the root level. This makes the format simpler and better matches how event logs actually work.

Key changes:
- Root structure is now an array instead of {log: [...]}
- previousLog is now an event type rather than a property
- Updated examples and docs to match

Before:
```json
{
"log": [{
"event": {...},
"proof": [...]
}]
}
```

After:
```json
[{
"event": {...},
"proof": [...]
}]
```

This should make the format easier to work with, especially for streaming. It also means everything (including previousLog) is just an event, which simplifies processing.

Closes digitalbazaar#3
@brianorwhatever brianorwhatever linked a pull request Jan 15, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants