Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use array for data model #12

Open
brianorwhatever opened this issue Jan 14, 2025 · 1 comment
Open

Use array for data model #12

brianorwhatever opened this issue Jan 14, 2025 · 1 comment

Comments

@brianorwhatever
Copy link

brianorwhatever commented Jan 14, 2025

I’d like to propose defining the CEL data model as an array rather than a single JSON object. Here’s why:

  1. Event Log Use Cases: Many use cases I envision treat CEL as an event log, which is inherently sequential. Arrays naturally represent an ordered list of events, making the data easier to parse and verify as a chronological record.

  2. Abstraction from JSONL: It doesn’t have to be JSONL, it could be a top level array in JSON, but if we represent the data model as an array, then JSONL can easily and effectively be abstracted into array-like entries. This approach maintains simplicity while allowing compatibility with streaming or line-delimited formats.

  3. Top-Level Properties as Events: Any properties that might otherwise be placed at the top level (such as previousLog) can be inserted as events that follow the same parsing and verification logic. This consistency reduces complexity, since implementers only need to handle a single, well-defined structure (the array).

  4. Clarity and Consistency: By focusing on an array-based model, we keep the spec uncluttered and aligned with the predominant need for event logging. Objects that include additional top-level fields introduce more overhead and potential confusion when verifying or parsing data.

For reference, please see: #3 and decentralized-identity/didwebvh#160 (comment)

@dlongley
Copy link
Member

dlongley commented Jan 20, 2025

Just some quick thoughts... perhaps there are easier answers to what I jotted down below; I would have written something shorter if I had more time.

One disadvantage of using an array at the top level of the log would be a desire to put all features into every event in the log. This could include something like witness signatures, which would only be produced after the event's creation -- and potentially after subsequent events as well.

This has two possible negative implications. The first is that events in the log would become only partially immutable, requiring a special algorithm for reproducing their hashes / covered-content for signature or general blockchain-like verification. Second, implementations could not keep a simple counter / index into the log for what they have verified so far, harming the efficiency and usefulness of caching portions of the already verified log.

Of course, another way to model these features is as additional events themselves, and require every new event to be immutable. But this then also strongly implies that any new feature requires buy-in from every consumer, otherwise existing consumers will be unable to read the log. Presumably, consumers should reject logs with unrecognized event types. If not, then some kind of "critical/non-critical" flag would need to be understood and there are a lot of dragons in that direction. And what if consumers later add support for non-critical events? Would this require reprocessing of the whole history or not? Will different consumers make different choices and is that / how is that ok in actual practice?

All of these concerns could potentially be addressed by using a top-level object that separates the event log from the other information -- but there could be other ways too. Instead, features like witness signatures could be kept in totally separate streams (i.e., in other "log files") -- such that they are also always immutable and can later be processed independently if a consumer adds support for them. However, considering the witness case specifically, it is not expected that all witnesses will coordinate or be desirable for all consumers, which implies that it might be better to have separate files for different witnesses too. This further indicates that perhaps a metadata file to keep all of these things together would be of benefit.

That metadata file itself needn't necessarily be signed. But it would still be desirable to keep hashes of the two most recent metadata file versions in some separate secure record (e.g., like in the High Assurance DIDs with DNS proposal) to increase confidence that there hasn't been an unauthorized overwrite (Note: keeping at least two hashes provides availability during upgrades across partitions).

Some of this gets into the threat modeling around the cel spec and what it does / doesn't mitigate (e.g., log servers could serve truncated (but otherwise valid) log files to different consumers and so on -- what should the cel spec say about this?).

Anyway, these ideas should be considered in deciding whether to use a top-level object or array and the general design (one or many files, etc.). Maybe we want some kind of metadata file to keep immutable/mutable data and different streams or versions of information separate. This could be a single top-level file that points at the most recent event log (and anything else, e.g., "witness log files", other "features"), or it could be a single-top level file with an object that embeds the most recent event log and other information. If we decide the spec should support multiple files anyway, then the former seems better to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants