Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace JSON Lines with JSON to simplify implementation, improve processing speed, and enhance extensibility #160

Open
filip26 opened this issue Dec 13, 2024 · 13 comments

Comments

@filip26
Copy link

filip26 commented Dec 13, 2024

Hi,
I’d like to propose avoiding JSON Lines for the following reasons:

Added Complexity

  • Supporting JSON Lines requires additional implementation effort to handle both standard JSON and JSON Lines parsing.
  • Converting JSON Lines into standard JSON through pre-processing is inefficient, as it results in redundant parsing with no added value other than compatibility.

Limited Extensibility

  • JSON Lines does not allow adding metadata, such as positions or links to subsequent chunks, etc.

Inefficient Processing

  • Processing line-by-line in a streaming context is less efficient compared to handling chunks or pages.
  • JSON Lines enforces sequential, linear history processing. Standard JSON Object with embedded links enables non-linear history processing.

Using JSON improves adoption, speeds up processing, and supports extensibility.

Please consider the outcome. Thank you.

JSON Lines were likely intended to serve as a replacement for CSV

@brianorwhatever
Copy link
Contributor

I ran some comparisons in python, go and node (see below)

Comparison Results

Python

--- Testing with 10000 items ---
Single JSON parse: 12ms
JSONL parse: 31ms

--- Testing with 100000 items ---
Single JSON parse: 126ms
JSONL parse: 326ms

--- Testing with 1000000 items ---
Single JSON parse: 1228ms
JSONL parse: 2834ms

GO

--- Testing with 10000 items ---
Single JSON parse: 20ms
JSONL parse: 27ms

--- Testing with 100000 items ---
Single JSON parse: 240ms
JSONL parse: 312ms

--- Testing with 1000000 items ---
Single JSON parse: 2294ms
JSONL parse: 3159ms

NODE JS

--- Testing with 10000 items ---
Single JSON parse: 7ms
JSONL parse: 20ms

--- Testing with 100000 items ---
Single JSON parse: 129ms
JSONL parse: 128ms

--- Testing with 1000000 items ---
Single JSON parse: 2218ms
JSONL parse: 1441ms

Python: Single JSON parse is roughly 2–3x faster than JSONL parsing at all tested sizes. For 1 million items, it finishes in 1228ms compared to JSONL’s 2834ms.

Go: Single JSON parse is consistently faster for all sizes. The difference ranges from 7ms at 10,000 items to about 865ms at 1 million items.

Node: Single JSON parse is faster for 10,000 and 100,000 items. At 1 million items, JSONL parse finishes sooner (1441ms) compared to single JSON parse (2218ms).


These results do show a slight performance preference for a single JSON array in Python and Go, with Node having some interesting behavior at larger sizes. Still, JSONL brings important benefits for our use case:

  • Streaming: We can process data line by line without loading everything into memory at once.

  • Incremental Processing: It's simpler to parse each entry independently, which can help when partial consumption is needed.

  • Flexibility: Adding new records becomes easier—just append another line.

  • Line-Based Handling: Tools and standard Unix utilities can read and write line-delimited entries naturally.

Given these advantages, the small performance tradeoff is acceptable for scenarios where streaming and incremental processing matter. I have opened digitalbazaar/cel-spec#12 as well to attempt to move that specification in at least an array data model direction.

@filip26
Copy link
Author

filip26 commented Jan 14, 2025

Sorry, @brianorwhatever , but your comparison of parsing speeds between JSON objects and JSON Lines isn't relevant to the issue I raised.

There is no valid justification for maintaining both JSON and JSON Lines, as it only serves to introduce additional complexity.

@brianorwhatever
Copy link
Contributor

@filip26 I have listed 4 reasons that justify why JSON Lines? The comparison of parsing speeds are in response to "improve processing speeds" in the issue title.

@filip26
Copy link
Author

filip26 commented Jan 14, 2025

@brianorwhatever Clearly, we’re not on the same page when it comes to computer science and engineering. I interpret your argument as an attempt to justify introducing it, though I’m not sure why. I raised this issue to improve WebH - take it or leave it.

@andrewwhitehead
Copy link
Contributor

@brianorwhatever I think native JSON parsing in Python is still notoriously slow, it might be more fair to use an additional library like orjson.

@brianorwhatever
Copy link
Contributor

FWIW others both artificial and human are on the same page

@filip26
Copy link
Author

filip26 commented Jan 14, 2025

@brianorwhatever You're comparing JSON objects and JSON Lines, and that’s exactly my point. Choose one - JSON or JSON Lines - not both, to keep things simple. Having both adds unnecessary complexity without providing any real value.

If you dislike JSON objects, consider using JSON arrays (just add the comma) and stick with JSON. That approach would be far better than forcing everyone to adopt and maintain a relatively obscure technology designed for a completely different purpose.

I’ve also noticed - though perhaps I’m mistaken - a tendency to view the world through the lens of a single programming language. Let’s also consider portability; after all, there are approximately 900 active programming languages to account for.

@brianorwhatever
Copy link
Contributor

brianorwhatever commented Jan 14, 2025

Ok now we are getting somewhere although I'm not sure I understand. JSON Lines is exactly that - Lines of JSON. You can't have JSON Lines without JSON.. Further we are using Data Integrity and specifically eddsa-jcs-2022 which I don't see changing..

So I think what you are arguing for is changing the file from something that looks like this (with a .jsonl) extension

{"versionId":"1-QmNt8Q34JdjyfshJkoLZhJdx725QfBeekiNJhwZ7KUP4N2","versionTime":"2025-01-10T19:27:34Z","parameters":{"method":"did:webvh:0.5","scid":"QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP","updateKeys":["z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8"],"portable":false,"nextKeyHashes":[],"witnesses":[],"witnessThreshold":0,"deactivated":false},"state":{"@context":["https://www.w3.org/ns/did/v1","https://w3id.org/security/multikey/v1"],"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","assertionMethod":["did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#ytNgTDR8","did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#mRpKhRQf"],"verificationMethod":[{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#ytNgTDR8","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8"},{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#mRpKhRQf","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6Mkvqc9EJp3YQWupmSrCEko12aq8WtUUiW65iFfmRpKhRQf"}]},"proof":[{"type":"DataIntegrityProof","cryptosuite":"eddsa-jcs-2022","verificationMethod":"did:key:z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8#z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8","created":"2025-01-10T19:27:34Z","proofPurpose":"assertionMethod","proofValue":"z6wvuMGVbY29jAGpj6rSKwF4zg3cUWC4rKAi4tZbJZnu67vC3Zg143r5WMNeH28oZUYG9xBgqDqLy3GZS7SA9EDK"}]}
{"versionId":"2-QmYJji9MhMNMwjWpcR7PqfcEgN3wGxCaBzfHrDe1LMxXjC","versionTime":"2025-01-10T19:27:34Z","parameters":{"witnesses":[],"witnessThreshold":0},"state":{"@context":["https://www.w3.org/ns/did/v1","https://w3id.org/security/multikey/v1"],"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","controller":["did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com"],"assertionMethod":["did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#VSqMo7Va","did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#nVmTqzjV"],"verificationMethod":[{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#VSqMo7Va","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6Mkf8NpNCYVtmgxeamda3mszJTqDur6TpJnyQUiVSqMo7Va"},{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#nVmTqzjV","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6MkoADLG9zCh4CymnUZnDahLzx2pfyNkCzxKQFqnVmTqzjV"}]},"proof":[{"type":"DataIntegrityProof","cryptosuite":"eddsa-jcs-2022","verificationMethod":"did:key:z6Mkvqc9EJp3YQWupmSrCEko12aq8WtUUiW65iFfmRpKhRQf#z6Mkvqc9EJp3YQWupmSrCEko12aq8WtUUiW65iFfmRpKhRQf","created":"2025-01-10T19:27:34Z","proofPurpose":"assertionMethod","proofValue":"z5TGXBbSbjgvJ4eSRSAYVBMffosJtWpxUimwSmMokfzeoynvnEv9sBPRYxEMRjihZUWCWcdLuwWj5utjyL8JxpGhz"}]}

To something that looks like this (with a .json extension)

[
  {"versionId":"1-QmNt8Q34JdjyfshJkoLZhJdx725QfBeekiNJhwZ7KUP4N2","versionTime":"2025-01-10T19:27:34Z","parameters":{"method":"did:webvh:0.5","scid":"QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP","updateKeys":["z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8"],"portable":false,"nextKeyHashes":[],"witnesses":[],"witnessThreshold":0,"deactivated":false},"state":{"@context":["https://www.w3.org/ns/did/v1","https://w3id.org/security/multikey/v1"],"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","assertionMethod":["did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#ytNgTDR8","did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#mRpKhRQf"],"verificationMethod":[{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#ytNgTDR8","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8"},{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#mRpKhRQf","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6Mkvqc9EJp3YQWupmSrCEko12aq8WtUUiW65iFfmRpKhRQf"}]},"proof":[{"type":"DataIntegrityProof","cryptosuite":"eddsa-jcs-2022","verificationMethod":"did:key:z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8#z6Mkt8ZfdufKQWY1svrZenBeXsTvWjukWWTNbDAUytNgTDR8","created":"2025-01-10T19:27:34Z","proofPurpose":"assertionMethod","proofValue":"z6wvuMGVbY29jAGpj6rSKwF4zg3cUWC4rKAi4tZbJZnu67vC3Zg143r5WMNeH28oZUYG9xBgqDqLy3GZS7SA9EDK"}]},
  {"versionId":"2-QmYJji9MhMNMwjWpcR7PqfcEgN3wGxCaBzfHrDe1LMxXjC","versionTime":"2025-01-10T19:27:34Z","parameters":{"witnesses":[],"witnessThreshold":0},"state":{"@context":["https://www.w3.org/ns/did/v1","https://w3id.org/security/multikey/v1"],"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","controller":["did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com"],"assertionMethod":["did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#VSqMo7Va","did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#nVmTqzjV"],"verificationMethod":[{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#VSqMo7Va","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6Mkf8NpNCYVtmgxeamda3mszJTqDur6TpJnyQUiVSqMo7Va"},{"id":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com#nVmTqzjV","controller":"did:webvh:QmVt4QVuJbvMSuSHLae6kJwh9k9hZFV58yZZ4XKCxZa5RP:example.com","type":"Multikey","publicKeyMultibase":"z6MkoADLG9zCh4CymnUZnDahLzx2pfyNkCzxKQFqnVmTqzjV"}]},"proof":[{"type":"DataIntegrityProof","cryptosuite":"eddsa-jcs-2022","verificationMethod":"did:key:z6Mkvqc9EJp3YQWupmSrCEko12aq8WtUUiW65iFfmRpKhRQf#z6Mkvqc9EJp3YQWupmSrCEko12aq8WtUUiW65iFfmRpKhRQf","created":"2025-01-10T19:27:34Z","proofPurpose":"assertionMethod","proofValue":"z5TGXBbSbjgvJ4eSRSAYVBMffosJtWpxUimwSmMokfzeoynvnEv9sBPRYxEMRjihZUWCWcdLuwWj5utjyL8JxpGhz"}]}
]

Your argument is:

  • .jsonl is an obscure file format
  • parsing the data is more complicated
  • parsing the data performs worse

My argument is:

  • Although it is obscure it is extremely simple
  • I don't think it's that complicated it just requires splitting on \n
  • the parsing performance might be slightly worse, but the memory requirements can be lower as you don't have to load the entire log into memory everytime.
  • streaming is impossible with json
  • entries can be accessed and manipulated by line
  • writing a new log entry can be an append operation and doesn't require a full file rewrite like in JSON

Does that sum it up? I'd be interested in hearing any implementers point of view of whether JSON Lines is a dealbreaker when looking to implement did:webvh

@filip26
Copy link
Author

filip26 commented Jan 14, 2025

As an implementer 😉, yes, this is indeed a bit of a blocker:

  1. I don’t want to preprocess data just to replace \n with , - It’s redundant, negates whatever minor benefits JSON Lines might provide, and makes things even worse
  2. I don’t want to introduce or rely on an unmaintained JSON Lines implementation.
  3. I don’t want to implement or maintain JSON Lines myself.

@brianorwhatever
Copy link
Contributor

haha yes I meant other implementers as I suspected that was your answer.

All 3 of these points are essentially the same point as "implementation" and "preprocessing" are the same thing. There isn't anything more to maintain other then splitting the file on new line characters. After that you have an array of strings to be processed as JSON and verified as per the spec. Any programming language that can read a file and split a string can easily do this.

I agree - if it were more work than this it wouldn't be worth the benefits (minor to you, major to me).

I am interested in seeing how this shakes out in CEL (see digitalbazaar/cel-spec#3 and digitalbazaar/cel-spec#12) and hopefully we can some day align this spec with that one 😄

Thanks for the discussion

@filip26
Copy link
Author

filip26 commented Jan 14, 2025

@brianorwhatever It’s not the same 😉 - one point is about time complexity, while the other two focus on portability, implementation and maintenance costs.

How many languages have solid support for JSON Lines? That’s why obscure: it offers little value, limited support, and is only valid for a narrow set of use cases. webvh is not one of them.

[Updated after the comment below: Obviously, I meant libraries/packages/etc. - simply existing solid implementations to use. This is getting ridiculous. I’m sorry, but I don’t understand the strong pushback. My intention is solely to help and make this easier for others to adopt - it should be a primary goal for any specification.]

@brianorwhatever
Copy link
Contributor

0 languages have support for JSON Lines. They don't need it. We could probably write the required code for the top 20 programming languages in the time we've spent on this thread 🥲

@brianorwhatever
Copy link
Contributor

In any software system there is a point where new software needs to be written and libraries aren't necessary. I am proposing this lives in that area, and I don't believe it takes up too much of it. I acknowledge the slight performance increase of JSONL but I currently believe the benefits it gives (easy appending, streaming support, line-based tools) to be worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants