Add client JSON storage API endpoint #1405

cpeel · 2024-12-21T18:45:44Z

We want the new proofreading interface to be able to persist client configuration details across browsers and devices. This requires that some configuration values are stored server-side. To enable this, allow clients to store opaque JSON blobs on the server, up to one blob per client per user.

Please read the updates to SETUP/API.md on important caveats about this endpoint and ensure we're OK with this from a server-side perspective as well as the client side.

This PR introduces a new generalized JsonStorage class and database table with an ApiClientStorage layer on top of that wired into the API. When JSON blobs are updated in the JsonStorage class their timestamp field in the database is also updated. This field isn't currently made available via the code right now but seems like it would be useful if, for instance, we want to clean up client blobs older than some period of time. I'm open to thoughts on this.

curl examples to test this:

export API_KEY=review

# Save a json blob
curl -i -H "Accept: application/json" -H "X-API-KEY: $API_KEY" \
    -X PUT -d '{"key": 1}' "https://www.pgdp.org/~cpeel/c.branch/client-storage/api/v1/storage/clients/newpi"

# Fetch a json blob
curl -i -H "Accept: application/json" -H "X-API-KEY: $API_KEY" \
    -X GET "https://www.pgdp.org/~cpeel/c.branch/client-storage/api/v1/storage/clients/newpi"

# Delete a json blob
curl -i -H "Accept: application/json" -H "X-API-KEY: $API_KEY" \
    -X DELETE "https://www.pgdp.org/~cpeel/c.branch/client-storage/api/v1/storage/clients/newpi"

This is likely to have minor merge conflicts with #1398 and I'll resolve those after that gets merged.

70ray · 2024-12-22T09:58:02Z

SETUP/API.md

+
+Some important notes about this feature:
+* Client storage is one blob per user per client. Said another way: clients are
+  only able to store one blob per `clientid` and that blob is only for the user


Shouln't this be 'one blob per userid ?

It's both. It's one per clientid and by the user that is authenticated. I'll change it slightly to maybe make it clearer.

pinc/JsonStorage.inc

api/v1_storage.inc

pinc/ApiClientStorage.inc

70ray

This looks like a good overall structure, just a few comments about details.

jchaffraix · 2024-12-22T16:38:52Z

api/dp-openapi.yaml

+          description: JSON blob that was persisted
+          content:
+            application/json:
+              schema:
+                type: object


Wouldn't it make sense to return the timestamp (something the client could use for caching) rather than the JSON blob that was sent by the client?

The timestamp is arguably just as useless as the JSON blob because it's always now(). I was uncertain what a good return value would be so I modeled what we do for PUTs for word lists -- and what the product at my job does for PUTs and PATCHes -- which is to return the object that was actually serialized.

Open to ideas / thoughts on what API clients should "expect" from a successful PUT / PATCH / etc beyond the HTTP status code.

The timestamp is arguably just as useless as the JSON blob because it's always now().

Sure as it represents the last modified timestamp from the server's perspective.

However I do think there is a lot more value to returning this compared to the blob that was sent and is known to the client. The timestamp would allow the client to implement some caching (e.g. don't check with the server if the timestamp is fresh enough), lower the bandwidth needed (by return the JSON object on a get request if the timestamp doesn't match, useful for mobile devices) or potentially even allow for delta (this would need to switch to some event based logic). Arguably, neither of those scenarios applies here, but those are enabled by returning the timestamp 😄

Open to ideas / thoughts on what API clients should "expect" from a successful PUT / PATCH / etc beyond the HTTP status code.

I don't think we need to return anything from the API on success: the return code says if it was successful and we can return an optional error field to return the cause of errors to the client. Those 2 are what a minimal client would expect. We can always add more as we understand our needs more.

api/v1_storage.inc

pinc/ApiClientStorage.inc

jchaffraix · 2024-12-22T16:49:00Z

pinc/JsonStorage.inc

+     */
+    public function get(string $setting): ?string
+    {
+        $value = null;


We should move defining $value where need it before the loop on l.69.

api/v1_validators.inc

jchaffraix · 2024-12-22T17:01:19Z

SETUP/API.md

+To enable this feature, add a string for the client to the
+`_API_CLIENT_STORAGE_KEYS` configuration setting and have the client use that
+string with the endpoint as the `clientid`.


I am having a hard time understanding the purpose of the API client storage keys. If the end goal is to allow any client to store this configuration down the road (which means that a user could trivially get one of our keys), is this some temporary measure to prevent arbitrary storage from storing JSON in our DB?

You have the gist of it, but it's not temporary.

Keep in mind that every user has de facto API access via PHP sessions -- that's how the Page Browser users the API and how the new proofreading UI will too. If we don't have some restriction, any logged in user could fill up our server with random JSON blobs. The clientid means that each user can have only at most len(API_CLIENT_STORAGE_KEYS) JSON blobs and those have a maximum size.

Perhaps "client" is the wrong word for this? I am using it in the "JS client that uses the APIs". Would "app" or some other word be clearer?

I'm having trouble visualizing it, too. Part of the discussion has centered around not only persisting settings across browsers and devices, but allowing different browsers on different devices to have settings specific to those devices (for example, if one device has a large monitor, the user may prefer the side-by-side version of the interface, but prefer top-to-bottom on a device with a smaller monitor).

My initial impression of "per client per user" was that multiple devices was covered because each user might be using different clients, thus, each user might have multiple entries, so I'm wondering if I'm just misunderstanding the terminology.

The JS client (ie: new proofreading interface) will need to figure out which settings it should store in the local browser cache (and only available to that browser -- like window size) and which one it should store on the server (and available to any browser device -- like font family). All this endpoint does is handle saving and returning whatever the JS client sends it. We have to figure out the other parts but that's all done within the JS client.

The clientid means that each user can have only at most len(API_CLIENT_STORAGE_KEYS) JSON blobs and those have a maximum size.

I am not seeing any validation on the number of stored blobs, we only check that the client id is allow-listed but besides that the constraints on the DB would allow unbounded storage. Those are trusted clients so it doesn't seem to be a concern, just curious about this.

Perhaps "client" is the wrong word for this? I am using it in the "JS client that uses the APIs". Would "app" or some other word be clearer?

Mhh, is the idea to separate our browser users from other "app" that programmatically call our API?

You have the gist of it, but it's not temporary.

OK, as a side question since we need a logged in users to store the information, would it make sense to use the user rather than an extra key? Or are we thinking that some "apps" will backfill/mutate a lot of users' keys? I am probably missing something rather obvious here as I don't have much context on the new UI.

cpeel

Thanks for the questions and updates. I'll get the (excellent) code suggestions done later today.

api/v1_storage.inc

pinc/JsonStorage.inc

cpeel · 2024-12-22T17:46:23Z

SETUP/API.md

+To enable this feature, add a string for the client to the
+`_API_CLIENT_STORAGE_KEYS` configuration setting and have the client use that
+string with the endpoint as the `clientid`.


You have the gist of it, but it's not temporary.

Keep in mind that every user has de facto API access via PHP sessions -- that's how the Page Browser users the API and how the new proofreading UI will too. If we don't have some restriction, any logged in user could fill up our server with random JSON blobs. The clientid means that each user can have only at most len(API_CLIENT_STORAGE_KEYS) JSON blobs and those have a maximum size.

cpeel · 2024-12-22T17:48:54Z

api/dp-openapi.yaml

+          description: JSON blob that was persisted
+          content:
+            application/json:
+              schema:
+                type: object


The timestamp is arguably just as useless as the JSON blob because it's always now(). I was uncertain what a good return value would be so I modeled what we do for PUTs for word lists -- and what the product at my job does for PUTs and PATCHes -- which is to return the object that was actually serialized.

Open to ideas / thoughts on what API clients should "expect" from a successful PUT / PATCH / etc beyond the HTTP status code.

pinc/ApiClientStorage.inc

SETUP/tests/ci_configuration.sh

cpeel · 2024-12-23T00:04:37Z

PR feedback so far incorporated and resolved the merge conflicts with the documents API PR.

api/v1_storage.inc

70ray

This is great. I think it could be slightly simplified.

70ray · 2024-12-25T12:20:27Z

api/ApiRouter.inc

@@ -73,7 +76,35 @@ class ApiRouter
        throw new InvalidAPI();
    }

-    public static function get_router()
+    public function request(bool $raw = false)


This function could stay in index.php

I moved this here to keep the encoding and decoding together. And because I had to move the encoding here this came along with it.

That seems reasonable. I've had some further thoughts about this and deleted my other ananswered comments which had overlooked a change you made in index.php. It's difficult to see the whole picture in this github view so I dowloaded the files to look at them more carefully.

70ray · 2024-12-25T12:22:24Z

api/ApiRouter.inc

+        }
+    }
+
+    public function response(bool $raw = false): string


We wouldn't need this if it was incorporated as above.

Because I wanted to keep the encoding and decoding together and they got moved in here, breaking them out into clear functions makes it clearer what's going on IMHO.

jchaffraix

Sorry for delay in the reply and merry Christmas! 🎄

jchaffraix · 2024-12-25T16:24:58Z

SETUP/API.md

+To enable this feature, add a string for the client to the
+`_API_CLIENT_STORAGE_KEYS` configuration setting and have the client use that
+string with the endpoint as the `clientid`.


The clientid means that each user can have only at most len(API_CLIENT_STORAGE_KEYS) JSON blobs and those have a maximum size.

I am not seeing any validation on the number of stored blobs, we only check that the client id is allow-listed but besides that the constraints on the DB would allow unbounded storage. Those are trusted clients so it doesn't seem to be a concern, just curious about this.

Perhaps "client" is the wrong word for this? I am using it in the "JS client that uses the APIs". Would "app" or some other word be clearer?

Mhh, is the idea to separate our browser users from other "app" that programmatically call our API?

You have the gist of it, but it's not temporary.

OK, as a side question since we need a logged in users to store the information, would it make sense to use the user rather than an extra key? Or are we thinking that some "apps" will backfill/mutate a lot of users' keys? I am probably missing something rather obvious here as I don't have much context on the new UI.

jchaffraix · 2024-12-25T16:33:19Z

api/dp-openapi.yaml

+          description: JSON blob that was persisted
+          content:
+            application/json:
+              schema:
+                type: object


The timestamp is arguably just as useless as the JSON blob because it's always now().

Sure as it represents the last modified timestamp from the server's perspective.

However I do think there is a lot more value to returning this compared to the blob that was sent and is known to the client. The timestamp would allow the client to implement some caching (e.g. don't check with the server if the timestamp is fresh enough), lower the bandwidth needed (by return the JSON object on a get request if the timestamp doesn't match, useful for mobile devices) or potentially even allow for delta (this would need to switch to some event based logic). Arguably, neither of those scenarios applies here, but those are enabled by returning the timestamp 😄

Open to ideas / thoughts on what API clients should "expect" from a successful PUT / PATCH / etc beyond the HTTP status code.

I don't think we need to return anything from the API on success: the return code says if it was successful and we can return an optional error field to return the cause of errors to the client. Those 2 are what a minimal client would expect. We can always add more as we understand our needs more.

70ray

It seems to me that to receive and return raw data we should really intervene in index.php in the functions api_get_request_body() and api_output_response() rather than make any changes in ApiRouter.
The function api_get_request_body() is easy by adding a 'raw' parameter as you have done but api_output_response() is more difficult. We could set a global variable to tell api_output_response() to send raw data but adding more global variables isn't something we want to do. Perhaps make a static API class to put the variable in. Would this be possible?

cpeel · 2024-12-26T15:43:32Z

We could set a global variable to tell api_output_response() to send raw data but adding more global variables isn't something we want to do. Perhaps make a static API class to put the variable in. Would this be possible?

That's exactly why I put it in ApiRouter -- to avoid a global variable or adding a new artificial class with static variables. To me it made sense to put it in the router because the router is accepting a request and returning a response.

70ray · 2024-12-26T16:50:53Z

We could set a global variable to tell api_output_response() to send raw data but adding more global variables isn't something we want to do. Perhaps make a static API class to put the variable in. Would this be possible?

That's exactly why I put it in ApiRouter -- to avoid a global variable or adding a new artificial class with static variables. To me it made sense to put it in the router because the router is accepting a request and returning a response.

Fair enough.
Another way would be to return an array with a key "raw_data" from our function. Then api_output_response() could check if its input was an array and if it had this key, otherwise encode the data. But this is a bit hacky.

70ray · 2024-12-26T16:57:57Z

api/ApiRouter.inc

@@ -55,15 +57,16 @@ class ApiRouter
            throw new MethodNotAllowed();
        }
        $function = $url_map["endpoint"][$method];
-        return $function($method, $data, $query_params);
+        $this->_response = $function($method, $data, $query_params);
+        return $this->_response;


Do you need this line now?

cpeel · 2024-12-26T21:25:53Z

I am not seeing any validation on the number of stored blobs, we only check that the client id is allow-listed but besides that the constraints on the DB would allow unbounded storage. Those are trusted clients so it doesn't seem to be a concern, just curious about this.

I'm really not doing a good job of describing this which means that I've named things incredibly poorly in the code and I need to figure out how to fix this, my apologies.

Let me zoom out just a little: to use the DP API you must be authenticated with an API key (or a valid PHP session which is a valid proxy) -- we do not allow unauthenticated access. So every request to the API is mapped to a user / username.

We expect the API to be used by two primary consumers:

JS code like the existing Page Browser and upcoming new proofreading interface
tech-savvy project managers, post-processors, and perhaps others who want to use it for their own use or for automated systems

The endpoint in question is geared to enabling the new proofreading interface (that is: client) with the ability to store arbitrary (but well-structured) configuration information to persist across different browsers and devices. Fundamentally we want to allow a client to store one and only one blob per user. Because all endpoints are authenticated when they PUT to v1/storage/clients/:client we have that 1:1 mapping between the user (API key or PHP session) and the client (clientid).

We have an allow-list for valid clientids to prevent a user from storing blobs for arbitrary clientid strings.

We can't rely on clientid to be a secret and known only to the JS code because it's available in the browser so we need to assume these values are effectively public. This means that any authenticated user can store any structurally-valid value at v1/storage/clients/:clientid outside of the JS code that it was intended for. That's just something the client is going to have to deal with.

So in summary, right now the maximum number of blobs that could be stored in this interface is num_users * len(API_CLIENT_STORAGE_KEYS). The maximum size of a blob is not well defined but probably in the 16MB range, although I expect clients will store a lot less than that. If a user wanted to store a lot more information for their clientid record they could but it would only impact them -- they can't store information for other users.

cpeel · 2024-12-30T19:34:34Z

I'm going to do some slight renaming of some things in this to remove client as that's an unnecessary source of confusion. I'll work on that over the next couple of days.

cpeel · 2025-01-14T01:34:03Z

Much later than I planned, but here's a small refactor. I've removed "client" everywhere and just went with "storage key". Hopefully that removes some confusion around the naming.

Functionally this hasn't changed:

Admins can configure a storage key that API users can use to set, get, and delete valid JSON blobs.
API users can only store one blob per storage key and users can only get, set, and delete their own blobs.

The new JS proofreading interface can use this to store per-user configuration information across browsers / computers. It could also be used for another client (such as the Page Browser, for instance) for the same purpose with a second storage key.

After #1411 goes in (and maybe #1407) I'll rebase this to fix any conflicts.

Add a generic JSON-based storage class and an extension of it that will be used in the API to store blobs from clients. This is a similar structure to the Settings class / user_settings table but enforces a JSON value. This creates a shortcut in the API to allow JSON input to bypass being deserialized and the output to bypass being serialized and allow the routing function to manage that.

cpeel · 2025-01-18T00:30:09Z

This has been rebased and the commits all squashed down into one. The sandbox has been updated and can be tested with:

export API_KEY=review

# Save a json blob
curl -i -H "Accept: application/json" -H "X-API-KEY: $API_KEY" \
    -X PUT -d '{"key": 1}' "https://www.pgdp.org/~cpeel/c.branch/client-storage/api/v1/storage/newpi"

# Fetch a json blob
curl -i -H "Accept: application/json" -H "X-API-KEY: $API_KEY" \
    -X GET "https://www.pgdp.org/~cpeel/c.branch/client-storage/api/v1/storage/newpi"

# Delete a json blob
curl -i -H "Accept: application/json" -H "X-API-KEY: $API_KEY" \
    -X DELETE "https://www.pgdp.org/~cpeel/c.branch/client-storage/api/v1/storage/newpi"

cpeel requested review from jchaffraix, chrismiceli, 70ray and srjfoo December 21, 2024 18:45

cpeel self-assigned this Dec 21, 2024

70ray reviewed Dec 22, 2024

View reviewed changes

jchaffraix reviewed Dec 22, 2024

View reviewed changes

cpeel commented Dec 22, 2024

View reviewed changes

SETUP/tests/ci_configuration.sh Outdated Show resolved Hide resolved

cpeel force-pushed the client-storage branch from ac44218 to 974c9b6 Compare December 23, 2024 00:00

70ray reviewed Dec 23, 2024

View reviewed changes

api/v1_storage.inc Outdated Show resolved Hide resolved

api/v1_storage.inc Outdated Show resolved Hide resolved

70ray reviewed Dec 25, 2024

View reviewed changes

jchaffraix reviewed Dec 25, 2024

View reviewed changes

70ray reviewed Dec 26, 2024

View reviewed changes

70ray approved these changes Jan 15, 2025

View reviewed changes

chrismiceli approved these changes Jan 15, 2025

View reviewed changes

jchaffraix approved these changes Jan 17, 2025

View reviewed changes

cpeel force-pushed the client-storage branch from ba7ea11 to 719dd7e Compare January 18, 2025 00:27

srjfoo approved these changes Jan 18, 2025

View reviewed changes

cpeel merged commit e2919d8 into DistributedProofreaders:master Jan 18, 2025
12 checks passed

cpeel deleted the client-storage branch January 18, 2025 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add client JSON storage API endpoint #1405

Add client JSON storage API endpoint #1405

cpeel commented Dec 21, 2024

70ray Dec 22, 2024

cpeel Dec 22, 2024

70ray left a comment

jchaffraix Dec 22, 2024

cpeel Dec 22, 2024

jchaffraix Dec 25, 2024

jchaffraix Dec 22, 2024

jchaffraix Dec 22, 2024

cpeel Dec 22, 2024

cpeel Dec 22, 2024

srjfoo Dec 23, 2024

cpeel Dec 23, 2024

jchaffraix Dec 25, 2024

cpeel left a comment

cpeel Dec 22, 2024

cpeel Dec 22, 2024

cpeel commented Dec 23, 2024

70ray left a comment

70ray Dec 25, 2024

cpeel Dec 25, 2024

70ray Dec 26, 2024

70ray Dec 25, 2024

cpeel Dec 25, 2024

jchaffraix left a comment

jchaffraix Dec 25, 2024

jchaffraix Dec 25, 2024

70ray left a comment

cpeel commented Dec 26, 2024

70ray commented Dec 26, 2024 •

edited

Loading

70ray Dec 26, 2024

cpeel commented Dec 26, 2024

cpeel commented Dec 30, 2024

cpeel commented Jan 14, 2025

cpeel commented Jan 18, 2025

Add client JSON storage API endpoint #1405

Add client JSON storage API endpoint #1405

Conversation

cpeel commented Dec 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

70ray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpeel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpeel commented Dec 23, 2024

70ray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchaffraix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

70ray left a comment

Choose a reason for hiding this comment

cpeel commented Dec 26, 2024

70ray commented Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

cpeel commented Dec 26, 2024

cpeel commented Dec 30, 2024

cpeel commented Jan 14, 2025

cpeel commented Jan 18, 2025

70ray commented Dec 26, 2024 •

edited

Loading