diff --git a/ZEP000-lfs.md b/ZEP000-lfs.md new file mode 100644 index 0000000..f32aa82 --- /dev/null +++ b/ZEP000-lfs.md @@ -0,0 +1,186 @@ +--- +title: Zeebe LFS Add On +authors: + - pihme +reviewers: + - TBD +approvers: + - TBD +editor: TBD +creation-date: 2020-07-13 +last-updated: 2020-07-13 +status: provisional +--- + +# Summary +[summary]: #summary + +This ZEP proposes an add on that outsources the storage of large data into an external system (e.g. Google Cloud Storage, AWS S3). Zeebe is relieved of handling large data and can deliver better performance. + +# Motivation +[motivation]: #motivation + +Zeebe is not ideal for handling or storing large amounts of data. Customers sometimes want to use large amounts of data within a workflow. + +The large data could be offloaded to an external system and Zeebe handles only a tiny reference to the data. + +This tiny reference will be small enough that Zeebe can be configured to use small message sizes to achieve better performance. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +There are at least two ways this could be implemented: + +## I) Implementation as part of the Gateway + +*Idea* +* Gateway will inspect data stream for variables that exceed a certain length +* Such variables will be stored in an external storage and Zeebe will replace them internally with a reference +* When variables are requested by a client the reference is resolved and the content returned +* When exporting the reference is resolved and the content returned + +*Pro* +* Fully transparent to the clients +* Could be offered as an added value feature to paying customers +* Data can be evicted from external storage when log entries are deleted + +*Contra* +* We break the design constraint to be independent of external systems. +* Gateway bandwidth still hit with large data +* Probably not scalable to very large data (> 1GB) +* Problem of large data is only moved along to the exporters + + + +## II) Implementation as part of the Client API + +*Idea* +* Client API is extended to allow setting large variables +* These are then streamed to external storage on the client side, and the client puts in the replacing identifier or URL as the variable content +* Gateway, Zeebe, exporters only see the reference to the object in external storage, never the real data + +*Pro* +* Could be interesting for users in terms of security and privacy (Zeebe will handle a reference to the data, but never see the actual data; all authorization and authentication is handled on client side) +* More flexible/expandable if users have strong preference on which storage to use +* Can support large data of arbitrary size +* Gateway/exporter bandwidth never sees the large data + +*Contra* +* Needs to be implemented for each client API we support +* No full audit trail, because external system is out of our control. Maybe we cannot even access it +* Workers need to be aware that variable content might be just a reference to data stored elsewhere. This is something we could add as part of the client API, but then again we have to do it for every client we support. +* Susceptible to user error (e.g. if a developer sets the big variable directly, and doesn't use the API for setting large variables, then Zeebe might choke on it) +* Unclear, how data in external storage can be evicted (Zeebe can give hints, e.g. after a workflow instance is closed, but somehow has to respect those hints) + + + +# Drawbacks +[drawbacks]: #drawbacks + +* Increases complexity of solution +* Offloading large data to external system will (likely) prevent us from using the data in FEEL expressions + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +- Why is this design the best in the space of possible designs? +- What other designs have been considered and what is the rationale for not choosing them? +- What is the impact of not doing this? + +# Prior art +[prior-art]: #prior-art + +The proposal is inspired by Git LFS (https://git-lfs.github.com/). The author has not had personal experiences in using Git LFS. + +# Out of scope +[out-of-scope]: #out-of-scope + +Call out anything which is explicitly not part of this ZEP. + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- Does the idea have merit? +- Which implementation approach to pursue? +- How to handle eviction of data in external volume? + + +# Future possibilities +[future-possibilities]: #future-possibilities + \ No newline at end of file