-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Storage Optimization for OPEA Workloads proposal #1118
Comments
I would still like the OPEA project to use Fluid, or any existing caching solution. Fluid is an existing project with wide community targeting data intensive (e.g., AI) applications. OPEA project should focus on creating Fluid configurations and deployments for each use case. I don't see the added value of creating OPEA CSI caching solution if there is already a solution which the project can leverage. |
@poussa Thanks much for the comments! Let me add some clarifications for reference: this RFC proposal is NOT target to create a OPEA CSI caching solution, instead, it targets to provide a solution to manage existing caching solution (e.g. fluid-https://github.com/fluid-cloudnative/fluid/, Rok-https://www.arrikto.com/rok-data-management-platform/ etc.) and data storage solution (e.g. local, nfs, ceph etc.). Based on the experiences of our POC (fluid + NFS/Ceph), We think the proposal can bring below values for OPEA users:
|
Priority
P2-High
OS type
Ubuntu
Hardware type
Xeon-GNR
Running nodes
Multiple Nodes
Description
Storage Optimization for OPEA Workloads proposal
This RFC proposes a solution to manage distributed storage and optimize data access performance (through leveraging 3rd party solution) for OPEA workloads.
Author(s)
qwren
ichbinblau
hhb584520
airren
hualongfeng
majianpeng
hle2
Status
Under Review
Objective
Goals:
Non-Goals:
Motivation
AI applications typically require large amounts of data that can be shared: e.g., models, training data, context data in RAG systems etc. While there are no components in OPEA to manage the data sharing and the OPEA users need to (1) Explicitly manage (e.g. create, delete etc.) their own data (2) Create Persistent Volumes (PV) and Persistent Volume Claims (PVC) and address sharing issues.
There is no effective way in OPEA to address data access latency and high bandwidth challenges stemming from remote data pull caused by the separation of compute and storage. Solution likes fluid helps to accelerate data access for data-intensive applications, while it needs the OPEA user to have the knowledge of backend storage system and usually hard to configure.
There is no support for offline environments or handling intermittent network connectivity in OPEA.
Design Proposal
We follow the k8s CSI specification to implement an OPEA CSI driver to communicate with K8s API Server and Kubelet then manage PV/PVC for OPEA workloads, the CSI Driver will manage backend storage solution through IDataEngine interface and cache solution through ICacheEngine interface, as shown in the follow diagram:
API Definitions
OPEA CSI Driver uses IDataEngine to manage plugins for local storage or distributed storage solutions: the engineParam parameter (includes information such as url, security information on how to communicate with the storage solution etc.) will be used to configure the plugin; Create/DeleteStorage interface will be invoked by the CSI plugin to create/delete real storage when required, then the storage will be used by the cache engine (when appliable) to manage data for OPEA workloads.
OPEA CSI Driver uses ICacheEngine to manage plugins for cache solutions: the engineParam parameter (includes information such as how to configure the cache engine ) will be used to configure the plugin; Create/DeleteCache interface will be invoked by the CSI plugin to create/delete cache in the local node, then the cache information will be used to generate PV for OPEA workloads.
Working flow:
It is expected the System Admin to deploy the distributed storage solution and cache solution in the cluster, then create a StorageClass (as shown below) to let the OPEA CSI Driver know what the storage and cache solution is and how it should be used. Different plugin can define its own CRD for dataEngine and cacheEngine parameter the it will be transferred to engineParam parameter to configure DataEngine and CacheEngine plugin by the CSI Driver.
If no dataEngine parameter is configured or no data storage solution deployed, the CSI Driver will fall back to use node's local disk for data access of OPEA workloads.
If no cacheEngine parameter is configured or no cache solution deployed, the CSI Driver will fall back to use no cache to create PV for OPEA workloads.
OPEA user will create a PVC (PersistentVolumeClaim) as below to use the OPEA storage solution.
The PVC should define a annotation with "opea-storage-key" key to enable data sharing between workloads. The OPEA CSI Driver maintains a "PV Meta Cache" internally. It will first check if the key for the PVC is existing or not, If existing, the OPEA CSI Driver will reuse the created remote storage for the workloads.
The general flow is shown in below diagram:
Alternatives Considered
Using fluid solution directly
Fluid provides CSI Driver to enable it to be used by k8s workloads, while it:
Compatibility
N/A, the solution will be used when the user define PVC explicitly.
Miscellaneous
List other information user and developer may care about, such as:
Staging plan
The text was updated successfully, but these errors were encountered: