-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Design for Upload Speed Test with BSL Config for Object Storage #1558
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,203 @@ | ||||||||
# Upload Speed Test for Backup Storage Location (Object Storage) | ||||||||
|
||||||||
## Abstract | ||||||||
|
||||||||
This document presents the design of a Custom Resource Definition (CRD) and its controller to test the upload speed from an OpenShift cluster to cloud object storage. | ||||||||
The design leverages the BackupStorageLocation configuration from the OADP operator’s DPA CRD to specify object storage locations for testing. The controller will use | ||||||||
these configurations to authenticate and perform uploads, measuring speed and reporting results through the CRD status. | ||||||||
|
||||||||
## Background | ||||||||
|
||||||||
We have observed several instances in customer environments where backup or restore operations became stalled or significantly delayed, but the root cause was not immediately apparent. | ||||||||
These issues could stem from poor network connectivity, cloud storage throttling, or misconfigurations in BackupStorageLocation (BSL) settings. Since backup performance plays a critical | ||||||||
role in disaster recovery strategies, it is essential to have a reliable way to test upload speeds. | ||||||||
|
||||||||
This Upload Speed Test CRD will help in such scenarios by proactively measuring the upload speed from the OpenShift cluster to the cloud storage configured in OADP’s BSL. By identifying | ||||||||
performance bottlenecks ahead of backup or restore operations, administrators can take corrective actions to avoid failures during critical operations. | ||||||||
|
||||||||
## Goals | ||||||||
|
||||||||
- Measure upload speeds for a given BackupStorageLocation (BSL) config settings and credentials | ||||||||
- Measure upload speeds from the OpenShift cluster to the configured object storage. | ||||||||
- Update the test results in the status of the UploadSpeedTest CRD for visibility and tracking. | ||||||||
- Enable smooth integration with OADP operator-managed backup storage location configurations. | ||||||||
|
||||||||
|
||||||||
## Non-Goals | ||||||||
|
||||||||
- This design will not create or modify BackupStorageLocation entries in OADP. | ||||||||
- It will not implement download or latency tests, focusing solely on upload speed. | ||||||||
- Scheduling of recurring tests is not supported in the initial version. | ||||||||
|
||||||||
## High-Level Design | ||||||||
Components involved and their responsibilities: | ||||||||
|
||||||||
- UploadSpeedTest (UST) CRD: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reading whole design, would make sense to change CRD name to something more generic (NetworkSpeedTest?). If more features, like latency and download speed, are added, the CRD name would not be telling users all things it can do? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I had thought about that, Any Network speed would consist of upload as well as download speed, the risk here with |
||||||||
- Captures the BSL configuration, test file size and test timeout used for testing directly in the CRD. | ||||||||
- Holds the test results, including success/failure and upload speed. | ||||||||
- UploadSpeedTest (UST) Controller: | ||||||||
- Watches for changes to the UploadSpeedTest CRs and executes uploads based on the provided BSL configuration. | ||||||||
- Uses appropriate cloud SDKs (AWS, Azure, or GCP) to perform uploads. | ||||||||
- Updates the status of UST CR with upload speed, upload success/failure and error message if any. | ||||||||
- OADP/DPA-UST Integration: | ||||||||
- Fetches BackupStorageLocation (BSL) details from the OADP operator’s DPA CR. | ||||||||
weshayutin marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
- Populates the UploadSpeedTest CR with relevant configurations from the BSL. | ||||||||
- Creates the UST CR for the BSL configured in DPA instance. | ||||||||
## Detailed Design | ||||||||
|
||||||||
#### Proposed Upload Speed Test CRD: | ||||||||
|
||||||||
The UploadSpeedTest CRD structure includes fields to capture backup storage location configurations, upload test parameters, and the location of cloud provider secrets. | ||||||||
|
||||||||
```yaml | ||||||||
apiVersion: oadp.openshift.io/v1alpha1 | ||||||||
kind: UploadSpeedTest | ||||||||
metadata: | ||||||||
name: my-upload-speed-test | ||||||||
spec: | ||||||||
backupLocation: | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we want UST to be able to be used independently (in addition to via DPA) by providing the BSLConfig. Thinking of the the use-case:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we do not want to reference by name to an already existing BSL? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah because we already have the functionality to test a particular BSL via DPA, proposed here . Why keep it bi-directional ! I believe this keeps the CRD simpler. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So an upstream velero user would not be able to use this without copying the whole bsl.spec? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps that was not the plan. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not an upstream feature :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds like an easy add. If it does not complicate our designs, I would propose we do things in an upstream usable way. It'd be much cooler to say "you can try our team's tool to debug your bsl speed" |
||||||||
velero: | ||||||||
provider: aws # Cloud provider type (aws, azure, gcp) | ||||||||
default: true | ||||||||
objectStorage: | ||||||||
bucket: sample-bucket | ||||||||
prefix: velero | ||||||||
config: | ||||||||
region: us-east-1 | ||||||||
profile: "default" | ||||||||
insecureSkipTLSVerify: "true" | ||||||||
credential: | ||||||||
name: cloud-credentials # Secret containing cloud credentials | ||||||||
key: cloud | ||||||||
uploadSpeedTestConfig: | ||||||||
fileSize: "100MB" # File size for the test upload | ||||||||
testTimeout: "60s" # Maximum duration allowed for the test | ||||||||
kaovilai marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
``` | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we change here to only have 2 fields in spec (
Having BSL spec copied here makes it harder for user to manually create one (even thinking about DPA integration, if BSL spec changes, UST must change. Having just name reference would not simplify workflow?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I was thinking about the exact same thing. Making There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, removed |
||||||||
|
||||||||
#### Changes to DPA CRD BSL Configuration spec: | ||||||||
|
||||||||
The DPA CRD configuration includes fields to enable and configure the Upload Speed Test (UST) within the BSL configuration. | ||||||||
|
||||||||
```yaml | ||||||||
apiVersion: oadp.openshift.io/v1alpha1 | ||||||||
kind: DataProtectionApplication | ||||||||
metadata: | ||||||||
name: sample-dpa | ||||||||
spec: | ||||||||
backupStorageLocations: | ||||||||
- name: aws-backup-location | ||||||||
BSLSpec: ... | ||||||||
uploadSpeedTest: | ||||||||
enabled: true # Flag to enable upload speed test | ||||||||
fileSize: "10MB" # Size of the file to be uploaded | ||||||||
testTimeout: "60s" # Timeout for the upload test | ||||||||
``` | ||||||||
|
||||||||
UST controller workflow: | ||||||||
|
||||||||
Reconciliation loop: | ||||||||
- Watches for UploadSpeedTest CRD instances. | ||||||||
- Fetches Cloud Provider configurations and credentials | ||||||||
- Performs upload speed test | ||||||||
- Calculates upload speed and update UST status | ||||||||
|
||||||||
Upload execution: | ||||||||
- Uses the appropriate cloud SDK (AWS, Azure, GCP) based on the BSL configuration. | ||||||||
- Initializes the cloud provider client dynamically using a CloudProvider interface, allowing extension to new providers without changes to the controller logic. | ||||||||
- Performs the upload and measures speed in Mbps, calculating speed using the formula: | ||||||||
``` | ||||||||
Speed (Mbps) = (FileSize (bytes) * 8)/(uploadDuration(ms) * 1000) | ||||||||
``` | ||||||||
- Records whether the upload succeeded or failed. | ||||||||
|
||||||||
Status Update: | ||||||||
- Updates the UploadSpeedTest CR status with: | ||||||||
- LastTested: The timestamp of the last test. | ||||||||
- Status: Result of the test operation (e.g., “Complete” or “Failed”). | ||||||||
- SpeedMbps: Calculated upload speed. | ||||||||
- ErrorMessage: Details of any error encountered during the test. | ||||||||
|
||||||||
Flow Diagram: | ||||||||
|
||||||||
![Upload Speed Test Flow Diagram](../images/ust.jpg) | ||||||||
|
||||||||
## Implementation | ||||||||
|
||||||||
#### Dynamic Provider Initialization: | ||||||||
- The `initializeProvider` function in UST reconcile dynamically selects the correct cloud provider (AWS, Azure, GCP) based on the configuration in the UploadSpeedTest CR. | ||||||||
- Credentials are fetched from the specified secret in the CR and used to initialize the cloud provider client. | ||||||||
|
||||||||
#### CloudProvider Interface: | ||||||||
- Each provider (AWS, Azure, GCP) implements the CloudProvider interface with its own logic for the UploadTest method. | ||||||||
- This allows the controller to remain agnostic to provider-specific implementation details. | ||||||||
```go | ||||||||
package cloudprovider | ||||||||
|
||||||||
import ( | ||||||||
"context" | ||||||||
oadpv1alpha1 "github.com/openshift/oadp-operator/api/v1alpha1" | ||||||||
"time" | ||||||||
) | ||||||||
|
||||||||
type CloudProvider interface { | ||||||||
UploadTest(ctx context.Context, ust *oadpv1alpha1.UploadSpeedTest, fileSize int64, testTimeout time.Duration) (int64, error) | ||||||||
} | ||||||||
``` | ||||||||
|
||||||||
#### Calculation and Status Update: | ||||||||
- The upload speed is calculated and recorded in the CR’s status for user visibility. | ||||||||
- Error handling is centralized in the controller, which records error messages and sets the status to “Failed” in case of errors. | ||||||||
|
||||||||
### UST CRD Spec | ||||||||
|
||||||||
```go | ||||||||
// UploadSpeedTestSpec defines the desired state of UploadSpeedTest | ||||||||
type UploadSpeedTestSpec struct { | ||||||||
// BackupLocation defines relevant configuration of the object storage/backup storage location to be tested | ||||||||
BackupLocation BackupLocation `json:"backupLocation"` | ||||||||
|
||||||||
// UploadSpeedTestConfig defines the parameters for testing upload speed | ||||||||
UploadSpeedTestConfig UploadSpeedTestConfig `json:"uploadSpeedTestConfig"` | ||||||||
} | ||||||||
|
||||||||
type UploadSpeedTestConfig struct { | ||||||||
// size of the file to be used for test | ||||||||
FileSize string `json:"fileSize"` | ||||||||
|
||||||||
// timeout value for the upload test operation | ||||||||
TestTimeout string `json:"testTimeout"` | ||||||||
} | ||||||||
``` | ||||||||
|
||||||||
### UST CRD Status | ||||||||
|
||||||||
```go | ||||||||
// UploadSpeedTestStatus defines the observed state of UploadSpeedTest | ||||||||
type UploadSpeedTestStatus struct { | ||||||||
// Timestamp of the last upload speed test | ||||||||
LastTested metav1.Time `json:"lastTested"` | ||||||||
|
||||||||
// Status of the test operation - Complete, Failed | ||||||||
Status string `json:"phase"` | ||||||||
|
||||||||
// Upload Speed of the operation | ||||||||
SpeedMbps int64 `json:"speed"` | ||||||||
|
||||||||
// Details of any error encountered | ||||||||
ErrorMessage string `json:"errorMessage"` | ||||||||
} | ||||||||
``` | ||||||||
|
||||||||
**Note:** | ||||||||
- We are targeting this feature for OADP 1.5 | ||||||||
- The implementation would be done in small phases: | ||||||||
1. First phase would independent introduction of UST CRD and controller (only for AWS provider) | ||||||||
1. Then next would be enabling integration with OADP/DPA | ||||||||
1. Followed by remaining cloud provider Azure and GCP | ||||||||
|
||||||||
## Future Scope | ||||||||
|
||||||||
- Recurring Tests: Support for recurring tests could be added by integrating with a scheduling system. | ||||||||
- Enhanced Metrics: Consider additional metrics like latency or download speed. | ||||||||
|
||||||||
## Open Questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we intend to only measure BSLs created by DPA and not all BSLs in OADP namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are 2 separate things:
So all the scenarios are covered I believe.