Node-agent high memory usage #8582

RobKenis · 2025-01-06T10:52:14Z

What steps did you take and what happened:

Velero is installed with nodeAgent enabled. Backup storage location is configured to use Azure Blob Storage.

apiVersion: velero.io/v1
kind: Restore
metadata:
  name: restore-that-uses-a-lot-of-memory
  namespace: velero
spec:
  backupName: external-backup-1
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  - csinodes.storage.k8s.io
  - volumeattachments.storage.k8s.io
  - backuprepositories.velero.io
  - policies.rabbitmq.com
  existingResourcePolicy: update
  hooks: {}
  includedNamespaces:
  - custom-namespace
  itemOperationTimeout: 48h0m0s
  uploaderConfig:
    parallelFilesDownload: 16

The memory request for node-agent is set to 5Gi. Normally, the limit is also set to 5Gi. To avoid the Pod getting OOMKilled, I have removed the limit for this restore.

What did you expect to happen:

Memory usage stays around, but below, 5Gi.

Anything else you would like to add:

Environment:

Velero version (use velero version): v1.15.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): v1.28.2+k3s1
Kubernetes installer & version: v1.28.2+k3s1
Cloud provider or hardware configuration: Azure VM
OS (e.g. from /etc/os-release): AlmaLinux 9.5 (Teal Serval)

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2025-01-07T02:28:01Z

The recommended configuration is to set as "no limit", or Best Efforts.
Depends on the complexity and scale of the data being backed up, node-agent may take much memory during the backup/restore (fs-backup), by the fs-uploader and repository. That memory comes down to process heap memory and system paging cache.
After the backup/restore, the system paging cache may not be reclaimed if your node has enough memory. This is a system behavior and out of control of Velero.
After the backup/restore, most of the process heap memory will be released, but a small proportion will be retained to be reused for the following backups/restores, which is a way of performance enhancement.

Lyndon-Li · 2025-01-07T02:30:31Z

Another recommendation is to use data mover backup/restore over fs-backup:

Data mover backup/restore allocates the memory into a dedicate pod and will be fully released after the backup/restore (since 1.15)
Data mover backup/restore is more consistent than fs-backup

RobKenis · 2025-01-07T08:58:26Z

How would removing the memory limit impact the system as a whole. Could this mean that the node goes OOM when Velero node agent uses too much memory?

The reason we would like to set memory limits is because we are running on a resource constrained system and we would like to avoid impact on other services.
During restore, other services are scaled down, so this is less of an issue, but during backup it would be nice to set a reasonable limit so the node agent still gets OOMKilled if it goes over the limit instead of impacting other services.

Lyndon-Li · 2025-01-07T09:07:31Z

How would removing the memory limit impact the system as a whole. Could this mean that the node goes OOM when Velero node agent uses too much memory?

That depends on the complexity and scale of the data being backed up. Most probably, the node memory will not run out, but the system cache will be reclaimed when the memory is tight. However, this would also impact the running of other workload in the same node.

Lyndon-Li · 2025-01-07T09:09:13Z

The reason we would like to set memory limits is because we are running on a resource constrained system and we would like to avoid impact on other services

If so, data mover is also recommended, because you could customize which nodes the data mover should/should not run, but you cannot do this for fs-backup.

RobKenis · 2025-01-07T09:10:00Z

That depends on the complexity and scale of the data being backed up. Most probably, the node memory will not run out, but the system cache will be reclaimed when the memory is tight. However, this would also impact the running of other workload in the same node.

The volume that causes the most issues seems to be a volume that contains around 1TB of small files, between 500K and 5M in size.

msfrucht · 2025-01-08T22:23:12Z

That isn't surprising. Worst case scenario of deduplication-based backup and restore software such as restic and kopia used in Velero is a large number of small files.

Red Hat's recommendation for a normal sized config is 16GB request, 32GB limit for restic. Your usage requirements are in line with expectations.

Kopia usually uses less resources in Velero 1.15 compared to restic. Will require new backups as kopia and restic repositories are not compatible and have no migration path. You can check which is use by checking the BackupRepository object.

RobKenis · 2025-01-10T12:47:36Z

Out of curiosity, I have been playing around with GOMEMLIMIT and GOGC to see if I can lower the overall memory usage. I have been able to lower the peak usage from 25G to 15G. After a restore is done, the go runtime keeps 8G, which is higher than expected.
I understand that this memory usage is to be expected for my use case of a lot of small files, but the systems we deploy on don't really allow this because it would impact other services running on the systems.
Do you have recommendations to lower the peak memory usage, even if they impact velero performance?

msfrucht · 2025-01-10T18:00:52Z

If the repository is restic then setting the parallelFilesUpload might improve the memory usage at the cost of performance.

Kopia is set to do parallel uploads equal to the number of cpus. Lowering parallel data upload streams would lower the memory usage at the cost of performance. I don't know if setting a cpu limit would change that reporting.

@Lyndon-Li I don't suppose you've tested that. A brief check using nproc got the same result to report the number of cpu cores regardless of cpu request and limit. That isn't necessarily the same as go.

Lyndon-Li · 2025-01-13T02:46:31Z

I don't know if setting a cpu limit would change that reporting

No, the number CPU got from Golang is always the number of CPU cores in the node, CPU limit of cgroup doesn't affect the number. So always use the backup parameter --parallel-files-upload to change the number of uploads

Lyndon-Li · 2025-01-13T06:26:15Z

@RobKenis
Could you also share the number of CPU cores in your nodes?

RobKenis · 2025-01-13T07:43:27Z

@Lyndon-Li I am testing on a system 16 cores and 128GB of memory

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 7763 64-Core Processor
    CPU family:           25
    Model:                1
    Thread(s) per core:   2
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             1
    BogoMIPS:             4890.85
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nons
                          top_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse
                           3dnowprefetch osvw topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save
                           tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm

               total        used        free      shared  buff/cache   available
Mem:           125Gi        51Gi       4.8Gi       2.0Gi        72Gi        74Gi
Swap:             0B          0B          0B

msfrucht · 2025-01-13T17:22:01Z

The option --parallel-files-upload doesn't affect Kopia, only Restic. https://github.com/vmware-tanzu/velero/blob/release-1.15/pkg/uploader/kopia/snapshot.go#L100

Should that be a separate issue to have this option apply to Kopia?

Lyndon-Li · 2025-01-14T02:30:17Z

The option --parallel-files-upload doesn't affect Kopia, only Restic

No, to the opposite, it works for Kopia path only,

velero/pkg/uploader/kopia/snapshot.go

Line 122 in 804d73c

curPolicy.UploadPolicy.MaxParallelFileReads = newOptionalInt(parallelUpload)

Lyndon-Li · 2025-01-14T02:43:41Z

@RobKenis

I am testing on a system 16 cores and 128GB of memory

Then the default concurrency in your env is 16.
And also notice that he pattern of your current data is the typical pattern that consumes high memory during backup, restore and repo maintenance.

So here are the recommendation all in all:

Data mover backup/restore should be used over fs-backup
You could reduce the memory usage of backup/restore by controlling the concurrency (--parallel-files-upload), but that will significantly low down the performance; on the other hand, you cannot control the memory usage during repo maintenance. So you could try this but it is not from our recommendation.
We recommend you prepare one or more dedicated nodes to run data mover backup/restore and repo maintenance, data mover pods and repo maintenance pods run and only run in those nodes and are free to use memory. And you could also block other workloads to run in those nodes, so that they won't be affected by lack of memory.

RobKenis · 2025-01-14T12:00:33Z

@Lyndon-Li I understand the need for Data mover, this would resolve a big part of the problem. From what I understand, this requires a CSI driver to create Volume Snapshots. Is this also a possible solution when using Local Volumes as we don't use a CSI Driver?

msfrucht · 2025-01-14T17:51:33Z

@Lyndon-Li Thanks for the correction and useful to know. @RobKenis As it means you can set this value below the cpu count of your system and it should reduce the memory usage if using Kopia at the cost of performance.

This allows us to enable the profiler endpoints on both the server and the node agent. This helps me in troubleshooting the high memory usage when restoring lots of small files. Refs: vmware-tanzu#8582

This allows us to enable the profiler endpoints on both the server and the node agent. This helps me in troubleshooting the high memory usage when restoring lots of small files. Refs: vmware-tanzu#8582 Signed-off-by: Rob Kenis <[email protected]>

RobKenis · 2025-01-15T08:39:43Z

@Lyndon-Li @msfrucht I lower the amount of parallel files using the following config in the Restore resource.

uploaderConfig:
    parallelFilesDownload: 1

This makes the restore a lot slower, but memory still rises to a high amount.

I would like to get more insights into this, but it seems I cannot enable profiling endpoints on the node agent, only the velero server. Could you please have a look at this PR #8618 to enable profiling on the node agent?

Lyndon-Li · 2025-01-16T04:57:03Z

See this comment #8582 (comment). System cache takes lots of memory during fs-uploader read/write files. The cache memory won't be aggressively reclaimed as long as there are enough memory in the node, even after the backup/restore completes (since you are using fs-backup and node-agent).

See these recommendations #8582 (comment) for the solution.

Lyndon-Li added Performance area/fs-backup labels Jan 7, 2025

This was referenced Jan 8, 2025

High Memory Usage kopia/kopia#1233

Closed

fix(snapshots): optimise memory consumption for restores kopia/kopia#3465

Closed

blackpiglet added the Area/Cloud/Azure label Jan 9, 2025

blackpiglet assigned Lyndon-Li Jan 13, 2025

RobKenis mentioned this issue Jan 15, 2025

Add profiler address parameter on node-agent #8618

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node-agent high memory usage #8582

Node-agent high memory usage #8582

RobKenis commented Jan 6, 2025 •

edited

Loading

Lyndon-Li commented Jan 7, 2025 •

edited

Loading

Lyndon-Li commented Jan 7, 2025

RobKenis commented Jan 7, 2025 •

edited

Loading

Lyndon-Li commented Jan 7, 2025

Lyndon-Li commented Jan 7, 2025

RobKenis commented Jan 7, 2025

msfrucht commented Jan 8, 2025 •

edited

Loading

RobKenis commented Jan 10, 2025

msfrucht commented Jan 10, 2025

Lyndon-Li commented Jan 13, 2025

Lyndon-Li commented Jan 13, 2025

RobKenis commented Jan 13, 2025 •

edited

Loading

msfrucht commented Jan 13, 2025

Lyndon-Li commented Jan 14, 2025

Lyndon-Li commented Jan 14, 2025

RobKenis commented Jan 14, 2025

msfrucht commented Jan 14, 2025

RobKenis commented Jan 15, 2025

Lyndon-Li commented Jan 16, 2025

Node-agent high memory usage #8582

Node-agent high memory usage #8582

Comments

RobKenis commented Jan 6, 2025 • edited Loading

Lyndon-Li commented Jan 7, 2025 • edited Loading

Lyndon-Li commented Jan 7, 2025

RobKenis commented Jan 7, 2025 • edited Loading

Lyndon-Li commented Jan 7, 2025

Lyndon-Li commented Jan 7, 2025

RobKenis commented Jan 7, 2025

msfrucht commented Jan 8, 2025 • edited Loading

RobKenis commented Jan 10, 2025

msfrucht commented Jan 10, 2025

Lyndon-Li commented Jan 13, 2025

Lyndon-Li commented Jan 13, 2025

RobKenis commented Jan 13, 2025 • edited Loading

msfrucht commented Jan 13, 2025

Lyndon-Li commented Jan 14, 2025

Lyndon-Li commented Jan 14, 2025

RobKenis commented Jan 14, 2025

msfrucht commented Jan 14, 2025

RobKenis commented Jan 15, 2025

Lyndon-Li commented Jan 16, 2025

RobKenis commented Jan 6, 2025 •

edited

Loading

Lyndon-Li commented Jan 7, 2025 •

edited

Loading

RobKenis commented Jan 7, 2025 •

edited

Loading

msfrucht commented Jan 8, 2025 •

edited

Loading

RobKenis commented Jan 13, 2025 •

edited

Loading